Use LEFT and RIGHT arrow keys to navigate between flashcards;
Use UP and DOWN arrow keys to flip the card;
H to show hint;
A reads text to speech;
66 Cards in this Set
- Front
- Back
Alignment Algorithms |
1.Global Alignment 2.Local Alignment |
|
Methods of Alignment |
1. Dot Plot 2. Dynammic Programming 3.Word Method |
|
Dot Plot Mmatrix Method |
A basic Sequence Alignment Method A graphical way of comparing two sequences in a two dimensional matrix Both sequences are written on the vertical and horizontal axes of the matrix a dot is placed within the graph when residues match , otherwise the position is left blank When the sequences have areas of similarity, there can be seen dots that form diagonal lines . The interruption between the diagonals are areas of deletion and insertion
Parallel diagonals represent repetitive regions of the sequences |
|
Problems attached with dot Plots |
1.When comparing large sequences, there can be seen a high noise level. |
|
Solving the Problem of Dotplot |
A window slide of fixed length this scans accross the two sequences and compares all possible matches. The size of the window can be manipulated. Sensitivity is lost if the window size is too long. |
|
Alignment of a sequence on itself |
Why: to identify regions of internal repeats elements. There is a perfect diagonal for matching residues . If the repeats are present, there are short diagonals below and above the main diagonal |
|
What can be found via Dotplots using Self complimentarity of DNA sequences |
inverted Repeats This method can be applied thus in Genomics |
|
Another Problem of Dotplot |
It lacks statistical Rigor in the assessing of an alignment. The method is also restricted to pairwise alignment |
|
Progrmans for Dotplot |
Dotmatcher and Dottup |
|
Dotmatcher |
1. Displays Dotplots of aligned sequences in FASTA Format |
|
Dottup |
aligns sequences based on the Word Method .Diagonal lines are only drawn, if exact matching of words of specific lengths are found. |
|
Dynammic Programming |
It is similar to Dotplots, however it incoperates scoring schemes and matrices ot the alignment and assesment of the alignment. Iit searches fo an alignment with the highest score, thus providing the best option. |
|
Gaps |
Represent Deletion or Insertion |
|
Affine Gap Penalties |
The differential Gap Penalties: Gap Oopening Penalty and Gap Extension Penalty |
|
The total Gap penalty is a linear function of what |
the gap length |
|
Needleman Wunsch Algorithm |
The global alignment using Dynamic Programming |
|
Dynammic Programming for Local Alignment |
Smith Waterman Algorithm Positive Scores are assigned to matches, zeros to missmatches and gaps |
|
Scoring Matrices |
or a Substitution Matrix is used for the Anaylsis of residue Substitution |
|
observation in substitution Matrices |
1. Transition[ Ssubstitution purines to purines or pyrimidines to pyrimidines ] occur more frequently than Transversions [purines to pyrimidines ] |
|
Ccomplexity of Scoring Matrices for Amino Acid |
They are more Complex, because amino acids are scored based on their physiochemical properties |
|
Amino Acid Scoring Matrices |
20*20 Matrices there are two types ; a. one is based on interchangeability of the genetic code or amino acid properties b. the other is derived from empirical studies of amino acid substitution |
|
Empirical A A scoring Matrices are ? |
PAM and BLOSUM are derived form actual alignments of high similar sequences |
|
how can a scoring system be developed ? |
by giving a high score to a more likely substitution and a low one to a rare subsitution |
|
a positive score means? |
the frequency of substitution is higher than one would expect by random chance |
|
Score zero? |
the frequency is equal to random chance |
|
negative score ? |
the frequency of the substitution is lower than one would expect randomly. |
|
log odds ratio |
are logarithmic ratios of observed mutation frequency divided by the probability of substitution one would expect by random chance |
|
PAM Matrices |
Point Accepted Mutation Matrices Point Mutations that are acceptedby natural selection |
|
One PAM unit means? |
1 % of the the amino acid positions have been changed or one mutation per 100 residues |
|
PAM80 is produced by? |
multiplying PAM1 by itself 8 times. |
|
how is a PAM1 substituion Table constructed? |
a group of closely related sequences of mutation frequences corresponding the PAM1 unit are chosen. |
|
BLOSUM Matrices |
Blocks amino acid substitution matrices they are percentage identity values of sequences selected for the construction of these matrices |
|
how were the blocks constructed ? |
based on more than 200 amino acid conserved patterns and 500 groups of protein sequences
Blocks are ungapped alignments of less than 60 residues in length The frequencies of amino acid substitution of the residues in these Blocks are calculated to produce the table.
|
|
Comparison between PAM and BLOSUM |
PAM matrices, except PAM1, are derived from evolutionary model, whereas BLOSUM matrices consist of entirely direct observations.
BLOSUM matrices may have less evolutionary meaning than PAM That is why PAM is used more for constructing phylogentic trees
However because of the mathmatical exptrapolation used for PAM Matrices, they may be less significant for more divergent sequences. 2. BLOSUM matrices are derived from local sequence alignments of conserved sequence blocks. whereas PAM1 is based on Global Alignment of full length sequences composed of conserved and variable regions
|
|
P value |
is given to indicate the probability that the original alignment is due by random chance if the value is less than 10 -100, it indicates an exact match between both sequences. If higher, then both sequences are considered to be identical . Avalue ranging between 10-5 and 10-1 indicates distant homologs. |
|
Heuristic Algorithms |
BLAST and FASTA 50-100 times faster than dynamic programming heuristic word methods |
|
BLAST |
Steps : 1. Query sequnce broken down to words [three residues for protein sequences and eleven for DNA residues ] 2. Scans for matches against the database sequences 3. this includes words with one or two letter matches 4.Calculates score of matches based on BLOSUM 62 5.Extension of both sides until the score of the alignment drops below threshold score. 6.Determine high scored segment above threshold score
In the original BLAST, the HSPs are presented as Final Report. and are called maximun scoring segment pairs
However in the new improvement, gapped alignment is presented. |
|
Statistical Significance in BLAST |
is presented as the Ee-value ; Expectation value, which is the probability that the resulting alignments from a database search are caused by random chance
E = m*n*P
m; total number of residues in the database n;number of residues in the query sequence P;Probability that the HSP is that of random chance e.g. aligning a query sequence of 100 residues to a database of 10 raised to 12 residues results in a P value for the ungapped HSP region in one of the database match to 1*1 raised to -20 . The E value is thus 10 raised to -6 the lower the value, the more significant it is . |
|
FASTA |
FASTA uses hashing strategy to find matches for short stretches of identical residues of length k, known as ktuples 2 residues for a protein sequence and 6 residues for DNA : in other words , shrter than the words in BLAST. |
|
Steps in FASTA |
1. Identify ktups between two sequences using the hashing strategy. this works by construtuing a tablethat shows the position of each ktup for the two sequences. The positional difference can be obtained for each word by substracting the position on the first sequence from the position on the second sequence, which is represented as the offset. 2. Ktups with identical offset values contain contigious identical sequence regions that corresponds to a diagonal stretch on a 2d matrix. 3. the top ten highest desity diagonals are pciked and emphasized, which are then scored via a substitution matrix. 4. naeibouring high score segments are joined together to form a signle alignment. the score of the gapped alignment allows incorperation og gap penalities when scoring again, 5. the alignment is the refined using the smith watermann algorithm. basically for statistical evaluation, E-score. |
|
Comparision of FASTA to BLAST |
1. the seeding step BLAST uses a substitution matric to find matching words , while FASTA identifies matching words using the hashing method 2. FAST by default scans smaller window sizes, thus giving it more sensititvity . 3. FASTA is slower than BLAST, 4.FASTA gives only one final alignment, while BLAST presents multiple best scoring alignments. |
|
Multiple Sequence Alignment |
Allows the identification of conserved regions and motifs in a whole sequence family and essential in carrying out phylogenetic analysis of sequence familes and prediction of the protein secondary and tertiary structure |
|
Concept of MSA |
the sequences are arranged in such a way that there is a maximum number of residues are matched up according to a particular scoring function |
|
The scoring function fo MSA based on what? |
the sum of SP's SP is the sum of all scores of all possible pairs of sequences in multiple alignment based on a specific scoring matrix. this alignment is pairwise considering also the matches, missmatches and gapcosts. |
|
Approaches of MSA |
1.Heuristic Algorithms - Progressive Alignment type, iterative alignment type, block based alignment. |
|
Progressive Alignment Method |
A multistep process I tfirst conducts pairwise alignment based on the needleman wunsch algorithm and then records the similarit scores, |
|
to align additional sequences via the PA method? |
the two already aligned sequences are converted to a consesus sequence with gap positions. this is treated as a single sequence in the next step. |
|
the most nown PA program |
CLUSTAL CLUSTALW; the Ww provides a simple text based interface . |
|
Advantages of CLUSTAL |
1. it does not use only one substitution matrix, instead it applies different scoring matrices when aligning the sequences. The choice of the matrix depends on the evlolutionary distance measured from the guide tree e.g. for closely related sequences, CLUSTAL uses BLOSUM62 or PAM120 matrix. But for more divergent sequences, BLOSUM45 or PAM250 is preferred. 2. it uses adjustable gap penalties, which allow more deletions and insetions outside regions of conservation , but fewer in conserved regions. |
|
Disadvantages of PA Method |
1. not suitable for the multiple aligment of sequences of diffferent lengths because it is a global alignment based method. 2. and as a result of the use affin gap penalties, long gaps are not allowed. 3.optimal result at the end of the alignment cannot be promised. Because at the intial stage alignment, once done, erros made cannot be corrected. thus there is a build up of errors with successive alignments. |
|
Improvement in the Clustal |
T-Coffee Performs both local and global alignment. Because an optimal alignment is chosen at the intial stage, T-coffee avoids or minimizes errorsin the early stages. However it is slower than LUSTAL, because of the computation cost which are high. |
|
Evaluation of Alignments gotten from PA |
Editiing: This involves introducing or removing gaps to maximize biologicallymeaningful matches. thus avoided missaligned portions.
BioEdit-porgram |
|
Computational Approaches to Protein three dimensional structural modeling and prediction |
1.Homology Modeling 2.Threading 3.Ab initio Prediction
the first two are knowledge based : They model structures based on knowledge of existing protein structural information in databases |
|
Homology Modeling Overview |
builds an atomic model based on an experimentally determinedstructure that is closely related at the sequence level |
|
Threading overview |
identifies Protein that are structurally similar with or without detectable sequence similarites. |
|
Ab initio Prediction overview |
Predicts and models structures based on physiochemical principles that govern protein folding without the use of structural templates |
|
Homology Modeling |
also known as comparative modeling Principle : If two sequences share a high sequence similarity, then are most likely to share the 3d structure . |
|
Steps in Homology Modeling |
1.template Selection -Identification of homologous sequences in a database that would be used for modeling 2. Alignment of target and template sequences. 3.build a frame work structure for the target protein consisting of main chain atoms 4.Addition and optimization of side chain atoms and loops 5.Energy optimization 6. Evaluation of the overall quality of the model |
|
Template selection |
1. involves searching the protein Data Bank for homologous proteins with determined structures this search can be performed using heuristic methods e.g. BLAST, FASTA
rule of thumb :A database protein must have at least 30% sequence identity with the query sequence in order to be accepted as template. |
|
Sequence Alignment |
Once the structure with the highest identity has been located,the full length sequence of the target proteins and template need to be aligned via refined algorithms to obtain optimal alignment.
Because incorrect alignment leads to incorrect designation or residues and therfore incorrect models. T-coffee is a suitable algorithm. |
|
Loop Modeling |
in the Sequence alignment for modeling, there are often regions of deletions and insertions, gaps, in the alignment. Cclosing the gaps require loop modoeling., which is very difficult and also a major cause of errors . |
|
Side chain Modeling |
Once the model of the main chain atoms are built, the positions ofthe side chain atoms are to be determined,
Side chain geometry moeling is important in determining protein -ligsnd interactions at active sites
A side chain can be built by searching every possible conformation at every torsional angle of the side chain to select the ones with the lowest interaction energy with neigbouring atoms
most side chain modeling programs use the concept of, rotamer,which are favored side chain torsional angles extracted from known protein crystal structures. only the possible rotamers with the lowest energy are selected |
|
Mdel Refinement |
1.Energy minimization 2.molecular dynamics simulation the simulation can be done in a vacuum or solvents. |
|
Why are protein structures more conserved than protein sequences ? |
Because there is only a small number of protein folds available in comparision to the numerous posibble protein sequences
As a result some protein share the same fold in absence of similar sequence identity |
|
Threading or structural fold recognition |
predicts the structural fold of an unknown protein sequence by fitting the sequence into a structural database and selecting the best fitting fold. Methods : 1. Pairwise Energy Method 2.Profile Method |
|
Pairwise Energy Method |
a Protein sequence is searched for in a structural fold database to find the best fitting matching structural fold using energy based criteria |