1.Global Alignment2.Local Alignment

1. Dot Plot2. Dynammic Programming3.Word Method

Dot Plot Mmatrix Method

A basic Sequence Alignment MethodA graphical way of comparing two sequences in a two dimensional matrixBoth sequences are written on the vertical and horizontal axes of the matrixa dot is placed within the graph when residues match , otherwise the position is left blankWhen the sequences have areas of similarity, there can be seen dots that form diagonal lines . The interruption between the diagonals are areas of deletion and insertion Parallel diagonals represent repetitive regions of the sequences

Problems attached with dot Plots

1.When comparing large sequences, there can be seen a high noise level.

Solving the Problem of Dotplot

A window slide of fixed lengththis scans accross the two sequences and compares all possible matches.The size of the window can be manipulated.Sensitivity is lost if the window size is too long.

Alignment of a sequence on itself

Why: to identify regions of internal repeats elements.There is a perfect diagonal for matching residues .If the repeats are present, there are short diagonals below and above the main diagonal

What can be found via Dotplots using Self complimentarity of DNA sequences

inverted RepeatsThis method can be applied thus in Genomics

Another Problem of Dotplot

It lacks statistical Rigor in the assessing of an alignment.The method is also restricted to pairwise alignment

1. Displays Dotplots of aligned sequences in FASTA Format

aligns sequences based on the Word Method .Diagonal lines are only drawn, if exact matching of words of specific lengths are found.

It is similar to Dotplots, however it incoperates scoring schemes and matrices ot the alignment and assesment of the alignment. Iit searches fo an alignment with the highest score, thus providing the best option.

Represent Deletion or Insertion

The differential Gap Penalties: Gap Oopening Penalty and Gap Extension Penalty

Needleman Wunsch Algorithm

The global alignment using Dynamic Programming

Dynammic Programming for Local Alignment

Smith Waterman AlgorithmPositive Scores are assigned to matches, zeros to missmatches and gaps

or a Substitution Matrix is used for the Anaylsis of residue Substitution

observation in substitution Matrices

1. Transition[ Ssubstitution purines to purines or pyrimidines to pyrimidines ] occur more frequently than Transversions [purines to pyrimidines ]

Ccomplexity of Scoring Matrices for Amino Acid

They are more Complex, because amino acids are scored based on their physiochemical properties

Amino Acid Scoring Matrices

20*20 Matricesthere are two types ;a. one is based on interchangeability of the genetic code or amino acid propertiesb. the other is derived from empirical studies of amino acid substitution

Empirical A A scoring Matrices are ?

PAM and BLOSUMare derived form actual alignments of high similar sequences

how can a scoring system be developed ?

by giving a high score to a more likely substitution and a low one to a rare subsitution

a positive score means?

the frequency of substitution is higher than one would expect by random chance

the frequency is equal to random chance

the frequency of the substitution is lower than one would expect randomly.

are logarithmic ratios of observed mutation frequency divided by the probability of substitution one would expect by random chance

Point Accepted Mutation MatricesPoint Mutations that are acceptedby natural selection

1 % of the the amino acid positions have been changed or one mutation per 100 residues

PAM80 is produced by?

multiplying PAM1 by itself 8 times.

how is a PAM1 substituion Table constructed?

a group of closely related sequences of mutation frequences corresponding the PAM1 unit are chosen.

Blocks amino acid substitution matricesthey are percentage identity values of sequences selected for the construction of these matrices

how were the blocks constructed ?

based on more than 200 amino acid conserved patterns and 500 groups of protein sequences Blocks are ungapped alignments of less than 60 residues in lengthThe frequencies of amino acid substitution of the residues in these Blocks are calculated to produce the table.

Comparison between PAM and BLOSUM

PAM matrices, except PAM1, are derived from evolutionary model, whereas BLOSUM matrices consist of entirely direct observations. BLOSUM matrices may have less evolutionary meaning than PAMThat is why PAM is used more for constructing phylogentic trees However because of the mathmatical exptrapolation used for PAM Matrices, they may be less significant for more divergent sequences.2. BLOSUM matrices are derived from local sequence alignments of conserved sequence blocks.whereas PAM1 is based on Global Alignment of full length sequences composed of conserved and variable regions

is given to indicate the probability that the original alignment is due by random chanceif the value is less than 10 -100, it indicates an exact match between both sequences. If higher, then both sequences are considered to be identical . Avalue ranging between 10-5 and 10-1 indicates distant homologs.

BLAST and FASTA50-100 times faster than dynamic programmingheuristic word methods

Steps :1. Query sequnce broken down to words [three residues for protein sequences and eleven for DNA residues ]2. Scans for matches against the database sequences3. this includes words with one or two letter matches4.Calculates score of matches based on BLOSUM 625.Extension of both sides until the score of the alignment drops below threshold score.6.Determine high scored segment above threshold score In the original BLAST, the HSPs are presented as Final Report. and are called maximun scoring segment pairs However in the new improvement, gapped alignment is presented.

Statistical Significance in BLAST

is presented as the Ee-value ; Expectation value, which is the probability that the resulting alignments from a database search are caused by random chance E = m*n*P m; total number of residues in the databasen;number of residues in the query sequenceP;Probability that the HSP is that of random chancee.g. aligning a query sequence of 100 residues to a database of 10 raised to 12 residues results in a P value for the ungapped HSP region in one of the database match to 1*1 raised to -20 . The E value is thus 10 raised to -6the lower the value, the more significant it is .

FASTA uses hashing strategy to find matches for short stretches of identical residues of length k, known as ktuples2 residues for a protein sequence and 6 residues for DNA : in other words , shrter than the words in BLAST.

1. Identify ktups between two sequences using the hashing strategy.this works by construtuing a tablethat shows the position of each ktup for the two sequences.The positional difference can be obtained for each word by substracting the position on the first sequence from the position on the second sequence, which is represented as the offset.2. Ktups with identical offset values contain contigious identical sequence regions that corresponds to a diagonal stretch on a 2d matrix.3. the top ten highest desity diagonals are pciked and emphasized, which are then scored via a substitution matrix.4. naeibouring high score segments are joined together to form a signle alignment. the score of the gapped alignment allows incorperation og gap penalities when scoring again,5. the alignment is the refined using the smith watermann algorithm. basically for statistical evaluation, E-score.

Comparision of FASTA to BLAST

1. the seeding stepBLAST uses a substitution matric to find matching words , while FASTA identifies matching words using the hashing method2. FAST by default scans smaller window sizes, thus giving it more sensititvity .3. FASTA is slower than BLAST,4.FASTA gives only one final alignment, while BLAST presents multiple best scoring alignments.

Multiple Sequence Alignment

Allows the identification of conserved regions and motifs in a whole sequence family and essential in carrying out phylogenetic analysis of sequence familes and prediction of the protein secondary and tertiary structure

the sequences are arranged in such a way that there is a maximum number of residues are matched up according to a particular scoring function

The scoring function fo MSA based on what?

the sum of SP'sSP is the sum of all scores of all possible pairs of sequences in multiple alignment based on a specific scoring matrix. this alignment is pairwise considering also the matches, missmatches and gapcosts.

1.Heuristic Algorithms- Progressive Alignment type, iterative alignment type, block based alignment.

Progressive Alignment Method

A multistep processI tfirst conducts pairwise alignment based on the needleman wunsch algorithm and then records the similarit scores,

to align additional sequences via the PA method?

the two already aligned sequences are converted to a consesus sequence with gap positions. this is treated as a single sequence in the next step.

the most nown PA program

CLUSTALCLUSTALW; the Ww provides a simple text based interface .

Advantages of CLUSTAL

1. it does not use only one substitution matrix, instead it applies different scoring matrices when aligning the sequences.The choice of the matrix depends on the evlolutionary distance measured from the guide treee.g. for closely related sequences, CLUSTAL uses BLOSUM62 or PAM120 matrix. But for more divergent sequences, BLOSUM45 or PAM250 is preferred.2. it uses adjustable gap penalties, which allow more deletions and insetions outside regions of conservation , but fewer in conserved regions.

Disadvantages of PA Method

1. not suitable for the multiple aligment of sequences of diffferent lengths because it is a global alignment based method.2. and as a result of the use affin gap penalties, long gaps are not allowed.3.optimal result at the end of the alignment cannot be promised. Because at the intial stage alignment, once done, erros made cannot be corrected. thus there is a build up of errors with successive alignments.

Improvement in the Clustal

T-Coffee Performs both local and global alignment.Because an optimal alignment is chosen at the intial stage, T-coffee avoids or minimizes errorsin the early stages.However it is slower than LUSTAL, because of the computation cost which are high.

Set the Language

We weren't able to detect the audio language on your flashcards. Please select the correct language below.

Front

Back

Flashcards
»
Bioinformatics

Bioinformatics

by TimilehinAdunni, Sep. 2014

Subjects: Basics

Favorite

Add to folder

Flag

Related Essays

Epilepsy Case Studies
Amino Acid Properties and Consequences of Substitutions. Chapter 14. Bioinformatics for Geneticists. Edited by Michael R. Barnes and Ian C. Gray. 2003; 289-
Unit 4 Fossil Blast Lab
No changes or alterations were not allowed to be made once the page that contains the parameter appeared. When that page appeared, BLAST button (located at t...
Analysis B Bigemina
(Fig. 1) with a computer-calculated molecular weight of 66.64-KDa. The protein has a signal peptide of 30 amino acids (Signal 3). The nucleotide sequence ...
Codon Bias Essay
Most amino acids are encoded by more than one codon. There are 64 total codons, three of which are stop codons, with 61 of the codons coding for 20 different...
Assignment 3: CS696 Programming Problems In Bioinformatics
Assignment-3: CS696 Programming Problems in Bioinformatics. Describe and discuss different approaches for finding motif in genetic string. Introduction: ...
Optimum Temperature Of Amylase
According to the bioinformatics study, the sequence
Identification Of Protein Synthesis And Pathogen
A total of 18 proteins from the common pathway and a total of 124 proteins of the unique pathways of the pathogen were detected as non-homologous. STRING da...
Bwok Analysis
The study is aimed at the comparison of homologous protein BLOCKs using different diversity parameters (MDRs, DHPs and MCRs etc) that are formulated using po...
Huntington's Disease Analysis
The full-length transcript was obtained from the “walking” of IT15A and IT16A to the cDNA libraries. The transcript consisted 10,366 bases including 18 A’s. ...
Molecular Biology Informative Speech
The remaining 93 amino acids in the protein sequence can differ in organisms because of random mutations that have occurred over generations. Genetic relatio...

Shuffle
Toggle On

Toggle Off
Alphabetize
Toggle On

Toggle Off
Front First
Toggle On

Toggle Off
Both Sides
Toggle On

Toggle Off
Read
Toggle On

Toggle Off

Reading...

Front

Card Range To Study

through

Play button

Progress

1/66

Click to flip

Use LEFT and RIGHT arrow keys to navigate between flashcards;

Use UP and DOWN arrow keys to flip the card;

H to show hint;

A reads text to speech;

66 Cards in this Set

Front
Back

	Alignment Algorithms	1.Global Alignment 2.Local Alignment
	Methods of Alignment	1. Dot Plot 2. Dynammic Programming 3.Word Method
	Dot Plot Mmatrix Method	A basic Sequence Alignment Method A graphical way of comparing two sequences in a two dimensional matrix Both sequences are written on the vertical and horizontal axes of the matrix a dot is placed within the graph when residues match , otherwise the position is left blank When the sequences have areas of similarity, there can be seen dots that form diagonal lines . The interruption between the diagonals are areas of deletion and insertion Parallel diagonals represent repetitive regions of the sequences
	Problems attached with dot Plots	1.When comparing large sequences, there can be seen a high noise level.
	Solving the Problem of Dotplot	A window slide of fixed length this scans accross the two sequences and compares all possible matches. The size of the window can be manipulated. Sensitivity is lost if the window size is too long.
	Alignment of a sequence on itself	Why: to identify regions of internal repeats elements. There is a perfect diagonal for matching residues . If the repeats are present, there are short diagonals below and above the main diagonal
	What can be found via Dotplots using Self complimentarity of DNA sequences	inverted Repeats This method can be applied thus in Genomics
	Another Problem of Dotplot	It lacks statistical Rigor in the assessing of an alignment. The method is also restricted to pairwise alignment
	Progrmans for Dotplot	Dotmatcher and Dottup
	Dotmatcher	1. Displays Dotplots of aligned sequences in FASTA Format
	Dottup	aligns sequences based on the Word Method .Diagonal lines are only drawn, if exact matching of words of specific lengths are found.
	Dynammic Programming	It is similar to Dotplots, however it incoperates scoring schemes and matrices ot the alignment and assesment of the alignment. Iit searches fo an alignment with the highest score, thus providing the best option.
	Gaps	Represent Deletion or Insertion
	Affine Gap Penalties	The differential Gap Penalties: Gap Oopening Penalty and Gap Extension Penalty
	The total Gap penalty is a linear function of what	the gap length
	Needleman Wunsch Algorithm	The global alignment using Dynamic Programming
	Dynammic Programming for Local Alignment	Smith Waterman Algorithm Positive Scores are assigned to matches, zeros to missmatches and gaps
	Scoring Matrices	or a Substitution Matrix is used for the Anaylsis of residue Substitution
	observation in substitution Matrices	1. Transition[ Ssubstitution purines to purines or pyrimidines to pyrimidines ] occur more frequently than Transversions [purines to pyrimidines ]
	Ccomplexity of Scoring Matrices for Amino Acid	They are more Complex, because amino acids are scored based on their physiochemical properties
	Amino Acid Scoring Matrices	20*20 Matrices there are two types ; a. one is based on interchangeability of the genetic code or amino acid properties b. the other is derived from empirical studies of amino acid substitution
	Empirical A A scoring Matrices are ?	PAM and BLOSUM are derived form actual alignments of high similar sequences
	how can a scoring system be developed ?	by giving a high score to a more likely substitution and a low one to a rare subsitution
	a positive score means?	the frequency of substitution is higher than one would expect by random chance
	Score zero?	the frequency is equal to random chance
	negative score ?	the frequency of the substitution is lower than one would expect randomly.
	log odds ratio	are logarithmic ratios of observed mutation frequency divided by the probability of substitution one would expect by random chance
	PAM Matrices	Point Accepted Mutation Matrices Point Mutations that are acceptedby natural selection
	One PAM unit means?	1 % of the the amino acid positions have been changed or one mutation per 100 residues
	PAM80 is produced by?	multiplying PAM1 by itself 8 times.
	how is a PAM1 substituion Table constructed?	a group of closely related sequences of mutation frequences corresponding the PAM1 unit are chosen.
	BLOSUM Matrices	Blocks amino acid substitution matrices they are percentage identity values of sequences selected for the construction of these matrices
	how were the blocks constructed ?	based on more than 200 amino acid conserved patterns and 500 groups of protein sequences Blocks are ungapped alignments of less than 60 residues in length The frequencies of amino acid substitution of the residues in these Blocks are calculated to produce the table.
	Comparison between PAM and BLOSUM	PAM matrices, except PAM1, are derived from evolutionary model, whereas BLOSUM matrices consist of entirely direct observations. BLOSUM matrices may have less evolutionary meaning than PAM That is why PAM is used more for constructing phylogentic trees However because of the mathmatical exptrapolation used for PAM Matrices, they may be less significant for more divergent sequences. 2. BLOSUM matrices are derived from local sequence alignments of conserved sequence blocks. whereas PAM1 is based on Global Alignment of full length sequences composed of conserved and variable regions
	P value	is given to indicate the probability that the original alignment is due by random chance if the value is less than 10 -100, it indicates an exact match between both sequences. If higher, then both sequences are considered to be identical . Avalue ranging between 10-5 and 10-1 indicates distant homologs.
	Heuristic Algorithms	BLAST and FASTA 50-100 times faster than dynamic programming heuristic word methods
	BLAST	Steps : 1. Query sequnce broken down to words [three residues for protein sequences and eleven for DNA residues ] 2. Scans for matches against the database sequences 3. this includes words with one or two letter matches 4.Calculates score of matches based on BLOSUM 62 5.Extension of both sides until the score of the alignment drops below threshold score. 6.Determine high scored segment above threshold score In the original BLAST, the HSPs are presented as Final Report. and are called maximun scoring segment pairs However in the new improvement, gapped alignment is presented.
	Statistical Significance in BLAST	is presented as the Ee-value ; Expectation value, which is the probability that the resulting alignments from a database search are caused by random chance E = mnP m; total number of residues in the database n;number of residues in the query sequence P;Probability that the HSP is that of random chance e.g. aligning a query sequence of 100 residues to a database of 10 raised to 12 residues results in a P value for the ungapped HSP region in one of the database match to 1*1 raised to -20 . The E value is thus 10 raised to -6 the lower the value, the more significant it is .
	FASTA	FASTA uses hashing strategy to find matches for short stretches of identical residues of length k, known as ktuples 2 residues for a protein sequence and 6 residues for DNA : in other words , shrter than the words in BLAST.
	Steps in FASTA	1. Identify ktups between two sequences using the hashing strategy. this works by construtuing a tablethat shows the position of each ktup for the two sequences. The positional difference can be obtained for each word by substracting the position on the first sequence from the position on the second sequence, which is represented as the offset. 2. Ktups with identical offset values contain contigious identical sequence regions that corresponds to a diagonal stretch on a 2d matrix. 3. the top ten highest desity diagonals are pciked and emphasized, which are then scored via a substitution matrix. 4. naeibouring high score segments are joined together to form a signle alignment. the score of the gapped alignment allows incorperation og gap penalities when scoring again, 5. the alignment is the refined using the smith watermann algorithm. basically for statistical evaluation, E-score.
	Comparision of FASTA to BLAST	1. the seeding step BLAST uses a substitution matric to find matching words , while FASTA identifies matching words using the hashing method 2. FAST by default scans smaller window sizes, thus giving it more sensititvity . 3. FASTA is slower than BLAST, 4.FASTA gives only one final alignment, while BLAST presents multiple best scoring alignments.
	Multiple Sequence Alignment	Allows the identification of conserved regions and motifs in a whole sequence family and essential in carrying out phylogenetic analysis of sequence familes and prediction of the protein secondary and tertiary structure
	Concept of MSA	the sequences are arranged in such a way that there is a maximum number of residues are matched up according to a particular scoring function
	The scoring function fo MSA based on what?	the sum of SP's SP is the sum of all scores of all possible pairs of sequences in multiple alignment based on a specific scoring matrix. this alignment is pairwise considering also the matches, missmatches and gapcosts.
	Approaches of MSA	1.Heuristic Algorithms - Progressive Alignment type, iterative alignment type, block based alignment.
	Progressive Alignment Method	A multistep process I tfirst conducts pairwise alignment based on the needleman wunsch algorithm and then records the similarit scores,
	to align additional sequences via the PA method?	the two already aligned sequences are converted to a consesus sequence with gap positions. this is treated as a single sequence in the next step.
	the most nown PA program	CLUSTAL CLUSTALW; the Ww provides a simple text based interface .
	Advantages of CLUSTAL	1. it does not use only one substitution matrix, instead it applies different scoring matrices when aligning the sequences. The choice of the matrix depends on the evlolutionary distance measured from the guide tree e.g. for closely related sequences, CLUSTAL uses BLOSUM62 or PAM120 matrix. But for more divergent sequences, BLOSUM45 or PAM250 is preferred. 2. it uses adjustable gap penalties, which allow more deletions and insetions outside regions of conservation , but fewer in conserved regions.
	Disadvantages of PA Method	1. not suitable for the multiple aligment of sequences of diffferent lengths because it is a global alignment based method. 2. and as a result of the use affin gap penalties, long gaps are not allowed. 3.optimal result at the end of the alignment cannot be promised. Because at the intial stage alignment, once done, erros made cannot be corrected. thus there is a build up of errors with successive alignments.
	Improvement in the Clustal	T-Coffee Performs both local and global alignment. Because an optimal alignment is chosen at the intial stage, T-coffee avoids or minimizes errorsin the early stages. However it is slower than LUSTAL, because of the computation cost which are high.
	Evaluation of Alignments gotten from PA	Editiing: This involves introducing or removing gaps to maximize biologicallymeaningful matches. thus avoided missaligned portions. BioEdit-porgram
	Computational Approaches to Protein three dimensional structural modeling and prediction	1.Homology Modeling 2.Threading 3.Ab initio Prediction the first two are knowledge based : They model structures based on knowledge of existing protein structural information in databases
	Homology Modeling Overview	builds an atomic model based on an experimentally determinedstructure that is closely related at the sequence level
	Threading overview	identifies Protein that are structurally similar with or without detectable sequence similarites.
	Ab initio Prediction overview	Predicts and models structures based on physiochemical principles that govern protein folding without the use of structural templates
	Homology Modeling	also known as comparative modeling Principle : If two sequences share a high sequence similarity, then are most likely to share the 3d structure .
	Steps in Homology Modeling	1.template Selection -Identification of homologous sequences in a database that would be used for modeling 2. Alignment of target and template sequences. 3.build a frame work structure for the target protein consisting of main chain atoms 4.Addition and optimization of side chain atoms and loops 5.Energy optimization 6. Evaluation of the overall quality of the model
	Template selection	1. involves searching the protein Data Bank for homologous proteins with determined structures this search can be performed using heuristic methods e.g. BLAST, FASTA rule of thumb :A database protein must have at least 30% sequence identity with the query sequence in order to be accepted as template.
	Sequence Alignment	Once the structure with the highest identity has been located,the full length sequence of the target proteins and template need to be aligned via refined algorithms to obtain optimal alignment. Because incorrect alignment leads to incorrect designation or residues and therfore incorrect models. T-coffee is a suitable algorithm.
	Loop Modeling	in the Sequence alignment for modeling, there are often regions of deletions and insertions, gaps, in the alignment. Cclosing the gaps require loop modoeling., which is very difficult and also a major cause of errors .
	Side chain Modeling	Once the model of the main chain atoms are built, the positions ofthe side chain atoms are to be determined, Side chain geometry moeling is important in determining protein -ligsnd interactions at active sites A side chain can be built by searching every possible conformation at every torsional angle of the side chain to select the ones with the lowest interaction energy with neigbouring atoms most side chain modeling programs use the concept of, rotamer,which are favored side chain torsional angles extracted from known protein crystal structures. only the possible rotamers with the lowest energy are selected
	Mdel Refinement	1.Energy minimization 2.molecular dynamics simulation the simulation can be done in a vacuum or solvents.
	Why are protein structures more conserved than protein sequences ?	Because there is only a small number of protein folds available in comparision to the numerous posibble protein sequences As a result some protein share the same fold in absence of similar sequence identity
	Threading or structural fold recognition	predicts the structural fold of an unknown protein sequence by fitting the sequence into a structural database and selecting the best fitting fold. Methods : 1. Pairwise Energy Method 2.Profile Method
	Pairwise Energy Method	a Protein sequence is searched for in a structural fold database to find the best fitting matching structural fold using energy based criteria

Share This Flashcard Set

Set the Language

Bioinformatics

Add to Folders

Upgrade to Cram Premium

Related Essays

Card Range To Study

66 Cards in this Set