Sequence Analysis Lab Report Example | Topics and Well Written Essays

BHS008-6 Computational Biology Topic: Sequence Analysis 11th December Introduction: Dynamic programming In the era of the human genome and the need to explore the entire 3 billion DNA base pairs, computer algorithms become an essential tool. Most of these computer algorithms string use dynamic programming in their implementation. According to Giegerich, (2000), dynamic programming is the most popular programming method applied in bioinformatics. Among the applications of dynamic programming in computational biology include assembling DNA sequence data from automated sequencing machines, determining the intron or exon structure in eukaryotic genes and also in inferring function through homology using proteins with known functions. Sequence and Structure Analysis Sequences alignment is one of the pillars of bioinformatics and is the process of comparing two or more sequences to deduce whether they are related to one another. In this comparison, one may draw inferences as to whether two proteins for instance have similar function, contains similar structural motifs which are the conserved sequences with a protein sequence that correspond to structural or functional regions. Sequence alignment is used to determine whether two or more sequence display sufficient similarity. This similarity must be sufficient to infer homology. Usually in sequence alignment; two terms are prominent, homology and similarity. Similarity is a measure and can be measured for instance the percentage identity and homology can never be measured and it is the conclusion drawn from the similarity data to infer whether any two genes or gene product being compared share a common evolutionary history. Homology may be defined as a putative conclusion which is reached after examining optimal alignment between two sequences and assessment of their similarity. Homology between two sequences is not measured in degrees and it implies evolutionary relationship. Homology may be classified as either orthologs which are genes separate by speciation event (s) and they may have similar structural domains and function. Paralogs sequences may be genes separated by duplications events and may provide insight on novel genes arising from duplication event adapting to modified cellular functions. Sequence alignment may either be local or global. Global sequence alignment compares two sequences along their entire length whereas local sequence alignment finds the most similar regions in two sequences being aligned. Albeit slow the dynamic programming algorithms are more accurate compared to the heuristic algorithms, FASTA and BLAST (Polyanovsky, et al., 2011). BLAST and FASTA algorithms are the commonly used algorithms used in similarity searches nowadays due to their faster approach in comparing sequences even when the databases are being searched (Altschul et al., 1990; Lipman and Pearson, 1985). These heuristic approaches apply word matching and hash tables in their implementation unlike dynamic programming that are rigorous mathematical approaches. Computational biology attempts to compare nucleotide or protein sequence to analyze either their difference or similarities at the level of bases or amino acids. This information is gathered with the objective of inferring structural, functional and evolutionary relationship among the sequences being analyzed (Baxevanis and Oullette, 2001). Sequence alignment enables one to map residues (nucleotides or amino acids) of two or more sequences. There are two methods of sequence alignment, pairwise alignment and multiple alignments. Whereas pairwise alignments involve two sequences, multiple alignments involve more than two sequences. In sequence alignment, various alignments are made to cater for the differences between any two or more sequences being aligned. Residues which are aligned but not identical may represent substitution in the evolution path. In other cases regions of the sequences where they aligned with other sequences in the alignment but they have no common residues may represent insertion I the evolution process. Sequence alignments will represent regions that are highly identical (highly conserved) than other alignment regions. These areas are suggestive of residues which may play crucial role to the structural and functional integrity of the expressed gene product either an RNA or protein. Materials and Methods Dynamic Programming Provided with two sequences AGQDEASY and AGCDEESW, an identity relation was used to construct a comparison matrix (1 if i=j, 0 otherwise). The dynamic programming method proceeded through three stages. Firstly, the two sequences were placed on two axes of a grid and the matrix filled to indicate the relationship between each residue in sequence 1 with all residues in sequence 2. Then, the comparison matrix is converted in to the ‘maximum match matrix’, which indicates the highest scoring path starting from a given residue. Here we start from the bottom right hand corner of the comparison matrix; the score in a particular grid point is added to the highest scoring box in the following row and column. Once this has been done, the highest scoring path can be extracted from the matrix, defining the alignment. Comparison Matrix A G Q D E A S Y A G C D E E S W Maximum match matrix A G Q D E A S Y A G C D E E S W Sequence and structure analysis Pairwise alignment of the provided two protein sequences was implemented using Blastp algorithm available at the NCBI websites (Altschul et al., 1997). Bovine cationic trypsin >IVGGYTCGANTVPYQVSLNSGYHFCGGSLINSQWVVSAAHCYKSGIQVRLGEDNINVVEGNEQFISASKSIVHPSYNSNTLNNDIMLIKLKSAASLNSRVASISLPTSCASAGTQCLISGWGNTKSSGTSYPDVLKCLKAPILSDSSCKSAYPGQITSNMFCAGYLEGGKDSCQGDSGGPVVCSGKLQGIVSWGSGCAQKNKPGVYTKVCNYVSWIKQTIASN Rat tonin >IVGGYKCEKNSQPWQVAVINEYLCGGVLIDPSWVITAAHCYSNNYQVLLGRNNLFKDEPFAQRRLVRQSFRHPDYIPLIVTNDTEQPVHDHSNDLMLLHLSEPADITGGVKVIDLPTKEPKVGSTCLASGWGSTNPSEMVVSHDLQCVNIHLLSNEKCIETYKDNVTDVMLCAGEMEGGKDTCAGDSGGPLICDGVLQGITSGGATPCAKPKTPAIYAKLIKFTSWIKKVMKENP In structure-sequence comparison, the coordinates of trypsin (5PTP) were downloaded as a PDB file from the protein data bank website (www.rcsb.org). In comparing the sequences, the Deep/ Swiss-PdbViewer v4.1.0 (Guex et al., 2012) was used. Results Dynamic programming Comparison Matrix A G Q D E A S Y A 1 0 0 0 0 1 0 0 G 0 1 0 0 0 0 0 0 C 0 0 0 0 0 0 0 0 D 0 0 0 1 0 0 0 0 E 0 0 0 0 1 0 0 0 E 0 0 0 0 1 0 0 0 S 0 0 0 0 0 0 1 0 W 0 0 0 0 0 0 0 0 Maximum Match Matrix A G Q D E A S Y A 5 3 3 2 1 2 0 0 G 3 4 3 3 2 1 0 0 C 3 3 3 2 1 1 0 0 D 2 2 2 3 1 1 0 0 E 2 2 2 2 2 1 0 0 E 1 1 1 1 2 1 0 0 S 0 0 0 0 0 0 1 0 W 0 0 0 0 0 0 0 0 Sequence and structure analysis a. Alignment of two serine protease sequences;Bovine cationic trypsin and Rat tonin >lcl|30919 unnamed protein product Length=235 Score = 180 bits (457), Expect = 2e-60, Method: Compositional matrix adjust. Identities = 95/235 (40%), Positives = 130/235 (55%), Gaps = 13/235 (6%) Query 1 IVGGYTCGANTVPYQVSLNSGYHFCGGSLINSQWVVSAAHCYKSGIQVRLGEDNINVVEG 60 IVGGY C N+ P+QV++ + Y CGG LI+ WV++AAHCY + QV LG +N+ E Sbjct 1 IVGGYKCEKNSQPWQVAVINEY-LCGGVLIDPSWVITAAHCYSNNYQVLLGRNNLFKDEP 59 Query 61 NEQFISASKSIVHPSY-----------NSNTLNNDIMLIKLKSAASLNSRVASISLPTSC 109 Q +S HP Y + +ND+ML+ L A + V I LPT Sbjct 60 FAQRRLVRQSFRHPDYIPLIVTNDTEQPVHDHSNDLMLLHLSEPADITGGVKVIDLPTKE 119 Query 110 ASAGTQCLISGWGNTKSSGTSYPDVLKCLKAPILSDSSCKSAYPGQITSNMFCAGYLEGG 169 G+ CL SGWG+T S L+C+ +LS+ C Y +T M CAG +EGG Sbjct 120 PKVGSTCLASGWGSTNPSEMVVSHDLQCVNIHLLSNEKCIETYKDNVTDVMLCAGEMEGG 179 Query 170 KDSCQGDSGGPVVCSGKLQGIVSWG-SGCAQKNKPGVYTKVCNYVSWIKQTIASN 223 KD+C GDSGGP++C G LQGI S G + CA+ P +Y K+ + SWIK+ + N Sbjct 180 KDTCAGDSGGPLICDGVLQGITSGGATPCAKPKTPAIYAKLIKFTSWIKKVMKEN 234 Figure 1: Bovine cationic trypsin and Rat tonin alignment On doing a pairwise sequence alignment of the two sequences, conserved regions have been observed (see figure 1 above) especially the stretch of amino acids from query61. This indicates sequence similarity between bovine cationic trypsin and rat tonin. Identities of the two aligned sequences are 40%. This is a parameter indicative of the extent to which the amino acid sequences are invariant. The positives account for 55% of the two sequence alignment. The positives represent the amino acids that share similar chemical properties and thus can be used in place of each other. Such is the case of isoleucine and leucine in the alignment above. In protein scoring systems different scoring matrices are used to score the amino acids so as to reflect different scenarios. Relating the sequence alignment to three-dimensional structure Figure 2: Structure of hydrolase (Serine proteinases -5PTP) Figure 3: Carbon alpha trace of hydrolase structure (Serine proteinases -5PTP) Conserved Color Secondary structure - Helices = red  - Sheets = Yellow Discussion As seen in the simple example of dynamic programming employing the matrices, dynamic programming though it achieves optimal alignment involves creating an alignment path matrix, backtracking to evaluate optimal path and then there is a stepwise calculation of score values. Usually in dynamic programming, the subject sequences are fragmented into subsequences. The quality of a sequence alignment between two more nucleotide or protein sequence is weighed by a score. The two sequences bovine cationic trypsin and rat tonin were aligned in the using the blast tool at NCBI (Altschul et al., 1997). The Blastp algorithm which compares a protein query to a protein database was used. However in this case, the subject sequence was used in place of the protein database since the objective was to carry out a pairwise alignment. Various parameters are produced from an alignment of any two or more sequences. The method used to align these two sequences is the composition matrix adjustment method, 13 gaps comprising 6% of the sequences length were created to accommodate the alignment. In scoring, the BLOSUM62 is commonly used in the BLAST similarity and it was applied in aligning the two protein sequence in this assignment. The expected value (E-value), given in this alignment as 2e-60 is a statistical significance value. Amino acids have different biochemical and physical features that guide their relative replaceability in the evolution path. For instance, As seen in the above sequence alignment, isoleucine and leucine are both aliphatic amino acids and thus may be used in place of each other in the course of evolution and this justifies them being scored as positives in the alignment. The value is achieved by relating observed alignment score, S, to an expected distribution. The E-value can be viewed as the number of alignments with scores at least equal to an observed alignment score, S, which would be expected randomly. Gaps are introduced in as sequence alignment so that an optimal sequence alignment may be achieved. In an optimal alignment, the numbers of matches are maximized and gaps minimized with addition of gaps targeted at minimizing mismatches. However, gaps should not be arbitrarily introduced since non-homologous sequences. Thus gaps are penalized to reduce their occurrence. The BLAST heuristic method used to align the two sequences unlike in dynamic programming algorithm uses word or K-tuple methods o align the sequence and thus can be termed as an approximate method (Altschul et al., 1997). The heuristic method, BLAST makes assumptions based on the nature of the sequence alignment to find a perfect match and proceeds to extend the alignment until one of the sequences ends. Heuristic algorithms are much faster though less sensitive when compared to the dynamic algorithms. The Blocks Amino acid Substitution (BLOSUM) matrix used in the alignment is based on blocks of observed amino acid substitution in blocks. These blocks represent large sets of conserved amino acids patterns usually referred as families of related properties. References Altschul, S.F., Gish, W., Miller, W., Myers, E. and Lipman, D.J., 1990. Basic local alignment search tool. J Mol Biol., 215:403-410. Altschul, S.F., Madden, T.L., Schäffer, A.A., Zhang, J., Zhang, Z., Miller, W. and Lipman, D.J., 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25:3389-3402. Baxevanis, A.D. and Oullette, B.F.F., 2001. Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins, 2nd edition. John Wiley & Sons, Inc. Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., Shindyalov, I.N. and Bourne, P.E., 2000. The Protein Data Bank, Nucleic Acids Research, 28: 235-242. www.rcsb.org Giegerich, R., 2000. A systematic approach to dynamic programming in bioinformatics. Bioinformatics Review, 16 (8):665-677. Guex, N., Peitsch, M., Schwede, T. and Diemand, A., 2012. DeepView/Swiss-Pdb-viewer. Swiss Institute of Bioinformatics, Available at http://www.expasy.org/spdbv/ Lipman, D.J. and Pearson, W.R., 1985. Rapid and sensitive protein similarity searches. Science, 227:1435-1441. Read More

Sequence Analysis - Lab Report Example

Extract of sample "Sequence Analysis"

CHECK THESE SAMPLES OF Sequence Analysis

Data Mining as the Process

DNA Sequence Analysis, Primer Design, Protein Expression, and Mutagenesis Assessment

Cell molecular biology

DNA Sequence Analysis

Math Sequences

Job Opportunity in Bioinformatics

Discovery of a New Gene and Analysis of the Encoded Protein

Eight Phases of Planning My Project