StudentShare
Contact Us
Sign In / Sign Up for FREE
Search
Go to advanced search...
Free

Sequence Analysis - Lab Report Example

Cite this document
Summary
The paper "Sequence Analysis" discusses the dynamic programming of computer algorithms that become an essential tool in the era of the human genome and the need to explore the entire 3 billion DNA base pairs.

Words: 35
Characters: 208
Download full paper File format: .doc, available for editing
GRAB THE BEST PAPER94.1% of users find it useful
Sequence Analysis
Read Text Preview

Extract of sample "Sequence Analysis"

BHS008-6 Computational Biology Topic:  Sequence Analysis 11th December Introduction: Dynamic programming In the era of the human genome and the need to explore the entire 3 billion DNA base pairs, computer algorithms become an essential tool. Most of these computer algorithms string use dynamic programming in their implementation. According to Giegerich, (2000), dynamic programming is the most popular programming method applied in bioinformatics. Among the applications of dynamic programming in computational biology include assembling DNA sequence data from automated sequencing machines, determining the intron or exon structure in eukaryotic genes and also in inferring function through homology using proteins with known functions. Sequence and Structure Analysis Sequences alignment is one of the pillars of bioinformatics and is the process of comparing two or more sequences to deduce whether they are related to one another. In this comparison, one may draw inferences as to whether two proteins for instance have similar function, contains similar structural motifs which are the conserved sequences with a protein sequence that correspond to structural or functional regions. Sequence alignment is used to determine whether two or more sequence display sufficient similarity. This similarity must be sufficient to infer homology. Usually in sequence alignment; two terms are prominent, homology and similarity. Similarity is a measure and can be measured for instance the percentage identity and homology can never be measured and it is the conclusion drawn from the similarity data to infer whether any two genes or gene product being compared share a common evolutionary history. Homology may be defined as a putative conclusion which is reached after examining optimal alignment between two sequences and assessment of their similarity. Homology between two sequences is not measured in degrees and it implies evolutionary relationship. Homology may be classified as either orthologs which are genes separate by speciation event (s) and they may have similar structural domains and function. Paralogs sequences may be genes separated by duplications events and may provide insight on novel genes arising from duplication event adapting to modified cellular functions. Sequence alignment may either be local or global. Global sequence alignment compares two sequences along their entire length whereas local sequence alignment finds the most similar regions in two sequences being aligned. Albeit slow the dynamic programming algorithms are more accurate compared to the heuristic algorithms, FASTA and BLAST (Polyanovsky, et al., 2011). BLAST and FASTA algorithms are the commonly used algorithms used in similarity searches nowadays due to their faster approach in comparing sequences even when the databases are being searched (Altschul et al., 1990; Lipman and Pearson, 1985). These heuristic approaches apply word matching and hash tables in their implementation unlike dynamic programming that are rigorous mathematical approaches. Computational biology attempts to compare nucleotide or protein sequence to analyze either their difference or similarities at the level of bases or amino acids. This information is gathered with the objective of inferring structural, functional and evolutionary relationship among the sequences being analyzed (Baxevanis and Oullette, 2001). Sequence alignment enables one to map residues (nucleotides or amino acids) of two or more sequences. There are two methods of sequence alignment, pairwise alignment and multiple alignments. Whereas pairwise alignments involve two sequences, multiple alignments involve more than two sequences. In sequence alignment, various alignments are made to cater for the differences between any two or more sequences being aligned. Residues which are aligned but not identical may represent substitution in the evolution path. In other cases regions of the sequences where they aligned with other sequences in the alignment but they have no common residues may represent insertion I the evolution process. Sequence alignments will represent regions that are highly identical (highly conserved) than other alignment regions. These areas are suggestive of residues which may play crucial role to the structural and functional integrity of the expressed gene product either an RNA or protein. Materials and Methods Dynamic Programming Provided with two sequences AGQDEASY and AGCDEESW, an identity relation was used to construct a comparison matrix (1 if i=j, 0 otherwise). The dynamic programming method proceeded through three stages. Firstly, the two sequences were placed on two axes of a grid and the matrix filled to indicate the relationship between each residue in sequence 1 with all residues in sequence 2. Then, the comparison matrix is converted in to the ‘maximum match matrix’, which indicates the highest scoring path starting from a given residue. Here we start from the bottom right hand corner of the comparison matrix; the score in a particular grid point is added to the highest scoring box in the following row and column. Once this has been done, the highest scoring path can be extracted from the matrix, defining the alignment. Comparison Matrix A G Q D E A S Y A G C D E E S W Maximum match matrix A G Q D E A S Y A G C D E E S W Sequence and structure analysis Pairwise alignment of the provided two protein sequences was implemented using Blastp algorithm available at the NCBI websites (Altschul et al., 1997). Bovine cationic trypsin >IVGGYTCGANTVPYQVSLNSGYHFCGGSLINSQWVVSAAHCYKSGIQVRLGEDNINVVEGNEQFISASKSIVHPSYNSNTLNNDIMLIKLKSAASLNSRVASISLPTSCASAGTQCLISGWGNTKSSGTSYPDVLKCLKAPILSDSSCKSAYPGQITSNMFCAGYLEGGKDSCQGDSGGPVVCSGKLQGIVSWGSGCAQKNKPGVYTKVCNYVSWIKQTIASN Rat tonin >IVGGYKCEKNSQPWQVAVINEYLCGGVLIDPSWVITAAHCYSNNYQVLLGRNNLFKDEPFAQRRLVRQSFRHPDYIPLIVTNDTEQPVHDHSNDLMLLHLSEPADITGGVKVIDLPTKEPKVGSTCLASGWGSTNPSEMVVSHDLQCVNIHLLSNEKCIETYKDNVTDVMLCAGEMEGGKDTCAGDSGGPLICDGVLQGITSGGATPCAKPKTPAIYAKLIKFTSWIKKVMKENP In structure-sequence comparison, the coordinates of trypsin (5PTP) were downloaded as a PDB file from the protein data bank website (www.rcsb.org). In comparing the sequences, the Deep/ Swiss-PdbViewer v4.1.0 (Guex et al., 2012) was used. Results Dynamic programming Comparison Matrix A G Q D E A S Y A 1 0 0 0 0 1 0 0 G 0 1 0 0 0 0 0 0 C 0 0 0 0 0 0 0 0 D 0 0 0 1 0 0 0 0 E 0 0 0 0 1 0 0 0 E 0 0 0 0 1 0 0 0 S 0 0 0 0 0 0 1 0 W 0 0 0 0 0 0 0 0 Maximum Match Matrix A G Q D E A S Y A 5 3 3 2 1 2 0 0 G 3 4 3 3 2 1 0 0 C 3 3 3 2 1 1 0 0 D 2 2 2 3 1 1 0 0 E 2 2 2 2 2 1 0 0 E 1 1 1 1 2 1 0 0 S 0 0 0 0 0 0 1 0 W 0 0 0 0 0 0 0 0 Sequence and structure analysis a. Alignment of two serine protease sequences;Bovine cationic trypsin and Rat tonin >lcl|30919 unnamed protein product Length=235 Score = 180 bits (457), Expect = 2e-60, Method: Compositional matrix adjust. Identities = 95/235 (40%), Positives = 130/235 (55%), Gaps = 13/235 (6%) Query 1 IVGGYTCGANTVPYQVSLNSGYHFCGGSLINSQWVVSAAHCYKSGIQVRLGEDNINVVEG 60 IVGGY C N+ P+QV++ + Y CGG LI+ WV++AAHCY + QV LG +N+ E Sbjct 1 IVGGYKCEKNSQPWQVAVINEY-LCGGVLIDPSWVITAAHCYSNNYQVLLGRNNLFKDEP 59 Query 61 NEQFISASKSIVHPSY-----------NSNTLNNDIMLIKLKSAASLNSRVASISLPTSC 109 Q +S HP Y + +ND+ML+ L A + V I LPT Sbjct 60 FAQRRLVRQSFRHPDYIPLIVTNDTEQPVHDHSNDLMLLHLSEPADITGGVKVIDLPTKE 119 Query 110 ASAGTQCLISGWGNTKSSGTSYPDVLKCLKAPILSDSSCKSAYPGQITSNMFCAGYLEGG 169 G+ CL SGWG+T S L+C+ +LS+ C Y +T M CAG +EGG Sbjct 120 PKVGSTCLASGWGSTNPSEMVVSHDLQCVNIHLLSNEKCIETYKDNVTDVMLCAGEMEGG 179 Query 170 KDSCQGDSGGPVVCSGKLQGIVSWG-SGCAQKNKPGVYTKVCNYVSWIKQTIASN 223 KD+C GDSGGP++C G LQGI S G + CA+ P +Y K+ + SWIK+ + N Sbjct 180 KDTCAGDSGGPLICDGVLQGITSGGATPCAKPKTPAIYAKLIKFTSWIKKVMKEN 234 Figure 1: Bovine cationic trypsin and Rat tonin alignment On doing a pairwise sequence alignment of the two sequences, conserved regions have been observed (see figure 1 above) especially the stretch of amino acids from query61. This indicates sequence similarity between bovine cationic trypsin and rat tonin. Identities of the two aligned sequences are 40%. This is a parameter indicative of the extent to which the amino acid sequences are invariant. The positives account for 55% of the two sequence alignment. The positives represent the amino acids that share similar chemical properties and thus can be used in place of each other. Such is the case of isoleucine and leucine in the alignment above. In protein scoring systems different scoring matrices are used to score the amino acids so as to reflect different scenarios. Relating the sequence alignment to three-dimensional structure Figure 2: Structure of hydrolase (Serine proteinases -5PTP) Figure 3: Carbon alpha trace of hydrolase structure (Serine proteinases -5PTP) Conserved Color Secondary structure - Helices = red  - Sheets = Yellow Discussion As seen in the simple example of dynamic programming employing the matrices, dynamic programming though it achieves optimal alignment involves creating an alignment path matrix, backtracking to evaluate optimal path and then there is a stepwise calculation of score values. Usually in dynamic programming, the subject sequences are fragmented into subsequences. The quality of a sequence alignment between two more nucleotide or protein sequence is weighed by a score. The two sequences bovine cationic trypsin and rat tonin were aligned in the using the blast tool at NCBI (Altschul et al., 1997). The Blastp algorithm which compares a protein query to a protein database was used. However in this case, the subject sequence was used in place of the protein database since the objective was to carry out a pairwise alignment. Various parameters are produced from an alignment of any two or more sequences. The method used to align these two sequences is the composition matrix adjustment method, 13 gaps comprising 6% of the sequences length were created to accommodate the alignment. In scoring, the BLOSUM62 is commonly used in the BLAST similarity and it was applied in aligning the two protein sequence in this assignment. The expected value (E-value), given in this alignment as 2e-60 is a statistical significance value. Amino acids have different biochemical and physical features that guide their relative replaceability in the evolution path. For instance, As seen in the above sequence alignment, isoleucine and leucine are both aliphatic amino acids and thus may be used in place of each other in the course of evolution and this justifies them being scored as positives in the alignment. The value is achieved by relating observed alignment score, S, to an expected distribution. The E-value can be viewed as the number of alignments with scores at least equal to an observed alignment score, S, which would be expected randomly. Gaps are introduced in as sequence alignment so that an optimal sequence alignment may be achieved. In an optimal alignment, the numbers of matches are maximized and gaps minimized with addition of gaps targeted at minimizing mismatches. However, gaps should not be arbitrarily introduced since non-homologous sequences. Thus gaps are penalized to reduce their occurrence. The BLAST heuristic method used to align the two sequences unlike in dynamic programming algorithm uses word or K-tuple methods o align the sequence and thus can be termed as an approximate method (Altschul et al., 1997). The heuristic method, BLAST makes assumptions based on the nature of the sequence alignment to find a perfect match and proceeds to extend the alignment until one of the sequences ends. Heuristic algorithms are much faster though less sensitive when compared to the dynamic algorithms. The Blocks Amino acid Substitution (BLOSUM) matrix used in the alignment is based on blocks of observed amino acid substitution in blocks. These blocks represent large sets of conserved amino acids patterns usually referred as families of related properties. References Altschul, S.F., Gish, W., Miller, W., Myers, E. and Lipman, D.J., 1990. Basic local alignment search tool. J Mol Biol., 215:403-410. Altschul, S.F., Madden, T.L., Schäffer, A.A., Zhang, J., Zhang, Z., Miller, W. and Lipman, D.J., 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25:3389-3402. Baxevanis, A.D. and Oullette, B.F.F., 2001. Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins, 2nd edition. John Wiley & Sons, Inc. Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., Shindyalov, I.N. and Bourne, P.E., 2000. The Protein Data Bank, Nucleic Acids Research, 28: 235-242. www.rcsb.org Giegerich, R., 2000. A systematic approach to dynamic programming in bioinformatics. Bioinformatics Review, 16 (8):665-677. Guex, N., Peitsch, M., Schwede, T. and Diemand, A., 2012. DeepView/Swiss-Pdb-viewer. Swiss Institute of Bioinformatics, Available at http://www.expasy.org/spdbv/ Lipman, D.J. and Pearson, W.R., 1985. Rapid and sensitive protein similarity searches. Science, 227:1435-1441. Read More
Cite this document
  • APA
  • MLA
  • CHICAGO
(Sequence Analysis Lab Report Example | Topics and Well Written Essays - 1500 words, n.d.)
Sequence Analysis Lab Report Example | Topics and Well Written Essays - 1500 words. https://studentshare.org/biology/1609808-sequence-analysis
(Sequence Analysis Lab Report Example | Topics and Well Written Essays - 1500 Words)
Sequence Analysis Lab Report Example | Topics and Well Written Essays - 1500 Words. https://studentshare.org/biology/1609808-sequence-analysis.
“Sequence Analysis Lab Report Example | Topics and Well Written Essays - 1500 Words”. https://studentshare.org/biology/1609808-sequence-analysis.
  • Cited: 0 times

CHECK THESE SAMPLES OF Sequence Analysis

Data Mining as the Process

Irrespective of the technique, data mining can be broadly carried out in three steps in generic terms: classification (applied to group data based on set rules), association (the relation between objects within the group is identified) and Sequence Analysis (the sequence in which a data repeats itself is identified)....
1 Pages (250 words) Essay

DNA Sequence Analysis, Primer Design, Protein Expression, and Mutagenesis Assessment

hellip; This amino acid sequence codes for the gene known as recA and its loci are p33156, (the p arm of chromosome 33 loci 156.... The paper states that when lysine is removed from the genetically engineered genome of an organism, the organism becomes dependant on synthetic forms of lysine....
3 Pages (750 words) Assignment

Cell molecular biology

The combined assessment of expression patterns and comparative Sequence Analysis with studies that detail the sequence of cellular events that characterizes the process of conjugation were used to generate a hypothesis about the putative functions of each of the cyclin genes of T.... The results of this analysis indicated that most of the 23 identified cyclin genes in this microorganism are transcriptionally activated at specific points during the cell cycle....
5 Pages (1250 words) Essay

DNA Sequence Analysis

Before then the linear DNA sequence is checked and the first graph shows us the location of restriction enzymes and their point of action.... What follows is the choice of primers that would match the segment of DNA sequence chosen for this case wasEL784663 from our observation the following primers would be the most appropriate:- At a temperature of between 50-65oC, the pUC18 plasmid vector is digested by any of the restriction enzymes found at the multiple cloning sites (MCS) to make a single cut in the plasmid....
6 Pages (1500 words) Coursework

Math Sequences

In this question, the series is arithmetic since the… analysis of the question indicates that it is a geometric sequence, since the compound interest formula is an example of a geometric sequence.... This means that the difference between all the numbers in an arithmetic sequence is the same.... A geometric sequence can be defined as having a constant ratio between terms, with the first Arithmetic and Geometric Series January 24, 35: Arithmetic Progression An arithmetic progression refers to a series of numbers where the next number is a sum of the previous number and a fixed element....
1 Pages (250 words) Speech or Presentation

Job Opportunity in Bioinformatics

The first job opportunity that bioinformatics presents are Sequence Analysis.... Sequence Analysis was first done in 1977 when the phage Φ-X174 was sequenced.... Another job area of bioinformatics in Sequence Analysis is in the automatic search for genes and regulatory sequences within genomes.... Bioinformatics uses information storage and data analysis in computer science to deal Bioinformatics is, therefore, an extensive field with many job opportunities both directly and indirectly linked to it....
5 Pages (1250 words) Term Paper

Discovery of a New Gene and Analysis of the Encoded Protein

The first step in the nucleotide Sequence Analysis was the translation of the DNA sequence to allow for the identification of the open reading frame… The discovery of a new gene and the analysis of its encoded protein is a complicated process which involves determining the encoding stand for transcription as well as its open reading frame.... A careful analysis of the nucleotide sequence of the unknown gene and its encoded protein revealed a number of close similarities with TTSP....
8 Pages (2000 words) Coursework

Eight Phases of Planning My Project

… Project ProgressIn my project proposal I had planned to do the project in eight phases as listed below:Gathering Information for the paperSystem design and analysisDetailed analysis of factors involved in reading databasesDetails of building a data Project ProgressIn my project proposal I had planned to do the project in eight phases as listed below:Gathering Information for the paperSystem design and analysisDetailed analysis of factors involved in reading databasesDetails of building a data mining system and running classification algorithms against test dataDecision tree algorithms and implementation Data mining system sequence analysisImplementing the project life cycleSubmit the final paper after final reviewEach of these phases was to be done in a week and thus the entire project should have taken eight weeks to do and complete....
9 Pages (2250 words) Assignment
sponsored ads
We use cookies to create the best experience for you. Keep on browsing if you are OK with that, or find out how to manage cookies.
Contact Us