Statistical Analysis of Sequence Alignments Can Detect Homology

by taratuta

on 19-01-2017

Category: Documents

>> Downloads: 23

166

views

Report

Comments

Description

Download Statistical Analysis of Sequence Alignments Can Detect Homology

Transcript

Statistical Analysis of Sequence Alignments Can Detect Homology

I. The Molecular Design of Life
7. Exploring Evolution
7.1. Homologs Are Descended from a Common Ancestor
Figure 7.3. Two Classes of Homologs. Homologs that perform identical or very similar functions in different organisms
are called orthologs, whereas homologs that perform different functions within one organism are called paralogs.
I. The Molecular Design of Life
7. Exploring Evolution
7.2. Statistical Analysis of Sequence Alignments Can Detect Homology
Conceptual Insights, Sequence Analysis, provides opportunities to
interactively explore issues involved in sequence alignment.
Conceptual Insights, appearing throughout the book, are interactive
animations that help you build your understanding of key biochemical
principles and concepts. To access, go to the Web site: www.whfreeman.com/
biochem5, and select the chapter, Conceptual Insights, and the title.
A significant sequence similarity between two molecules implies that they are likely to have the same evolutionary
origin and, therefore, the same three-dimensional structure, function, and mechanism. Although both nucleic acid and
protein sequences can be compared to detect homology, a comparison of protein sequences is much more effective for
several reasons, most notably that proteins are built from 20 different building blocks, whereas RNA and DNA are
synthesized from only 4 building blocks.
To illustrate sequence-comparison methods, let us consider a class of proteins called the globins. Myoglobin is a protein
that binds oxygen in muscle, whereas hemoglobin is the oxygen-carrying protein in blood (Section 10.2). Both proteins
cradle a heme group, an iron-containing organic molecule that binds the oxygen. Each human hemoglobin molecule is
composed of four heme-containing polypeptide chains, two identical α chains and two identical β chains. Here, we
consider only the α chain. We wish to examine the similarity between the amino acid sequence of the human α chain
and that of human myoglobin (Figure 7.4). To detect such similarity, methods have been developed for sequence
alignment.
How can we tell where to align the two sequences? The simplest approach is to compare all possible juxtapositions of
one protein sequence with another, in each case recording the number of identical residues that are aligned with one
another. This comparison can be accomplished by simply sliding one sequence past the other, one amino acid at a time,
and counting the number of matched residues (Figure 7.5).
For hemoglobin α and myoglobin, the best alignment reveals 23 sequence identities, spread throughout the central parts
of the sequences. However, a nearby alignment showing 22 identities is nearly as good. In this alignment, the identities
are concentrated toward the amino-terminal end of the sequences. The sequences can be aligned to capture most of the
identities in both alignments by introducing a gap into one of the sequences (Figure 7.6). Such gaps must often be
inserted to compensate for the insertions or deletions of nucleotides that may have taken place in the gene for one
molecule but not the other in the course of evolution.
The use of gaps substantially increases the complexity of sequence alignment because, in principle, the insertion of gaps
of arbitrary sizes must be considered throughout each sequence. However, methods have been developed for the
insertion of gaps in the automatic alignment of sequences. These methods use scoring systems to compare different
alignments, and they include penalties for gaps to prevent the insertion of an unreasonable number of them. Here is an
example of such a scoring system: each identity between aligned sequences results in +10 points, whereas each gap
introduced, regardless of size, results in -25 points. For the alignment shown in Figure 7.6, there are 38 identities and 1
gap, producing a score of (38 × 10 - 1 × 25 = 355). Overall, there are 38 matched amino acids in an average length of
147 residues; so the sequences are 25.9% identical. The next step is to ask, Is this precentage of identity significant?
7.2.1. The Statistical Significance of Alignments Can Be Estimated by Shuffling
The similarities in sequence in Figure 7.5 appear striking, yet there remains the possibility that a grouping of sequence
identities has occurred by chance alone. How can we estimate the probability that a specific series of identities is a
chance occurrence? To make such an estimate, the amino acid sequence in one of the proteins is "shuffled" that is,
randomly rearranged and the alignment procedure is repeated (Figure 7.7). This process is repeated to build up a
distribution showing, for each possible score, the number of shuffled sequences that received that score.
When this procedure is applied to the sequences of myoglobin and hemoglobin α , the authentic alignment clearly stands
out (Figure 7.8). Its score is far above the mean for the alignment scores based on shuffled sequences. The odds of such a
deviation occurring owing due to chance alone are approximately 1 in 1020. Thus, we can comfortably conclude that the
two sequences are genuinely similar; the simplest explanation for this similarity is that these sequences are
homologous that is, that the two molecules have descended by divergence from a common ancestor.
7.2.2. Distant Evolutionary Relationships Can Be Detected Through the Use of
Substitution Matrices
The scoring scheme in Section 7.2.1 assigns points only to positions occupied by identical amino acids in the two
sequences being compared. No credit is given for any pairing that is not an identity. However, not all substitutions are
equivalent. Some are structurally conservative substitutions, replacing one amino acid with another that is similar in size
and chemical properties. Such conservative amino acid substitutions may have relatively minor effects on protein
structure and can thus be tolerated without compromising function. In other substitutions, an amino acid replaces one
that is dissimilar. Furthermore, some amino acid substitutions result from the replacement of only a single nucleotide in
the gene sequence; whereas others require two or three replacements. Conservative and single-nucleotide substitutions
are likely to be more common than are substitutions with more radical effects. How can we account for the type of
substitution when comparing sequences? We can approach this problem by first examining the substitutions that have
actually taken place in evolutionarily related proteins.
From the examination of appropriately aligned sequences, substitution matrices can be deduced. In these matrices, a
large positive score corresponds to a substitution that occurs relatively frequently, whereas a large negative score
corresponds to a substitution that occurs only rarely. The Blosum-62 substitution matrix illustrated in Figure 7.9 is an
example. The highest scores in this substitution matrix indicate that amino acids such as cysteine (C) and tryptophan (W)
tend to be conserved more than those such as serine (S) and alanine (A). Furthermore, structurally conservative
substitutions such as lysine (K) for arginine (R) and isoleucine (I) for valine (V) have relatively high scores. When two
sequences are compared, each substitution is assigned a score based on the matrix. In addition, a gap penalty is often
assigned according to the size of the gap. For example, the introduction of a gap lowers the alignment score by 12 points
and the extension of an existing gap costs 2 points per residue. Using this scoring system, the alignment shown in Figure
7.6 receives a score of 115. In many regions, most substitutions are conservative (defined as those substitutions with
scores greater than 0) and relatively few are strongly disfavored types (Figure 7.10).
This scoring system detects homology between less obviously related sequences with greater sensitivity than would a
comparison of identities only. Consider, for example, the protein leghemoglobin, an oxygen-binding protein found in the
roots of some plants. The amino acid sequence of leghemoglobin from the herb lupine can be aligned with that of human
myoglobin and scored by using either the simple scoring scheme based on identities only or the Blosum-62 scoring
matrix (see Figure 7.9). Repeated shuffling and scoring provides a distribution of alignment scores (Figure 7.11).
Scoring based on identities only indicates that the odds of the alignment between myoglobin and leghemoglobin
occurring by chance alone are 1 in 20. Thus, although the level of similarity suggests a relationship, there is a 5% chance
that the similarity is accidental on the basis of this analysis. In contrast, users of the substitution matrix are able to
incorporate the effects of conservative substitutions. From such an analysis, the odds of the alignment occurring by
chance are calculated to be approximately 1 in 300. Thus, an analysis performed by using the substitution matrix reaches
a much firmer conclusion about the evolutionary relationship between these proteins (Figure 7.12).
Experience with sequence analysis has led to the development of simpler rules of thumb. For sequences longer than 100
amino acids, sequence identities greater than 25% are almost certainly not the result of chance alone; such sequences are
probably homologous. In contrast, if two sequences are less than 15% identical, pairwise comparison alone is unlikely to
indicate statistically significant similarity. For sequences that are between 15% and 25% identical, further analysis is
necessary to determine the statistical significance of the alignment. It must be emphasized that the lack of a statistically
significant degree of sequence similarity does not rule out homology. The sequences of many proteins that have
descended from common ancestors have diverged to such an extent that the relationship between the proteins can no
longer be detected from their sequences alone. As we will see, such homologous proteins can often be detected by
examining three-dimensional structures.
7.2.3. Databases Can Be Searched to Identify Homologous Sequences
When the sequence of a protein is first determined, comparing it with all previously characterized sequences can be a
source of tremendous insight into its evolutionary relatives and, hence, its structure and function. Indeed, an extensive
sequence comparison is almost always the first analysis performed on a newly elucidated sequence. The sequence
alignment methods heretofore described are used to compare an individual sequence with all members of a database of
known sequences.
In 1995, investigators reported the first complete sequence of the genome of a free-living organism, the bacterium
Haemophilus influenzae. Of 1743 identified open reading frames (Section 6.3.2), 1007 (58%) could be linked by
sequence-comparison methods to some protein of known function that had been previously characterized in another
organism. An additional 347 open reading frames could be linked to sequences in the database for which no function had
yet been assigned ("hypothetical proteins"). The remaining 389 sequences did not match any sequence present in the
database at the time at which the Haemophilus influenzae sequence was completed. Thus, investigators were able to
identify likely functions for more than half the proteins within this organism solely through the use of sequencecomparison methods.
I. The Molecular Design of Life
7. Exploring Evolution
7.2. Statistical Analysis of Sequence Alignments Can Detect Homology
Figure 7.4. Amino Acid Sequences of Human Hemoglobin (α chain) and Human Myoglobin. Hemoglobin α is
composed of 141 amino acids; myoglobin consists of 153 amino acids. (One-letter abbreviations designating amino acids
are used; see Table 3.2.)
I. The Molecular Design of Life
7. Exploring Evolution
7.2. Statistical Analysis of Sequence Alignments Can Detect Homology
Figure 7.5. Comparing the Amino Acid Sequences of Hemoglobin α and myoglobin. (A) A comparison is made by
sliding the sequences of the two proteins past one another, one amino acid at a time, and counting the number of amino
acid identities between the proteins. (B) The two alignments with the largest number of matches are shown above the
graph, which plots the matches as a function of alignment.
I. The Molecular Design of Life
7. Exploring Evolution
7.2. Statistical Analysis of Sequence Alignments Can Detect Homology
Figure 7.6. Alignment with Gap Insertion. The alignment of hemoglobin α and myoglobin after a gap has been
inserted into the hemoglobin α sequence.
I. The Molecular Design of Life
7. Exploring Evolution
7.2. Statistical Analysis of Sequence Alignments Can Detect Homology
Figure 7.7. The Generation of a Shuffled Sequence.
I. The Molecular Design of Life
7. Exploring Evolution
7.2. Statistical Analysis of Sequence Alignments Can Detect Homology
Figure 7.8. Statistical Comparison of Alignment Scores. Alignment scores are calculated for many shuffled
sequences, and the number of sequences generating a particular score is plotted against the score. The resulting plot is a
distribution of alignment scores occurring by chance. The alignment score for hemoglobin α and myoglobin (shown in
red) is substantially greater than any of these scores, strongly suggesting that the sequence similarity is significant.
I. The Molecular Design of Life
7. Exploring Evolution
7.2. Statistical Analysis of Sequence Alignments Can Detect Homology
Figure 7.9. A Graphic View of the Blosum-62 Substitution Matrix. This scoring scheme was derived by examining
substitutions that occur within aligned sequence blocks in related proteins. Amino acids are classified into four groups
(charged, red; polar, green; large and hydrophobic, blue; other, black). Substitutions that require the change of only a
single nucleotide are shaded. To find the score for a substitution of, for instance, a Y for an H, you find the Y in the
column having H (boxed) at the top and check the number at the left. In this case, the resulting score is 3.
I. The Molecular Design of Life
7. Exploring Evolution
7.2. Statistical Analysis of Sequence Alignments Can Detect Homology
Figure 7.10. Alignment with Conservative Substitutions Noted. The alignment of hemoglobin α and myoglobin with
conservative substitutions indicated by yellow shading and identities by orange.