Comments
Description
Transcript
96 252 Proteomics
wea25324_ch25_789-826.indd Page 812 812 23/12/10 8:43 AM user-f467 /Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile Chapter 25 / Genomics II: Functional Genomics, Proteomics, and Bioinformatics in turn, should give us important clues about how these pathways work in humans. Moreover, yeast cells can be used as human surrogates to test the effects of knocking out the yeast ortholog of a known human disease gene. SUMMARY Single-nucleotide polymorphisms can probably account for many genetic conditions caused by single genes, and even multiple genes. They might also be able to predict a person’s response to drugs. A haplotype map with over 1 million SNPs will make it easier to sort out the important SNPs from those with no effect. Structural variation (insertions, deletions, inversions, and other rearrangements of chunks of DNA) is also a surprisingly prominent source of variation in human genomes. Some structural variation can in principle predispose certain people to contract diseases, but some is presumably benign, and some is demonstrably beneficial. 25.2 Proteomics Earlier in this chapter, we learned that studies of an organism’s proteome, that is, the properties and activities of all the proteins that organism makes in its lifetime, is called proteomics. Whereas the task of analyzing an organism’s genome, or even its transcriptome, is relatively straightforward, the task of analyzing an organism’s proteome is anything but simple, in large part because of the complexity of proteins relative to nucleic acids. Indeed, with current techniques, proteomics studies on complex organisms can examine only a fraction of the total proteome. Given this difficulty, why are scientists even interested in studying gene expression at the protein level, when they already have transcriptomics, in which they can probe the expression of vast numbers of genes simultaneously by looking at the levels of their transcripts? Part of the answer is that we now know that a large fraction, perhaps 50% or more, of polyadenylated RNAs in human cells do not code for proteins. These are called noncoding RNAs (ncRNAs), and, as we have seen, they are also known as transcripts of unknown function (TUFs). They are interesting in their own right, but their level of expression tells us nothing about protein expression levels. Another part of the answer is that the sequence of a protein-encoding gene and the level of its expression may give little or no information about the activity of its protein product. Another part of the answer is that the level of transcription of a gene gives only a rough idea of the real level of expression of that gene. For one thing, an mRNA may be produced in abundance, but degraded rapidly, or translated inefficiently, so the amount of protein produced is minimal. For another, many proteins experience posttranslational modifications that have a profound effect on their activities. For example, some proteins are not active until they are phosphorylated. Thus, if the cell is not phosphorylating such a protein at a given time, production of a large amount of mRNA for that protein would give a misleading picture of the true level of expression of the corresponding gene. Furthermore, many transcripts give rise to more than one protein—through alternative splicing, or alternative posttranslational modification. So measuring the level of a gene transcript doesn’t necessarily tell what protein products will be made. Finally, many polypeptides form large complexes with other polypeptides, and the true expression of each polypeptide’s function occurs only in the context of the complex. Therefore, if we want to measure real gene expression, we must look at the protein level. To analyze all the proteins in an organism, we need to do two things: First, we need to separate all those proteins from one another. Second, we have to analyze each protein by identifying it and measuring its activity. In the next two sections we will introduce some of the ways molecular biologists do these things. SUMMARY The sum of all proteins produced by an organism is its proteome, and the study of these proteins, even smaller sets of them, is called proteomics. Such studies give a more accurate picture of gene expression than transcriptomics studies do. Protein Separations One of the best separation tools available is two-dimensional gel electrophoresis, which was invented in the 1970s (Chapter 5). As powerful as that technique is, it is not up to the job of resolving all the tens of thousands of proteins in the human proteome. An average 2-D gel can resolve only about 2000 proteins, and even the best gel in the best hands can resolve only about 11,000 proteins. This problem is compounded by the fact that the performance of 2-D electrophoresis is unpredictable, and it frequently seems to be more art than science. Another problem is that many very interesting membrane proteins are too hydrophobic to dissolve in the buffers used in 2-D electrophoresis, so they cannot be seen at all. Finally, many proteins are present in such tiny quantities in the cell that a 2-D gel cannot detect them. Most of these problems are presently intractable, but scientists have dealt with the 2-D gel resolution problem by analyzing different cellular compartments separately. For example, they can start with just the nucleus, or even a subcompartment like the nucleolus or a protein assembly like the nuclear pore complex. With many fewer proteins to separate, resolution is not such a serious problem. wea25324_ch25_789-826.indd Page 813 23/12/10 8:43 AM user-f467 /Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile 813 25.2 Proteomics Laser pulse Detector 1 Ion beam Sample Focuse d ion be am Signal Electrostatic ion reflector Matrix Detector 2 Signal Figure 25.18 Principle behind MALDI-TOF mass spectrometry. Place a sample (a peptide in this case) on the matrix at left and ionize it with a laser pulse. An electrical potential difference between the matrix and the sample then accelerates the ionized sample toward detector 1. The time it takes the ions to reach detector 1 depends on their masses, so one can learn much about their masses by analyzing the time of flight to detector 1. Alternatively, one can turn on an electrostatic ion reflector in front of detector 1 to focus the ions and reflect them toward detector 2. This detector gives even more precise data about the masses of the ions, according to their times of flight. Protein Analysis If the sequence of the whole genome is known, we know what proteins to expect, so a computer can use the information from the mass spectrometer to match each spot on the 2-D gel with one of the genes in the genome, and therefore predict the sequence of the whole protein. For example, the sequence information determined in 100 Relative abundance Once the proteins are separated and quantified, how are they analyzed? First, they have to be identified, and the best method now available works like this: Individual spots are cut out of the gel and cleaved into peptides with proteolytic enzymes. These peptides can then be identified by mass spectrometry. Figure 25.18 illustrates a popular technique known by the cumbersome title matrix-assisted laser desorption-ionization time-of-flight (MALDI-TOF) mass spectrometry. In this procedure, a peptide is placed on a matrix, which causes the peptide to form crystals. Then the peptide on the matrix is ionized with a laser beam (the matrix helps the peptide ionize), and an increase in voltage at the matrix is used to shoot the ions toward a detector. Assuming all the ions have just one charge (and almost all do), the time it takes an ion to reach the detector depends on its mass. The higher the mass, the longer the time of flight of the ion. In a MALDI-TOF mass spectrometer, the ions can also be deflected with an electrostatic reflector that also focuses the ion beam. Thus, we can determine the masses of the ions reaching the second detector with high precision, and these masses can reveal the exact chemical compositions of the peptides. Then these ions can be broken at their peptide bonds by a process known as collision-induced dissociation (CID). Experimenters do this by accelerating the ions and colliding them with a neutral gas to break them, mostly at their peptide bonds, then sending the new peptide ions to another analyzer to determine their molecular makeup. Because this involves two mass spectrometry steps in a row, it is called MS/MS. By comparing the masses of ions differing by just one amino acid, the nature of the lost amino acids can be determined one by one, which leads to a sequence, as illustrated in Figure 25.19. NH2-V-P-T-P-N-V-S-V-V-D-L-T-C-R-COOH 987 75 50 25 0 ICAT 1201 1009 1102 T C+ICAT V 600 S V 874 L V 800 D 1387 V 1300 V D L 1000 T 1200 m/z S V N P C+ICAT 1400 1600 T 1800 Figure 25.19 Sequencing a peptide by mass spectronomy (MS/MS). The molecular ion is the ionized peptide at top, linked through its cysteine residue (C) to an adduct known as ICAT, which we will discuss in the next section. Its nature is not important here. The molecular ion was fragmented by CID, and the fragment ions were then subjected to a second round of MS, yielding the spectrum shown below the sequence. The relative abundance of each ion is plotted against its mass/charge ratio (m/z). The charge of each ion is assumed to be 11 in this experiment. Starting at the right, measuring the exact mass differences between the most prominent ions, one can deduce the amino acid that was lost to generate the next ion to the left. For example, the difference between the masses of the last two ions on the right shows that a threonine (T) was lost. Continuing in this way, and following the top (solid) arrows, one can read the sequence TPNVSVVDLTC-ICAT. The ion also fragmented from the other end, giving the sequence shown on the bottom with dashed arrows between major ions. wea25324_ch25_789-826.indd Page 814 814 23/12/10 8:43 AM user-f467 /Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile Chapter 25 / Genomics II: Functional Genomics, Proteomics, and Bioinformatics Figure 25.19 was enough to identify the peptide as belonging to glyceraldehyde-3-phosphate dehydrogenase. However, knowing the sequence of a protein does not necessarily tell us that protein’s activity, so further research will be necessary to determine the activities of many proteins. You may be thinking that it would be nice to make a microchip that could identify thousands of proteins at once, as DNA microarrays identify thousands of RNAs at once in functional genomics studies. That would remove the need to separate the proteins because a mixture of many proteins could simply be incubated with the chip to see what binds. One such strategy would be to produce antibodies that can recognize proteins specifically and quantitatively and place them on microchips. But many obstacles stand in the way of realizing that dream. To begin with, antibodies are much more expensive and timeconsuming to produce than oligonucleotides. In fact, the task of generating antibodies for every human protein is unthinkably vast at present. Moreover, the task of detecting low-abundance proteins, already impossible for many proteins using 2-D gels, would only be exacerbated by the miniaturization of microchip technology. On the other hand, the technology to complete the human genome in a reasonable period of time was not available when that project was first proposed in the mid-1980s, but the project stimulated the development of the technology. Perhaps we will experience a similar phenomenon if a full-scale human proteome project is initiated. SUMMARY Current research in proteomics requires first that proteins be resolved, sometimes on a massive scale. The best tool available for separation of many proteins at once is 2-D gel electrophoresis. After they are separated, proteins must be identified, and the best method for doing that involves digestion of the proteins one by one with proteases, and identifying the resulting peptides by mass spectrometry. Someday microchips with antibodies attached may allow analysis of proteins in complex mixtures without separation. Quantitative Proteomics Mass spectrometry is now able to identify proteins as they emerge from high-performance separation procedures, such as capillary chromatography, or even in mixtures without separation. But mass spectrometry is not a quantitative method, so it has been difficult to use it to analyze the expression levels of proteins. However, beginning at the end of the 1990s, analytical chemists developed methods that can tell us how much of a given protein is present in cells under one set of conditions, compared to the concen- Affinity reagent (e.g., biotin) Sulfhydrylreactive group Figure 25.20 A generic ICAT tag. One end (blue) contains a sulfhydryl-reactive group that binds to cysteine side chains. The middle contains a number of positions (red) that can be either all light isotopes (e.g., hydrogen) or all heavy isotopes (e.g., deuterium). The left end (yellow) contains an affinity reagent such as biotin, which allows easy purification of tagged proteins or peptides. tration of that same protein in cells under a different set of conditions. For example, it can measure the increase in concentration of a protein when the gene for that protein is turned on by an inducer. Here is how one such method, using isotope coded affinity tags (ICATs), works. Experimenters couple affinity tags to proteins through the sulfhydryl groups of their cysteine side chains. These affinity tags typically contain three parts, illustrated generically in Figure 25.20: a sulfhydrylreactive group at one end that can link to a protein’s cysteine side chains; a linker in the middle that contains several atoms of either a normal isotope (e.g., hydrogen), or a heavy isotope (e.g., deuterium); and an affinity reagent such as biotin at the other end, which allows convenient purification of a protein or peptide bearing the tag. In the example in Figure 25.20, the heavy tag would be 8 Daltons heavier than the light tag, by virtue of its eight deuteriums. This permits tagged peptides and their untagged counterparts to be identified easily in mass spectra because they appear as a pair of peaks exactly 8 Da apart. How does this help in quantification? Consider cells grown under two conditions: with and without serum, for example. The question is how much change we see in the concentrations of proteins when serum is added to the medium in which cells are growing. Figure 25.21 shows one approach to this question. In this case, the investigator could add light ICATs to proteins from cells grown in the absence of serum (condition 1), and heavy ICATs to proteins from cells grown in the presence of serum (condition 2). Then the proteins could be mixed, hydrolyzed with a protease such as trypsin, affinity-purified using the affinity reagent, and subjected to liquid chromatography-mass spectrometry (LC-MS), in which the peptides are separated by liquid chromatography in a fine capillary, then fed into a mass spectrometer, in which each peptide appears as a pair of peaks, separated by a molecular mass defined by the ICATs in use (e.g., 8 Da). The heavier of the peaks in each pair comes from the cells grown in the presence of serum, and the lighter of the two comes from cells grown without serum. Their relative areas, which can be determined by expanding the spectrum to reveal true peaks instead of lines, tell us the change in the wea25324_ch25_789-826.indd Page 815 23/12/10 8:31 PM user-f469 /Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile 25.2 Proteomics Condition 1 (light tag) Condition 2 (heavy tag) Relative abundance a. Combine proteins b. Proteolyze c. Affinity-purify tagged peptides d. LC-MS 1 2 3 4 5 Retention time 6 7 Peptide Figure 25.21 Using ICATs to measure the change in protein concentrations upon shift in growth conditions. Cells are grown under two different conditions (e.g., without [condition 1] and with [condition 2] serum). Proteins are extracted from cells grown under both conditions and tagged with either a light ICAT (condition 1, blue) or a heavy ICAT (condition 2, red). The tagged proteins are combined and proteolyzed, and the resulting tagged peptides are subjected to LC-MS. MS resolves the peptides derived from condition 1 and condition 2 because of their small difference in mass (8 Da, in the example in Figure 25.20). Thus, each peptide appears as a pair of peaks, and the relative areas under these peaks corresponds to the change in concentration of the protein to which each peptide belongs. That protein can frequently be identified by sequencing the peptide by MS/MS. amount of each peptide upon addition of serum to the medium. Even without expanding the spectrum, we can estimate that the concentration of peptide #1 appears to double, peptide #2 to remain the same, and peptide #3 to fall about 25%, upon addition of serum. Of course, these peptides represent proteins, and many of those proteins can be indentified by sequencing the peptides by MS/MS as described earlier in this chapter. In this way, the change in the concentration of a large number of proteins can be quantified relatively quickly and easily. Since the introduction of the ICAT labeling method, other methods have been developed. For example, proteins can be labeled in vivo by including heavy-isotope-tagged amino acids in the growth medium. This is called stable isotope labeling by amino acids in cell culture (SILAC), and it has the advantage of labeling a wider range of peptides— not just those that contain cysteines. It also eliminates all variation in sample preparation because the two cell cultures are mixed prior to protein preparation. The power of these techniques led Jürgen Cox and Matthias Mann to ask: “Is proteomics the new genomics?” In other words, can we hope to examine massive numbers of proteins simultaneously, in the same way that a DNA microchip allows us to examine massive numbers of RNAs? Clearly, the proteomic method is more time-consuming than the genomic method, and only a subset of proteins can 815 be identified at one time, because of the time limitations of the MS/MS technique. But, with some readily imaginable improvements, these proteomic techniques will become even more powerful. Note that the methods described here quantify the change in proteins’ concentrations, rather than the absolute concentrations of proteins. Fortunately, the former is frequently the more useful information. However, if one wants to quantify a particular protein’s absolute cellular concentration, one can take a protein mixture labeled with a light tag and spike it with a known amount of that protein, labeled with a heavy tag. MS on peptides derived from the tagged protein will reveal the ratio of the known, heavy peak to the unknown, light peak, and therefore the concentration of the protein. SUMMARY To determine the changes in protein lev- els upon perturbation of a cell culture, one can label the cells under the first condition with a light isotopic tag, and under the second condition with a heavy isotopic tag. If the proteins are labeled in vivo, the cell cultures can be mixed, proteins can be extracted and fragmented by proteolysis. Then the peptides can be separated and subjected to mass spectronomy. Peptides will appear as pairs of peaks separated by the mass difference in the tags. The ratio of heavy to light peak area tells us the change in protein concentration as the growth conditions change. Comparative Proteomics What makes a worm a worm and a fly a fly? As stated in Chapter 3, it is the proteins produced in these organisms that set them apart. And, presumably, not just the sum total of proteins produced, but when and where they are made. Quantitative proteomics techniques such as those described in the previous section can shed a good deal of light on these questions. For example, in 2009, Michael Hengartner and colleagues examined the C. elegans (roundworm) proteome using these techniques, and compared it to the D. melanogaster proteome that had been reported in 2007. They looked at proteins in eggs and worms at various stages of development, and identified 10,977 different proteins, representing 10,631 different genes, which is 54% of the 19,735 predicted genes in the C. elegans genome. When they compared the proteins they identified with the proteins predicted from the genome, they found certain classes of proteins underrepresented. These missing proteins tended to be short (less than 400 amino acids) and to have high hydrophobicity (presumably membrane proteins with many fatty transmembrane domains). Hengartner and colleagues estimated protein concentrations in C. elegans from their mass spectrometry with ICAT data, and compared them with similar protein wea25324_ch25_789-826.indd Page 816 8:43 AM user-f467 /Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile Chapter 25 / Genomics II: Functional Genomics, Proteomics, and Bioinformatics concentration data from the previous Drosophila study. They focused on 2695 pairs of orthologs present in both organisms, for which there was also transcript concentration data from microarray and SAGE experiments. The earlier transcript concentration data had shown only a modest correlation between the concentration of a given worm mRNA and its ortholog in the fly. But the protein concentrations of orthologs in worm and fly showed a much better correlation. Indeed, the correlation between orthologous protein concentrations in the two organisms is even better than the correlation between mRNA and corresponding protein concentrations within either organism. Apparently, orthologous proteins are needed in similar concentrations in the two organisms, so differences in mRNA concentrations between the two organisms are compensated by mechanisms affecting protein abundance. To make these comparisons, Hengartner and colleagues used Spearman’s rank correlation. In this statistical technique, two data sets are arranged in rank order. In this case, the concentrations of the 2695 worm proteins were arranged in rank order from highest to lowest concentration, and the orthologous fly proteins were arranged in the same way. Then the correlation between the two ranks is expressed as Spearman’s rank correlation, RS. A perfect correlation would have an RS of 1.0, and two totally unrelated data sets would have an RS of 0.0, though random similarities in large data sets will raise this number above zero, even if there is no correlation. Figure 25.22 shows the statistical data. Figure 25.22a shows a graphical representation of the protein data. If the two data sets were perfectly correlated, all the dots, each representing a comparison of the abundance of a single orthologous protein in the two organisms, would fall on a line with a slope of 1.0. In this case, there is considerable scatter in the data points, but they cluster around a line with a slope of 1.0. In fact, as shown in Figure 25.22b, the RS for the protein data is high: 0.79, showing a clear correlation between protein concentrations of orthologous proteins in the two organisms. By contrast, the concentrations of orthologous mRNAs in the two organisms have an RS of only 0.47 if measured by microarrays, and only 0.22 if measured by SAGE. Thus, the protein concentrations are much more highly conserved than their corresponding mRNA concentrations. In fact, the protein concentrations in the two organisms are even more highly correlated than the protein and mRNA concentrations in the same organism. The RS values for protein–mRNA correlations in C. elegans are 0.59 with the microarray data and 0.44 with the SAGE data. The RS values in Drosophila are 0.66 and 0.36 with the two data sets. D. melanogaster (log10 ppm of total protein) 816 23/12/10 C. elegans (log10 ppm of total protein) Figure 25.22 Correlation between abundances of orthologous proteins and transcripts in C. elegans and D. melanogaster. (a) Abundances (in parts per million [ppm]) of orthologous proteins in the two organisms, determined by mass spectrometry, are plotted against each other. Each dot represents one orthologous pair of proteins. Crosses represent medians of equal sized bins of values. The “whiskers” at the ends of the crosses represent the range from 25% to 75% of values (where the median, of course, is 50%). The inset contains a similar analysis of the subsets of proteins involved in signal transduction (blue) and translation (red). (b) Correlation coefficients (RS) between proteins and transcripts (measured by microarray [Affymetrix] or SAGE, as noted) in the two species, and between proteins and transcripts within the two species. (Source: Figure 5 from, Schrimpf SP, Weiss M, Reiter L, Ahrens CH, Jovanovic M, et al. (2009). Comparative Functional Analysis of the Caenorhabditis elegans and Drosophila melanogaster Proteomes. PLoS Biol 7(3): e1000048. doi:10.1371/journal.pbio.100048. © 2009 Schrimpf et al.) orthologous proteins in the two organisms are correlated much better than the orthologous mRNAs in the two organisms, and even better than the proteins and corresponding mRNAs in the same organism. SUMMARY Mass spectrometry data can be used to compare protein concentrations in two different organisms. This kind of analysis, applied to C. elegans and Drosophila, showed that the concentrations of Protein Interactions Most proteins do not function in isolation, but collaborate with other proteins, by participating in such things as wea25324_ch25_789-826.indd Page 817 23/12/10 8:43 AM user-f467 /Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile 25.2 Proteomics biochemical or developmental pathways. Signal transduction pathways (Chapter 12) are good examples. Many other proteins form large multiprotein complexes dedicated to a specific task, such as the ribosome (protein synthesis) or the proteasome (protein degradation). So one goal of proteomics is to identify the proteins that interact with one another. This frequently can give important clues about the functions of newly discovered proteins. Traditionally, protein–protein interactions have been detected by yeast two-hybrid analysis (Chapter 5), and some proteome-wide studies of protein–protein interactions have been performed using this technique. But twohybrid analysis is indirect, using reporter gene activation to observe interaction between two parts of a chimeric transcription activator, and it suffers from both false-positives and false-negatives. Nevertheless, in conjunction with validation by an independent technique, yeast two-hybrid screens can be very powerful. In 2005, Erich Wanker and colleagues used a yeast two-hybrid screen, with partial independent validation, to detect over 3000 interactions between human proteins—a start down the arduous path toward elucidating the human interactome, the total set of interactions among human proteins. Investigators have also used ultrasensitive protein mass spectrometry to do a better job of detecting protein–protein interactions. In one such study in 2002, Daniel Figeys and colleagues employed the following procedure (Figure 25.23) to screen protein–protein interactions in yeast: First, they chose a set of 725 “bait” proteins that were likely to interact with other, “fish” proteins. The bait proteins represented several different classes, including protein kinases, protein phosphatases, and proteins that participate in the response to DNA damage. The investigators engineered the genes for each of these proteins to include the coding region for the Flag epitope and then introduced the chimeric genes into yeast cells where they were expressed. (The word “Flag” simply refers to the fact that the epitope serves as a “flag” to make the proteins easy for a single antibody to recognize.) Then the investigators used immunoaffinity chromatography with an anti-Flag antibody to purify protein complexes containing the bait protein from a cell extract. They separated the proteins from the complexes by SDS-PAGE, cut each band out of the gel, digested the protein in each band with trypsin, and subjected the resulting tryptic peptides to mass spectrometry. Because we know the sequence of the whole yeast genome, a computer can predict all of the proteins encoded in the genome, and the masses of the tryptic peptides that should be obtained from each of them. Thus, this kind of bioinformatic analysis (see next section) can use the mass spectrometer data to identify the tryptic peptides and therefore the proteins. Using 10% of the predicted yeast proteins as bait, Figeys and colleagues fished out and identified 3617 associated proteins, which is about 25% of the predicted yeast (a) 817 Tag 1 Bait protein Isolate protein complex 4 (b) 5 3 Affinity column 1 2 SDSPAGE (c) Excise bands Digest with trypsin Peptides Analyze by mass spectrometry and use bioinformatics to identify Figure 25.23 Using mass spectrometry to detect protein–protein interactions. (a) Generating the tagged bait protein. A yeast gene encoding a bait protein is engineered to include the coding region for a tag, such as the Flag epitope, then placed in yeast cells and expressed to yield the tagged bait protein. (b) Isolating complexes with the bait protein. Immunoaffinity chromatography is performed with a resin containing an antibody directed against the tag on the bait protein. This “fishes out” not only the bait protein, but any “fish” proteins that interact with it. In this case, there are four such proteins, numbered 2–5. (c) Purifying and identifying the proteins. SDS-PAGE is used to separate and purify the proteins in the complex. The proteins are excised from the gel and digested with trypsin, and the resulting peptides are analyzed by mass spectrometry. A computer compares the masses of the tryptic peptides with the predicted masses of peptides from all the proteins encoded in the yeast genome to identify the proteins. (Source: Adapted from Kumar, A. and M. Snyder, Protein complexes take the bait. Nature 415, 2002, p. 123, f. 1.) proteome. This is about three-fold higher than the success rate for yeast two-hybrid analysis. Figure 25.24 shows the results obtained with two bait proteins that are protein kinases, Kss1 and Cdc28. Some known interactions (red arrows) were rediscovered, but many new interactions (green arrows) were also found. wea25324_ch25_789-826.indd Page 818 /Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile (b) Ste11 Pph3 Ste12 Kss1 Tec1 Dig1 Rpn6 Dig2 Rpn10 Bem3 Bck3Msg5 Dop1 Hsl7 Kel1 Swe1 Clb3 Clb2 Clb5 Net1 Cks1Fkh1 Mbp1 Cln1 Ubp15 Sin3 Cdc28 Cln2 Cdh1 Nap1 α -GST Chapter 25 / Genomics II: Functional Genomics, Proteomics, and Bioinformatics (a) Ste7 8:43 AM user-f467 Fkh2 Figure 25.24 Examples of protein–protein interactions discovered by Figeys and colleagues. (a) Interactions discovered with Kss1 as bait. (b) Interactions discovered with Cdc28 as bait. In both panels, red arrows represent known interactions, and green arrows represent new interactions discovered in this study. (Source: Adapted from Ho, Y., Probe 818 23/12/10 A. Grahler, A. Heilbut, G.D. Bader, L. Moore, S.L. Adams, et al., Systematic identification of protein complexes in Saccharomyces carevisiae by mass spectrometry. Nature 415, 2002, p. 180, f. 1.) PI(4,5)P2 PI(3,4)P2 PC α -GST PI(3)P Probe In a similar study, Anne-Claude Gavin and colleagues discovered 589 yeast protein assemblies in 232 distinct multiprotein complexes. Most interesting was the fact that these associations could predict new roles for 344 proteins, including 231 proteins for which no function was previously known. This “guilt by association” technique is a powerful way to assign functions to unknown proteins. Michael Snyder and colleagues have approached the problem from a different angle. They have used protein microarrays representing most of the yeast proteome to determine which yeast proteins (or lipids) bind to each protein in the array. Each tiny spot on the array contained a yeast protein coupled to glutathione-S-transferase and an oligohistidine tag. In fact, the proteins were tethered to the nickel-coated chip through their oligohistidine tags. In one test of the method (Figure 25.25), Snyder and colleagues probed the array with a protein or lipid coupled to biotin, then probed with streptavidin bound to a fluorescent tag. The streptavidin binds tightly to biotin, and its tag fluoresces green, indicating a positive interaction. The proteins on the microarray were spotted in duplicate, so true positives should appear as pairs of green spots. Figure 25.25 shows at least one positive interaction in each field. Calmodulin is a calcium-binding protein that interacts with many other proteins that require calcium for activity. The other five probes were liposomes containing biotinylated lipids, most of which are active in intracellular signaling. The arrays were also probed with anti-GST antibody and a secondary antibody that gave red fluorescence. This was a control for protein loading; all the proteins were tagged with GST, so they should all “light up” with the a-GST antibody. Some proteins have binding modules for particular peptide sequences in other proteins. For example, SH3 and WW domains bind to proline-rich peptides, and SH2 domains bind to peptides containing a phosphotyrosine. Based on this knowledge, Stanley Fields, Charles Boone, Calmodulin PI(4)P Figure 25.25 Using a protein microchip to detect protein–protein and protein–lipid interactions. Snyder and colleagues made protein microarrays with proteins spotted in duplicate side-by-side and probed them first with an a-GST antibody (first and third rows) or the probes listed beneath the second and fourth rows. The a-GST antibody was in turn detected with a fluorescent probe to yield the red spots. The intensity of the red fluorescence indicated the amount of protein in each spot. The probes in the second and fourth rows were coupled to biotin, which could be detected with streptavidin coupled to a green flourescent tag. The probes were calmodulin, a protein involved in many processes that require calcium, and liposomes containing the following signalling lipids: phosphatidylinositol(3)phosphate [PI(3)P]; phosphatidylinositol(4,5)bisphosphate [PI(4,5)P2]; phosphatidylinositol(4)phosphate [PI(4)P]; phosphatidylinositol(3,4)bisphosphate [PI(3,4)P2]; and phosphatidylcholine [PC]. Each pair of green spots corresponds to a protein on the microarray, spotted in duplicate, that binds to the protein or lipid probe. The red spots corresponding to the positive (green) spots in rows 2 and 4 are boxed. (Source: Adapted from Zhu et al., Science 293 (2001) Fig. 2A, p. 2102.) and Gianni Cesareni and colleagues (Tong et al., 2002) have developed a procedure that meshes experimental and computational strategies to identify the specific partners of proteins having these and other peptide-binding domains. The procedure employs the following four steps: First, the investigators used a technique called phage display to discover the consensus sequences recognized by a given peptide-binding domain. In phage display, the gene or gene wea25324_ch25_789-826.indd Page 819 23/12/10 8:44 AM user-f467 /Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile 25.2 Proteomics (a) 819 (b) Figure 25.26 Predicted network of protein–protein interactions involving yeast SH3 domains and their targets. (a) All proteins and interactions predicted by phage display and searching the yeast proteome. Proteins are grouped into k-cores in which each protein makes k interactions. For example, a 3-core contains proteins that make 3 interactions. Each protein is color-coded by its k-core value as follows: 6-cores, black; 5-cores, cyan; 4-cores, blue; 3-cores, red; 2-cores, green; and 1-cores, yellow. The interactions of the 6-core proteins are represented by red lines. (b) Expansion of the 6-core network to show interactions with specific proteins. (Source: Adapted fragment encoding a protein or peptide is cloned into a phage vector coupled to a phage coat protein gene such that the protein or peptide will be displayed on the surface of the recombinant phage. The phages displaying a protein or peptide that interacts with a second protein can be fished out with the second protein linked to a resin bead. These positive phage clones can then be analyzed to see what protein or peptide they are displaying. These are putative targets for the second protein. In this study, Tong and colleagues identified 24 different SH3 domains in yeast by a c-BLAST analysis (see next section) with the oncoprotein Src, which has an SH3 domain, as the query sequence. Twenty of these SH3 domains could be expressed as GST-fusion proteins in E. coli, and Tong and colleagues linked these fusion proteins to resin beads and screened them against a library of random nonapeptides (peptides of 9 amino acids) displayed on phage surfaces. Each SH3 domain bound preferentially to a subset of nonapeptides, which yielded a consensus sequence for the peptide target of each SH3 domain. Second, Tong and colleagues used computational methods to find the consensus peptide target sequences in the yeast proteome. This process yielded the protein network shown in Figure 25.26a. It is a network because many target proteins have SH3 domains of their own that bind in turn to other targets. The proteins are grouped in “k-cores,” where each protein has k interactions with other proteins. For example, the 6-core is a group of pro- teins, each of which is predicted to interact with at least six other proteins. The 6-core is shown in black, with red connecting lines, in Figure 25.26a, and is expanded in Figure 25.26b. In the third step, Tong and colleagues detected interactions between SH3 domains and target proteins in a different way, using a yeast two-hybrid analysis. Finally, in the fourth step, they compared the results of the two methods to find interactions common to both. Of all the interactions, 59 were detected by both methods and, because they were independently identified by two methods, it is very likely that the great majority of them are authentic. As a test, Tong and colleagues chose one protein (Las17) with five different proline-rich domains, which is predicted to interact with nine different SH3 proteins. They then verified all of these interactions with direct in vitro assays. Indeed, the phage display experiments predicted which of the five proline-rich domains on Las17 would be the favorite target of each of the nine proteins. With one exception, the in vitro assays proved these predictions correct. Each of these techniques for measuring protein– protein interaction is useful, but each has its own problems. All are subject to false-negatives (failure to discover an authentic interaction) and false-positives (detecting an apparent interaction that does not occur in vivo). The best data will probably come from a combination of different techniques. from Tong et al., Science 295 (2002) Fig. 2, p. 322.)