96 252 Proteomics

by taratuta

on 19-01-2017

Category: Documents

>> Downloads: 8

views

Report

Comments

Description

Download 96 252 Proteomics

Transcript

96 252 Proteomics

wea25324_ch25_789-826.indd Page 812
812
23/12/10
8:43 AM user-f467
/Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile
Chapter 25 / Genomics II: Functional Genomics, Proteomics, and Bioinformatics
in turn, should give us important clues about how these
pathways work in humans. Moreover, yeast cells can be
used as human surrogates to test the effects of knocking
out the yeast ortholog of a known human disease gene.
SUMMARY Single-nucleotide polymorphisms can
probably account for many genetic conditions
caused by single genes, and even multiple genes.
They might also be able to predict a person’s response to drugs. A haplotype map with over 1 million SNPs will make it easier to sort out the
important SNPs from those with no effect. Structural variation (insertions, deletions, inversions, and
other rearrangements of chunks of DNA) is also a
surprisingly prominent source of variation in human genomes. Some structural variation can in
principle predispose certain people to contract diseases, but some is presumably benign, and some is
demonstrably beneficial.
25.2 Proteomics
Earlier in this chapter, we learned that studies of an organism’s proteome, that is, the properties and activities of all
the proteins that organism makes in its lifetime, is called
proteomics. Whereas the task of analyzing an organism’s
genome, or even its transcriptome, is relatively straightforward, the task of analyzing an organism’s proteome is anything but simple, in large part because of the complexity of
proteins relative to nucleic acids. Indeed, with current techniques, proteomics studies on complex organisms can examine only a fraction of the total proteome.
Given this difficulty, why are scientists even interested
in studying gene expression at the protein level, when they
already have transcriptomics, in which they can probe the
expression of vast numbers of genes simultaneously by
looking at the levels of their transcripts? Part of the answer
is that we now know that a large fraction, perhaps 50% or
more, of polyadenylated RNAs in human cells do not code
for proteins. These are called noncoding RNAs (ncRNAs),
and, as we have seen, they are also known as transcripts of
unknown function (TUFs). They are interesting in their
own right, but their level of expression tells us nothing
about protein expression levels. Another part of the answer
is that the sequence of a protein-encoding gene and the
level of its expression may give little or no information
about the activity of its protein product.
Another part of the answer is that the level of transcription of a gene gives only a rough idea of the real level of
expression of that gene. For one thing, an mRNA may be
produced in abundance, but degraded rapidly, or translated
inefficiently, so the amount of protein produced is minimal.
For another, many proteins experience posttranslational
modifications that have a profound effect on their activities. For example, some proteins are not active until they
are phosphorylated. Thus, if the cell is not phosphorylating
such a protein at a given time, production of a large amount
of mRNA for that protein would give a misleading picture
of the true level of expression of the corresponding gene.
Furthermore, many transcripts give rise to more than one
protein—through alternative splicing, or alternative posttranslational modification. So measuring the level of a gene
transcript doesn’t necessarily tell what protein products
will be made. Finally, many polypeptides form large complexes with other polypeptides, and the true expression of
each polypeptide’s function occurs only in the context of
the complex.
Therefore, if we want to measure real gene expression,
we must look at the protein level. To analyze all the proteins
in an organism, we need to do two things: First, we need to
separate all those proteins from one another. Second, we
have to analyze each protein by identifying it and measuring its activity. In the next two sections we will introduce
some of the ways molecular biologists do these things.
SUMMARY The sum of all proteins produced by an
organism is its proteome, and the study of these proteins, even smaller sets of them, is called proteomics.
Such studies give a more accurate picture of gene
expression than transcriptomics studies do.
Protein Separations
One of the best separation tools available is two-dimensional
gel electrophoresis, which was invented in the 1970s
(Chapter 5). As powerful as that technique is, it is not up to
the job of resolving all the tens of thousands of proteins in
the human proteome. An average 2-D gel can resolve only
about 2000 proteins, and even the best gel in the best hands
can resolve only about 11,000 proteins. This problem is
compounded by the fact that the performance of 2-D electrophoresis is unpredictable, and it frequently seems to be
more art than science. Another problem is that many very
interesting membrane proteins are too hydrophobic to dissolve in the buffers used in 2-D electrophoresis, so they cannot be seen at all. Finally, many proteins are present in such
tiny quantities in the cell that a 2-D gel cannot detect them.
Most of these problems are presently intractable, but
scientists have dealt with the 2-D gel resolution problem by
analyzing different cellular compartments separately. For
example, they can start with just the nucleus, or even a
subcompartment like the nucleolus or a protein assembly
like the nuclear pore complex. With many fewer proteins to
separate, resolution is not such a serious problem.
wea25324_ch25_789-826.indd Page 813
23/12/10
8:43 AM user-f467
/Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile
813
25.2 Proteomics
Laser pulse
Detector 1
Ion beam
Sample
Focuse
d ion be
am
Signal
Electrostatic
ion reflector
Matrix
Detector 2
Signal
Figure 25.18 Principle behind MALDI-TOF mass spectrometry.
Place a sample (a peptide in this case) on the matrix at left and ionize
it with a laser pulse. An electrical potential difference between the
matrix and the sample then accelerates the ionized sample toward
detector 1. The time it takes the ions to reach detector 1 depends on
their masses, so one can learn much about their masses by analyzing
the time of flight to detector 1. Alternatively, one can turn on an
electrostatic ion reflector in front of detector 1 to focus the ions and
reflect them toward detector 2. This detector gives even more precise
data about the masses of the ions, according to their times of flight.
Protein Analysis
If the sequence of the whole genome is known, we
know what proteins to expect, so a computer can use the
information from the mass spectrometer to match each
spot on the 2-D gel with one of the genes in the genome,
and therefore predict the sequence of the whole protein.
For example, the sequence information determined in
100
Relative abundance
Once the proteins are separated and quantified, how are
they analyzed? First, they have to be identified, and the best
method now available works like this: Individual spots are
cut out of the gel and cleaved into peptides with proteolytic
enzymes. These peptides can then be identified by mass
spectrometry. Figure 25.18 illustrates a popular technique
known by the cumbersome title matrix-assisted laser
desorption-ionization time-of-flight (MALDI-TOF) mass
spectrometry. In this procedure, a peptide is placed on a
matrix, which causes the peptide to form crystals. Then the
peptide on the matrix is ionized with a laser beam (the matrix helps the peptide ionize), and an increase in voltage at
the matrix is used to shoot the ions toward a detector. Assuming all the ions have just one charge (and almost all
do), the time it takes an ion to reach the detector depends
on its mass. The higher the mass, the longer the time of
flight of the ion. In a MALDI-TOF mass spectrometer, the
ions can also be deflected with an electrostatic reflector
that also focuses the ion beam. Thus, we can determine the
masses of the ions reaching the second detector with high
precision, and these masses can reveal the exact chemical
compositions of the peptides.
Then these ions can be broken at their peptide bonds by
a process known as collision-induced dissociation (CID).
Experimenters do this by accelerating the ions and colliding them with a neutral gas to break them, mostly at their
peptide bonds, then sending the new peptide ions to another analyzer to determine their molecular makeup.
Because this involves two mass spectrometry steps in a row,
it is called MS/MS. By comparing the masses of ions differing
by just one amino acid, the nature of the lost amino acids
can be determined one by one, which leads to a sequence,
as illustrated in Figure 25.19.
NH2-V-P-T-P-N-V-S-V-V-D-L-T-C-R-COOH
987
75
50
25
0
ICAT
1201
1009
1102
T
C+ICAT
V
600
S
V
874 L
V
800
D
1387
V
1300
V
D
L
1000
T
1200
m/z
S
V
N
P
C+ICAT
1400
1600
T
1800
Figure 25.19 Sequencing a peptide by mass spectronomy (MS/MS).
The molecular ion is the ionized peptide at top, linked through its
cysteine residue (C) to an adduct known as ICAT, which we will
discuss in the next section. Its nature is not important here. The
molecular ion was fragmented by CID, and the fragment ions were
then subjected to a second round of MS, yielding the spectrum shown
below the sequence. The relative abundance of each ion is plotted
against its mass/charge ratio (m/z). The charge of each ion is assumed
to be 11 in this experiment. Starting at the right, measuring the exact
mass differences between the most prominent ions, one can deduce
the amino acid that was lost to generate the next ion to the left. For
example, the difference between the masses of the last two ions on
the right shows that a threonine (T) was lost. Continuing in this way,
and following the top (solid) arrows, one can read the sequence
TPNVSVVDLTC-ICAT. The ion also fragmented from the other end,
giving the sequence shown on the bottom with dashed arrows
between major ions.
wea25324_ch25_789-826.indd Page 814
814
23/12/10
8:43 AM user-f467
/Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile
Chapter 25 / Genomics II: Functional Genomics, Proteomics, and Bioinformatics
Figure 25.19 was enough to identify the peptide as
belonging to glyceraldehyde-3-phosphate dehydrogenase. However, knowing the sequence of a protein does
not necessarily tell us that protein’s activity, so further
research will be necessary to determine the activities of
many proteins.
You may be thinking that it would be nice to make a
microchip that could identify thousands of proteins at
once, as DNA microarrays identify thousands of RNAs at
once in functional genomics studies. That would remove
the need to separate the proteins because a mixture of
many proteins could simply be incubated with the chip to
see what binds. One such strategy would be to produce
antibodies that can recognize proteins specifically and
quantitatively and place them on microchips. But many
obstacles stand in the way of realizing that dream. To begin
with, antibodies are much more expensive and timeconsuming to produce than oligonucleotides. In fact, the
task of generating antibodies for every human protein is
unthinkably vast at present. Moreover, the task of detecting
low-abundance proteins, already impossible for many proteins using 2-D gels, would only be exacerbated by the
miniaturization of microchip technology. On the other
hand, the technology to complete the human genome in a
reasonable period of time was not available when that
project was first proposed in the mid-1980s, but the project
stimulated the development of the technology. Perhaps we
will experience a similar phenomenon if a full-scale human
proteome project is initiated.
SUMMARY Current research in proteomics requires
first that proteins be resolved, sometimes on a massive scale. The best tool available for separation of
many proteins at once is 2-D gel electrophoresis.
After they are separated, proteins must be identified,
and the best method for doing that involves digestion of the proteins one by one with proteases, and
identifying the resulting peptides by mass spectrometry. Someday microchips with antibodies attached
may allow analysis of proteins in complex mixtures
without separation.
Quantitative Proteomics
Mass spectrometry is now able to identify proteins as they
emerge from high-performance separation procedures,
such as capillary chromatography, or even in mixtures
without separation. But mass spectrometry is not a quantitative method, so it has been difficult to use it to analyze
the expression levels of proteins. However, beginning at the
end of the 1990s, analytical chemists developed methods
that can tell us how much of a given protein is present in
cells under one set of conditions, compared to the concen-
Affinity reagent
(e.g., biotin)
Sulfhydrylreactive group
Figure 25.20 A generic ICAT tag. One end (blue) contains a
sulfhydryl-reactive group that binds to cysteine side chains. The
middle contains a number of positions (red) that can be either all light
isotopes (e.g., hydrogen) or all heavy isotopes (e.g., deuterium). The
left end (yellow) contains an affinity reagent such as biotin, which
allows easy purification of tagged proteins or peptides.
tration of that same protein in cells under a different set of
conditions. For example, it can measure the increase in
concentration of a protein when the gene for that protein is
turned on by an inducer.
Here is how one such method, using isotope coded
affinity tags (ICATs), works. Experimenters couple affinity
tags to proteins through the sulfhydryl groups of their cysteine side chains. These affinity tags typically contain three
parts, illustrated generically in Figure 25.20: a sulfhydrylreactive group at one end that can link to a protein’s cysteine side chains; a linker in the middle that contains several
atoms of either a normal isotope (e.g., hydrogen), or a
heavy isotope (e.g., deuterium); and an affinity reagent
such as biotin at the other end, which allows convenient
purification of a protein or peptide bearing the tag. In the
example in Figure 25.20, the heavy tag would be 8 Daltons
heavier than the light tag, by virtue of its eight deuteriums.
This permits tagged peptides and their untagged counterparts to be identified easily in mass spectra because they
appear as a pair of peaks exactly 8 Da apart.
How does this help in quantification? Consider cells
grown under two conditions: with and without serum, for
example. The question is how much change we see in the
concentrations of proteins when serum is added to the medium in which cells are growing. Figure 25.21 shows one
approach to this question. In this case, the investigator
could add light ICATs to proteins from cells grown in the
absence of serum (condition 1), and heavy ICATs to proteins from cells grown in the presence of serum (condition 2).
Then the proteins could be mixed, hydrolyzed with a
protease such as trypsin, affinity-purified using the affinity
reagent, and subjected to liquid chromatography-mass
spectrometry (LC-MS), in which the peptides are separated
by liquid chromatography in a fine capillary, then fed into
a mass spectrometer, in which each peptide appears as a
pair of peaks, separated by a molecular mass defined by the
ICATs in use (e.g., 8 Da).
The heavier of the peaks in each pair comes from the
cells grown in the presence of serum, and the lighter of the
two comes from cells grown without serum. Their relative
areas, which can be determined by expanding the spectrum
to reveal true peaks instead of lines, tell us the change in the
wea25324_ch25_789-826.indd Page 815
23/12/10
8:31 PM user-f469
/Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile
25.2 Proteomics
Condition 1
(light tag)
Condition 2
(heavy tag)
Relative
abundance
a. Combine proteins
b. Proteolyze
c. Affinity-purify tagged peptides
d. LC-MS
1
2
3
4
5
Retention time
6
7
Peptide
Figure 25.21 Using ICATs to measure the change in protein
concentrations upon shift in growth conditions. Cells are grown
under two different conditions (e.g., without [condition 1] and with
[condition 2] serum). Proteins are extracted from cells grown under both
conditions and tagged with either a light ICAT (condition 1, blue) or a
heavy ICAT (condition 2, red). The tagged proteins are combined and
proteolyzed, and the resulting tagged peptides are subjected to LC-MS.
MS resolves the peptides derived from condition 1 and condition 2
because of their small difference in mass (8 Da, in the example in
Figure 25.20). Thus, each peptide appears as a pair of peaks, and the
relative areas under these peaks corresponds to the change in
concentration of the protein to which each peptide belongs. That protein
can frequently be identified by sequencing the peptide by MS/MS.
amount of each peptide upon addition of serum to the medium. Even without expanding the spectrum, we can estimate that the concentration of peptide #1 appears to
double, peptide #2 to remain the same, and peptide #3 to
fall about 25%, upon addition of serum. Of course, these
peptides represent proteins, and many of those proteins can
be indentified by sequencing the peptides by MS/MS as
described earlier in this chapter. In this way, the change in
the concentration of a large number of proteins can be
quantified relatively quickly and easily.
Since the introduction of the ICAT labeling method,
other methods have been developed. For example, proteins
can be labeled in vivo by including heavy-isotope-tagged
amino acids in the growth medium. This is called stable
isotope labeling by amino acids in cell culture (SILAC), and
it has the advantage of labeling a wider range of peptides—
not just those that contain cysteines. It also eliminates all
variation in sample preparation because the two cell cultures are mixed prior to protein preparation.
The power of these techniques led Jürgen Cox and
Matthias Mann to ask: “Is proteomics the new genomics?”
In other words, can we hope to examine massive numbers
of proteins simultaneously, in the same way that a DNA
microchip allows us to examine massive numbers of RNAs?
Clearly, the proteomic method is more time-consuming
than the genomic method, and only a subset of proteins can
815
be identified at one time, because of the time limitations of
the MS/MS technique. But, with some readily imaginable
improvements, these proteomic techniques will become
even more powerful.
Note that the methods described here quantify the
change in proteins’ concentrations, rather than the absolute concentrations of proteins. Fortunately, the former is
frequently the more useful information. However, if one
wants to quantify a particular protein’s absolute cellular
concentration, one can take a protein mixture labeled with
a light tag and spike it with a known amount of that protein, labeled with a heavy tag. MS on peptides derived from
the tagged protein will reveal the ratio of the known, heavy
peak to the unknown, light peak, and therefore the concentration of the protein.
SUMMARY To determine the changes in protein lev-
els upon perturbation of a cell culture, one can label
the cells under the first condition with a light isotopic tag, and under the second condition with a heavy
isotopic tag. If the proteins are labeled in vivo, the
cell cultures can be mixed, proteins can be extracted
and fragmented by proteolysis. Then the peptides
can be separated and subjected to mass spectronomy. Peptides will appear as pairs of peaks separated by the mass difference in the tags. The ratio of
heavy to light peak area tells us the change in protein concentration as the growth conditions change.
Comparative Proteomics What makes a worm a worm
and a fly a fly? As stated in Chapter 3, it is the proteins
produced in these organisms that set them apart. And, presumably, not just the sum total of proteins produced, but
when and where they are made. Quantitative proteomics
techniques such as those described in the previous section
can shed a good deal of light on these questions.
For example, in 2009, Michael Hengartner and colleagues examined the C. elegans (roundworm) proteome
using these techniques, and compared it to the D. melanogaster proteome that had been reported in 2007. They
looked at proteins in eggs and worms at various stages of
development, and identified 10,977 different proteins,
representing 10,631 different genes, which is 54% of the
19,735 predicted genes in the C. elegans genome. When
they compared the proteins they identified with the proteins predicted from the genome, they found certain classes
of proteins underrepresented. These missing proteins
tended to be short (less than 400 amino acids) and to have
high hydrophobicity (presumably membrane proteins with
many fatty transmembrane domains).
Hengartner and colleagues estimated protein concentrations in C. elegans from their mass spectrometry with
ICAT data, and compared them with similar protein
wea25324_ch25_789-826.indd Page 816
8:43 AM user-f467
/Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile
Chapter 25 / Genomics II: Functional Genomics, Proteomics, and Bioinformatics
concentration data from the previous Drosophila study.
They focused on 2695 pairs of orthologs present in both
organisms, for which there was also transcript concentration
data from microarray and SAGE experiments. The earlier
transcript concentration data had shown only a modest
correlation between the concentration of a given worm
mRNA and its ortholog in the fly. But the protein concentrations of orthologs in worm and fly showed a much better correlation. Indeed, the correlation between orthologous
protein concentrations in the two organisms is even better
than the correlation between mRNA and corresponding
protein concentrations within either organism. Apparently,
orthologous proteins are needed in similar concentrations
in the two organisms, so differences in mRNA concentrations between the two organisms are compensated by
mechanisms affecting protein abundance. To make these
comparisons, Hengartner and colleagues used Spearman’s
rank correlation. In this statistical technique, two data sets
are arranged in rank order. In this case, the concentrations
of the 2695 worm proteins were arranged in rank order
from highest to lowest concentration, and the orthologous
fly proteins were arranged in the same way. Then the correlation between the two ranks is expressed as Spearman’s
rank correlation, RS. A perfect correlation would have an
RS of 1.0, and two totally unrelated data sets would have an
RS of 0.0, though random similarities in large data sets will
raise this number above zero, even if there is no correlation.
Figure 25.22 shows the statistical data. Figure 25.22a
shows a graphical representation of the protein data. If the
two data sets were perfectly correlated, all the dots, each representing a comparison of the abundance of a single orthologous protein in the two organisms, would fall on a line with
a slope of 1.0. In this case, there is considerable scatter in the
data points, but they cluster around a line with a slope of 1.0.
In fact, as shown in Figure 25.22b, the RS for the protein data
is high: 0.79, showing a clear correlation between protein
concentrations of orthologous proteins in the two organisms.
By contrast, the concentrations of orthologous mRNAs in
the two organisms have an RS of only 0.47 if measured by
microarrays, and only 0.22 if measured by SAGE. Thus, the
protein concentrations are much more highly conserved than
their corresponding mRNA concentrations. In fact, the protein concentrations in the two organisms are even more
highly correlated than the protein and mRNA concentrations
in the same organism. The RS values for protein–mRNA correlations in C. elegans are 0.59 with the microarray data and
0.44 with the SAGE data. The RS values in Drosophila are
0.66 and 0.36 with the two data sets.
D. melanogaster (log10 ppm of total protein)
816
23/12/10
C. elegans (log10 ppm of total protein)
Figure 25.22 Correlation between abundances of orthologous
proteins and transcripts in C. elegans and D. melanogaster.
(a) Abundances (in parts per million [ppm]) of orthologous proteins in
the two organisms, determined by mass spectrometry, are plotted
against each other. Each dot represents one orthologous pair of
proteins. Crosses represent medians of equal sized bins of values. The
“whiskers” at the ends of the crosses represent the range from 25% to
75% of values (where the median, of course, is 50%). The inset
contains a similar analysis of the subsets of proteins involved in signal
transduction (blue) and translation (red). (b) Correlation coefficients (RS)
between proteins and transcripts (measured by microarray [Affymetrix]
or SAGE, as noted) in the two species, and between proteins and
transcripts within the two species. (Source: Figure 5 from, Schrimpf SP,
Weiss M, Reiter L, Ahrens CH, Jovanovic M, et al. (2009). Comparative Functional
Analysis of the Caenorhabditis elegans and Drosophila melanogaster Proteomes.
PLoS Biol 7(3): e1000048. doi:10.1371/journal.pbio.100048. © 2009 Schrimpf et al.)
orthologous proteins in the two organisms are correlated much better than the orthologous mRNAs in
the two organisms, and even better than the proteins
and corresponding mRNAs in the same organism.
SUMMARY Mass spectrometry data can be used to
compare protein concentrations in two different organisms. This kind of analysis, applied to C. elegans
and Drosophila, showed that the concentrations of
Protein Interactions
Most proteins do not function in isolation, but collaborate
with other proteins, by participating in such things as
wea25324_ch25_789-826.indd Page 817
23/12/10
8:43 AM user-f467
/Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile
25.2 Proteomics
biochemical or developmental pathways. Signal transduction pathways (Chapter 12) are good examples. Many other
proteins form large multiprotein complexes dedicated to a
specific task, such as the ribosome (protein synthesis) or the
proteasome (protein degradation). So one goal of proteomics is to identify the proteins that interact with one
another. This frequently can give important clues about the
functions of newly discovered proteins.
Traditionally, protein–protein interactions have been
detected by yeast two-hybrid analysis (Chapter 5), and
some proteome-wide studies of protein–protein interactions have been performed using this technique. But twohybrid analysis is indirect, using reporter gene activation to
observe interaction between two parts of a chimeric transcription activator, and it suffers from both false-positives
and false-negatives. Nevertheless, in conjunction with validation by an independent technique, yeast two-hybrid
screens can be very powerful. In 2005, Erich Wanker and
colleagues used a yeast two-hybrid screen, with partial independent validation, to detect over 3000 interactions between human proteins—a start down the arduous path
toward elucidating the human interactome, the total set of
interactions among human proteins.
Investigators have also used ultrasensitive protein mass
spectrometry to do a better job of detecting protein–protein
interactions. In one such study in 2002, Daniel Figeys and
colleagues employed the following procedure (Figure 25.23)
to screen protein–protein interactions in yeast: First, they
chose a set of 725 “bait” proteins that were likely to interact
with other, “fish” proteins. The bait proteins represented
several different classes, including protein kinases, protein
phosphatases, and proteins that participate in the response
to DNA damage. The investigators engineered the genes
for each of these proteins to include the coding region
for the Flag epitope and then introduced the chimeric
genes into yeast cells where they were expressed. (The
word “Flag” simply refers to the fact that the epitope
serves as a “flag” to make the proteins easy for a single antibody to recognize.)
Then the investigators used immunoaffinity chromatography with an anti-Flag antibody to purify protein complexes containing the bait protein from a cell extract. They
separated the proteins from the complexes by SDS-PAGE,
cut each band out of the gel, digested the protein in each
band with trypsin, and subjected the resulting tryptic peptides to mass spectrometry. Because we know the sequence
of the whole yeast genome, a computer can predict all of
the proteins encoded in the genome, and the masses of the
tryptic peptides that should be obtained from each of them.
Thus, this kind of bioinformatic analysis (see next section)
can use the mass spectrometer data to identify the tryptic
peptides and therefore the proteins.
Using 10% of the predicted yeast proteins as bait,
Figeys and colleagues fished out and identified 3617 associated proteins, which is about 25% of the predicted yeast
(a)
817
Tag
1
Bait
protein
Isolate protein
complex
4
(b)
5
3
Affinity
column
1
2
SDSPAGE
(c)
Excise bands
Digest with trypsin
Peptides
Analyze by mass
spectrometry and use
bioinformatics to identify
Figure 25.23 Using mass spectrometry to detect protein–protein
interactions. (a) Generating the tagged bait protein. A yeast gene
encoding a bait protein is engineered to include the coding region for
a tag, such as the Flag epitope, then placed in yeast cells and
expressed to yield the tagged bait protein. (b) Isolating complexes
with the bait protein. Immunoaffinity chromatography is performed
with a resin containing an antibody directed against the tag on the bait
protein. This “fishes out” not only the bait protein, but any “fish”
proteins that interact with it. In this case, there are four such proteins,
numbered 2–5. (c) Purifying and identifying the proteins. SDS-PAGE is
used to separate and purify the proteins in the complex. The proteins
are excised from the gel and digested with trypsin, and the resulting
peptides are analyzed by mass spectrometry. A computer compares
the masses of the tryptic peptides with the predicted masses of
peptides from all the proteins encoded in the yeast genome to identify
the proteins. (Source: Adapted from Kumar, A. and M. Snyder, Protein
complexes take the bait. Nature 415, 2002, p. 123, f. 1.)
proteome. This is about three-fold higher than the success
rate for yeast two-hybrid analysis. Figure 25.24 shows the
results obtained with two bait proteins that are protein
kinases, Kss1 and Cdc28. Some known interactions (red
arrows) were rediscovered, but many new interactions
(green arrows) were also found.
wea25324_ch25_789-826.indd Page 818
/Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile
(b)
Ste11
Pph3
Ste12
Kss1
Tec1
Dig1
Rpn6
Dig2
Rpn10
Bem3 Bck3Msg5
Dop1
Hsl7
Kel1
Swe1
Clb3
Clb2
Clb5
Net1
Cks1Fkh1
Mbp1
Cln1
Ubp15
Sin3
Cdc28
Cln2
Cdh1
Nap1
α -GST
Chapter 25 / Genomics II: Functional Genomics, Proteomics, and Bioinformatics
(a)
Ste7
8:43 AM user-f467
Fkh2
Figure 25.24 Examples of protein–protein interactions discovered
by Figeys and colleagues. (a) Interactions discovered with Kss1 as
bait. (b) Interactions discovered with Cdc28 as bait. In both panels,
red arrows represent known interactions, and green arrows represent
new interactions discovered in this study. (Source: Adapted from Ho, Y.,
Probe
818
23/12/10
A. Grahler, A. Heilbut, G.D. Bader, L. Moore, S.L. Adams, et al., Systematic
identification of protein complexes in Saccharomyces carevisiae by mass
spectrometry. Nature 415, 2002, p. 180, f. 1.)
PI(4,5)P2
PI(3,4)P2
PC
α -GST
PI(3)P
Probe
In a similar study, Anne-Claude Gavin and colleagues
discovered 589 yeast protein assemblies in 232 distinct
multiprotein complexes. Most interesting was the fact that
these associations could predict new roles for 344 proteins,
including 231 proteins for which no function was previously known. This “guilt by association” technique is a
powerful way to assign functions to unknown proteins.
Michael Snyder and colleagues have approached the
problem from a different angle. They have used protein
microarrays representing most of the yeast proteome to
determine which yeast proteins (or lipids) bind to each protein in the array. Each tiny spot on the array contained a
yeast protein coupled to glutathione-S-transferase and an
oligohistidine tag. In fact, the proteins were tethered to the
nickel-coated chip through their oligohistidine tags. In one
test of the method (Figure 25.25), Snyder and colleagues
probed the array with a protein or lipid coupled to biotin,
then probed with streptavidin bound to a fluorescent tag.
The streptavidin binds tightly to biotin, and its tag fluoresces green, indicating a positive interaction. The proteins
on the microarray were spotted in duplicate, so true positives should appear as pairs of green spots. Figure 25.25
shows at least one positive interaction in each field.
Calmodulin is a calcium-binding protein that interacts
with many other proteins that require calcium for activity.
The other five probes were liposomes containing biotinylated lipids, most of which are active in intracellular signaling. The arrays were also probed with anti-GST antibody
and a secondary antibody that gave red fluorescence. This
was a control for protein loading; all the proteins were
tagged with GST, so they should all “light up” with the
a-GST antibody.
Some proteins have binding modules for particular peptide sequences in other proteins. For example, SH3 and
WW domains bind to proline-rich peptides, and SH2 domains bind to peptides containing a phosphotyrosine.
Based on this knowledge, Stanley Fields, Charles Boone,
Calmodulin
PI(4)P
Figure 25.25 Using a protein microchip to detect protein–protein
and protein–lipid interactions. Snyder and colleagues made protein
microarrays with proteins spotted in duplicate side-by-side and
probed them first with an a-GST antibody (first and third rows) or the
probes listed beneath the second and fourth rows. The a-GST
antibody was in turn detected with a fluorescent probe to yield the red
spots. The intensity of the red fluorescence indicated the amount of
protein in each spot. The probes in the second and fourth rows were
coupled to biotin, which could be detected with streptavidin coupled
to a green flourescent tag. The probes were calmodulin, a protein
involved in many processes that require calcium, and liposomes
containing the following signalling lipids: phosphatidylinositol(3)phosphate [PI(3)P]; phosphatidylinositol(4,5)bisphosphate [PI(4,5)P2];
phosphatidylinositol(4)phosphate [PI(4)P]; phosphatidylinositol(3,4)bisphosphate [PI(3,4)P2]; and phosphatidylcholine [PC]. Each pair of
green spots corresponds to a protein on the microarray, spotted in
duplicate, that binds to the protein or lipid probe. The red spots
corresponding to the positive (green) spots in rows 2 and 4 are
boxed. (Source: Adapted from Zhu et al., Science 293 (2001) Fig. 2A, p. 2102.)
and Gianni Cesareni and colleagues (Tong et al., 2002)
have developed a procedure that meshes experimental and
computational strategies to identify the specific partners of
proteins having these and other peptide-binding domains.
The procedure employs the following four steps: First,
the investigators used a technique called phage display to
discover the consensus sequences recognized by a given
peptide-binding domain. In phage display, the gene or gene
wea25324_ch25_789-826.indd Page 819
23/12/10
8:44 AM user-f467
/Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile
25.2 Proteomics
(a)
819
(b)
Figure 25.26 Predicted network of protein–protein interactions
involving yeast SH3 domains and their targets. (a) All proteins and
interactions predicted by phage display and searching the yeast
proteome. Proteins are grouped into k-cores in which each protein
makes k interactions. For example, a 3-core contains proteins that
make 3 interactions. Each protein is color-coded by its k-core value as
follows: 6-cores, black; 5-cores, cyan; 4-cores, blue; 3-cores, red;
2-cores, green; and 1-cores, yellow. The interactions of the 6-core
proteins are represented by red lines. (b) Expansion of the 6-core
network to show interactions with specific proteins. (Source: Adapted
fragment encoding a protein or peptide is cloned into a
phage vector coupled to a phage coat protein gene such
that the protein or peptide will be displayed on the surface
of the recombinant phage. The phages displaying a protein or
peptide that interacts with a second protein can be fished
out with the second protein linked to a resin bead. These
positive phage clones can then be analyzed to see what
protein or peptide they are displaying. These are putative
targets for the second protein.
In this study, Tong and colleagues identified 24 different
SH3 domains in yeast by a c-BLAST analysis (see next section) with the oncoprotein Src, which has an SH3 domain,
as the query sequence. Twenty of these SH3 domains could
be expressed as GST-fusion proteins in E. coli, and Tong
and colleagues linked these fusion proteins to resin beads
and screened them against a library of random nonapeptides (peptides of 9 amino acids) displayed on phage surfaces. Each SH3 domain bound preferentially to a subset of
nonapeptides, which yielded a consensus sequence for the
peptide target of each SH3 domain.
Second, Tong and colleagues used computational
methods to find the consensus peptide target sequences in
the yeast proteome. This process yielded the protein network shown in Figure 25.26a. It is a network because
many target proteins have SH3 domains of their own that
bind in turn to other targets. The proteins are grouped in
“k-cores,” where each protein has k interactions with
other proteins. For example, the 6-core is a group of pro-
teins, each of which is predicted to interact with at least
six other proteins. The 6-core is shown in black, with red
connecting lines, in Figure 25.26a, and is expanded in
Figure 25.26b.
In the third step, Tong and colleagues detected interactions between SH3 domains and target proteins in a different way, using a yeast two-hybrid analysis. Finally, in
the fourth step, they compared the results of the two
methods to find interactions common to both. Of all the
interactions, 59 were detected by both methods and, because they were independently identified by two methods,
it is very likely that the great majority of them are authentic. As a test, Tong and colleagues chose one protein
(Las17) with five different proline-rich domains, which is
predicted to interact with nine different SH3 proteins.
They then verified all of these interactions with direct in
vitro assays. Indeed, the phage display experiments predicted which of the five proline-rich domains on Las17
would be the favorite target of each of the nine proteins.
With one exception, the in vitro assays proved these predictions correct.
Each of these techniques for measuring protein–
protein interaction is useful, but each has its own problems.
All are subject to false-negatives (failure to discover an
authentic interaction) and false-positives (detecting an apparent interaction that does not occur in vivo). The best
data will probably come from a combination of different
techniques.
from Tong et al., Science 295 (2002) Fig. 2, p. 322.)