Comments
Description
Transcript
97 253 Bioinformatics
wea25324_ch25_789-826.indd Page 820 820 23/12/10 8:44 AM user-f467 /Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile Chapter 25 / Genomics II: Functional Genomics, Proteomics, and Bioinformatics SUMMARY Most proteins work with other proteins to perform their functions. Several techniques are available to probe these protein–protein interactions. Traditionally, yeast two-hybrid analysis has been done, but now other methods are available. These include protein microarrays, immunoaffinity chromatography followed by mass spectrometry, and combinations of experimental methods such as phage display with computational methods. One of the most useful fruits of such analyses is the discovery of functions for new proteins. 25.3 Bioinformatics As our databases swell with billions of bases of sequence from the human and other genomes, and countless protein structures and protein–protein interactions, one crucial problem will be to access and manipulate all those data. Accordingly, a new specialty has arisen, known as bioinformatics. Practitioners of bioinformatics must understand both biology and computerized data processing, so they can manage the data collecting during genomic and proteomic studies and write programs that allow scientists to use the data. For example: BLAST is a program that searches a database for a DNA or protein sequence similar to a sequence of interest and shows how the two sequences line up; and GRAIL is a program that identifies genes in a database. Two types of databases are already established. First, we have generalized databases that include DNA and protein sequences from all organisms. Two generalized databases for DNA sequences are GenBank (http://www.ncbi .nlm.nih.gov/) and EMBL (http://www.ebi.ac.uk/embl/). Swissprot (http://www.ebi.ac.uk/swissprot/) is a generalized protein sequence database. Second, we have specialized databases that deal with a particular organism. For example, FlyBase is a database of the genome of the fruitfly Drosophila melanogaster. You can access it online at http:// flybase.bio.indiana.edu:82, and search it for genetic maps, genes, DNA sequences, and other information. A similar site, WormBase, provides the same kind of data for the nematode C. elegans. The problem, as William Gelbart has pointed out, is that we are functional illiterates in understanding the genomic sequence. He uses a language analogy: We know a few of the “nouns,” or polypeptide coding regions of the genome, but we don’t know the “verbs,” “adjectives,” and “adverbs” that tell when and how much of each gene to express. And we don’t know the “grammar” that tells how polypeptides assemble into complexes to do their jobs, such as catalyzing biochemical pathways. Bioinformatics will supply the databases and annotation that will be needed to understand genomic grammar fully. SUMMARY Bioinformatics involves the building and use of biological databases, some of which contain the DNA sequences of genomes. Bioinformatics is essential for mining the massive amount of biological data for meaningful knowledge about gene structure and expression. Finding Regulatory Motifs in Mammalian Genomes Here is an example of scientists using pure computational tools to discover regulatory motifs in mammalian genomes. In the discovery phase of their study, these scientists did not use test tubes (which would be an in vitro study) or whole cells or animals (an in vivo study). Thus, their work could be described as in silico, in reference to the silicon-based chips in their computers. Earlier in this chapter, we saw an example of an experimental approach to identifying target sites for some known transcription factors. But what about regulatory sites that interact with molecules nobody has identified yet? In 2005, Eric Lander and Manolis Kellis and their colleagues reported the results of a bioinformatic approach to this question. They reasoned that regulatory motifs (6–10 bp long) are most likely to be found in the upstream regulatory regions of genes, where transcription factors are likely to bind, and in the 39-untranslated regions (UTRs) of genes, where miRNAs and other regulatory molecules bind and regulate mRNA stability and translatability. They further reasoned that regulatory motifs are likely to be conserved among related organisms. So they compared the human, mouse, rat, and dog genomes to find conserved sequences in the 59-flanking regions, and in the 39-UTRs of genes. These researchers focused on about 17,000 genes in the four species that were well annotated, so there was little doubt that they were real genes. They defined the promoter region of each gene as the noncoding sequence within a 4-kb region centered on the transcription start site, and they defined the 39-UTR as the region between the translation stop codon and the polyadenylation signal as annotated for each mRNA. As a control, they looked at approximately 123 Mb of sequence from the last two introns in many genes. The terminal introns are thought to be poor in regulatory motifs and so should provide a good negative control. The authors defined conservation as follows: A “conserved occurrence” is a motif that is absolutely conserved in all four species. The “conservation rate” is the ratio of conserved occurrences of a motif to total occurrences of that motif in the part of the human genome under study (promoter regions, for example). Finally, the “motif conservation score,” or “MCS,” is the number of standard deviations by which wea25324_ch25_789-826.indd Page 821 23/12/10 8:44 AM user-f467 /Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile 25.3 Bioinformatics the conservation rate of a motif exceeds the conservation rate for random motifs of the same size. To illustrate conservation, the authors chose the 8-mer TGACCTTG, which is a binding site for the Err-a transcription factor. This motif occurs 434 times in human promoter regions, of which 162 are conserved occurrences. Thus, the conservation rate is 162/434, or 37%. On the other hand, a random 8-mer has a conservation rate of only 6.8% in promoter regions. Furthermore, the conservation of the 8-mer is specific to promoter regions. In introns, the conservation of this sequence is only 6.2%. Statistical analysis of these and other data allowed the authors to calculate a conservation score. The MCS for the Err-a motif is 25.2 standard deviations, which reflects the very small probability of finding a conservation rate of 37% against a background rate of 6.8%. To get a more general idea of conservation of regulatory motifs, the authors calculated the MCSs of known transcription factor binding sites from the TRANSFAC database. They found that 63% of these motifs had MCS .3 and nearly 50% had MCS ,6. So they defined “highly conserved motifs” as those having MCS .6. The authors listed three reasons why so many known regulatory motifs failed to achieve an MCS as high as 3: They may be erroneously identified; they may not be conserved in the four species studied; or they may not be common enough. The authors identified 174 highly conserved motifs in promoter regions, of which 59 strongly matched, and 10 weakly matched, previously-identified regulatory motifs in TRANSFAC. The other 105 motifs are likely to represent new regulatory elements. If these new motifs are authentic regulatory elements, the genes with which they are associated are likely to show some tissue specificity of expression. That is because genes that are controlled by a common factor are likely to be active in the same tissues. The authors consulted databases listing gene expression data from 75 tissues and found that 86% of the known motifs and 50% of the new motifs were associated with genes whose activity was significantly enriched in one or more tissues. Another check on authenticity is to see if the elements show positional bias with respect to the transcription start site. In fact, the highly conserved elements showed a strong tendency to cluster within 100 bp of the start site, while random elements were randomly distributed across the 4-kb region analyzed. Taken together, these data demonstrated that most of the identified motifs are likely to be part of authentic regulatory elements. The authors found 106 highly conserved motifs in the 39-UTRs of genes. However, because there was no database of 39-UTR elements similar to TRANSFAC, they had to use other means to check their authenticity. Fortunately, two characteristics stood out. First, the 39-UTR motifs, unlike those in the promoter region, showed a strong directional bias—they tended to be found on one strand and not the 821 other. This is consistent with the hypothesis that the 39-UTR motifs act in mRNAs, where they bind to miRNAs and other molecules to regulate mRNA stability or translatability. That is because the motif must be on the correct strand to be transcribed into mRNA. On the other hand, motifs in promoter regions act at the DNA level. They typically bind to activators, which can work in either orientation, so there is no strand bias. The second characteristic of the highly conserved motifs in the 39-UTRs of genes is that they had a strong preference for an 8-base length, and for A in the last position. The motifs in the promoter regions showed no such biases. These characteristics are consistent with the hypothesis that the 8-mers are sites for hybridization to miRNAs, which tend to begin with a U, followed by seven bases that are complementary to sites in the mRNAs they regulate. The authors were interested in the apparent relationship of the highly conserved 8-mers and miRNAs, so they searched the miRNA registry, which contained 207 different human miRNAs, for matches with the 8-mers, and found 43.5% of the known human miRNAs matched one of the 8-mers perfectly, while only 2% matched an equal number of control 8-mers. The 8-mers that did not match a known miRNA were evolving faster than those that did, suggesting that the matching 8-mers cannot alter their sequences without impairing hybridization to miRNAs, which is important to gene regulation. Finally, the authors used the conserved 8-mer motifs to find new miRNA genes. They searched the four genomes for conserved sequences complementary to the highly conserved 8-mer motifs. Then they examined the sequences surrounding the conserved sequences for ability to form the stable stem-loop structures that are characteristic of miRNAs. They found 242 such stable stem-loop structures, which presumably encode miRNAs. Of these, 113 encode known miRNAs, leaving 129 more that encode predicted miRNAs. The authors chose 12 of these at random and checked for expression in pooled adult tissues (the only in vitro experimental part of their work). They found that six were expressed. Thus, many of the 129 predicted miRNA genes probably really do encode miRNAs. This means that many miRNA genes probably remain to be discovered, and the control of gene expression by miRNAs is probably even more widespread than had been believed. SUMMARY Using computational biology techniques, Lander and Kellis have discovered highly conserved sequence motifs in the promoter regions and 39-UTRs of four mammalian species, including humans. The motifs in the promoter regions probably represent binding sites for transcription factors. Most of the motifs in the 39-UTRs probably represent binding sites for miRNAs. wea25324_ch25_789-826.indd Page 822 822 23/12/10 8:44 AM user-f467 /Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile Chapter 25 / Genomics II: Functional Genomics, Proteomics, and Bioinformatics Using the Databases Yourself Some very useful databases are kept at the National Center for Biological Information (NCBI). In this section we will see a few simple examples of how to access and use the data. To see how a search works, let us imagine you are a physician treating a patient with warts. Suspecting a viral cause, and even having a candidate virus (papilloma virus) in mind, you excise the warts, homogenize them in a buffer containing a detergent to break open the virus particles, purify the DNA, and perform PCR with primers specific for the candidate virus. You obtain a band, which confirms that the suspected viral DNA is present in the warts. To investigate further the exact strain of the virus, you sequence part of the DNA that you have amplified by PCR and obtain the sequencing gel pictured in Figure 5.19. If you need practice reading a sequencing gel, ignore the sequence below and write down the sequence of the first 21 bases from the gel, beginning with the C at the extreme bottom of the gel. If you want to skip that step, here is the sequence, written in lowercase, which is less ambiguous than capitals because it is harder to confuse a g with a c than a G with a C: caaaaaacggaccgggtgtac To begin a search, go to the NCBI home page and click BLAST, or just start at the NCBI BLAST home page: http:// www.ncbi.nlm.nih.gov/BLAST/. To search in a nucleotide database, look under the Nucleotide heading and click Nucleotide-nucleotide BLAST [blastn]. The large box near the top asks for the query sequence (the sequence you want to compare to the database). You can type in a sequence, but it is easier to copy and paste a sequence from another document. When you have finished entering your sequence, you can choose just part of it to search for using the Query subrange boxes. For example, if you wanted to search for residues 10–21 of your sequence, enter 10 in the “From” window and 21 in the “To” window (but we will use all 21 nucleotides in our search). You can also select the database in which you want to search using the top Choose Search Set box. Because we think this is a viral sequence we ignore the human and mouse database options and select Others. The default database under Others is Nucleotide collection (nr/nt), where nr stands for “nonredundant.” This includes all the nucleotide sequences in several different databases and is the most comprehensive of all. To start searching, click the BLAST button near the bottom of the page. You should receive a search status message, including a request ID (RID) number. If you receive your results promptly, you can proceed with your analysis. However, you may have to wait. In that case, remember the RID, so you can log onto the NCBI website later and retrieve the results for that ID number. You will receive your results in several forms. First, you will see a colored bar graph indicating the rough extent of match between your query sequence and various sequences in the database. In this case, blue indicates the best match. You can mouse-over each bar to see the identity of the DNA sequence that matches your query sequence. If you click on a bar you will get more details about the matching sequence. Below the bar graph are the sequences that match the query sequence, within certain limits, which we will discuss later. Each of the matching sequences is identified, with the best match given first and then the others in descending order of closeness of match. Two scores are assigned to each sequence. The first is a bit score (S), which is related to the number of matches between the query sequence and the sequence from the database. The larger the bit score, the better the match. For the best match in this case, it is 42.1, which is good. The second score is the expect value (E value). This is the number of matches yielding the corresponding bit score that we would expect to see by chance. Thus, the lower the E value, the better the match. Really good matches give E scores much less than 1.0. For the best match in this case it is 0.021. What is the identity of the database sequence with the best match to your query sequence? The mouse-over on the top bar says it is the human papilloma virus (HPV) type 31, which suggests that this is the strain of virus that caused the warts in your patient. You can get the same information from the short list of matching sequences below the bars. The top black bar corresponds to a Mus musculus (house mouse) gene, which also gives a fairly good match. Moving down the page in the results, we come to the alignments between the query sequence and the database sequences. You can see that your query sequence matches the HPV 31 sequence perfectly, but matches the mouse gene sequence in only 19 out of 21 positions. However, that apparently minor difference makes a big difference in the E value. The mouse gene has an E value of 0.33, which is significantly less exciting than 0.011. The query sequence we have entered is only 21 bases long, which is unusually short. To see the effect of increasing the length of the query sequence, either read the sequencing gel further (to 42 bases), or enter the following sequence into a new search: caaaaaacggaccgggtgtacaacttttactatggcgtgaca. The new E value is 5e214, which is a very small number and indicates that the perfect match in all 42 positions has a high degree of significance. What if you do not have a DNA sequence of your own to investigate, but you want to use the NCBI database for general information? For example, you may be interested in finding genes that are associated with certain human diseases, such as colon cancer. To start such a search, you could go to the NCBI website and click on Genes and Expression in the menu on the left. Then enter “colon cancer” in the box at top, leave the database option on the default, wea25324_ch25_789-826.indd Page 823 23/12/10 8:44 AM user-f467 /Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile 25.3 Bioinformatics “all databases,” and click “Search.” The next page asks you to limit your search, so click on “Gene: gene-centered information.” The next page gives you a list of genes associated with colon cancer. The fifth entry is MLH1 (at least it was in December, 2010, but the order will change with time). Click on the MLH1 link to receive data on this gene. The summary reveals that MLH1 is the human homolog of the mutL gene in E. coli, which encodes a protein involved in mismatch repair (Chapter 20). The MLH1 gene in humans is also involved in mismatch repair, and MLH1 mutations cause mismatches to build up and, therefore, mutations to accumulate. This presumably predisposes people to develop cancer, and colon cancer in particular, because genes that normally keep cancer in check (tumor suppressor genes) can be inactivated by mutation, and genes that predispose cells to lose control of their growth (oncogenes) can be activated by mutation. We can also use the NCBI site as a source of information about protein structure. For example, suppose we wanted to see the structure of the p53 protein (a tumor suppressor gene product), whose inactivation is a feature of the majority of human cancers. Go to the NCBI website. In the box at the top, type “p5 complexed with DNA” and click “Go.” You will be back at the Entrez page, from which you selected “Gene” when looking for information about colon cancer. This time, select “Structure.” You will be presented with a list of entries. Scroll down to the structure named “1TUP.” In December, 2010, this was entry number 18, but that will change with time. Click on the structure. This will bring up a page of information about this structure. To see it in 3-D, you will need the appropriate free software. If you already have the Cn3D software on your computer, simply click “Structure view in Cn3D.” If not, click “Download Cn3D.” Once you have installed Cn3D, click the “Structure view in Cn3D” button. You will see a structure based on an x-ray crystallography study of p53 complexed with DNA. The Cn3D software allows you to rotate the structure any way you wish with your mouse. Start with the mouse pointer on the left of the structure. Left click and hold the button down and move the mouse to the right. The structure will rotate from left to right. You can also rotate it from top to bottom, or through any angle in between horizontal and vertical. Rotate it so you can clearly see the interaction between the zinc module and the major groove of the DNA. You can also look at the 3D structures of some of the proteins we have studied in previous chapters. For example, look for the structures of GAL4 and the glucocorticoid receptors. In both cases, the rotation will make the structures even clearer than they were in this book. SUMMARY The NCBI website contains a vast store of biological information, including genomic and proteomic data. You can start with a sequence and 823 discover the gene it belongs to, and compare that sequence with that of similar genes. You can also start with a topic you want to study and query the database for information on that topic. Or you can look up a protein of interest and view the structure of that protein in three dimensions by rotating the structure on your computer screen. S U M M A RY Functional genomics is the study of the expression of large numbers of genes. One branch of this study is transcriptomics, which is the study of transcriptomes—all the transcripts an organism makes at any given time. One approach to transcriptomics is to create DNA microarrays or DNA microchips, holding thousands of cDNAs or oligonucleotides, then to hybridize labeled RNAs (or corresponding cDNAs) from cells to these arrays or chips. The intensity of hybridization to each spot reveals the extent of expression of the corresponding gene. Such arrays can be used to analyze the timing and location of expression of many genes at once. SAGE (serial analysis of gene expression) allows one to determine which genes are expressed in a given tissue and the extent of that expression. Short tags, characteristic of particular genes, are generated from cDNAs and ligated together between linkers. The ligated tags are then sequenced to determine which genes are expressed and how abundantly. Cap analysis of gene expression (CAGE) gives the same information as SAGE about which genes are expressed, and how abundantly, in a given tissue. However, because it focuses on the 59-ends of mRNAs, it also allows the identification of transcription start sites and, therefore, helps locate promoters. High-density whole chromosome transcriptional mapping studies have shown that the majority of sequences in cytoplasmic polyadenylated RNAs derive from non-exon regions of 10 human chromosomes. Furthermore, almost half of the transcription from these same 10 chromosomes is nonpolyadenylated. Taken together, these results indicate that the great majority of stable nuclear and cytoplasmic transcripts of these chromosomes comes from regions outside the exons. This may help to explain the great differences between species, such as humans and chimpanzees, whose exons are almost identical. Genomic functional profiling can be performed by creating mutants in an organism by replacing genes one at a time with an antibiotic resistance gene flanked by oligomers that serve as a barcode to identify each mutant. wea25324_ch25_789-826.indd Page 824 824 23/12/10 8:44 AM user-f467 /Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile Chapter 25 / Genomics II: Functional Genomics, Proteomics, and Bioinformatics Then the whole group of mutants can be grown together under various conditions to see which mutants disappear most rapidly. Functional profiling can also be done by inactivating genes via RNAi. Tissue-specific expression profiling can be done by examining the spectrum of mRNAs whose levels are decreased by an exogenous miRNA, and comparing that to the spectrum of expression of genes at the mRNA level in various tissues. If the miRNA in question causes a decrease in the levels of the mRNAs that are naturally low in cells in which the miRNA is expressed, it suggests that the miRNA is at least part of the cause of those natural low levels. This kind of analysis has implicated miR-124 in destabilizing mRNAs in brain tissue, and miR-1 in destabilizing mRNAs in muscle tissue. Chromatin immunoprecipitation followed by DNA microarray analysis (ChIP-chip analysis) can be used to identify DNA-binding sites for activators and other proteins. In organisms with small genomes, such as yeast, all of the intergenic regions can be included in the microarray. But with large genomes, such as the human genome, that is now impractical. To narrow the field, CpG islands can be used, since they are associated with gene control regions. Also, if the timing or conditions of an activator’s activity are known, the control regions of genes known to be activated at those times, or under those conditions, can be used. The mouse can be used as a human surrogate in largescale expression studies that would be ethically impossible to perform on humans. For example, scientists have studied the expression of almost all the mouse orthologs of the genes on human chromosome 21. They have followed the expression of these genes through various stages of embryonic development and have catalogued the embryonic tissues in which the genes are expressed. Single-nucleotide polymorphisms can probably account for many genetic conditions caused by single genes, and even multiple genes. They might also be able to predict a person’s response to drugs. A haplotype map with over 10 million SNPs will make it easier to sort out the important SNPs from those with no effect. Structural variation (insertions, deletions, inversions, and other rearrangements of chunks of DNA) is also a surprisingly prominent source of variation in human genomes. Some structural variation can in principle predispose certain people to contract diseases, but some is presumably benign, and some is demonstrably beneficial. The sum of all proteins produced by an organism is its proteome, and the study of these proteins, even smaller sets of them, is called proteomics. Current research in proteomics requires first that proteins be resolved, sometimes on a massive scale. One of the best tools available for separation of many proteins at once is 2-D gel electrophoresis. After they are separated, proteins must be identified, and the best method for doing that involves digestion of the proteins one by one with proteases and identifying the resulting peptides by mass spectrometry. Someday microchips with antibodies attached may allow analysis of proteins in complex mixtures without separation. Most proteins work with other proteins to perform their functions. Several techniques are available to probe these protein–protein interactions. Traditionally, yeast twohybrid analysis has been done, but now other methods are available. These include protein microarrays, immunoaffinity chromatography followed by mass spectrometry, and combinations of experimental methods such as phage display and computational methods. One of the most useful fruits of such analyses is the discovery of functions for new proteins. Bioinformatics involves the building and use of biological databases, some of which contain the DNA sequences of genomes. Bioinformatics is essential for mining the massive amount of genomic information for meaningful knowledge about gene structure and expression. Using computational biology techniques, Lander and Kellis have discovered highly conserved sequence motifs in the promoter regions and 39-UTRs of four mammalian species, including humans. The motifs in the promoter regions probably represent binding sites for transcription factors. Most of the motifs in the 39-UTRs probably represent binding sites for miRNAs. The NCBI website contains a vast store of biological information, including genomic and proteomic data. You can start with a sequence and discover the gene it belongs to, and compare that sequence with that of similar genes. You can also start with a topic you want to study and query the database for information on that topic. Or you can look up a protein of interest and view the structure of that protein in three dimensions by rotating the structure on your computer screen. REVIEW QUESTIONS 1. Describe the process of making a DNA microchip (oligonucleotide array). 2. Describe a SAGE experiment to measure transcription in cancer cells of a certain type. Show how the production of a ditag works, with actual sequences of your own invention. 3. Explain how the cap-trapper in a CAGE experiment ensures that only full-length cDNAs are captured. 4. Explain the roles of the MmeI, XmaJI, and XbaI restriction sites in the CAGE procedure. 5. Describe how genomic functional profiling can be performed by gene knockout in yeast. 6. Describe how genomic functional profiling can be performed by RNAi in higher eukaryotes. wea25324_ch25_789-826.indd Page 825 23/12/10 8:44 AM user-f467 /Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile Suggested Readings 825 7. Describe a tissue-specific functional profiling method that shows the effects of miRNAs on gene expression. Give a hypothetical example of positive results. A N A LY T I C A L Q U E S T I O N S 8. Explain how ChIP-chip analysis works. Show how it can be used to find DNA regions (enhancers) that bind to a particular activator. 1. List in order the steps you would perform to make an oligonucleotide array in which two of the spots contain the dinucleotides AC and AT. You may ignore all of the other spots. 9. Explain how tag sequencing (ChIP-seq) works. What problems in ChIP-chip analysis are solved by ChIP-seq? 2. Describe a hypothetical experiment using a DNA microarray to measure the transcription from viral genes at two stages of infection of cells by the virus. Present sample results. 10. What are cis-regulatory modules (CRMs)? Why are they easier to find than single enhancers? 11. Outline a genomic strategy for finding enhancers that bind to unknown proteins. Describe at least one drawback to this strategy. 12. Ren and colleagues employed ChIP-chip analysis using an anti-TAF1 antibody to locate promoters in human cells. They found that these promoters were not enriched in TATA boxes. Given that TAF1 is part of a transcription factor (TFIID) that binds to TATA boxes, why is it not surprising that many of the promoters identified lack TATA boxes? You will need information from Chapter 11 to answer this question. 13. Describe how in situ expression analysis works. Give a hypothetical example of a positive result. 14. What are SNPs? Why are most of them unimportant? How can some of them be useful? How can they be abused? 15. Compare and contrast in a general way the techniques used in, and information obtained in, transcriptomics and proteomics. 16. Describe a bioinformatic approach to identifying human gene control motifs. 3. Perform a BLAST search on the first 20 nt of this sequence (the sequence is divided into blocks of 10 nt): ttaagtgaaa taaagagtga atgaaaaaat aatatcctta. What gene did you identify? What was the best E value you obtained? Now try again with all 40 nt. Did you still retrieve the same gene? What is the best E value you obtained this time? Why is the E value different this time? On what chromosome is this gene located? Is there any relationship between this gene and prostate cancer in men? If so, what is the relationship? 4. You are an MD/PhD developmental biologist (highly trained in techniques of molecular biology) studying the pathogenesis of Type I insulin-dependent diabetes mellitus (IDDM). You have several patients who are predisposed to becoming diabetic, and control subjects who have no family history or predisposition to becoming diabetic, enrolled in a clinical study that will involve the removal of a small section of pancreatic beta cells. You want to analyze the differences in gene expression between cells from these two groups of subjects. Describe the experimental method(s) you would use, and what information you hope to obtain from this study. 17. Explain how MS/MS analysis can yield the sequence of a protein. Present hypothetical results. 18. Explain how isotope coded affinity tags (ICATs) can enable you to quantify the changes in protein concentration in cells grown under two different conditions. 19. In Figure 25.21, estimate what has happened to the concentrations of peptides 4–7 when cells are shifted from condition 1 (no serum) to condition 2 (1serum)? 20. Explain how you would use stable isotope labeling by amino acids in cell culture (SILAC) to quantify the changes in protein concentration in cells grown under two different conditions. Show sample results. 21. How would you measure the absolute concentration of a particular protein in a cell? 22. What do the data in Figure 25.22 tell us about the accuracy of estimating a protein’s concentration from its mRNA’s concentration? 23. What would happen to the gray data points in Figure 25.22 if there were a lower correlation between the abundances of orthologous proteins in the two organisms? What would happen if there were a higher correlation? 24. Describe how affinity tagging and mass spectrometry can be used to examine an organism’s interactome. 25. Explain how a protein microarray could be used to examine an organism’s interactome. 26. Describe an experiment in which you would use phage display to investigate an organism’s interactome. SUGGESTED READINGS General References and Reviews Abbott, A. 1999. A post-genomic challenge: Learning to read patterns of protein synthesis. Nature 402:715–20. Cheung, V.G., M. Morley, F. Aguilar, A. Massimi, R. Kucherlapati, and G. Childs. 1999. Making and reading microarrays. Nature Genetics Supplement 21:15–19. Cox, J. and M. Mann. 2007. Is proteomics the new genomics? Cell 130:395–98. Hieter, P. and Boguski, M. 1997. Functional genomics: It’s all how you read it. Science 278:601–02. Kruglyak, L. and D.L. Stern. 2007. An embarrassment of switches. Science 317:758–59. Kumar, A. and M. Snyder. 2002. Protein complexes take the bait. Nature 415:123–24. Lipshutz, R.J., S.P.A. Fodor, T.R. Gingeras, and D.J. Lockhart. 1999. High density synthetic oligonucleotide arrays. Nature Genetics Supplement 21:20–24. Marx, J. 2006. A clearer view of macular degeneration. Science 311:1704–05. Service, R.F. 1998. Microchip arrays put DNA on the spot. Science 282:396–99. Young, R.A. 2000. Biomedical discovery with DNA arrays. Cell 102:9–15. wea25324_ch25_789-826.indd Page 826 826 23/12/10 8:44 AM user-f467 /Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile Chapter 25 / Genomics II: Functional Genomics, Proteomics, and Bioinformatics Research Articles Arbeitman, M.N., E.E.M. Furlong, F. Imam, E. Johnson, B.H. Null, B.S. Baker, M.A. Krasnow, M.P. Scott, R.W. Davis, and K.P. White. 2002. Gene expression during the life cycle of Drosophila melanogaster. Science 297:2270–75. Blanchette, M., A.R. Bataille, X. Chen, C. Poitras, J. LaganiPre, G. Debois, V. GiguPre, V. Ferretti, D. Bergeron, B. Coulombe, and F. Robert. 2006. Genome-wide computational prediction of transcriptional regulatory modules reveals new insights into human gene expression. Genome Research 16:656–68. Cheng, J., T.R. Gingeras, et al. 2005. Transcriptional maps of 10 human chromosomes at 5-nucleotide resolution. Science 308:1149–54. Gavin, A.-C., M. Bosche, R. Krause, P. Grandi, M. Marzioch, A. Bauer, et al. 2002. Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature 415:141–47. Glaever, G., A.M. Chu, L. Ni, C. Connelly, L. Riles, S. Veronneau, et al. 2002. Functional profiling of the Saccharomyces cerevisiae genome. Nature 418:387–91. Gygi, S.P., B. Rist, S.A. Gerber, F. Turecek, M.H. Gelb, and R. Aebersoldgene. 1999. Quantitative analysis of complex protein mixtures using isotope-coded affinity tags. Nature Biotechnology 17:994–99. Ho, Y., A. Gruhler, A. Heilbut, G.D. Bader, L. Moore, S.L. Adams, D. Figeys, and many other authors. 2002. Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry. Nature 415:180–83. Iyer, V.R., M.B. Eisen, D.T. Ross, G. Schuler, T. Moore, J. Lee, et al. 1999. The transcriptional program in the response to human fibroblasts to serum. Science 283:83–87. Lim, L.P., N.C. Lau, P. Garrett-Engele, A. Grimson, J.M. Schelter, J. Castle, D.P. Bartel, P.S. Linsley, and J.M. Johnson. 2005. Microarray analysis shows that some microRNAs downregulate large numbers of target mRNAs. Nature 433:769–73. Pennacchio, L.A., et al. 2006. In vivo enhancer analysis of human conserved non-coding sequences. Nature 444:499–502. Ren, B., F. Robert, J.J. Wyrick, O. Aparicio, E.G. Jennings, I. Simon, et al. 2000. Genome-wide location and function of DNA binding proteins. Science 290:2306–09. Sönnichsen, B., et al. 2005. Full-genome RNAi profiling of early embryogenesis in Caenorhabditis elegans. Nature 434:462–69. Stelzl, U., et al. 2005. A human protein–protein interaction network: A resource for annotating the proteome. Cell 122:957–68. Tong, A.H., B. Drees, G. Nardelli, G.D. Bader, B. Brannetti, L. Castagnoli, et al. 2002. A combined experimental and computational strategy to define protein interaction networks for peptide recognition modules. Science 295:321–4. Velculescu, V.E., L. Zhang, B. Vogelstein, and K.W. Kinsler. 1995. Serial analysis of gene expression. Science 270:484–87. Wang, D.G., J.B. Fan, C.J. Sino, A. Berno, P. Young, R. Sapolsky, et al. 1998. Large-scale identification, mapping, and genotyping of single-nucleotide polymorphisms in the human genome. Science 280:1077–82. Xie, X., J. Lu, E.J. Kulbokas, T.R. Golub, V. Mootha, K. Lindblad-Toh, E.S. Lander, and M. Kellis. 2005. Systematic discovery of regulatory motifs in human promoters and 39 UTRs by comparison of several mammals. Nature 434:338–45. Zhu, H., M. Bilgin, R. Bangham, D. Hall, A. Casamayor, P. Bertone, et al. 2001. Global analysis of protein activities using proteome chips. Science 293:2101–05.