95 251 Functional Genomics Gene Expression on a Genomic Scale
by taratuta
Comments
Transcript
95 251 Functional Genomics Gene Expression on a Genomic Scale
wea25324_ch25_789-826.indd Page 790 790 23/12/10 8:43 AM user-f467 /Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile Chapter 25 / Genomics II: Functional Genomics, Proteomics, and Bioinformatics 25.1 Functional Genomics: Gene Expression on a Genomic Scale First of all, one can focus on expression of genomes at the RNA level. If we consider all the transcripts an organism makes at any given time, we call that the organism’s transcriptome, by analogy with the term “genome,” which refers to all the genes in an organism. And functional genomics studies that measure the levels of RNAs produced from many genes at a time are part of a field called transcriptomics. Second, one can use genomic information to try to determine the pattern of expression of all the genes in an organism at all stages of the organism’s life. This kind of analysis is called genomic functional profiling. Third, one can compare many individuals’ genomes to find significant differences. For example, differences in single nucleotides are called single-nucleotide polymorphisms (SNPs). Sometimes these SNPs are associated with genetic disorders or other, less dramatic characteristics, such as susceptibilities to drugs. But SNPs are not the only common differences among human genomes. The more geneticists look, the more they find major chromosomal structural variations, such as inversions, duplications, and deletions. Moreover, at least some of these variations appear to have important consequences. For example, one long inversion has been found to be common in Europeans, but not in Africans and Asians, and women with this inversion have more children than those without it. Thus, the inversion seems to provide an evolutionary advantage. Finally, one can study the structures and functions of the protein products of genomes. To the extent that it focuses on protein structure, this latter enterprise can be called structural genomics, but the whole endeavor is called proteomics, and will be the subject of a later section of this chapter. In this section, we will consider transcriptomics, genomic functional profiling, and SNPs. Transcriptomics To discover the pattern of expression of a gene in a given tissue over time, one can perform a dot blot analysis as described in Chapter 5. In a classical dot blot, one makes spots a few mm in diameter containing a single-stranded DNA from the gene in question on filters and then hybridizes these dot blots to labeled RNAs made in the tissue in question at different times. But suppose one wants to know the pattern of expression of all the genes in that tissue over time. In principle, one could make a large dot blot with tens of thousands of single-stranded DNAs corresponding to all the potential mRNAs in a cell and hybridize labeled cellular RNAs to that monster dot blot. But the sheer size of that blot would present a serious problem. Fortunately, molecular biologists have devised some methods to miniaturize such 1 (25.4 mm) 3 (76.2 mm) Figure 25.1 Schematic diagram of a DNA microarray. This drawing represents a standard, 10 3 30 glass microscope slide with an array of 7500 tiny spots of DNA. Each dot is 200 mm in diameter, and the distance between the dot centers is 400 mm. This is by no means the highest density of spots presently attainable. It is actually possible to place more than 50,000 spots on a slide of this size. (Source: Adapted from Cheung, V.G., M. Morley, F. Aguilar, A. Massimi, R. Kucherlapati, and G. Childs, Making and reading microarrays. Nature Genetics Supplement Vol. 21 (1999) f. 2, p. 17.) blots, and some novel methods to analyze the expression of whole genomes. We will look first at DNA arrays and gene microchips, and then at a more exotic method. DNA Microarrays and Microchips To circumvent the problem of size, molecular biologists have adapted inkjet printer technology to spot tiny volumes of DNA on a chip, so the dots of the dot blot are very small. This allows many different DNAs to be spotted on one chip, called a DNA microarray. One system, developed by Vivian Cheung and colleagues, uses a robot with 12 parallel pens, each of which can squirt out a tiny volume of DNA solution: 0.25–1.0 nL (billionth of a liter). The spots are exquisitely small, only 100–150 mm in diameter, and the centers of the spots are only 200–250 mm apart. The result looks like the schematic diagram in Figure 25.1, but even better, as the figure represents a DNA microarray with only 7500 DNA spots on a common microscope slide. After spotting, the DNAs are air dried, and covalently attached by ultraviolet radiation to a thin silane layer on top of the glass. Another strategy for reducing the size of a blot has been to synthesize many oligonucleotides simultaneously, right on the surface of a chip. Steven Fodor and his colleagues pioneered this method in 1991, using the same kind of photolithographic techniques employed in computer chip manufacture, to build short DNAs (oligonucleotides) on tiny, closely spaced spots on a small glass microchip. In a 1999 version of this technique (Figure 25.2), these workers started with a small glass slide coated with a synthetic linker that was blocked with a photoreactive group that can be removed by light. They masked some of the areas of the slide and illuminated it, so the blocking agent was removed only from the unmasked areas. Then they added a nucleotide (also blocked with a photoreactive group) and chemically coupled it to all the areas of the slide that had been unblocked in the previous step. The result: A nucleotide was attached to a subset of the tiny spots on the chip. Next, they masked a different wea25324_ch25_789-826.indd Page 791 23/12/10 8:43 AM user-f467 /Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile 25.1 Functional Genomics: Gene Expression on a Genomic Scale 791 Light Mask Light O OO O O O H H O OO O O O Chemical coupling G (with O OO O O O ) G Glass First cycle G Light H O G G O OO O O O Light G G H H O OO O O O A Chemical coupling G (with O OO O O O ) A A G A Second cycle Figure 25.2 Growing oligonucleotides on a glass substrate. The glass is coated with a reactive group that is blocked with a photosensitive agent (red). This blocking agent can be removed with light, but parts of the plate are masked (blue) so the light cannot get through. In the first cycle, four of the six spots pictured are masked, so the light reaches only two unmasked spots and removes the blocking agent. Then a blocked guanosine nucleotide is chemically coupled to the unblocked spots. In the second cycle, three spots are masked, and the other three are therefore exposed to the light. This removes the blocking agent from three spots, including the first one, which already has a G attached. Thus, after a blocked adenosine nucleotide is chemically coupled to the three unblocked spots, the first spot has a G–A dinucleotide, the third and sixth spots have an A mononucleotide, the fourth has a G mononucleotide, and the second and fifth spots, which were masked in both cycles, have no nucleotides attached yet. As the cycle is repeated over and over with different masking patterns and different nucleotides, unique oligonucleotides are built up in each spot. subset of spots, illuminated the others to remove the blocking groups, and attached another nucleotide. On the spots that were unmasked in both steps, dinucleotides were formed. By repeating this process, they could build up different oligonucleotides on each spot. The resulting chip is known as a DNA microchip or oligonucleotide array, although these terms and “DNA microarray” are often used interchangeably. In fact, the generic term “microarray” can be used to refer to any kind of DNA or oligonucleotide microarray. The technology is so miniaturized that about 300,000 oligonucleotides can be built on a chip only 1.28 3 1.28 cm (about ½0 square). And the process is so efficient that a set of 4n different oligonucleotides can be built in only 4 3 n cycles. So if our goal is to generate all the possible 9-mers (49, or about 250,000 different oligonucleotides), we can do it in only 4 3 9 5 36 cycles. How long must an oligonucleotide be to uniquely identify one human gene product in a mixture of all the others? Knowing the sequence of the human genome helps us answer this question with great accuracy. However, even without that information, we can do a calculation to give us a minimum estimate. A given sequence of n bases will occur in a DNA about every 4n bases. In other words, a DNA sequence needs to be n bases long to occur about once in a DNA 4n bases long. Thus, we need to solve the following equation for n to find the minimum size of an oligonucleotide we would expect to find only once in the whole human genome, which may be as much as 3.5 3 109 bases long: would require 4 3 16 5 64 cycles to build them all on an oligonucleotide array. Again, however, this is a minimum estimate, so it would be a good idea to start with longer oligonucleotides to be reasonably sure that they occur only once in the human genome and therefore uniquely identify human genes. Even before the publication of the sequence of the first human chromosome, scientists at Affymetrix, Inc. were already producing microchips containing 25-mers designed to recognize single genes. They based their design on the sequence that was available, including the many ESTs already in the database. To enhance the reliability of their chips, they included multiple oligonucleotides designed to hybridize to single transcripts, so the results obtained with each of these oligonucleotides could be checked against one another. The oligonucleotides on a microchip or the cDNAs on a microarray can be hybridized to labeled RNA isolated from cells (or to corresponding cDNAs) to see which genes in the cell were being transcribed. For example, consider a study by Patrick Brown and colleagues in which they used the DNA microarray technique to examine the effect of serum on the RNAs made by a human cell. They isolated RNA from cells grown in the presence and absence of serum, then reverse transcribed the two RNA samples in the presence of nucleotides tagged with fluorescent dyes, so the cDNA products would be labeled with the fluorescent tags. They used a green-fluorescing nucleotide to label the cDNA from serum-deprived cells, and a red-fluorescing nucleotide to label the cDNA from serum-stimulated human cells. Then they mixed the cDNAs, hybridized them to DNA microarrays containing unlabeled cDNAs corresponding to 8613 different human genes, and detected the resulting 4n 5 3.5 3 109 The answer is that if n 5 16, 4n . 3.5 3 109. So our oligonucleotides need to be at least 16 bases long, and that wea25324_ch25_789-826.indd Page 792 792 23/12/10 8:43 AM user-f467 /Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile Chapter 25 / Genomics II: Functional Genomics, Proteomics, and Bioinformatics mentally regulated genes, ordered by time of onset of the first increase in expression. That is, the topmost genes in the figure were stimulated earliest in the life cycle, and the bottommost genes were stimulated last. Figure 25.3 Using a DNA chip. Brown and colleagues made cDNAs from RNAs from serum-starved and serum-stimulated human cells. They labeled the cDNAs corresponding to RNAs from serum-starved cells with a green fluorescent nucleotide; they labeled the cDNAs corresponding to RNAs from serum-stimulated cells with a red fluorescent nucleotide. Then they hybridized these fluorescent cDNAs together to DNA chips containing cDNAs corresponding to over 8600 human genes. The figure shows the same part of the DNA chip from three different hybridizations. The red spots (e.g., spots 2 and 4) correspond to genes that are more active in the presence of serum. The green spots (e.g., spot 3) correspond to genes that are more active in the absence of serum. The yellow spots (e.g., spot 1) correspond to genes that are roughly equally active in the presence or absence of serum. (Source: Lyer, V.R., M.B. Eisen, D.T. Ross, G. Schuler, ■ More than 88% of the developmentally regulated genes are active during the first 20 h of development, which is before the end of the embryonic phase (see Figure 25.4c). ■ RNAs from about 33% of the developmentally regulated genes are already present at the very earliest time point (Figure 25.4c). These represent maternal genes, or maternal effect genes, those that are expressed during oogenesis in the mother. Thus, the maturing oocyte either transcribes these genes or receives their transcripts from surrounding nurse cells so the mRNAs are already present in the egg and are available for translation as soon as fertilization occurs. ■ As illustrated in Figure 25.4d, expression of some genes is maintained throughout the life cycle, whereas expression of others peaks and declines. In particular, as further illustrated in Figure 25.4e, genes that reach peak expression during early embryonic life tend to peak again in early pupal development, whereas genes that peak in the late embryonic phase tend to achieve another peak in late pupal development. A related phenomenon, not illustrated here, is that genes that peak in larval development tend to reach another peak of expression during adult life. ■ Genes encoding components of a given supramolecular complex tended to be coexpressed. Thus, the genes encoding the ribosomal proteins tended to be regulated coordinately, as did the genes encoding the proteins in the mitochondrion. ■ Genes encoding proteins with related functions tended to be coexpressed, even if the proteins did not form complexes. Thus, genes encoding transcription factors, or cell cycle regulators, tended to be expressed together. ■ Coexpression of some genes was tissue-specific. For example, one cluster of 23 coregulated genes included eight genes that were already known to be expressed in muscle cells. Upon further examination, the control regions of 15 of the genes in this cluster had pairs of binding sites for the transcription factor dMEF2, which is known to activate genes in differentiating muscle cells. Seven of the genes in the cluster had unknown function, and six of these had dMEF2-binding sites and were expressed in differentiating muscle. Thus, this analysis allowed White and colleagues to assign a function in muscle differentiation to these six unknown genes. This is important because it is very difficult to determine the function of genes based solely on their sequences. The additional clues about timing and location of expression are a tremendous help. Indeed, they allowed White and colleagues to assign functions to 53% of the genes they analyzed. T. Moore, J.C. Lee, et al., The transcriptional program in the response of human fibroblasts to serum. Science 283 (1 Jan 1999) f. 1, p. 83. Copyright © AAAS.) fluorescence. Figure 25.3 shows the same region of the microarray from triplicate hybridizations. The red spots correspond to genes that are turned on by serum, and the green spots represent genes that are active in serumdeprived cells. The yellow spots result from hybridization of both probes to the same spot (the green and red fluorescence together produce a yellow color). Thus, the yellow spots correspond to genes that are active in both the presence and absence of serum. Microarrays allow one to examine changes in gene expression in systems much more complex than the one we have just described. For example, our knowledge of the complete yeast genome sequence has enabled molecular biologists to use DNA chips to analyze the expression of every yeast gene at once, under a variety of conditions. In another example, Kevin White and colleagues used DNA chips in 2002 to follow the expression of 4028 Drosophila genes during 66 distinct periods throughout the fly’s life cycle. Figure 25.4a shows the 66 developmental stages at which RNAs were collected for gene expression analysis. Notice that almost half (30) of these time points were in the embryonic phase of development, in which gene expression changes most rapidly. In fact, early in the embryonic phase, when gene expression is most dynamic, RNAs were collected every half-hour. This analysis yielded several conclusions: ■ A large number of genes (3219) experienced a substantial change in expression (four-fold or more) during the fly’s life cycle. Figure 25.4b shows all of these develop- wea25324_ch25_789-826.indd Page 793 23/12/10 8:43 AM user-f467 /Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile 25.1 Functional Genomics: Gene Expression on a Genomic Scale 793 (b) (a) Fertilization Blastoderm E Gastrulation L P A Muscle I Embryo RNA collections 0 5 0 1 E 10 5 Larva Hatching 15 20 10 Pupa 15 24 h 20 25 30 35 40 days Adult Metamorphosis Eclosion Muscle I Muscle II Eye Testis Ovary (c) Fraction of genes used Fraction of genes used 1 0.75 0.50 0.25 Maternal genes 1 0.75 0.50 0.25 0 0 5 10 15 20 Developmental time (h) 0 0 5 10 15 20 25 Developmental time (days) 30 35 40 Developmental time <0.25 0.33 0.5 1 2 Fold induction Figure 25.4 Patterns of expression of Drosophila genes during development. (a) Outline of RNA collection periods. White and colleagues collected RNAs from whole animals at the indicated times during development (E, embryonic; L; larval; P, pupal; A, the first 40 days of the adult phase). The embryonic period is expanded to show all of the overlapping collection periods. They purified Poly(A)1 RNA by oligo(dT)-cellulose chromatography and made fluorescent cDNAs by reverse transcribing the poly(A)1 RNAs in the presence of a fluorescent nucleotide. Then they hybridized the fluorescent cDNA from a given time point to a microarray and measured the extent of hybridization. They normalized all such hybridization values against the extent of hybridization of a reference standard cDNA prepared from a mixture of RNAs from all phases of the life cycle. (b) Gene 3 >4 expression profiles. The profiles of 3219 genes whose expression levels changed by more than four-fold during the fly life cycle are arranged in order of the onset of the first increase in abundance of transcript. The developmental phase is indicated at top, with the same abbreviations and color coding as in (a). The expression level is indicated by color, as indicated at bottom, blue stands for low expression and yellow stands for high expression. (c) Graphic representation of the cumulative fraction of genes that have shown a strong increase in expression. Note that a large fraction (about 33%) of genes are already represented by a large amount of RNA at the earliest time point. These are labeled maternal genes. The inset is an expansion of the first 20 h of the embryonic phase, which also shows the large proportion of transcripts already present in the first hour of development. wea25324_ch25_789-826.indd Page 794 794 23/12/10 8:43 AM user-f467 /Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile (d) Fold change Induced and maintained 16 CG5958 Early embryo/early pupa 16 Amalgam 4 0 0 –4 –16 –16 Transiently induced Fold change 16 4 CG1733 4 0 0 –4 –4 –16 Late embryo/late pupa 16 –16 E L P Figure 25.4 Continued (d) Expression patterns of four selected genes. At upper left, gene CG5958 shows an induction in early embryonic phase to a high level that is largely maintained throughout the life cycle. At upper right, the Amalgam gene shows an induction in the early embryonic phase, a decrease in the larval phase, and a reinduction at the boundary between the larval and pupal stages. At lower left, gene CG1733 shows a distinct peak of expression at the larval–pupal boundary. At lower right, gene CG17814 shows one burst of induction that begins in the late embryonic phase and lasts through SUMMARY Functional genomics is the study of the expression of large numbers of genes. One branch of this study is transcriptomics, which is the study of transcriptomes—all the transcripts an organism makes at any given time. One approach to transcriptomics is to create DNA microarrays or DNA microchips, holding thousands of cDNAs or oligonucleotides, then to hybridize labeled RNAs (or corresponding cDNAs) from cells to these arrays or chips. The intensity of hybridization to each spot reveals the extent of expression of the corresponding gene. With a microarray one can canvass the expression patterns (both temporal and spatial) of many genes at once. The clustering of expression of genes in time and space suggests that the products of these genes collaborate in some process. This can give clues about the functions of genes of unknown function if the unknown gene is expressed together with one or more well-studied genes. Serial Analysis of Gene Expression In 1995, Victor Velculescu, working with Kenneth Kinzler and colleagues, developed a novel method of analyzing the range of genes expressed in a given cell. They called this method serial analysis of gene expression (SAGE). The underlying strategy of SAGE is to synthesize short cDNAs, or tags, from all the mRNAs in a cell, and then link these tags together in clones that can be sequenced to learn the nature of the tags, CG17814 A (e) % embryonic genes with 2nd peak in interval Chapter 25 / Genomics II: Functional Genomics, Proteomics, and Bioinformatics 1st embryonic peak: 50 Early (0–3 h) Late (9–19 h) 40 30 20 10 0 L1 L2 L3 P1 P2 P3 A1 A2 A3 Developmental intervals the larval phase, and a reinduction in the late pupal phase. (e) Reinduction patterns. The percent of genes expressed either early (blue) or late (red) in the embryonic phase that show a reinduction at the given times later in development. Note that the genes expressed in early embryogenesis tend to be reinduced in the early pupal stage (P1, bracket over blue bar), whereas the genes expressed in late embryogenesis tend to be reinduced in the late pupal stage (P3, bracket over red bar). (Source: Adapted from Arbeitman et al., Science 297, 2002. Fig. 1, p. 2271. © 2002 by the AAAs.) and therefore the nature of the genes expressed in the cell, and the extent of expression of each gene. Figure 25.5 shows how Velculescu and colleagues carried out this strategy. First, they used a biotinylated oligo(dT) primer to prime reverse transcription of the mRNAs present in human pancreatic tissue, yielding doublestranded cDNAs. The goal was to reduce the size of the cDNAs to short tags that could be ligated together and sequenced readily. Because of the shortness of the tags (9 bp in the example in Figure 25.5), it is important to confine them to a small region of the cDNAs to increase the chance that they will uniquely identify one cDNA. To begin the shortening process, Velculescu and colleagues cleaved the biotinylated cDNAs with an anchoring enzyme (AE) to chop off a short 39-terminal fragment. They chose as their anchoring enzyme NlaIII, which recognizes 4-base restriction sites and therefore yields fragments averaging 250 bp long. They bound these biotinylated 39-fragments to streptavidin beads, which bind biotin. Next, they divided the bead-bound cDNA fragments into two pools and ligated one pool to a linker (Y) and the other pool to a second linker (Z). Both linkers contained the recognition site for a type IIS restriction endonuclease (the tagging enzyme [TE]) that cuts 20 bp downstream of this recognition site. The result of cleavage of the cDNA fragments with the tagging enzyme FokI was a set of short fragments, each containing the linker (Y or Z) followed by the 4-bp anchoring enzyme site, followed by 9 bp from the cDNA. That 9-bp piece of cDNA is the tag. If the tagging enzyme leaves overhangs, these can be filled in to yield blunt ends. wea25324_ch25_789-826.indd Page 795 23/12/10 8:43 AM user-f467 /Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile 25.1 Functional Genomics: Gene Expression on a Genomic Scale 795 (a) Synthesize double-stranded cDNAs using a biotinylated oligo (dT) primer. AAAAA TTTTT AAAAA TTTTT AAAAA TTTTT (b) Cleave with anchoring enzyme (AE). Bind 3′-terminal fragments to streptavidin beads. GTAC GTAC GTAC AAAAA TTTTT AAAAA TTTTT AAAAA TTTTT (c) Divide in half. Ligate to linkers (Y and Z). AAAAA TTTTT AAAAA TTTTT AAAAA TTTTT Y CATG GTAC Y CATG GTAC Y CATG GTAC Z CATG GTAC Z CATG GTAC Z CATG GTAC AAAAA TTTTT AAAAA TTTTT AAAAA TTTTT (d) Cleave with tagging enzyme (TE), and blunt the ends. Primer Y GGATGCATGCATCATCAT CCTACGTACGTAGTAGTA TE AE Primer Z Tag GGATGCATGGAGGAGGAG CCTACGTACCT C CTC CTC TE AE Tag (e) Ligate and amplify by PCR with primers Y and Z. GGATGCATGCATCATCATGAGGAGGAGCATGCATCC Primer Z CCTACGTACGTAGTAGTACTC CTC C TCGTACGTAGG Ditag (f) Cleave with anchoring enzyme. Isolate ditags. Join together and clone. -----CATGCATCATCATGAGGAGGAG CATG CATCATCAT GAGGAGGAGCATG---------GTACGTAGTAGTACTC CTC C TC GTAC GTAGTAGTA CT C CTC C TCGTAC----Tag 1 Tag 2 Tag 3 Tag 4 AE AE AE Ditag Ditag Primer Y Figure 25.5 Serial analysis of gene expression (SAGE). (a) Doublestranded cDNAs are formed from cellular mRNAs, using biotinylated oligo(dT) to prime first-strand cDNA synthesis. Orange balls represent biotin. (b) Biotinylated cDNAs are cleaved with an anchoring enzyme (AE, NlaIII in this case), and the biotinylated 39-end fragments are bound to streptavidin beads (blue). (c) The bead-bound fragments are divided into two pools; the fragments in one pool are ligated to linker Y (blue) and the fragments in the other pool are ligated to linker Z (pink). (d) The fragments are cleaved with the tagging enzyme (TE), and ends are filled in if necessary to create blunt ends. In this case, the tagging enzyme is FokI, which leaves 9-bp tags attached to the linkers. The tag attached to linker Y is represented by the arbitrary sequence CATCATCAT and its complement highlighted in yellow, and the tag attached to linker Z is represented by the arbitrary sequence GAGGAGGAG and its complement (light purple highlight). (e) Tag-containing fragments are blunt-end-ligated together and amplified by PCR with primers that hybridize to primer Y and primer Z regions in each linker. Only fragments ligated with tags joined tail to tail (ditags) will be amplified by PCR. (f) The amplified ditag-containing fragments are cleaved with the anchoring enzyme to yield ditags with sticky ends. The ditags are ligated together to form concatemers, which are cloned. Part of a concatemer of ditags is shown, with the 4-base recognition sites for the anchoring enzyme shown in green. Note that these 4-base sites set off each ditag so it can be recognized easily. The clones are then sequenced to discover which tags are represented, and in what quantity. This tells which genes are expressed, and how actively. (Source: Adapted from Velculescu, V.E., L. Zhang, B. Vogelstein, and K.W. Kinsler, Serial analysis of gene expression. Science 270:484, 1995.) wea25324_ch25_789-826.indd Page 796 796 23/12/10 8:43 AM user-f467 /Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile Chapter 25 / Genomics II: Functional Genomics, Proteomics, and Bioinformatics Velculescu and colleagues’ next task was to ligate the tags together, along with defined DNA so they could tell where one tag left off and another began. To do this, they blunt-end-ligated the tagged fragments together to form fragments with two tags abutting each other in the middle (forming a ditag) and linkers on each end. The linkers contain sites that are complementary to a pair of primers that can be used to amplify the whole fragment by PCR. After the PCR amplification, Velculescu and colleagues cleaved the products with the anchoring enzyme, ligated these restriction fragments together, and cloned the products. Now the ditags can be easily identified because each one is flanked by the 4-bp anchoring enzyme recognition sites. And, of course, half of each ditag belongs to one tag, and half to the other. Clones with at least 10 tags (some had more than 50) can be identified by PCR analysis and sequenced. If enough clones are sequenced, we can get an idea of the range of genes expressed, and tags that show up repeatedly indicate genes that are very actively expressed. Velculescu and colleagues’ examination of expression in the human pancreas by SAGE had predictable, and therefore encouraging, results. The most common tags (GAGCACACC and TTCTGTGTG) corresponded to the genes for procarboxypeptidase A1 and pancreatic trypsinogen 2, respectively. These are two abundantly expressed pancreatic proenzymes, which, after cleavage to the mature enzyme forms, digest proteins in the small intestine. Many other familiar pancreatic genes were identified among the plentiful tags, but many of the tags did not match any gene sequences in the database, so their identities were unknown. As the database expands to include all human genes, all tags should at least be correlated to genes, even if the functions of some of those genes remain obscure. SUMMARY SAGE allows us to determine which genes are expressed in a given tissue and the extent of that expression. Short tags, characteristic of particular genes, are generated from cDNAs and ligated together between linkers. The ligated tags are then sequenced to determine which genes are expressed and how abundantly. Cap Analysis of Gene Expression (CAGE) SAGE is a useful method for global analysis of gene expression, but it focuses on the 39-ends of transcripts. Sometimes it is necessary to identify the 59-ends of transcripts—for example, if one is interested in identifying promoters on a genomic scale. In that case, a related method known as cap analysis of gene expression (CAGE, Figure 25.6) is available. The CAGE procedure starts with reverse transcription (RT), as SAGE does, but with two important differences that ensure production of full-length cDNAs that copy the mRNA all the way to the 59-end. First, the RT reaction includes a disaccharide known as trehalose. This substance mRNA Cap AAA - - - AAAAA (a) Reverse transcription AA AAA - - - AAA TTT - - - GAGCTC(GA), Cap Full-length + AA AAA - - - AAA TTT - - - GAGCTC(GA), Cap Non-full-length (b) Biotinylation AAAAA - - - AAA TTT - - - GAGCTC(GA), Cap + AAAAA - - - AAA TTT - - - GAGCTC(GA), Cap (c) RNase I AAAAA AAA - - TTT - - - GAGCTC(GA), AAAAA AAAA - - TTT - - - GAGCTC(GA), Cap + Cap (d) Magnetic bead capture Cap AAA - - TTT - - - GAGCTC(GA), (e) Base hydrolysis Cap TTT - - - GAGCTC(GA), Linker (f) Biotin-linker ligation TCCGAC AGGCTG MmeI TTT - etc. (g) Second-strand synthesis TCCGAC AGGCTG TTT - etc. (h) MmeI digestion 20 nt TCCGAC AGGCTG XmaJI TCCGAC AGGCTG CTAGGTCCGAC CAGGCTG + 18 nt discard TTT - etc. Magnetic bead capture and ligation to linker 2 TCTAGA TTT - etc. AGATCT XbaI XmaJI and XbaI digestion T AGATC 20-nt tag Figure 25.6 Use of CAGE to produce 20-nt tags representing the 59-ends of mRNAs. The procedure is described in the text. After the tags are produced as shown here, they can be ligated together via their identical sticky ends to form concatemers, cloned, and sequenced. wea25324_ch25_789-826.indd Page 797 23/12/10 8:43 AM user-f467 /Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile 25.1 Functional Genomics: Gene Expression on a Genomic Scale stabilizes reverse transcriptase at high temperature, so the RT reaction can be run at 608C. This elevated temperature weakens mRNA secondary structure that otherwise would stop the RT reaction before it reached the 59-end of the mRNA. Second, a cap trapper method is used: The caps of the mRNAs in the mRNA–cDNA hybrids are tagged with biotin. As we will see, this allows hybrids with full-length cDNAs to be purified away from hybrids containing lessthan-full-length cDNAs. Figure 25.6 shows how the tagging works. First, the RT priming is done, not with oligo(dT), but with oligo(dT), preceded by a stretch of random nucleotides that do not hybridize with the poly(A) tail. The importance of this feature will become apparent shortly. After first strand cDNA synthesis, both ends of the mRNA are tagged with biotin by reacting the RNA–DNA hybrid with a biotin-containing reagent that attaches to diols. There are only two diols (adjacent hydroxyl groups) in a capped mRNA: the free 29- and 39-hydroxyl groups in the cap and the 39-terminal nucleotide. One would like to tag just the cap, but the 39-terminal nucleotide is unavoidably tagged in the same step. But that problem is resolved in the next step, in which the hybrids are treated with RNase I. The RNase degrades any singlestranded RNA that is not hybridized to the cDNA. Thus, it not only removes the biotin tag from any hybrids that contain incomplete cDNAs, it also removes the biotin tag from the 39-hydroxyl group at the end of every mRNA’s poly(A) tail, which cannot hybridize to the random tail at the beginning of the primer. After the RNase treatment, the only remaining biotin-tagged hybrids are those containing fulllength cDNAs, and these are collected using magnetic beads coated with the biotin-binding protein streptavidin. After the hybrids are purified, their mRNA parts, including the biotin-tagged caps, are destroyed by base hydrolysis, leaving just the single-stranded cDNAs. Next, the full-length, single-stranded cDNAs are ligated to biotin-tagged linkers that contain a recognition site for the tagging enzyme MmeI, which dictates cleavage 20 and 18 nt away. Thus, after second-strand cDNA synthesis, the tagged cDNAs can be cut with MmeI to yield 20-nt tags that can be purified via their biotin parts, and ligated to a second linker (linker 2) via their 2-nt overhangs. Linker 1 also contains a recognition site for XmaJI and linker 2 contains a recognition site for XbaI, so the tags can be cut with those two enzymes, ligated together into concatemers, cloned, and sequenced as in the SAGE procedure. The 20-nt tags would be expected to be found every 420, or about 1.1 3 1012 base-pairs. Thus, since the human genome contains only about 3 3 109 bp, most of the 20-nt tags should identify a unique sequence in even the large human genome, which can be found by consulting the known human genome sequence. This sequence should begin with the transcription start site, so the promoter should be in the immediate neighborhood. When Piero Carninci and colleagues performed this kind of CAGE analysis on mouse mRNAs from whole brain and three 797 distinct brain regions, they found many CAGE tags that mapped close to previously mapped start sites, but many more that did not. This could help identify a number of new promoters and alternative start sites. SUMMARY Cap analysis of gene expression (CAGE) gives the same information as SAGE about which genes are expressed, and how abundantly, in a given tissue. Because it focuses on the 59-ends of mRNAs, it also allows the identification of transcription start sites and, therefore, helps locate promoters. Whole Chromosome Transcriptional Mapping Transcriptomics studies have become sophisticated enough that they can map transcripts with great accuracy to sites in whole chromosomes. This kind of study, called transcriptional mapping, is shedding light on a paradox mentioned earlier in this chapter: The number of protein-encoding genes in humans is scarcely larger than the number of such genes in a lowly roundworm! How can we reconcile that fact with the vastly greater complexity of human beings? One emerging answer is that transcripts of protein-encoding genes make up only a small fraction of the whole human transcriptome. And the closer we look at this problem, the more complex the human transcriptome becomes. If we consider only exons in protein-coding genes, we would predict that only 1–2% of the whole human genome would be expressed in RNAs found in the cytoplasm of cells. However, as early as 2002, Thomas Gingeras and colleagues, using microarrays to study expression of human chromosomes 21 and 22, discovered that polyadenylated RNAs in the cytoplasm of human cells covered about an order of magnitude more of those two chromosomes than could be accounted for by protein-encoding exons. This excess of unexpected transcripts has been dubbed transcripts of unknown function, or TUFs. All of the transcribed regions (exons and TUFs alike) detected by such arrays are called transcribed fragments, or transfrags. Furthermore, approximately two-thirds of the transcripts in human cells and hamster cells have been reported to be nonpolyadenylated [poly(A)2]. These poly(A)2 transcripts therefore represent another chunk of the human genome, whose extent is unknown, but apparently large. Taken together, these findings suggest that protein-encoding exons make up only a small fraction of the total genomic sequences represented by cytoplasmic RNAs. To investigate this intriguing conclusion further, Gingeras and colleagues used high-density oligonucleotide arrays with 25-mers spaced on average only 5 bp apart, thus providing an average of a 20-bp overlap. Why use such a high density? For one thing, it allows one to detect shorter exons, and, for another, hybridizations to overlapping oligonucleotides give greater confidence that transcription in that region really occurs. The oligonucleotide on the arrays wea25324_ch25_789-826.indd Page 798 798 23/12/10 8:43 AM user-f467 /Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile Chapter 25 / Genomics II: Functional Genomics, Proteomics, and Bioinformatics came from the sequences of ten human chromosomes (6, 7, 13, 14, 19, 20, 21, 22, X, and Y), representing 30% of the total length of the human genome. To the arrays, Gingeras and colleagues hybridized double-stranded cDNAs representing cytoplasmic poly(A)1 RNAs from eight different human cell lines, or cytoplasmic and nuclear poly(A)1 and poly(A)2 RNAs from a single cell line (HepG2). In all cases, transfrags that overlapped pseudogenes or repetitive DNA regions were dropped from consideration. About 9% of more than 74 million probe pairs (both strands) hybridized to cDNAs from poly(A)1 RNA, per cell line. Applying a “1 of 8” rule, in which a probe pair needs to hybridize to a cDNA from only one of the eight cell lines, the percentage of positive probes rose to 16.5%. This is the “1 of 8 map.” An average of 4.9% of the nucleotides in the 10 chromosomes were expressed as cytoplasmic RNA in each cell line. In the 1 of 8 map, this figure rose to 10.1%. These findings suggest that about 10.1% of the sequences in the 10 human chromosomes are expressed as polyadenylated RNA in the cytoplasm in at least one cell line. Furthermore, the difference between 4.9% and 10.1% indicates that considerable cell-line-specific transcription occurs. Figure 25.7 shows the proportions of each of the 10 chromosomes from which cytoplasmic polyadenylated transcripts are made. Such transcripts from intergenic regions and introns are, by definition, unannotated. And these regions make up the majority (57%) of the transcripts from the 10 chromosomes as a whole (central pie chart). The annotated transcripts overlap with one of three annotations: Known, which is a combination of two exon databases; mRNA, which contains the mRNAs from a third database that do not overlap with the Known exons; and EST, which contains all publicly available ESTs that do not overlap with either the Known or mRNA databases. What about poly(A)2 transcripts? For this analysis, Gingeras and colleagues focused on a single cell line, HepG2. They looked for stable poly(A)1, poly(A)2, and bimorphic transcripts in both the nucleus and cytoplasm of these cells. (Bimorphic transcripts start out polyadenylated, 6 25% 32% 7 32% 21% 5% 4% 63% 13% 12% Y 12% 2% 6% 27% 29% 13 17% 4% 43% 10% 17% Combination of all 10 chromosomes X 25% 36% 26% 14 Known 26% Intergenic 31% 29% 29% 4% 23% 24% EST 12% 22 22% Intronic 26% 34% 25% 21 6% 13% 6% mRNA 5% 11% 13% 13% 19 46% 21% 20 25% 32% 29% 4% 13% 26% Figure 25.7 Transcription maps of 10 human chromosomes. The percentages of different categories of sequences found in polyadenylated cytoplasmic transcripts in the 1 in 8 map are represented by the wedges of each pie chart. Each of the chromosomes represented by the small pie charts is identified in boldface, as is the collective of all 10 chromosomes (large pie chart in the middle). 29% 15% 5% 4% 26% 12% Sequence categories are given in the collective pie chart, and the same color coding is used throughout. The unannotated sequences are intergenic and intronic. The annotated sequences are designated Known, mRNAs, and ESTs. (Source: Cheng, J., T.R. Gingeras, et al. 2005. Transcriptional maps of 10 human chromosomes at 5-nucleotide resolution. Science 308:1149–54.) wea25324_ch25_789-826.indd Page 799 23/12/10 8:43 AM user-f467 /Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile 25.1 Functional Genomics: Gene Expression on a Genomic Scale but then lose their poly[A] tail.) They found that fully 15.4% of nucleotides in the 10 chromosomes are represented in one of these classes of transcripts (almost half of which are poly[A]2). Thus, about 10 times as much of the genome is represented in stable transcripts than we would expect on the basis of exons alone. Of course, the majority of most human genes is in introns, so this result may not sound surprising at first. But if spliced-out introns have no function, we would expect them to be degraded rapidly and not contribute so heavily to the cDNAs made from presumably stable nuclear RNAs. Another conclusion from this study is that about half of the human transcriptome appears to be overlapping. There are two kinds of overlaps: those on the same strand, and those on opposite strands. Of course, transcripts that overlap on opposite strands represent sense/antisense pairs, which should invoke an RNAi response. Thus, this may represent a kind of gene expression control mechanism. Studies like this that show abundant cytoplasmic poly(A)1 and poly(A)2 transcripts of non-exon regions may help to explain the differences between organisms. Although the exons of humans and chimpanzees are extremely similar, the non-exon regions have diverged considerably more. And transcription of those regions may give rise to some of the differences we see in the two species. SUMMARY High-density whole chromosome tran- scriptional mapping studies have shown that the majority of sequences in cytoplasmic polyadenylated RNAs derive from non-exon regions of 10 human chromosomes. Furthermore, almost half of the transcription from these same 10 chromosomes is nonpolyadenylated. Taken together, these results indicate that the great majority of stable nuclear and cytoplasmic transcripts of these chromosomes comes from regions outside the exons. This may help to explain the great differences between species, such as humans and chimpanzees, whose exons are almost identical. Genomic Functional Profiling The ultimate goal of genomic functional profiling is to determine the pattern of expression of all the genes in an organism at all stages of the organism’s life. That is a daunting task even in the simplest of eukaryotes, but it is even more difficult in complex multicellular organisms. So far, the puzzle for each organism is being put together piece by piece, with each research group contributing its own piece. Let us consider some general techniques for attacking the problem. Deletion Analysis Once all the genes in a genome have been identified, one can investigate what happens when 799 each of them is removed. That kind of experiment is ethically impossible in humans, of course, but it can be done in other vertebrates as their genomes are completely sequenced—at least in principle. Logistical problems may delay this kind of analysis of a genome as large as that of a vertebrate, but the yeast genome has already been profiled in this way. In 2002, a large consortium of investigators led by Ronald Davis reported that they had generated a set of yeast mutants, in each of which one gene had been replaced with an antibiotic resistance gene flanked by 20-mer sequences that were different for each replaced gene. Thus, each gene replacement has a “molecular barcode” so it can be uniquely identified. In all, these investigators replaced over 96% of the annotated ORFs in Saccharomyces cerevisiae. Next, they examined the mutants for ability to grow in a mixed culture under six different conditions: high salt; sorbitol; galactose; pH 8; minimal medium; and the antifungal agent nystatin. They also examined gene expression under each of these conditions by hybridization of RNA to oligonucleotide microarrays. To do this genomic functional profile, Davis and colleagues grew a mixed culture of all 5916 mutants under each of the conditions and collected cells at various times and tested for each barcode by hybridization to an oligonucleotide array containing sequences complementary to the barcodes. If a gene is important for dealing with a given condition, such as the presence of galactose, then mutants lacking that gene should disappear rapidly from the mixture when that condition is imposed. In fact, the rate at which the mutant disappears should correlate with the importance of the deleted gene in dealing with the condition. When the investigators applied this kind of profiling to yeast mutants responding to the presence of galactose, they found several genes that were already known through years of study to be involved in yeast metabolism of galactose. But they also found 10 new genes that had previously not been implicated in galactose metabolism. Wild-type yeast and 11 of the mutants identified by the profiling as important in galactose metabolism were tested individually, and the results are presented in Figure 25.8. As predicted, all 11 mutant strains grew more slowly in galactose than the wild-type strain did. Their growth rates varied from 44% to 91% of wild-type. SUMMARY Genomic functional profiling can be performed in several ways. In one kind of mutation analysis, called deletion analysis, mutants are created by replacing genes one at a time with an antibiotic resistance gene flanked by oligomers that serve as a barcode to identity each mutant. Then, a functional profile can be obtained by growing the whole group of mutants together under various conditions to see which mutants disappear most rapidly. wea25324_ch25_789-826.indd Page 800 800 23/12/10 8:43 AM user-f467 /Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile Chapter 25 / Genomics II: Functional Genomics, Proteomics, and Bioinformatics 0.7 WT gal4 gal3 gal1 yml090w msn2 gal2 yml077wΔ ykl037w ftr1 fet3 gef1 0.6 Growth (A600) 0.5 0.4 100% 51.3% 53.4% 49.0% 44.2% 62.5% 73.5% 91.0% 60.1% 85.6% 86.9% 65.4% 0.3 0.2 0.1 0 5 10 Time (h) 15 20 Figure 25.8 Growth curves of various mutants discovered by profiling to be deficient in response to galactose. Davis and colleagues tested wild-type yeast cells and 11 deletion mutants individually for growth in galactose-containing medium. All of the mutants had been identified by profiling in a mixture of strains as defective in growth with galactose. A600 (absorbance of 600-nm light) is a measure of turbidity, which in turn is a measure of yeast growth. (Source: Adapted from Giaever, G., A.M. Chu, L. Ni, C. Connelly, L. Riles, S. Veronneau, et al., Functional profiling of the Saccharomyces cerevisiae genome. Nature 418, 2002, p. 388, f. 2.) RNAi Analysis “Knocking out” genes by mutagenesis is laborious, and has so far been accomplished on a genomewide scale only in yeast. But some more complex organisms are amenable to a simpler alternative: “knocking down” genes by RNA interference (RNAi, Chapter 16). The nematode worm Caenorhabditis elegans is particularly (a) susceptible to RNAi, which even affects the progeny of treated worms; it can reproduce parthenogenically, which means that only one parent is required; it contains fewer than 1000 cells, and its whole genome has been sequenced. Thus, this organism is an obvious target for genomic functional profiling by RNAi analysis. Birte Sönnichsen and colleagues have exploited this technique to inactivate 19,075 of the worm’s genes, over 98% of the total, and observe the effects on early embryogenesis—the first two cell divisions after fertilization. They injected 25-bp double-stranded RNAs into worms and then followed the first two cell divisions in the progeny of the injected worms by time-lapse microscopy. They also checked for the viability of the embryos beyond the two-cell stage and for gross phenotypic alterations in the larval and adult stages. In all, inactivation of 1668 genes by RNAi produced detectable phenotypic defects. Of these 1668, inactivation of 661 genes gave reproducible defects in the first two cell divisions; the rest gave defects at later stages of development (Figure 25.9). (It is not surprising that inactivating virtually all of the 661 genes that gave defects in early embryogenesis also produced embryonic lethality.) One problem with RNAi is that it sometimes fails to inactivate genes (false-negatives), so negative results are difficult to interpret. As a check on their procedure, Sönnichsen and colleagues evaluated the 65 genes that had previously been shown by mutagenesis to affect the first cell division. Of these genes, 62 (95%) had been detected by the RNAi analysis. The three genes that had been missed the first time were rechecked by RNAi analysis, and two were detected the second time, increasing the success rate to 98%. It is also true that mutations are detected only if they give clear phenotypes, so the mutagenesis strategy also produces false-negatives. Thus, as another check on their procedure, the researchers compared their data to other RNAi analyses that targeted early embryogenesis, and found that (b) Adult (134) 8% Larva (268) 16% Early embryo (661) 40% Mutant (1668) 9% No dsRNA (469) 2% Wild-type (17,426) 89% Figure 25.9 Distribution of phenotypes from a genomic functional profile of C. elegans using RNAi. (a) Initial screen. Sönnichsen and colleagues targeted 19,075 genes with dsRNAs. Of these, 17,426 (“wild-type,” blue) caused no change in phenotype in the screens the authors used, and 1,668 (“Mutant,” red) showed an alteration in phenotype. Four hundred sixty-nine genes (“No dsRNA,” yellow) were not targeted in this experiment. (b) Distribution of Late embryo (605) 36% mutant phenotypes. Starting with the 1668 genes whose inactivation yielded mutant phenotypes, Sönnichsen and colleagues sorted the developmental stages at which defects were seen. For example, 661 of these (red) exhibited defects in the early embryo stage (first two cell divisions). (Source: Adapted from Sönnichsen, et al., Full-genome RNAi profiling of early embryogenesis in Caenorhabditis elegans. Nature. Vol. 434 (2005) f. 2, p. 465.) 23/12/10 8:43 AM user-f467 /Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile 801 25.1 Functional Genomics: Gene Expression on a Genomic Scale Tissue-Specific Functional Profiling Another approach to genomic functional profiling is to observe the tissuespecificity of the genes that are inactivated by mutation or other means. In one notable study, Lee Lim and colleagues used two miRNAs to knock down expression of genes in human (HeLa) cells in culture, and then looked at the profile of genes whose expression was significantly reduced. Remarkably, miR-124, an miRNA expressed in brain, knocked down expression of genes that are expressed at low levels in brain, while miR-1, an miRNA expressed in muscle, knocked down expression of genes that are expressed at low levels in muscle. In other words, these two miRNAs shifted the expression of genes in HeLa cells towards that seen in the tissues in which the respective miRNAs are prominent. This is exactly what we would expect if these two miRNAs play a major role in turning down the expression of these same genes in vivo. A further striking feature of this study is that the miRNAs reduced the concentrations of the mRNAs in question, even though, as we learned in Chapter 16, animal miRNAs generally affect mRNA translation, not mRNA concentrations. Thus, Lim and colleagues introduced double-stranded miRNAs into HeLa cells and then used microarrays to measure the levels of mRNAs purified from the treated cells. The result was clear reduction in the concentrations of 100 or more mRNAs with each miRNA. Here is how Lim and colleagues did their analysis, considering miR-124 first. They began by plotting the expression levels of 10,000 human genes in each of 46 tissues, using data from a previous genome-wide survey. The histogram in Figure 25.10a contains the data for gene expression Number of genes Number of genes 10 250 200 150 100 50 0 8 6 4 2 0 10 20 30 40 Cerebral cortex rank (d) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 –15 –10 –5 P-value (Log10) 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 Heart (c) 10 20 30 40 Cerebral cortex rank Skeletal muscle plex organisms can be done by inactivating genes via RNAi. An application of this approach targeting the genes involved in early embryogenesis in C. elegans has identified 661 important genes, 326 of which are involved in embryogenesis per se. (b) 300 Tissues SUMMARY Genomic functional analysis on com- (a) Brain tissues they had detected 75% of the genes that others had found. Accordingly, Sönnichsen and colleagues concluded conservatively that their RNAi analysis could detect 75–90% of genes involved in early embryogenesis. Next, the researchers grouped the 661 genes according to their specific phenotypes. They found that inactivation of about half (326) of the genes produced defects in embryogenesis per se, while the remainder (335) simply affected the general cell metabolism required to keep the embryo alive long enough to divide twice. By careful annotation of the specific defects, the researchers were able to group the former 326 genes into defects in 23 aspects of embryogenesis, such as spindle assembly (9 genes) and sister chromatid separation (64 genes). –8 –6 –4 –2 P-value (Log10) Tissues wea25324_ch25_789-826.indd Page 801 0 Figure 25.10 Tissue-specific down-regulation by miRNAs. (a) Ranking of expression of genes in cerebral cortex. The rankings of all 10,000 genes in each of 46 tissues are plotted as follows: The left-most bar (rank 1) represents the genes that are expressed at a higher level in cerebral cortex than in any other tissue; the next bar (rank 2) represents genes that are expressed at a higher level in cerebral cortex than in any other tissue except one, and the last bar (rank 46) represents the genes that are expressed at a lower level in cerebral cortex than in any other tissue. (b) Ranking of genes whose mRNA levels are significantly decreased by miR-124. Note the skew toward genes that are poorly expressed in cerebral cortex compared to the background in panel (a), which gives a P-value of significance of about 10212. (c) Plot of the Log10 of P-values derived from plots like that in panel (b) for all 46 tissues. The only tissues with significant P-values (,0.001) are brain tissues: 5, whole brain; 6, amygdala; 7, caudate nucleus; 8, cerebellum; 9, cerebral cortex; 10, fetal brain; 11, hippocampus; 12, postcentral gyrus; and 13, thalamus. (d) Similar to (c), except that the analysis was performed on cells to which miR-1, instead of miR-124, had been added. (Source: Adapted from Lim et al., Microarray analysis shows that some microRNAs downregulate large numbers of target mRNAs. Nature. Vol. 433 (2005) f. 1, p. 770.) wea25324_ch25_789-826.indd Page 802 802 23/12/10 8:43 AM user-f467 /Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile Chapter 25 / Genomics II: Functional Genomics, Proteomics, and Bioinformatics in cerebral cortex. Each bar represents the number of genes expressed at a given level in cerebral cortex. The left-most bar represents the genes that are more highly expressed, and the right-most bar represents the genes less highly expressed in this tissue than in any other tissue. The other bars represent genes that are intermediate in expression, from highly expressed, to poorly expressed in cerebral cortex. All 10,000 genes are represented in this panel, so a random set of genes should produce something similar, which we can consider background. The histogram in Figure 25.10b contains the ranking of genes whose expression was significantly decreased by miR-124 in HeLa cells. Instead of a background plot, as in panel (a), we see a plot that is significantly skewed toward genes that are naturally poorly expressed in cerebral cortex. Notice the predominance of bars on the right-hand side of the histogram, which yields a P-value of significance that is much less than 0.001. In fact, it is of the order of 10212. Next, Lim and colleagues expanded their analysis of the effect of miR-124 to all 46 tissues and plotted the Log10 of P-values (Figure 25.10c). Using a threshold of significance of a P-value less than 0.001, brain tissues were the only ones whose P-values were significantly different from background (bars 5–13). In a similar analysis of the effect of miR-1 (Figure 25.10d), Lim and colleagues found that the only tissues whose P-values were significantly different from background were muscle tissues. Thus, the pattern of depression of HeLa cell gene expression by miR-124 matched the pattern of low gene expression levels only in brain cells. Similarly, the pattern of depression of HeLa cell gene expression by miR-1 matched the pattern of low gene expression levels only in muscle cells. Note again that these studies used microarrays, which detect mRNA levels. Thus, it is likely that the miRNAs are affecting the steady-state levels of particular mRNAs, presumably by destabilizing them. If this is so, we would expect to see evidence of complementarity between the miRNAs and the destabilized mRNAs, probably in the 39-UTRs of the mRNAs, where such complementarity has typically been found. So Lim and colleagues compared the sequences of the miRNAs to the sequences of the 39-UTRs of the mRNAs whose levels were significantly depressed. They used a “motif discovery tool” called MEME to do the matching, and obtained striking results. Fully 88% of the mRNAs down-regulated by miR-1 had strings of at least six bases, with the consensus sequence CAUUCC, that is complementary to a string of bases in miR-1. And 76% of the mRNAs down-regulated by miR-124 had strings of at least six bases, with the consensus sequence GUGCCU, that is complementary to a string of bases in miR-124. This is strong evidence that the miRNAs really do interact with the 39-UTRs of their target mRNAs, and presumably destabilize them. An attractive hypothesis emerges from these studies: miRNAs play an important role in cell differentiation by inhibiting the expression of gene batteries, or sets of functionally related effector genes. For example, miR124 inhibits the expression of a battery of hundreds of non-neuronal genes that help to keep a human cell in an undifferentiated state. Presumably, suppression of these non-neuronal genes is a key to differentiation of neuronal cells. Gail Mandel and her colleagues have provided support for this hypothesis by identifying a protein factor, RE1 silencing transcription factor (REST) that inhibits the expression of a battery of neuron-specific genes, including miR-124 and a number of other miRNAs. REST inhibits miR-124 expression in non-neuronal and pre-neuronal cells. However, during differentiation of neuronal cells, REST dissociates from the miR-124 gene and allows its expression. The newly made miR-124 then inhibits the expression of non-neuronal genes, helping the cell develop into a neuronal cell. Indeed, one of the mRNAs targeted by miR-124 encodes one of the subunits of REST. Thus, miR-124 and REST antagonize each other’s expression, as we might expect of two factors that lead to different developmental fates. SUMMARY Tissue-specific expression profiling can be done by examining the spectrum of mRNAs whose levels are decreased by an exogenous miRNA, and comparing that to the spectrum of expression of genes at the mRNA level in various tissues. If the miRNA in question causes a decrease in the levels of the mRNAs that are naturally low in cells in which the miRNA is expressed, it suggests that the miRNA is at least part of the cause of those natural low levels. This kind of analysis has implicated miR-124 in destabilizing mRNAs in brain tissue, and miR-1 in destabilizing mRNAs in muscle tissue. By inhibiting the expression of batteries of genes, miRNAs can influence the differentiation of cells. For example, miR-124 inhibits the expression of non-neuronal genes. Thus, expression of miR-124 in a pre-neuronal cell pushes the cell toward neuronal differentiation. Locating Target Sites for Transcription Factors As we learned in Chapter 12, genes are stimulated by activators, which bind to enhancers. Many activators have many enhancer targets in a genome and therefore activate many genes. Such a set of genes that tend to be regulated together is sometimes called a regulon. To understand fully the effects of a given activator, it is important to identify all the genes that respond to that activator, and several methods have been developed to accomplish this task. The most straightforward method is to compare the microarray hybridization patterns of RNAs from organisms wea25324_ch25_789-826.indd Page 803 23/12/10 8:43 AM user-f467 /Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile 25.1 Functional Genomics: Gene Expression on a Genomic Scale that do not express, express at a low level, or overexpress the gene for a given activator. This analysis reveals the genes that are turned on by high expression of the activator and has been useful for that purpose. But two problems limit the utility of this sort of experiment. First, the genes that are turned on may not be direct targets of the activator, but may be targets of other activators whose genes were stimulated by the first activator. Second, the genes that are turned on when the activator is overexpressed may not be turned on in vivo by physiological levels of the activator. Still, there are ways to get around these problems by examining directly the interaction of an activator with the control regions of specific genes. One such strategy, employed by Richard Young and colleagues (Ren et al., 2000), melds two different techniques: chromatin immunoprecipitation (ChIP, Chapter 13) and DNA microarray hybridization on a DNA microarray, or chip. The technique is therefore called ChIP-chip or, sometimes, ChIP on chip. Figure 25.11 shows the general plan of the method, which Young and colleagues adapted to identify the binding sites for the activator GAL4 throughout the yeast genome. First, they chemically crosslinked proteins to DNA in chromatin so they could not separate. Then they broke open the cells and sheared the chromatin into small segments. Next, they immunoprecipitated the sheared yeast chromatin with an antibody against GAL4 to precipitate DNA bound to GAL4. Then they reversed the cross-links between the protein and DNA, and labeled copies of this DNA with a red fluorescent dye (Cy5) by PCR. By a parallel procedure, they labeled copies of DNA that was not immunoprecipitated by the antiGAL4 antibody with a green fluorescent dye (Cy3). Then they probed DNA microarrays representing all the intergenic regions of the yeast genome with the two labeled DNAs. Figure 25.12 shows the results of a small section of the array. One spot, denoted by the arrow, clearly shows a preponderance of red fluorescence, suggesting that it hybridized preferentially to the DNA that was associated with GAL4. Using this technique, Young and colleagues identified DNA sequences associated with 10 genes, all of which are known to be activated by GAL4. Thus, the method worked well in this trial. This method is well suited for yeast because of the limited size of the yeast genome and the fact that the yeast genome has been completely sequenced. But could one perform a similar experiment with the human genome? There would be a serious problem, because the whole intergenic fraction of the human genome is almost as large as the genome itself, so a microarray containing all those sequences would be very complex and difficult to produce. But there are some ways to narrow the field of DNA sequences to make the experiment practical. Two of these were reported in work on the same activator, human E2F4, in 2002. In their approach to narrowing the field, Peggy Farnham and coworkers used a microarray containing only CpG Wild-type 803 Deletion mutant (a) Cross-link proteins to DNA (b) Extract and shear cross-linked DNA (c) Immunoprecipitate with specific antibody (d) Reverse cross-links, amplify and label DNA (e) Hybridize to microarray containing all intergenic regions Figure 25.11 Genome-wide search for DNA–protein interactions in yeast by ChIP-chip analysis. (a) First, proteins are chemically crosslinked to DNA in yeast cells. This is done in wild-type cells and in reference cells missing the gene encoding the protein of interest (red). (b) The protein–DNA complexes (cross-linked chromatin) are extracted from the cells and sheared by sonication. (c) Sheared chromatin is immunoprecipitated with an antibody directed against the protein of interest. (d) After precipitation, the cross-links are reversed, and the precipitated DNA is amplified and labeled by PCR. (e) The labeled DNA from both kinds of cells is hybridized to a microarray containing DNA representing all intergenic regions in the yeast genome. The precipitated DNA from the wild-type cells is labeled with a red fluorescent dye, and the precipitated DNA from the mutant cells lacking the protein of interest is labeled with a green fluorescent dye. Thus, if a DNA spot on the microarray hybridizes to DNA that binds to the protein of interest more than to other proteins, that spot will fluoresce red. If the DNA hybridizes to DNA that binds to other proteins preferentially, the spot will fluoresce green. If it hybridizes to both DNA probes, it will fluoresce yellow. Careful normalization of the relative intensities of fluorescence of the two DNA probes allows one to determine the ratio of red and green fluorescence at each spot and therefore the significance of the preference a given DNA region has for binding to the protein of interest. (Source: Adapted from Nature 409: from Lyer et al., 2001, Fig. 1, p. 534). islands (7776 of them). As we learned in Chapter 24, such CpG islands are associated with gene control regions and therefore should be highly enriched in the activatorbinding sequences being sought by this technique. Using that strategy, Farnham and coworkers identified 68 target wea25324_ch25_789-826.indd Page 804 8:43 AM user-f467 /Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile Chapter 25 / Genomics II: Functional Genomics, Proteomics, and Bioinformatics Binding site IP-enriched DNA Unenriched DNA Merged Figure 25.12 Identifying a DNA sequence that binds to GAL4. Young and colleagues prepared a red fluorescent DNA probe by performing PCR on DNA from chromatin immunoprecipitated by an anti-GAL4 antibody. Then they prepared a similar, green fluorescent DNA probe by PCR on DNA that was not immunoprecipitated by the antibody. Then they hybridized these two probes to a DNA microarray with DNAs representing all the intergenic regions in the yeast genome. This is a small section of that array, showing one red spot (arrow) that indicates a putative GAL4-binding DNA, several green spots, indicating DNA that does not bind GAL4, and several yellow spots (binding both red and green probes) that do not show significant preferential binding of GAL4. (Source: Adapted from Ren et al., Science 290 (2000) Fig. 1A. p. 2306.) sites for their activator. Instead of CpG islands, David Dynlacht and colleagues chose the control regions of approximately 1200 genes that were known to be activated as cells entered the cell cycle (a time when E2F4 is active). From this panel of DNAs on the microarray, they found that 127 bound to E2F4 in human fibroblasts. Thus, some foreknowledge of the timing and selectivity of an activator can be very useful in designing a microarray to seek out more target genes. One problem with the ChIP-chip technique for finding transcription factor binding sites is that it is limited to the sequences placed on the chip. In order to contain all the possible sequences in the euchromatic part of the human genome, such a chip (or chips) would have to contain of the order of a billion spots—beyond the reach of current technology. Even when chips with tiling arrays (DNAs with overlapping sequences) approach the resolution of just a few nucleotides, they are predicted to be quite expensive, at least at first. Another problem is that hybridization efficiency to spots on a chip is different for different DNAs, so some binding sites will be missed because their hybridization conditions are not met. Also, it is an unfortunate fact of life that hybridization specificity is not perfect: Sometimes one DNA will hybridize to more than one spot, or will fail to hybridize where it should because of DNA secondary structure. Finally, excellent coverage of the genome by ChIP-chip will be realized in the near future only for the human genome, in which high-resolution tiling arrays will be available. Investigators studying other genomes will not have that advantage. An alternative that solves these problems is a technique called tag sequencing, in which the amplified pieces of DNA precipitated in the ChIP procedure are not hybridized to a chip, but repeatedly sequenced using one of the new high-throughput, next-generation techniques described in Chapter 5. With 2007 technology, one instrument could do about 400,000 200-nt reads, or 40 million 25-nt reads at a time. Barbara Wold and colleagues tested such a method, which they dubbed ChipSeq (more commonly known as ChIP-seq) in 2007. They performed millions of 25-nt reads on DNAs isolated by ChIP with an antibody specific for a transcription factor called neuron-restrictive silencing factor (NRSF), which represses neuronal genes in non-neuronal cells and in neuronal precursor cells. Then they used a computer program to show where these 25-nt reads mapped to the human genome. They counted as significant any site where 13 or more reads clustered, and where this clustering was at least five-fold enriched over a control in which no antibody was used during the ChIP procedure. Figure 25.13 depicts a cluster of reads that defines a binding site for a hypothetical protein. NRSF binding sites were attractive subjects because they had already been carefully studied by other techniques, and a canonical binding site sequence had been recognized. The ChIPSeq procedure identified almost all of the canonical binding sites, and found new binding sites as well. Some of these had canonical half-sites separated by noncanonical spacers. Others had only one half-site. Thus, this technique appears to be comprehensive in its ability to identify binding sites. Mathieu Blanchette, François Robert, and their colleagues adopted a different approach to finding transcription Position on genome (a) Exp. Reads 804 23/12/10 (b) Control Figure 25.13 Mapping a transcription factor binding site by ChIPSeq. (a) Short (25-nt) reads of sequence of DNAs precipitated by ChIP using an antibody specific for a transcription factor are plotted vs. genome position at one particular place in the genome. Each red block represents one read. The peak defines the binding site for the transcription factor. (b) A control is run without an antibody in the ChIP step, and this shows only background binding. wea25324_ch25_789-826.indd Page 805 23/12/10 8:43 AM user-f467 /Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile 25.1 Functional Genomics: Gene Expression on a Genomic Scale factor binding sites in the human genome. Instead of searching for binding sites for a single protein, they looked for clusters of such binding sites (cis regulatory modules [CRMs], Chapter 12). Whereas each individual transcription factor binding site can be quite variable in sequence, and thus escape notice, clusters of such sites are relatively easy to find. Blanchette, Robert, and colleagues took advantage of the Transfac database, containing binding site sequence information for 229 different transcription factors. They also realized that CRMs are well conserved, relative to surrounding DNA sequences. Accordingly, they focused on nonrepetitive, noncoding DNA regions that are conserved in the human, mouse, and rat genomes and searched in those regions for transcription factor binding sites from the Transfac database. This scan, encompassing the 34% of the human genome that can be aligned with both the mouse and rat genomes, yielded 118,402 predicted CRMs (pCRMs). This number surely includes some false positives, but it represents only about one-third of the human genome. While that part of the genome is likeliest to be enriched in CRMS, we can still conclude that the human genome probably contains at least two hundred thousand CRMs. That number may seem surprisingly large, but the authors have validated their data in several ways. For example, they found a strong enrichment of their pCRMs in known promoter regions (defined as DNA regions within 1 kb upstream of the transcription start site), particularly promoters within CpG islands. They also found good correspondence between the pCRMs and DNase hypersensitive regions, which, as we learned in Chapter 13, tend to contain gene regulatory elements. One somewhat surprising result of this work was the large number of pCRMs that lie in regions thought to be devoid of genes. This finding could be explained in several ways: (1) It may reflect our inability to identify all the genes in the human genome. (2) It may indicate that some genes have cryptic transcription start sites that lie far upstream of the canonical start sites. (3) The pCRMs may be regulating the production of noncoding RNAs. (4) The pCRMs may be regulating the transcription of genes a great distance away. Figure 25.14 depicts the frequency of pCRMs within and surrounding known genes. As expected, there is a strong preference for pCRMs in the immediate 59-flanking region of a gene, where enhancers are classically found. But there is also a preponderance of pCRMs in regions where we would not expect them, beginning with the region just downstream of the transcription start site. This could reflect alternative, downstream transcription start sites, or it could be the first indication of widespread regulatory elements within genes. A second surprise in Figure 25.14 is the abundance of pCRMs in the region surrounding the transcription termination site. Again, this has at least two possible explanations. It could indicate a large class of 805 enhancers just downstream of the genes they control, or it could represent antisense transcripts that could play a negative role in gene expression. There is a poverty of pCRMs in the regions 10–50 kb upstream and 10–30 kb downstream of genes, and at the edges of introns (except the first and last ones). Some of this may be only apparent. For example, there could be a selection in these regions for pCRMs with few enough factor binding sites that they escaped notice in this study. SUMMARY ChIP-chip analysis can be used to iden- tify DNA-binding sites for activators and other proteins. In organisms with small genomes, such as yeast, all of the intergenic regions can be included in the microarray. But with large genomes, such as the human genome, that is now impractical. To narrow the field, CpG islands can be used, since they are associated with gene control regions. Also, if the timing or conditions of an activator’s activity are known, the control regions of genes known to be activated at those times, or under those conditions, can be used. Tag sequencing, or ChIP-seq, in which the chromatin pieces precipitated by ChIP are repeatedly sequenced, can also be used to identify transcription factor binding sites. Knowledge of the sequences of multiple mammalian genomes also allows one to narrow the search for human transcription factor binding sites by beginning with conserved regions of the genome. In addition, it is easier to search for CRMs, which contain several transcription factor binding sites. There are more than 100,000 CRMs in the human genome. They tend to cluster in the regions surrounding the transcription start and termination sites, but a surprising number are found in gene deserts far from any known genes. Locating Enhancers that Bind Unknown Proteins The “gene-centric” strategy we have just studied is applicable only to enhancers that bind known proteins. But there are still many enhancers whose protein partners are unknown. In order to identify such enhancers, Len Pennacchio and colleagues reasoned that they needed a genomic approach, and they described a very effective one in 2006. They started their search for vertebrate enhancers by looking for highly conserved noncoding DNA regions. These DNA regions could meet their definition of “highly conserved” in two ways: They were either conserved in distantly related species (say, human and pufferfish), or 100% conserved over at least 200 base pairs in more closely related species (e.g., human and mouse). Pennacchio and colleagues found 167 such enhancer candidates. To test these DNA sequences for enhancer wea25324_ch25_789-826.indd Page 806 806 23/12/10 8:43 AM user-f467 /Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile Chapter 25 / Genomics II: Functional Genomics, Proteomics, and Bioinformatics (a) 0.1 Fraction of bases included in a pCRM 0.09 0.08 0.07 0.06 0.05 0.04 0.03 0.02 100 kb 0.01 0 1st Intron Last Intron 5ⴕ-UTR Middle Introns 3ⴕ-UTR (b) 0.3 20 kb 0.25 Fraction of bases included in a pCRM 0.2 0.06 50 kb 10 kb 0.05 0.16 0.2 0.04 0.12 0.15 0.03 0.1 0.02 0.05 0.01 0.08 0 TSS 0 Figure 25.14 Distribution of pCRMs within and surrounding genes. (a) The fraction of bases included in a pCRM is plotted vs. position within or outside a gene. Colors in the graph, and in the gene diagram below, represent various gene regions as follows: Dark blue, upstream and downstream flanking regions; red, 59-UTR; yellow, first intron; 0.04 0 light blue, middle introns; brown, last intron; aqua, 39-UTR. (The fraction of bases in a pCRM is off scale for the 39-UTR, so no aqua line is visible. (b) Same as in (a), except that the horizontal scale has been lengthened to show the individual regions more clearly. wea25324_ch25_789-826.indd Page 807 12/28/10 10:57 AM user-f469 /Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile 25.1 Functional Genomics: Gene Expression on a Genomic Scale 807 Figure 25.15 Expression patterns driven by enhancers discovered by transgenic mouse enhancer assay. The expression patterns are pictured in typical X-gal-stained mouse 11.5-day embryonic whole mounts, below the bar graph. The number of DNA elements giving rise to each expression pattern is shown. Some enhancers produced more than one expression pattern, which explains why the number of elements is higher than the total number (75) of enhancers tested. activity, they hooked them up to lacZ reporter genes under the control of a mouse minimal promoter. Then they placed these constructs into mouse zygotes, creating transgenic mice. They allowed the transgenic embryos to grow to embryonic day 11.5, then stained whole embryo mounts with X-gal to detect b-galactosidase. Strong blue staining with X-gal indicates abundant b-galactosidase, and therefore strong transcription stimulated by proteins binding to an enhancer. Pennacchio and colleagues chose day 11.5 embryos for several reasons: First, they can be stained and visualized as whole embryo mounts. Furthermore, major organ systems are visible by this stage. Finally, highly conserved enhancers are known to be clustered near genes that are expressed during embryonic life. Of the 167 enhancer candidates tested in this way, Pennacchio and colleagues found that 75 (45%) were positive in this transgenic mouse enhancer assay. Figure 25.15 shows the number of enhancers that operated in each of several different tissues, and the pattern of staining that demonstrates each of the tissue-specificities. The numbers add up to more than 75, because many of the enhancers are active in more than one tissue. It is striking that nervous tissue is by far the most common locus of enhancer activity in this experiment, but that is not surprising, considering that a large percentage of vertebrate genes are expressed in nervous tissue, and that the development of the nervous system is complex and requires the function of many genes. Thus, this strategy has a remarkably high success rate: 45%, achieved by sampling only one stage of embryonic development. One expects that many of the sequences that gave negative results in this experiment would be positive if other stages of life were sampled. Also, it is already known that some of the negative sequences are in fact silencers, so they are also interesting gene control elements. Pennacchio and colleagues reported that there are 5500 more noncoding sequences in the human genome that are conserved between humans and pufferfish, and are thus good candidates for additional enhancers. This strategy therefore shows great promise for locating enhancers in the human and in other genomes. As successful as this method may be for locating gene control regions, it suffers from the drawback that it only detects highly conserved sequences. And there is reason to believe that not all important gene control regions are conserved. We have already seen examples of poorly conserved control regions in different species of yeast earlier in this chapter, and the same phenomenon is also found in vertebrates. In 2008, Duncan Odom and colleagues reported their studies on gene expression in mouse cells carrying a copy of human chromosome 21. They found that the levels of transcription of human chromosome 21 genes in mouse (Source: Reprinted by permission from Macmilllan Publishers Ltd: Nature, 444, 499–502, 23 November 2006. Pennacchio et al, In vivo enhancer anylysis of human conserved non-coding sequences. © 2006.) wea25324_ch25_789-826.indd Page 808 808 23/12/10 8:43 AM user-f467 /Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile Chapter 25 / Genomics II: Functional Genomics, Proteomics, and Bioinformatics cells more closely resembles their transcription levels in human cells than the levels of transcription of homologous mouse genes in mouse cells. This implies that mouse transcription factors recognize human gene control regions and homologous mouse gene control regions differently. Indeed, Odom and colleagues also showed by ChIP analysis that mouse transcription factors bind to human chromosome 21 in a more human-like than mouse-like pattern. The most likely reason for these differences is a difference in sequence between the human and mouse gene control regions. Thus, one probably misses important gene control regions if one focuses only on highly conserved sequences, even between closely related species. SUMMARY To find enhancers whose protein partners are unknown, one can look for noncoding sequences that are highly conserved between moderately related species, or absolutely conserved between closely related species. These putative enhancers can then be verified by linking them to a reporter gene, such as lacZ, and looking for reporter gene activity in embryos, in which many genes are active. In the case of the lacZ reporter gene, one looks for blue tissue in the presence of the indicator X-gal. One limitation to this kind of study is that some important gene control regions are not well conserved, even between closely related species. Locating Promoters In principle, class II promoters should be easier to locate than enhancers, as they lie at or very near the transcription start sites of genes, which are usually known. Nevertheless, when Bing Ren and colleagues performed a genome-wide search for human promoters, they got a surprise: Many genes have alternative promoters that are located hundreds of base pairs away from the primary ones. Ren and colleagues searched for promoters in human fibroblasts using a ChIP-chip strategy. As mentioned earlier in this chapter, the ChIP-chip technique seeks to identify regions in the genome that bind a particular protein. Ren and colleagues performed ChIP using a monoclonal antibody against the TAF1 subunit of TFIID, reasoning that preinitiation complexes forming at promoters should contain this key general transcription factor. Then they amplified the DNA precipitated by ChIP and used it to probe DNA microarrays containing about 14.5 million 50-mers representing all the nonrepetitive DNA in the human genome. Figure 24.16 summarizes the method and presents some of their findings. They found 12,150 TFIID-binding sites, of which 10,553 (87%) mapped within 2.5 kb of a known transcription start site. They had to use the fairly large window of 2.5 kb to allow for uncertainties in the mapping of transcript 59-ends and uncertainties in the ChIP-chip mapping of TFIID-binding sites due to noise in the microarray data. Some TFIID-binding sites mapped to the same transcript 59-ends; by eliminating these redundancies, Ren and colleagues settled on 9328 binding sites that mapped to unique transcripts. They subjected these 9328 binding sites to four tests for promoter-like character. First, they performed ChIP-chip analysis with an anti-RNA polymerase II antibody and found that 97% of the TFIID-binding sites also bound polymerase II. Second, they selected 28 of these sites at random and performed standard ChIP analysis with an anti-RNA polymerase antibody to verify polymerase II binding. All but one site passed this test. Third, they searched for CpG islands and Inr, DPE, and TATA box core promoter elements in the 9328 TFIID-binding sites. They found enrichment for the first three but not for TATA boxes (Figure 24.17c). Fourth, they used ChIP-chip analysis to look for histone modifications (acetylated histone H3 and dimethylated lysine 4 on histone H3) that are associated with gene activity. Again, 97% of the TFIID-binding sites were associated with these modifications. In summary, the ChIP-chip method appears to have selected promoters very accurately, and most of these promoters lack TATA boxes, in accord with other data showing a paucity of TATA boxes in yeast and Drosophila. Ren and colleagues discovered that over 1600 of the genes they identified had multiple promoters. In most cases, these promoters gave rise to transcripts that differed only in the lengths of their 59-UTRs, or in having a distinct first exon, but did not affect the protein products of the genes. In other cases, they gave rise to transcripts that were spliced, polyadenylated, or translated differently. These latter cases could provide another layer of control over gene expression, if cells can select which promoter to use at a given time. SUMMARY Class II promoters can be identified using ChIP-chip analysis with an anti-TAF1 antibody. In one such study with human fibroblasts, over 9000 promoters were identified, and over 1600 genes had multiple promoters. In Situ Expression Analysis Consider the following opportunity: As is well known, human chromosome 21 is involved in Down syndrome. To discover which gene(s) on this chromosome are responsible for the disorder, it would be useful to know the pattern of expression during embryonic life of all the genes on this chromosome. Such studies are routinely done in lower organisms, typically by performing in situ hybridization (Chapter 5) wea25324_ch25_789-826.indd Page 809 23/12/10 8:43 AM user-f467 /Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile 25.1 Functional Genomics: Gene Expression on a Genomic Scale 809 (a) 4 ENr231 - chr1:148,374,643-148,874,642 TFIID ChIP log2 R 0 RefGene TFIID ChIP Replicate 1 Peak TFIID ChIP Replicate 2 Peak TCFL1 Count 3000 n =9328 2000 1000 0 0 1 2 Position of TFIID binding site relative to the matched 5ⴕ-end (kb) Percentage of occurrence (c) (b) 100 Control IMR90 DBTSS 80 60 40 20 0 CpG Inr TATA DPE Promoter elements Figure 25.16 Finding promoters. Ren and colleagues performed ChIP-chip analysis using an anti-TAF1 antibody to identify TFIIDbinding sites in human fibroblasts. (a) Representative results from a relatively small region of human chromosome 1. The top panel presents the logarithmic ratio (log2 R) of hybridization of DNA precipitated by TAF1-ChIP to hybridization of a control DNA. Peaks show putative TFIID-binding sites. The middle panel shows a gene annotation of this DNA region from the RefSeq database. Note that the peaks in the top panel generally align with the 59-ends of the annotated genes. The bottom panel presents a blow-up of two replicate ChIP analyses of the TCFL1 gene. Arrows show the peak of hybridization, determined by a peak-finding algorithm, and the position of the gene is given below, with the 59-end on the right. (b) Alignment of TFIID-binding sites with 59-ends of genes. The bulk of the binding sites (83%) fall within 500 bp of the 59-ends of genes. (c) Association of CpG islands and three core promoter elements with promoters. Red, TFIID-binding sites identified in this study; blue, promoters from the DBTSS database; yellow, control DNA. with cDNA probes in embryonic sections. But that presents a serious problem: Such studies are ethically problematic when performed on human embryos. Fortunately, now that we have the sequence of the mouse genome, there is a way around this problem. The mouse genome harbors orthologs for 161 of the 178 confirmed genes on human chromosome 21. So the expression of these genes can be followed through time and space during development of mouse embryos, and we can assume a similar pattern of expression applies to the homologous genes in the human embryo. Two research groups applied this strategy to the mouse orthologs of the genes on human chromosome 21. In one, Gregor Eichele, Stylianos Antonarakis, and Andrea Ballabio and their colleagues looked at expression of 158 of the mouse orthologs at three times during gestation by in situ hybridization. They also checked the expression patterns of all 161 orthologs in adults by RT-PCR (Chapter 5). They found patterned expression (expression confined to specific sites at specific times) of several genes. Moreover, some of this patterned expression was in sites (central nervous system, heart, gastrointestinal tract, and limbs) that are consistent with the pathology of Down syndrome. For example, Figure 25.17 shows the expression of the Pcp4 gene in day 10.5 mouse embryos (by in situ hybridization to whole mount sections) and in day 14.5 embryos (by in situ hybridization to embryonic sections). At day 10.5, the gene is expressed in the eye (black arrow), brain, and dorsal root ganglia (white arrow). At day 14.5, the gene is expressed in many tissues, including the cortical plate (red arrow) in the brain, the midbrain, cerebellum, spinal cord, intestine, heart, and dorsal root ganglia. All of wea25324_ch25_789-826.indd Page 810 810 23/12/10 8:43 AM user-f467 /Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile Chapter 25 / Genomics II: Functional Genomics, Proteomics, and Bioinformatics (a) (b) Figure 25.17 Expression of two genes in mouse embryos. Gene expression was assayed by in situ hybridization (Chapter 5), using either a whole mount embryo (panel a), or a sectioned embryo (panel b). (a) Expression of Pcp4 in a whole mount of a day 10.5 embryo. The black arrow indicates the eye, and the white arrow indicates a dorsal root ganglion. (b) Expression of Pcp4 in a section of a day 14.5 embryo. The red arrow indicates the cortical plate of the brain. Dark staining denotes expression of the gene. (Source: Adapted from Nature 420: from Reymond et al., fig. 2, p. 583, 2002.) these are areas affected by Down syndrome, so the Pcp4 gene is a candidate for one of the genes involved in the disorder. Another example combines work from Eichele and colleagues and another group headed by Ariel Ruiz i Altaba, Bernhard Herrmann, and Marie-Laure Yaspo on the expression of the mouse SH3BGR gene in days 9.5, 10.5, and 14.5 of gestation. These studies show that this gene is prominently expressed in the heart at all three stages of development. Because the heart is one of the organs affected by Down syndrome, the SH3BGR gene is another candidate for involvement in the disorder. SUMMARY The mouse can be used as a human sur- rogate in large-scale expression studies that would be impermissible to perform on humans. For example, scientists have studied the expression of almost all the mouse orthologs of the genes on human chromosome 21. They have followed the expression of these genes through various stages of embryonic development and have catalogued the embryonic tissues in which the genes are expressed. Single-Nucleotide Polymorphisms: Pharmacogenomics Now that we have a finished draft of the human genome sequence, we can look for differences among individuals. So far, most of these are differences in single nucleotides, and we classify them as single-nucleotide polymorphisms, or SNPs (pronounced “snips”) if the minor variant is present in at least 1% of the population. The human genome contains at least 10 million such SNPs, and on average, any two unrelated people differ in millions of SNPs. If we can link these SNPs to human diseases governed by defects in single genes, we could then screen individuals for the tendency to develop those diseases simply by screening for SNPs. We might also be able to find sets of SNPs that associate with polygenic traits, such as susceptibility to such disorders as cardiovascular disease and cancer and thus pin down the genes responsible for these traits. We may also be able to identify SNPs that correlate with good or poor response to certain drugs. Using this information, physicians should be able to screen a patient for key SNPs, then custom design a drug treatment program for that patient based on his or her predicted responses to a range of drugs. This field of study is called pharmacogenomics. However, these tasks will not be easy. Already, geneticists are discovering that the vast majority of SNPs are not in genes at all, but in intergenic regions of DNA. Most of these do not affect gene function, but a few will if they are located in gene control regions. Even when they are found within genes, they tend to be silent mutations that do not alter the structure of the protein product, and thus do not usually cause any malfunction that could lead to a disease. (For an exception, see Chapter 18.) The reason for the preponderance of silent SNPs is clear: Polymorphisms caused by mutations that change the products of genes are generally deleterious, and are therefore selected against. That is, the individuals with these damaging mutations generally die before they can reproduce and thus the mutations are lost. Finally, if history is a guide, even knowing which SNPs correlate with diseases may not be of immediate benefit. It will take time to figure out how to use this information. One can detect SNPs correlating with disease or other traits in any given individual by a variety of genotyping techniques. One of these is to hybridize a primer adjacent to a SNP and then perform primer extension with fluorescent nucleotides and observe which nucleotide is incorporated in the SNP position. Another is to hybridize a person’s DNA to DNA microarrays containing oligonucleotides with the wild-type and mutated sequences. Still another is sequencing: either shotgun sequencing, or amplifying a region surrounding a SNP by PCR and then sequencing it. Such knowledge can be useful in helping to prevent or treat disease. How do SNPs differ from RFLPs? RFLPs are identical to SNPs if the single-nucleotide difference between two individuals lies in a restriction site, as we observed in Chapter 24 in the RFLPs involving HindIII sites in Huntington disease patients. In such a case, a single-nucleotide difference makes a difference in the pattern of restriction fragments. However, RFLPs can also result from insertion of a chunk of DNA between two restriction sites in one individual, but not another—VNTRs, for example. That wea25324_ch25_789-826.indd Page 811 23/12/10 8:43 AM user-f467 /Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile 25.1 Functional Genomics: Gene Expression on a Genomic Scale would not be a SNP because it involves more than just a single-nucleotide difference. For those who are enthusiastic about the potential of SNPs to help identify the causes of common diseases, 2005 was a banner year. The International HapMap Consortium published a haplotype map including over 1 million human SNPs, discovered by genotyping 269 DNA samples from four distinct human populations (one in Nigeria; one in Utah, USA; one in China; and one in Japan). A haplotype map shows the locations of haplotypes, blocks of DNA that tend to be inherited intact, because of the low rate of recombination within the block. We have already seen in our discussion of the human genome that the rate of recombination varies considerably from spot to spot, and regions of high recombination rate alternate with regions in which recombination is rare. The latter regions are likely to contain genetic markers that are inherited together and therefore make up a haplotype. By focusing on certain well-chosen SNPs (tag SNPs), the International HapMap Consortium was able to identify other SNPs in the same region, thus cutting down on the total amount of genotyping they had to do. They did this genotyping largely by hybridizing labeled human DNA fragments to DNA microarrays designed to detect tag SNPs. This procedure is highly automated, allowing one worker to scan 500,000 SNPs covering the whole genome in only two days. One immediate payoff of the project was the identification of millions of new SNPs (only 1.7 million were known at the beginning of the project). Another was new insight into recombination and natural selection in human evolution. But the potential payoff that attracts the most attention is the identification of genes that are involved in human diseases. This process was straightforward in the case of HD and other diseases caused by a mutation in a single gene, because people with particular mutations are all but certain to have the disease. But it is vastly more difficult when many genes contribute to a disease, because each mutation may contribute only a little bit, and so each is difficult to spot. Unfortunately, the diseases that kill and disable most people (cancer, heart disease, and dementia, for example) are of the latter kind. In principle, the HapMap should make this job easier. Indeed, in 2005, Josephine Hoh, Margaret PericakVance, and Albert Edwards and their colleagues reported their work on age-related macular degeneration (AMD), a common cause of blindness in elderly people. They scanned 116,204 human SNPs looking for linkage to AMD and found one with a high degree of correlation. That is, one allele is found significantly more frequently in AMD patients than in normal controls. These workers traced this SNP to a gene called CFH, which encodes complement factor H. This protein regulates the complement cascade, which governs inflammation. Later in 2005, Gregory 811 Hageman, Rando Alikmets, Bert God, Michael Dean, and their colleagues confirmed the linkage between CFH and AMD, finding a high-risk variant of the gene and also several variants of the gene that appeared to be protective. These results led Hageman, Alikmets, God, Dean, and their colleagues to look for participation of other components of the complement cascade. Sure enough, they found a strong association between AMD and the factor B gene, and both high-risk and protective variants. These findings validated Hageman’s earlier hypothesis that inflammation is central to the disease process in AMD, and suggest that controlling inflammation may be a way to help prevent or control the disease. But genes in the complement cascade are not the only ones linked to AMD. Another group has linked a gene (LOC387715), with a product of unknown function, in AMD, and there are sure to be others. Other workers have looked beyond SNPs in comparing the genomes of different people, and have been surprised by what they found. The genomes of seemingly normal people frequently contain not just SNPs, but deletions, insertions, inversions, and other rearrangements of whole chunks of DNA. Geneticists are now calling such differences in genomes structural variation. For example, Michael Wigler and his colleagues examined the genomes of 20 healthy individuals and found 221 places where these people had different numbers of copies of particular chunks of DNA. While these variations in copy number had no apparent effect on health in these people, it is possible that, in combination with certain environmental factors, they could predispose other people to disease. On the other hand, some structural variants appear to be beneficial. Sunil Ahuja and colleagues have shown that extra copies of a particular immune system gene help protect people against AIDS. And a team of Icelandic scientists has discovered a large inversion that is carried by 20% of Europeans. Strikingly, women carrying this inversion have more children than those who do not, suggesting that the inversion confers some kind of reproductive advantage, and that it is therefore probably spreading. The complete sequences of genomes of simpler organisms can also be important in understanding and treating human diseases. For example, as soon as the complete yeast genome had been sequenced, molecular biologists began systematically mutating every one of the 6000 yeast genes to see what effects those mutations would have. They also began systematically screening all 18 million possible protein– protein interactions using a yeast two-hybrid screen (Chapter 5, and see later in this chapter). The results of such experiments can tell us much about the activities of gene products that are still uncharacterized. And knowing the activities of all the proteins in an organism, and the other proteins with which they interact, should lead to greater understanding of biochemical pathways, such as the ones that metabolize drugs, or signal transduction pathways that control gene expression. This understanding,