...

95 251 Functional Genomics Gene Expression on a Genomic Scale

by taratuta

on
Category: Documents
12

views

Report

Comments

Transcript

95 251 Functional Genomics Gene Expression on a Genomic Scale
wea25324_ch25_789-826.indd Page 790
790
23/12/10
8:43 AM user-f467
/Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile
Chapter 25 / Genomics II: Functional Genomics, Proteomics, and Bioinformatics
25.1 Functional Genomics: Gene
Expression on a Genomic
Scale
First of all, one can focus on expression of genomes at
the RNA level. If we consider all the transcripts an organism makes at any given time, we call that the organism’s
transcriptome, by analogy with the term “genome,” which
refers to all the genes in an organism. And functional
genomics studies that measure the levels of RNAs produced
from many genes at a time are part of a field called
transcriptomics. Second, one can use genomic information
to try to determine the pattern of expression of all the genes
in an organism at all stages of the organism’s life. This kind
of analysis is called genomic functional profiling.
Third, one can compare many individuals’ genomes to
find significant differences. For example, differences in single nucleotides are called single-nucleotide polymorphisms
(SNPs). Sometimes these SNPs are associated with genetic
disorders or other, less dramatic characteristics, such as
susceptibilities to drugs. But SNPs are not the only common differences among human genomes. The more geneticists look, the more they find major chromosomal structural
variations, such as inversions, duplications, and deletions.
Moreover, at least some of these variations appear to have
important consequences. For example, one long inversion
has been found to be common in Europeans, but not in
Africans and Asians, and women with this inversion have
more children than those without it. Thus, the inversion
seems to provide an evolutionary advantage.
Finally, one can study the structures and functions of
the protein products of genomes. To the extent that it
focuses on protein structure, this latter enterprise can be
called structural genomics, but the whole endeavor is called
proteomics, and will be the subject of a later section of this
chapter. In this section, we will consider transcriptomics,
genomic functional profiling, and SNPs.
Transcriptomics
To discover the pattern of expression of a gene in a given
tissue over time, one can perform a dot blot analysis as
described in Chapter 5. In a classical dot blot, one makes spots
a few mm in diameter containing a single-stranded DNA
from the gene in question on filters and then hybridizes
these dot blots to labeled RNAs made in the tissue in question at different times. But suppose one wants to know the
pattern of expression of all the genes in that tissue over
time. In principle, one could make a large dot blot with tens
of thousands of single-stranded DNAs corresponding to all
the potential mRNAs in a cell and hybridize labeled cellular
RNAs to that monster dot blot. But the sheer size of that
blot would present a serious problem. Fortunately, molecular
biologists have devised some methods to miniaturize such
1 (25.4 mm)
3 (76.2 mm)
Figure 25.1 Schematic diagram of a DNA microarray. This drawing
represents a standard, 10 3 30 glass microscope slide with an array of
7500 tiny spots of DNA. Each dot is 200 mm in diameter, and the
distance between the dot centers is 400 mm. This is by no means the
highest density of spots presently attainable. It is actually possible to
place more than 50,000 spots on a slide of this size. (Source: Adapted
from Cheung, V.G., M. Morley, F. Aguilar, A. Massimi, R. Kucherlapati, and
G. Childs, Making and reading microarrays. Nature Genetics Supplement Vol. 21
(1999) f. 2, p. 17.)
blots, and some novel methods to analyze the expression of
whole genomes. We will look first at DNA arrays and gene
microchips, and then at a more exotic method.
DNA Microarrays and Microchips To circumvent the
problem of size, molecular biologists have adapted inkjet
printer technology to spot tiny volumes of DNA on a chip,
so the dots of the dot blot are very small. This allows many
different DNAs to be spotted on one chip, called a DNA
microarray. One system, developed by Vivian Cheung and
colleagues, uses a robot with 12 parallel pens, each of
which can squirt out a tiny volume of DNA solution:
0.25–1.0 nL (billionth of a liter). The spots are exquisitely
small, only 100–150 mm in diameter, and the centers of the
spots are only 200–250 mm apart. The result looks like the
schematic diagram in Figure 25.1, but even better, as
the figure represents a DNA microarray with only 7500 DNA
spots on a common microscope slide. After spotting, the
DNAs are air dried, and covalently attached by ultraviolet
radiation to a thin silane layer on top of the glass.
Another strategy for reducing the size of a blot has
been to synthesize many oligonucleotides simultaneously,
right on the surface of a chip. Steven Fodor and his colleagues pioneered this method in 1991, using the same
kind of photolithographic techniques employed in computer chip manufacture, to build short DNAs (oligonucleotides) on tiny, closely spaced spots on a small glass
microchip. In a 1999 version of this technique (Figure
25.2), these workers started with a small glass slide coated
with a synthetic linker that was blocked with a photoreactive group that can be removed by light. They masked
some of the areas of the slide and illuminated it, so the
blocking agent was removed only from the unmasked
areas. Then they added a nucleotide (also blocked with a
photoreactive group) and chemically coupled it to all the
areas of the slide that had been unblocked in the previous
step. The result: A nucleotide was attached to a subset of
the tiny spots on the chip. Next, they masked a different
wea25324_ch25_789-826.indd Page 791
23/12/10
8:43 AM user-f467
/Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile
25.1 Functional Genomics: Gene Expression on a Genomic Scale
791
Light
Mask
Light
O OO O O O
H
H
O OO O O O
Chemical
coupling
G
(with
O OO O O O
)
G
Glass
First cycle
G
Light
H
O
G
G
O OO O O O
Light
G
G
H
H
O OO O O O
A
Chemical
coupling
G
(with
O OO O O O
)
A
A G
A
Second cycle
Figure 25.2 Growing oligonucleotides on a glass substrate.
The glass is coated with a reactive group that is blocked with a
photosensitive agent (red). This blocking agent can be removed with
light, but parts of the plate are masked (blue) so the light cannot get
through. In the first cycle, four of the six spots pictured are masked,
so the light reaches only two unmasked spots and removes the
blocking agent. Then a blocked guanosine nucleotide is chemically
coupled to the unblocked spots. In the second cycle, three spots are
masked, and the other three are therefore exposed to the light. This
removes the blocking agent from three spots, including the first one,
which already has a G attached. Thus, after a blocked adenosine
nucleotide is chemically coupled to the three unblocked spots, the
first spot has a G–A dinucleotide, the third and sixth spots have an
A mononucleotide, the fourth has a G mononucleotide, and the
second and fifth spots, which were masked in both cycles, have no
nucleotides attached yet. As the cycle is repeated over and over with
different masking patterns and different nucleotides, unique
oligonucleotides are built up in each spot.
subset of spots, illuminated the others to remove the blocking groups, and attached another nucleotide. On the spots
that were unmasked in both steps, dinucleotides were
formed. By repeating this process, they could build up different oligonucleotides on each spot.
The resulting chip is known as a DNA microchip or oligonucleotide array, although these terms and “DNA microarray” are often used interchangeably. In fact, the generic
term “microarray” can be used to refer to any kind of DNA
or oligonucleotide microarray. The technology is so miniaturized that about 300,000 oligonucleotides can be built on
a chip only 1.28 3 1.28 cm (about ½0 square). And the
process is so efficient that a set of 4n different oligonucleotides can be built in only 4 3 n cycles. So if our goal is
to generate all the possible 9-mers (49, or about 250,000
different oligonucleotides), we can do it in only 4 3 9 5
36 cycles. How long must an oligonucleotide be to uniquely
identify one human gene product in a mixture of all the others? Knowing the sequence of the human genome helps us
answer this question with great accuracy. However, even
without that information, we can do a calculation to give us
a minimum estimate. A given sequence of n bases will occur
in a DNA about every 4n bases. In other words, a DNA
sequence needs to be n bases long to occur about once in a
DNA 4n bases long. Thus, we need to solve the following
equation for n to find the minimum size of an oligonucleotide we would expect to find only once in the whole human
genome, which may be as much as 3.5 3 109 bases long:
would require 4 3 16 5 64 cycles to build them all on an
oligonucleotide array. Again, however, this is a minimum
estimate, so it would be a good idea to start with longer
oligonucleotides to be reasonably sure that they occur only
once in the human genome and therefore uniquely identify
human genes.
Even before the publication of the sequence of the first
human chromosome, scientists at Affymetrix, Inc. were already producing microchips containing 25-mers designed
to recognize single genes. They based their design on the
sequence that was available, including the many ESTs
already in the database. To enhance the reliability of their
chips, they included multiple oligonucleotides designed to
hybridize to single transcripts, so the results obtained with
each of these oligonucleotides could be checked against
one another.
The oligonucleotides on a microchip or the cDNAs on
a microarray can be hybridized to labeled RNA isolated
from cells (or to corresponding cDNAs) to see which genes
in the cell were being transcribed. For example, consider a
study by Patrick Brown and colleagues in which they used
the DNA microarray technique to examine the effect of
serum on the RNAs made by a human cell. They isolated
RNA from cells grown in the presence and absence of
serum, then reverse transcribed the two RNA samples in the
presence of nucleotides tagged with fluorescent dyes, so the
cDNA products would be labeled with the fluorescent tags.
They used a green-fluorescing nucleotide to label the cDNA
from serum-deprived cells, and a red-fluorescing nucleotide
to label the cDNA from serum-stimulated human cells.
Then they mixed the cDNAs, hybridized them to DNA
microarrays containing unlabeled cDNAs corresponding
to 8613 different human genes, and detected the resulting
4n 5 3.5 3 109
The answer is that if n 5 16, 4n . 3.5 3 109. So our
oligonucleotides need to be at least 16 bases long, and that
wea25324_ch25_789-826.indd Page 792
792
23/12/10
8:43 AM user-f467
/Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile
Chapter 25 / Genomics II: Functional Genomics, Proteomics, and Bioinformatics
mentally regulated genes, ordered by time of onset of
the first increase in expression. That is, the topmost
genes in the figure were stimulated earliest in the life
cycle, and the bottommost genes were stimulated last.
Figure 25.3 Using a DNA chip. Brown and colleagues made cDNAs
from RNAs from serum-starved and serum-stimulated human cells.
They labeled the cDNAs corresponding to RNAs from serum-starved
cells with a green fluorescent nucleotide; they labeled the cDNAs
corresponding to RNAs from serum-stimulated cells with a red
fluorescent nucleotide. Then they hybridized these fluorescent cDNAs
together to DNA chips containing cDNAs corresponding to over 8600
human genes. The figure shows the same part of the DNA chip from
three different hybridizations. The red spots (e.g., spots 2 and 4)
correspond to genes that are more active in the presence of serum.
The green spots (e.g., spot 3) correspond to genes that are more
active in the absence of serum. The yellow spots (e.g., spot 1)
correspond to genes that are roughly equally active in the presence or
absence of serum. (Source: Lyer, V.R., M.B. Eisen, D.T. Ross, G. Schuler,
■
More than 88% of the developmentally regulated genes
are active during the first 20 h of development, which is
before the end of the embryonic phase (see Figure 25.4c).
■
RNAs from about 33% of the developmentally regulated genes are already present at the very earliest
time point (Figure 25.4c). These represent maternal
genes, or maternal effect genes, those that are expressed
during oogenesis in the mother. Thus, the maturing
oocyte either transcribes these genes or receives their
transcripts from surrounding nurse cells so the
mRNAs are already present in the egg and are available for translation as soon as fertilization occurs.
■
As illustrated in Figure 25.4d, expression of some
genes is maintained throughout the life cycle, whereas
expression of others peaks and declines. In particular,
as further illustrated in Figure 25.4e, genes that reach
peak expression during early embryonic life tend to
peak again in early pupal development, whereas genes
that peak in the late embryonic phase tend to achieve
another peak in late pupal development. A related
phenomenon, not illustrated here, is that genes that
peak in larval development tend to reach another
peak of expression during adult life.
■
Genes encoding components of a given supramolecular
complex tended to be coexpressed. Thus, the genes
encoding the ribosomal proteins tended to be regulated
coordinately, as did the genes encoding the proteins
in the mitochondrion.
■
Genes encoding proteins with related functions tended
to be coexpressed, even if the proteins did not form
complexes. Thus, genes encoding transcription factors,
or cell cycle regulators, tended to be expressed together.
■
Coexpression of some genes was tissue-specific. For
example, one cluster of 23 coregulated genes included
eight genes that were already known to be expressed
in muscle cells. Upon further examination, the control
regions of 15 of the genes in this cluster had pairs of
binding sites for the transcription factor dMEF2,
which is known to activate genes in differentiating
muscle cells. Seven of the genes in the cluster had unknown function, and six of these had dMEF2-binding
sites and were expressed in differentiating muscle.
Thus, this analysis allowed White and colleagues to
assign a function in muscle differentiation to these six
unknown genes. This is important because it is very
difficult to determine the function of genes based
solely on their sequences. The additional clues about
timing and location of expression are a tremendous
help. Indeed, they allowed White and colleagues to
assign functions to 53% of the genes they analyzed.
T. Moore, J.C. Lee, et al., The transcriptional program in the response of human
fibroblasts to serum. Science 283 (1 Jan 1999) f. 1, p. 83. Copyright © AAAS.)
fluorescence. Figure 25.3 shows the same region of the microarray from triplicate hybridizations. The red spots correspond to genes that are turned on by serum, and the
green spots represent genes that are active in serumdeprived cells. The yellow spots result from hybridization
of both probes to the same spot (the green and red fluorescence together produce a yellow color). Thus, the yellow
spots correspond to genes that are active in both the presence and absence of serum.
Microarrays allow one to examine changes in gene expression in systems much more complex than the one we
have just described. For example, our knowledge of the
complete yeast genome sequence has enabled molecular
biologists to use DNA chips to analyze the expression of
every yeast gene at once, under a variety of conditions.
In another example, Kevin White and colleagues used
DNA chips in 2002 to follow the expression of 4028 Drosophila genes during 66 distinct periods throughout the
fly’s life cycle. Figure 25.4a shows the 66 developmental
stages at which RNAs were collected for gene expression
analysis. Notice that almost half (30) of these time points
were in the embryonic phase of development, in which
gene expression changes most rapidly. In fact, early in the
embryonic phase, when gene expression is most dynamic,
RNAs were collected every half-hour. This analysis yielded
several conclusions:
■
A large number of genes (3219) experienced a substantial change in expression (four-fold or more) during the
fly’s life cycle. Figure 25.4b shows all of these develop-
wea25324_ch25_789-826.indd Page 793
23/12/10
8:43 AM user-f467
/Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile
25.1 Functional Genomics: Gene Expression on a Genomic Scale
793
(b)
(a)
Fertilization
Blastoderm
E
Gastrulation
L
P
A
Muscle I
Embryo
RNA collections
0
5
0
1
E
10
5
Larva
Hatching
15
20
10
Pupa
15
24 h
20
25
30
35
40 days
Adult
Metamorphosis Eclosion
Muscle I
Muscle II
Eye
Testis
Ovary
(c)
Fraction of genes used
Fraction of genes used
1
0.75
0.50
0.25
Maternal genes
1
0.75
0.50
0.25
0
0
5
10
15
20
Developmental time (h)
0
0
5
10
15
20
25
Developmental time (days)
30
35
40
Developmental time
<0.25 0.33 0.5 1
2
Fold induction
Figure 25.4 Patterns of expression of Drosophila genes during
development. (a) Outline of RNA collection periods. White and
colleagues collected RNAs from whole animals at the indicated times
during development (E, embryonic; L; larval; P, pupal; A, the first
40 days of the adult phase). The embryonic period is expanded to
show all of the overlapping collection periods. They purified Poly(A)1
RNA by oligo(dT)-cellulose chromatography and made fluorescent
cDNAs by reverse transcribing the poly(A)1 RNAs in the presence of
a fluorescent nucleotide. Then they hybridized the fluorescent cDNA
from a given time point to a microarray and measured the extent of
hybridization. They normalized all such hybridization values against
the extent of hybridization of a reference standard cDNA prepared
from a mixture of RNAs from all phases of the life cycle. (b) Gene
3
>4
expression profiles. The profiles of 3219 genes whose expression
levels changed by more than four-fold during the fly life cycle are
arranged in order of the onset of the first increase in abundance of
transcript. The developmental phase is indicated at top, with the same
abbreviations and color coding as in (a). The expression level is indicated
by color, as indicated at bottom, blue stands for low expression and
yellow stands for high expression. (c) Graphic representation of the
cumulative fraction of genes that have shown a strong increase in
expression. Note that a large fraction (about 33%) of genes are
already represented by a large amount of RNA at the earliest time
point. These are labeled maternal genes. The inset is an expansion of
the first 20 h of the embryonic phase, which also shows the large
proportion of transcripts already present in the first hour of development.
wea25324_ch25_789-826.indd Page 794
794
23/12/10
8:43 AM user-f467
/Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile
(d)
Fold change
Induced and maintained
16
CG5958
Early embryo/early pupa
16
Amalgam
4
0
0
–4
–16
–16
Transiently induced
Fold change
16
4
CG1733
4
0
0
–4
–4
–16
Late embryo/late pupa
16
–16
E
L
P
Figure 25.4 Continued (d) Expression patterns of four selected
genes. At upper left, gene CG5958 shows an induction in early
embryonic phase to a high level that is largely maintained throughout
the life cycle. At upper right, the Amalgam gene shows an induction in
the early embryonic phase, a decrease in the larval phase, and a
reinduction at the boundary between the larval and pupal stages. At
lower left, gene CG1733 shows a distinct peak of expression at the
larval–pupal boundary. At lower right, gene CG17814 shows one burst
of induction that begins in the late embryonic phase and lasts through
SUMMARY Functional genomics is the study of the
expression of large numbers of genes. One branch
of this study is transcriptomics, which is the study of
transcriptomes—all the transcripts an organism
makes at any given time. One approach to transcriptomics is to create DNA microarrays or DNA
microchips, holding thousands of cDNAs or oligonucleotides, then to hybridize labeled RNAs (or corresponding cDNAs) from cells to these arrays or
chips. The intensity of hybridization to each spot
reveals the extent of expression of the corresponding gene. With a microarray one can canvass the
expression patterns (both temporal and spatial) of
many genes at once. The clustering of expression of
genes in time and space suggests that the products
of these genes collaborate in some process. This can
give clues about the functions of genes of unknown
function if the unknown gene is expressed together
with one or more well-studied genes.
Serial Analysis of Gene Expression In 1995, Victor
Velculescu, working with Kenneth Kinzler and colleagues,
developed a novel method of analyzing the range of genes
expressed in a given cell. They called this method serial
analysis of gene expression (SAGE). The underlying strategy of SAGE is to synthesize short cDNAs, or tags, from all
the mRNAs in a cell, and then link these tags together in
clones that can be sequenced to learn the nature of the tags,
CG17814
A
(e)
% embryonic genes with 2nd peak in interval
Chapter 25 / Genomics II: Functional Genomics, Proteomics, and Bioinformatics
1st embryonic peak:
50
Early (0–3 h)
Late (9–19 h)
40
30
20
10
0
L1
L2
L3
P1
P2
P3
A1
A2
A3
Developmental intervals
the larval phase, and a reinduction in the late pupal phase.
(e) Reinduction patterns. The percent of genes expressed either early
(blue) or late (red) in the embryonic phase that show a reinduction at
the given times later in development. Note that the genes expressed
in early embryogenesis tend to be reinduced in the early pupal stage
(P1, bracket over blue bar), whereas the genes expressed in late
embryogenesis tend to be reinduced in the late pupal stage (P3,
bracket over red bar). (Source: Adapted from Arbeitman et al., Science 297,
2002. Fig. 1, p. 2271. © 2002 by the AAAs.)
and therefore the nature of the genes expressed in the cell,
and the extent of expression of each gene.
Figure 25.5 shows how Velculescu and colleagues carried out this strategy. First, they used a biotinylated
oligo(dT) primer to prime reverse transcription of the
mRNAs present in human pancreatic tissue, yielding doublestranded cDNAs. The goal was to reduce the size of the
cDNAs to short tags that could be ligated together and sequenced readily. Because of the shortness of the tags (9 bp
in the example in Figure 25.5), it is important to confine
them to a small region of the cDNAs to increase the chance
that they will uniquely identify one cDNA. To begin the
shortening process, Velculescu and colleagues cleaved
the biotinylated cDNAs with an anchoring enzyme (AE) to
chop off a short 39-terminal fragment. They chose as their
anchoring enzyme NlaIII, which recognizes 4-base restriction sites and therefore yields fragments averaging 250 bp
long. They bound these biotinylated 39-fragments to streptavidin beads, which bind biotin.
Next, they divided the bead-bound cDNA fragments into
two pools and ligated one pool to a linker (Y) and the other
pool to a second linker (Z). Both linkers contained the recognition site for a type IIS restriction endonuclease (the tagging
enzyme [TE]) that cuts 20 bp downstream of this recognition
site. The result of cleavage of the cDNA fragments with the
tagging enzyme FokI was a set of short fragments, each containing the linker (Y or Z) followed by the 4-bp anchoring
enzyme site, followed by 9 bp from the cDNA. That 9-bp
piece of cDNA is the tag. If the tagging enzyme leaves overhangs, these can be filled in to yield blunt ends.
wea25324_ch25_789-826.indd Page 795
23/12/10
8:43 AM user-f467
/Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile
25.1 Functional Genomics: Gene Expression on a Genomic Scale
795
(a) Synthesize double-stranded cDNAs using a
biotinylated oligo (dT) primer.
AAAAA
TTTTT
AAAAA
TTTTT
AAAAA
TTTTT
(b) Cleave with anchoring enzyme (AE).
Bind 3′-terminal fragments to streptavidin beads.
GTAC
GTAC
GTAC
AAAAA
TTTTT
AAAAA
TTTTT
AAAAA
TTTTT
(c) Divide in half.
Ligate to linkers (Y and Z).
AAAAA
TTTTT
AAAAA
TTTTT
AAAAA
TTTTT
Y CATG
GTAC
Y CATG
GTAC
Y CATG
GTAC
Z CATG
GTAC
Z
CATG
GTAC
Z CATG
GTAC
AAAAA
TTTTT
AAAAA
TTTTT
AAAAA
TTTTT
(d) Cleave with tagging enzyme (TE),
and blunt the ends.
Primer Y
GGATGCATGCATCATCAT
CCTACGTACGTAGTAGTA
TE
AE
Primer Z
Tag
GGATGCATGGAGGAGGAG
CCTACGTACCT C CTC CTC
TE
AE
Tag
(e) Ligate and amplify by PCR with
primers Y and Z.
GGATGCATGCATCATCATGAGGAGGAGCATGCATCC
Primer Z
CCTACGTACGTAGTAGTACTC CTC C TCGTACGTAGG
Ditag
(f) Cleave with anchoring enzyme.
Isolate ditags.
Join together and clone.
-----CATGCATCATCATGAGGAGGAG CATG CATCATCAT GAGGAGGAGCATG---------GTACGTAGTAGTACTC CTC C TC GTAC GTAGTAGTA CT C CTC C TCGTAC----Tag 1
Tag 2
Tag 3
Tag 4
AE
AE
AE
Ditag
Ditag
Primer Y
Figure 25.5 Serial analysis of gene expression (SAGE). (a) Doublestranded cDNAs are formed from cellular mRNAs, using biotinylated
oligo(dT) to prime first-strand cDNA synthesis. Orange balls represent
biotin. (b) Biotinylated cDNAs are cleaved with an anchoring enzyme
(AE, NlaIII in this case), and the biotinylated 39-end fragments are
bound to streptavidin beads (blue). (c) The bead-bound fragments
are divided into two pools; the fragments in one pool are ligated to
linker Y (blue) and the fragments in the other pool are ligated to linker
Z (pink). (d) The fragments are cleaved with the tagging enzyme (TE),
and ends are filled in if necessary to create blunt ends. In this case,
the tagging enzyme is FokI, which leaves 9-bp tags attached to the
linkers. The tag attached to linker Y is represented by the arbitrary
sequence CATCATCAT and its complement highlighted in yellow, and
the tag attached to linker Z is represented by the arbitrary sequence
GAGGAGGAG and its complement (light purple highlight).
(e) Tag-containing fragments are blunt-end-ligated together and
amplified by PCR with primers that hybridize to primer Y and primer
Z regions in each linker. Only fragments ligated with tags joined tail to
tail (ditags) will be amplified by PCR. (f) The amplified ditag-containing
fragments are cleaved with the anchoring enzyme to yield ditags with
sticky ends. The ditags are ligated together to form concatemers,
which are cloned. Part of a concatemer of ditags is shown, with the
4-base recognition sites for the anchoring enzyme shown in green.
Note that these 4-base sites set off each ditag so it can be recognized
easily. The clones are then sequenced to discover which tags are
represented, and in what quantity. This tells which genes are
expressed, and how actively. (Source: Adapted from Velculescu, V.E.,
L. Zhang, B. Vogelstein, and K.W. Kinsler, Serial analysis of gene expression.
Science 270:484, 1995.)
wea25324_ch25_789-826.indd Page 796
796
23/12/10
8:43 AM user-f467
/Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile
Chapter 25 / Genomics II: Functional Genomics, Proteomics, and Bioinformatics
Velculescu and colleagues’ next task was to ligate the
tags together, along with defined DNA so they could tell
where one tag left off and another began. To do this, they
blunt-end-ligated the tagged fragments together to form
fragments with two tags abutting each other in the middle
(forming a ditag) and linkers on each end. The linkers contain sites that are complementary to a pair of primers that
can be used to amplify the whole fragment by PCR. After
the PCR amplification, Velculescu and colleagues cleaved
the products with the anchoring enzyme, ligated these
restriction fragments together, and cloned the products.
Now the ditags can be easily identified because each one is
flanked by the 4-bp anchoring enzyme recognition sites.
And, of course, half of each ditag belongs to one tag, and
half to the other. Clones with at least 10 tags (some had
more than 50) can be identified by PCR analysis and
sequenced. If enough clones are sequenced, we can get an
idea of the range of genes expressed, and tags that show up
repeatedly indicate genes that are very actively expressed.
Velculescu and colleagues’ examination of expression
in the human pancreas by SAGE had predictable, and
therefore encouraging, results. The most common tags
(GAGCACACC and TTCTGTGTG) corresponded to the
genes for procarboxypeptidase A1 and pancreatic trypsinogen 2, respectively. These are two abundantly expressed
pancreatic proenzymes, which, after cleavage to the mature
enzyme forms, digest proteins in the small intestine. Many
other familiar pancreatic genes were identified among the
plentiful tags, but many of the tags did not match any gene
sequences in the database, so their identities were unknown. As the database expands to include all human
genes, all tags should at least be correlated to genes, even if
the functions of some of those genes remain obscure.
SUMMARY SAGE allows us to determine which
genes are expressed in a given tissue and the extent
of that expression. Short tags, characteristic of particular genes, are generated from cDNAs and ligated
together between linkers. The ligated tags are then
sequenced to determine which genes are expressed
and how abundantly.
Cap Analysis of Gene Expression (CAGE) SAGE is a useful method for global analysis of gene expression, but it
focuses on the 39-ends of transcripts. Sometimes it is necessary to identify the 59-ends of transcripts—for example, if
one is interested in identifying promoters on a genomic
scale. In that case, a related method known as cap analysis
of gene expression (CAGE, Figure 25.6) is available.
The CAGE procedure starts with reverse transcription
(RT), as SAGE does, but with two important differences
that ensure production of full-length cDNAs that copy the
mRNA all the way to the 59-end. First, the RT reaction includes a disaccharide known as trehalose. This substance
mRNA
Cap
AAA - - - AAAAA
(a) Reverse transcription
AA
AAA - - - AAA
TTT - - - GAGCTC(GA),
Cap
Full-length
+
AA
AAA - - - AAA
TTT - - - GAGCTC(GA),
Cap
Non-full-length
(b) Biotinylation
AAAAA - - - AAA
TTT - - - GAGCTC(GA),
Cap
+
AAAAA - - - AAA
TTT - - - GAGCTC(GA),
Cap
(c) RNase I
AAAAA
AAA - - TTT - - - GAGCTC(GA),
AAAAA
AAAA - - TTT - - - GAGCTC(GA),
Cap
+
Cap
(d) Magnetic bead capture
Cap
AAA - - TTT - - - GAGCTC(GA),
(e) Base hydrolysis
Cap
TTT - - - GAGCTC(GA),
Linker
(f) Biotin-linker ligation
TCCGAC
AGGCTG
MmeI
TTT - etc.
(g) Second-strand synthesis
TCCGAC
AGGCTG
TTT - etc.
(h) MmeI digestion
20 nt
TCCGAC
AGGCTG
XmaJI
TCCGAC
AGGCTG
CTAGGTCCGAC
CAGGCTG
+
18 nt
discard
TTT - etc.
Magnetic bead capture
and ligation to linker 2
TCTAGA
TTT - etc.
AGATCT
XbaI
XmaJI and XbaI digestion
T
AGATC
20-nt tag
Figure 25.6 Use of CAGE to produce 20-nt tags representing the
59-ends of mRNAs. The procedure is described in the text. After the
tags are produced as shown here, they can be ligated together via their
identical sticky ends to form concatemers, cloned, and sequenced.
wea25324_ch25_789-826.indd Page 797
23/12/10
8:43 AM user-f467
/Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile
25.1 Functional Genomics: Gene Expression on a Genomic Scale
stabilizes reverse transcriptase at high temperature, so the
RT reaction can be run at 608C. This elevated temperature
weakens mRNA secondary structure that otherwise would
stop the RT reaction before it reached the 59-end of the
mRNA. Second, a cap trapper method is used: The caps of
the mRNAs in the mRNA–cDNA hybrids are tagged with
biotin. As we will see, this allows hybrids with full-length
cDNAs to be purified away from hybrids containing lessthan-full-length cDNAs.
Figure 25.6 shows how the tagging works. First, the RT
priming is done, not with oligo(dT), but with oligo(dT), preceded by a stretch of random nucleotides that do not hybridize with the poly(A) tail. The importance of this feature will
become apparent shortly. After first strand cDNA synthesis,
both ends of the mRNA are tagged with biotin by reacting
the RNA–DNA hybrid with a biotin-containing reagent that
attaches to diols. There are only two diols (adjacent hydroxyl
groups) in a capped mRNA: the free 29- and 39-hydroxyl
groups in the cap and the 39-terminal nucleotide.
One would like to tag just the cap, but the 39-terminal
nucleotide is unavoidably tagged in the same step. But that
problem is resolved in the next step, in which the hybrids
are treated with RNase I. The RNase degrades any singlestranded RNA that is not hybridized to the cDNA. Thus, it
not only removes the biotin tag from any hybrids that contain incomplete cDNAs, it also removes the biotin tag from
the 39-hydroxyl group at the end of every mRNA’s poly(A)
tail, which cannot hybridize to the random tail at the beginning of the primer. After the RNase treatment, the only
remaining biotin-tagged hybrids are those containing fulllength cDNAs, and these are collected using magnetic
beads coated with the biotin-binding protein streptavidin.
After the hybrids are purified, their mRNA parts, including
the biotin-tagged caps, are destroyed by base hydrolysis,
leaving just the single-stranded cDNAs.
Next, the full-length, single-stranded cDNAs are ligated
to biotin-tagged linkers that contain a recognition site for
the tagging enzyme MmeI, which dictates cleavage 20 and
18 nt away. Thus, after second-strand cDNA synthesis, the
tagged cDNAs can be cut with MmeI to yield 20-nt tags
that can be purified via their biotin parts, and ligated to a
second linker (linker 2) via their 2-nt overhangs. Linker 1
also contains a recognition site for XmaJI and linker 2 contains a recognition site for XbaI, so the tags can be cut with
those two enzymes, ligated together into concatemers,
cloned, and sequenced as in the SAGE procedure.
The 20-nt tags would be expected to be found every 420,
or about 1.1 3 1012 base-pairs. Thus, since the human
genome contains only about 3 3 109 bp, most of the 20-nt
tags should identify a unique sequence in even the large
human genome, which can be found by consulting the
known human genome sequence. This sequence should
begin with the transcription start site, so the promoter
should be in the immediate neighborhood. When Piero
Carninci and colleagues performed this kind of CAGE
analysis on mouse mRNAs from whole brain and three
797
distinct brain regions, they found many CAGE tags that
mapped close to previously mapped start sites, but many
more that did not. This could help identify a number of
new promoters and alternative start sites.
SUMMARY Cap analysis of gene expression (CAGE)
gives the same information as SAGE about which
genes are expressed, and how abundantly, in a given
tissue. Because it focuses on the 59-ends of mRNAs,
it also allows the identification of transcription start
sites and, therefore, helps locate promoters.
Whole Chromosome Transcriptional Mapping Transcriptomics studies have become sophisticated enough that they
can map transcripts with great accuracy to sites in whole
chromosomes. This kind of study, called transcriptional
mapping, is shedding light on a paradox mentioned earlier
in this chapter: The number of protein-encoding genes in
humans is scarcely larger than the number of such genes in
a lowly roundworm! How can we reconcile that fact with
the vastly greater complexity of human beings? One emerging answer is that transcripts of protein-encoding genes
make up only a small fraction of the whole human transcriptome. And the closer we look at this problem, the more
complex the human transcriptome becomes.
If we consider only exons in protein-coding genes, we
would predict that only 1–2% of the whole human genome
would be expressed in RNAs found in the cytoplasm of
cells. However, as early as 2002, Thomas Gingeras and colleagues, using microarrays to study expression of human
chromosomes 21 and 22, discovered that polyadenylated
RNAs in the cytoplasm of human cells covered about an
order of magnitude more of those two chromosomes than
could be accounted for by protein-encoding exons. This
excess of unexpected transcripts has been dubbed transcripts of unknown function, or TUFs. All of the transcribed regions (exons and TUFs alike) detected by such
arrays are called transcribed fragments, or transfrags.
Furthermore, approximately two-thirds of the transcripts in human cells and hamster cells have been reported
to be nonpolyadenylated [poly(A)2]. These poly(A)2 transcripts therefore represent another chunk of the human
genome, whose extent is unknown, but apparently large.
Taken together, these findings suggest that protein-encoding
exons make up only a small fraction of the total genomic
sequences represented by cytoplasmic RNAs.
To investigate this intriguing conclusion further, Gingeras
and colleagues used high-density oligonucleotide arrays
with 25-mers spaced on average only 5 bp apart, thus providing an average of a 20-bp overlap. Why use such a high
density? For one thing, it allows one to detect shorter
exons, and, for another, hybridizations to overlapping oligonucleotides give greater confidence that transcription in
that region really occurs. The oligonucleotide on the arrays
wea25324_ch25_789-826.indd Page 798
798
23/12/10
8:43 AM user-f467
/Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile
Chapter 25 / Genomics II: Functional Genomics, Proteomics, and Bioinformatics
came from the sequences of ten human chromosomes (6, 7,
13, 14, 19, 20, 21, 22, X, and Y), representing 30% of the
total length of the human genome. To the arrays, Gingeras
and colleagues hybridized double-stranded cDNAs representing cytoplasmic poly(A)1 RNAs from eight different
human cell lines, or cytoplasmic and nuclear poly(A)1 and
poly(A)2 RNAs from a single cell line (HepG2). In all cases,
transfrags that overlapped pseudogenes or repetitive DNA
regions were dropped from consideration.
About 9% of more than 74 million probe pairs (both
strands) hybridized to cDNAs from poly(A)1 RNA, per
cell line. Applying a “1 of 8” rule, in which a probe pair
needs to hybridize to a cDNA from only one of the eight
cell lines, the percentage of positive probes rose to 16.5%.
This is the “1 of 8 map.” An average of 4.9% of the nucleotides in the 10 chromosomes were expressed as cytoplasmic RNA in each cell line. In the 1 of 8 map, this figure
rose to 10.1%. These findings suggest that about 10.1% of
the sequences in the 10 human chromosomes are expressed
as polyadenylated RNA in the cytoplasm in at least one
cell line. Furthermore, the difference between 4.9% and
10.1% indicates that considerable cell-line-specific transcription occurs.
Figure 25.7 shows the proportions of each of the 10
chromosomes from which cytoplasmic polyadenylated
transcripts are made. Such transcripts from intergenic regions and introns are, by definition, unannotated. And
these regions make up the majority (57%) of the transcripts
from the 10 chromosomes as a whole (central pie chart).
The annotated transcripts overlap with one of three annotations: Known, which is a combination of two exon databases; mRNA, which contains the mRNAs from a third
database that do not overlap with the Known exons; and
EST, which contains all publicly available ESTs that do not
overlap with either the Known or mRNA databases.
What about poly(A)2 transcripts? For this analysis,
Gingeras and colleagues focused on a single cell line,
HepG2. They looked for stable poly(A)1, poly(A)2, and
bimorphic transcripts in both the nucleus and cytoplasm of
these cells. (Bimorphic transcripts start out polyadenylated,
6
25%
32%
7
32%
21%
5%
4%
63%
13%
12%
Y 12%
2%
6%
27%
29%
13
17%
4%
43%
10%
17%
Combination of all
10 chromosomes
X
25%
36%
26%
14
Known
26%
Intergenic
31%
29%
29%
4%
23%
24%
EST 12%
22
22%
Intronic
26%
34%
25%
21
6%
13%
6%
mRNA 5%
11%
13%
13% 19
46%
21%
20
25%
32%
29%
4%
13%
26%
Figure 25.7 Transcription maps of 10 human chromosomes.
The percentages of different categories of sequences found in
polyadenylated cytoplasmic transcripts in the 1 in 8 map are
represented by the wedges of each pie chart. Each of the chromosomes
represented by the small pie charts is identified in boldface, as is the
collective of all 10 chromosomes (large pie chart in the middle).
29%
15%
5%
4%
26%
12%
Sequence categories are given in the collective pie chart, and the
same color coding is used throughout. The unannotated sequences are
intergenic and intronic. The annotated sequences are designated
Known, mRNAs, and ESTs. (Source: Cheng, J., T.R. Gingeras, et al. 2005.
Transcriptional maps of 10 human chromosomes at 5-nucleotide resolution.
Science 308:1149–54.)
wea25324_ch25_789-826.indd Page 799
23/12/10
8:43 AM user-f467
/Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile
25.1 Functional Genomics: Gene Expression on a Genomic Scale
but then lose their poly[A] tail.) They found that fully
15.4% of nucleotides in the 10 chromosomes are represented
in one of these classes of transcripts (almost half of which
are poly[A]2). Thus, about 10 times as much of the genome
is represented in stable transcripts than we would expect
on the basis of exons alone. Of course, the majority of most
human genes is in introns, so this result may not sound
surprising at first. But if spliced-out introns have no function, we would expect them to be degraded rapidly and not
contribute so heavily to the cDNAs made from presumably
stable nuclear RNAs.
Another conclusion from this study is that about half of
the human transcriptome appears to be overlapping. There
are two kinds of overlaps: those on the same strand, and
those on opposite strands. Of course, transcripts that overlap on opposite strands represent sense/antisense pairs,
which should invoke an RNAi response. Thus, this may
represent a kind of gene expression control mechanism.
Studies like this that show abundant cytoplasmic
poly(A)1 and poly(A)2 transcripts of non-exon regions
may help to explain the differences between organisms.
Although the exons of humans and chimpanzees are extremely similar, the non-exon regions have diverged considerably more. And transcription of those regions may give
rise to some of the differences we see in the two species.
SUMMARY High-density whole chromosome tran-
scriptional mapping studies have shown that the
majority of sequences in cytoplasmic polyadenylated RNAs derive from non-exon regions of 10 human chromosomes. Furthermore, almost half of the
transcription from these same 10 chromosomes is
nonpolyadenylated. Taken together, these results indicate that the great majority of stable nuclear and
cytoplasmic transcripts of these chromosomes
comes from regions outside the exons. This may
help to explain the great differences between species, such as humans and chimpanzees, whose exons
are almost identical.
Genomic Functional Profiling
The ultimate goal of genomic functional profiling is to determine the pattern of expression of all the genes in an organism at all stages of the organism’s life. That is a daunting task
even in the simplest of eukaryotes, but it is even more difficult in complex multicellular organisms. So far, the puzzle
for each organism is being put together piece by piece, with
each research group contributing its own piece. Let us consider some general techniques for attacking the problem.
Deletion Analysis Once all the genes in a genome have
been identified, one can investigate what happens when
799
each of them is removed. That kind of experiment is ethically
impossible in humans, of course, but it can be done in other
vertebrates as their genomes are completely sequenced—at
least in principle. Logistical problems may delay this kind of
analysis of a genome as large as that of a vertebrate, but the
yeast genome has already been profiled in this way.
In 2002, a large consortium of investigators led by Ronald
Davis reported that they had generated a set of yeast
mutants, in each of which one gene had been replaced with
an antibiotic resistance gene flanked by 20-mer sequences
that were different for each replaced gene. Thus, each gene
replacement has a “molecular barcode” so it can be
uniquely identified. In all, these investigators replaced over
96% of the annotated ORFs in Saccharomyces cerevisiae.
Next, they examined the mutants for ability to grow in a
mixed culture under six different conditions: high salt; sorbitol; galactose; pH 8; minimal medium; and the antifungal
agent nystatin. They also examined gene expression under
each of these conditions by hybridization of RNA to oligonucleotide microarrays.
To do this genomic functional profile, Davis and colleagues grew a mixed culture of all 5916 mutants under
each of the conditions and collected cells at various times
and tested for each barcode by hybridization to an oligonucleotide array containing sequences complementary to
the barcodes. If a gene is important for dealing with a given
condition, such as the presence of galactose, then mutants
lacking that gene should disappear rapidly from the mixture when that condition is imposed. In fact, the rate at
which the mutant disappears should correlate with the importance of the deleted gene in dealing with the condition.
When the investigators applied this kind of profiling to
yeast mutants responding to the presence of galactose, they
found several genes that were already known through years
of study to be involved in yeast metabolism of galactose.
But they also found 10 new genes that had previously not
been implicated in galactose metabolism. Wild-type yeast
and 11 of the mutants identified by the profiling as important in galactose metabolism were tested individually, and
the results are presented in Figure 25.8. As predicted, all 11
mutant strains grew more slowly in galactose than the
wild-type strain did. Their growth rates varied from 44%
to 91% of wild-type.
SUMMARY Genomic functional profiling can be
performed in several ways. In one kind of mutation
analysis, called deletion analysis, mutants are created by replacing genes one at a time with an antibiotic resistance gene flanked by oligomers that serve
as a barcode to identity each mutant. Then, a functional profile can be obtained by growing the whole
group of mutants together under various conditions
to see which mutants disappear most rapidly.
wea25324_ch25_789-826.indd Page 800
800
23/12/10
8:43 AM user-f467
/Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile
Chapter 25 / Genomics II: Functional Genomics, Proteomics, and Bioinformatics
0.7
WT
gal4
gal3
gal1
yml090w
msn2
gal2
yml077wΔ
ykl037w
ftr1
fet3
gef1
0.6
Growth (A600)
0.5
0.4
100%
51.3%
53.4%
49.0%
44.2%
62.5%
73.5%
91.0%
60.1%
85.6%
86.9%
65.4%
0.3
0.2
0.1
0
5
10
Time (h)
15
20
Figure 25.8 Growth curves of various mutants discovered by
profiling to be deficient in response to galactose. Davis and
colleagues tested wild-type yeast cells and 11 deletion mutants
individually for growth in galactose-containing medium. All of the
mutants had been identified by profiling in a mixture of strains as
defective in growth with galactose. A600 (absorbance of 600-nm light)
is a measure of turbidity, which in turn is a measure of yeast growth.
(Source: Adapted from Giaever, G., A.M. Chu, L. Ni, C. Connelly, L. Riles,
S. Veronneau, et al., Functional profiling of the Saccharomyces cerevisiae genome.
Nature 418, 2002, p. 388, f. 2.)
RNAi Analysis “Knocking out” genes by mutagenesis is
laborious, and has so far been accomplished on a genomewide scale only in yeast. But some more complex organisms are amenable to a simpler alternative: “knocking
down” genes by RNA interference (RNAi, Chapter 16).
The nematode worm Caenorhabditis elegans is particularly
(a)
susceptible to RNAi, which even affects the progeny of
treated worms; it can reproduce parthenogenically, which
means that only one parent is required; it contains fewer
than 1000 cells, and its whole genome has been sequenced.
Thus, this organism is an obvious target for genomic functional profiling by RNAi analysis.
Birte Sönnichsen and colleagues have exploited this
technique to inactivate 19,075 of the worm’s genes, over
98% of the total, and observe the effects on early
embryogenesis—the first two cell divisions after fertilization. They injected 25-bp double-stranded RNAs into
worms and then followed the first two cell divisions in the
progeny of the injected worms by time-lapse microscopy.
They also checked for the viability of the embryos beyond
the two-cell stage and for gross phenotypic alterations
in the larval and adult stages.
In all, inactivation of 1668 genes by RNAi produced
detectable phenotypic defects. Of these 1668, inactivation
of 661 genes gave reproducible defects in the first two cell
divisions; the rest gave defects at later stages of development (Figure 25.9). (It is not surprising that inactivating
virtually all of the 661 genes that gave defects in early embryogenesis also produced embryonic lethality.)
One problem with RNAi is that it sometimes fails to inactivate genes (false-negatives), so negative results are difficult to interpret. As a check on their procedure, Sönnichsen
and colleagues evaluated the 65 genes that had previously
been shown by mutagenesis to affect the first cell division.
Of these genes, 62 (95%) had been detected by the RNAi
analysis. The three genes that had been missed the first time
were rechecked by RNAi analysis, and two were detected
the second time, increasing the success rate to 98%.
It is also true that mutations are detected only if they
give clear phenotypes, so the mutagenesis strategy also produces false-negatives. Thus, as another check on their procedure, the researchers compared their data to other RNAi
analyses that targeted early embryogenesis, and found that
(b)
Adult (134)
8%
Larva (268)
16%
Early embryo (661)
40%
Mutant (1668)
9%
No dsRNA (469)
2%
Wild-type (17,426)
89%
Figure 25.9 Distribution of phenotypes from a genomic
functional profile of C. elegans using RNAi. (a) Initial screen.
Sönnichsen and colleagues targeted 19,075 genes with dsRNAs. Of
these, 17,426 (“wild-type,” blue) caused no change in phenotype in
the screens the authors used, and 1,668 (“Mutant,” red) showed an
alteration in phenotype. Four hundred sixty-nine genes (“No dsRNA,”
yellow) were not targeted in this experiment. (b) Distribution of
Late embryo (605)
36%
mutant phenotypes. Starting with the 1668 genes whose inactivation
yielded mutant phenotypes, Sönnichsen and colleagues sorted the
developmental stages at which defects were seen. For example, 661
of these (red) exhibited defects in the early embryo stage (first two
cell divisions). (Source: Adapted from Sönnichsen, et al., Full-genome RNAi
profiling of early embryogenesis in Caenorhabditis elegans. Nature. Vol. 434
(2005) f. 2, p. 465.)
23/12/10
8:43 AM user-f467
/Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile
801
25.1 Functional Genomics: Gene Expression on a Genomic Scale
Tissue-Specific Functional Profiling Another approach
to genomic functional profiling is to observe the tissuespecificity of the genes that are inactivated by mutation or
other means. In one notable study, Lee Lim and colleagues
used two miRNAs to knock down expression of genes in
human (HeLa) cells in culture, and then looked at the profile
of genes whose expression was significantly reduced. Remarkably, miR-124, an miRNA expressed in brain, knocked
down expression of genes that are expressed at low levels
in brain, while miR-1, an miRNA expressed in muscle,
knocked down expression of genes that are expressed at
low levels in muscle. In other words, these two miRNAs
shifted the expression of genes in HeLa cells towards that
seen in the tissues in which the respective miRNAs are
prominent. This is exactly what we would expect if these
two miRNAs play a major role in turning down the expression of these same genes in vivo.
A further striking feature of this study is that the miRNAs
reduced the concentrations of the mRNAs in question, even
though, as we learned in Chapter 16, animal miRNAs
generally affect mRNA translation, not mRNA concentrations. Thus, Lim and colleagues introduced double-stranded
miRNAs into HeLa cells and then used microarrays to
measure the levels of mRNAs purified from the treated
cells. The result was clear reduction in the concentrations
of 100 or more mRNAs with each miRNA.
Here is how Lim and colleagues did their analysis, considering miR-124 first. They began by plotting the expression levels of 10,000 human genes in each of 46 tissues,
using data from a previous genome-wide survey. The histogram in Figure 25.10a contains the data for gene expression
Number of genes
Number of genes
10
250
200
150
100
50
0
8
6
4
2
0
10 20 30 40
Cerebral cortex rank
(d)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
–15
–10
–5
P-value (Log10)
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
Heart
(c)
10 20 30 40
Cerebral cortex rank
Skeletal muscle
plex organisms can be done by inactivating genes
via RNAi. An application of this approach targeting the
genes involved in early embryogenesis in C. elegans
has identified 661 important genes, 326 of which
are involved in embryogenesis per se.
(b)
300
Tissues
SUMMARY Genomic functional analysis on com-
(a)
Brain tissues
they had detected 75% of the genes that others had found.
Accordingly, Sönnichsen and colleagues concluded conservatively that their RNAi analysis could detect 75–90% of
genes involved in early embryogenesis.
Next, the researchers grouped the 661 genes according
to their specific phenotypes. They found that inactivation
of about half (326) of the genes produced defects in embryogenesis per se, while the remainder (335) simply affected the general cell metabolism required to keep the
embryo alive long enough to divide twice. By careful annotation of the specific defects, the researchers were able to
group the former 326 genes into defects in 23 aspects of
embryogenesis, such as spindle assembly (9 genes) and sister chromatid separation (64 genes).
–8
–6 –4 –2
P-value (Log10)
Tissues
wea25324_ch25_789-826.indd Page 801
0
Figure 25.10 Tissue-specific down-regulation by miRNAs.
(a) Ranking of expression of genes in cerebral cortex. The rankings of all
10,000 genes in each of 46 tissues are plotted as follows: The left-most
bar (rank 1) represents the genes that are expressed at a higher level in
cerebral cortex than in any other tissue; the next bar (rank 2) represents
genes that are expressed at a higher level in cerebral cortex than in any
other tissue except one, and the last bar (rank 46) represents the genes
that are expressed at a lower level in cerebral cortex than in any other
tissue. (b) Ranking of genes whose mRNA levels are significantly
decreased by miR-124. Note the skew toward genes that are poorly
expressed in cerebral cortex compared to the background in panel (a),
which gives a P-value of significance of about 10212. (c) Plot of the Log10
of P-values derived from plots like that in panel (b) for all 46 tissues. The
only tissues with significant P-values (,0.001) are brain tissues: 5, whole
brain; 6, amygdala; 7, caudate nucleus; 8, cerebellum; 9, cerebral cortex;
10, fetal brain; 11, hippocampus; 12, postcentral gyrus; and 13, thalamus.
(d) Similar to (c), except that the analysis was performed on cells to
which miR-1, instead of miR-124, had been added. (Source: Adapted
from Lim et al., Microarray analysis shows that some microRNAs downregulate
large numbers of target mRNAs. Nature. Vol. 433 (2005) f. 1, p. 770.)
wea25324_ch25_789-826.indd Page 802
802
23/12/10
8:43 AM user-f467
/Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile
Chapter 25 / Genomics II: Functional Genomics, Proteomics, and Bioinformatics
in cerebral cortex. Each bar represents the number of genes
expressed at a given level in cerebral cortex. The left-most
bar represents the genes that are more highly expressed, and
the right-most bar represents the genes less highly expressed
in this tissue than in any other tissue. The other bars represent genes that are intermediate in expression, from highly
expressed, to poorly expressed in cerebral cortex. All 10,000
genes are represented in this panel, so a random set of genes
should produce something similar, which we can consider
background.
The histogram in Figure 25.10b contains the ranking
of genes whose expression was significantly decreased by
miR-124 in HeLa cells. Instead of a background plot, as in
panel (a), we see a plot that is significantly skewed toward
genes that are naturally poorly expressed in cerebral cortex. Notice the predominance of bars on the right-hand
side of the histogram, which yields a P-value of significance that is much less than 0.001. In fact, it is of the order
of 10212.
Next, Lim and colleagues expanded their analysis of the
effect of miR-124 to all 46 tissues and plotted the Log10 of
P-values (Figure 25.10c). Using a threshold of significance
of a P-value less than 0.001, brain tissues were the only
ones whose P-values were significantly different from background (bars 5–13). In a similar analysis of the effect of
miR-1 (Figure 25.10d), Lim and colleagues found that the
only tissues whose P-values were significantly different
from background were muscle tissues. Thus, the pattern of
depression of HeLa cell gene expression by miR-124
matched the pattern of low gene expression levels only in
brain cells. Similarly, the pattern of depression of HeLa cell
gene expression by miR-1 matched the pattern of low gene
expression levels only in muscle cells.
Note again that these studies used microarrays, which
detect mRNA levels. Thus, it is likely that the miRNAs are
affecting the steady-state levels of particular mRNAs, presumably by destabilizing them. If this is so, we would expect to see evidence of complementarity between the
miRNAs and the destabilized mRNAs, probably in the
39-UTRs of the mRNAs, where such complementarity has
typically been found.
So Lim and colleagues compared the sequences of the
miRNAs to the sequences of the 39-UTRs of the mRNAs
whose levels were significantly depressed. They used a
“motif discovery tool” called MEME to do the matching,
and obtained striking results. Fully 88% of the mRNAs
down-regulated by miR-1 had strings of at least six bases,
with the consensus sequence CAUUCC, that is complementary to a string of bases in miR-1. And 76% of the
mRNAs down-regulated by miR-124 had strings of at
least six bases, with the consensus sequence GUGCCU,
that is complementary to a string of bases in miR-124.
This is strong evidence that the miRNAs really do interact
with the 39-UTRs of their target mRNAs, and presumably
destabilize them.
An attractive hypothesis emerges from these studies:
miRNAs play an important role in cell differentiation by
inhibiting the expression of gene batteries, or sets of functionally related effector genes. For example, miR124 inhibits the expression of a battery of hundreds of non-neuronal
genes that help to keep a human cell in an undifferentiated
state. Presumably, suppression of these non-neuronal genes
is a key to differentiation of neuronal cells.
Gail Mandel and her colleagues have provided support
for this hypothesis by identifying a protein factor, RE1
silencing transcription factor (REST) that inhibits the expression of a battery of neuron-specific genes, including
miR-124 and a number of other miRNAs. REST inhibits
miR-124 expression in non-neuronal and pre-neuronal
cells. However, during differentiation of neuronal cells,
REST dissociates from the miR-124 gene and allows its
expression. The newly made miR-124 then inhibits the expression of non-neuronal genes, helping the cell develop
into a neuronal cell. Indeed, one of the mRNAs targeted
by miR-124 encodes one of the subunits of REST. Thus,
miR-124 and REST antagonize each other’s expression,
as we might expect of two factors that lead to different
developmental fates.
SUMMARY Tissue-specific expression profiling can
be done by examining the spectrum of mRNAs whose
levels are decreased by an exogenous miRNA, and
comparing that to the spectrum of expression of genes
at the mRNA level in various tissues. If the miRNA
in question causes a decrease in the levels of the
mRNAs that are naturally low in cells in which the
miRNA is expressed, it suggests that the miRNA is at
least part of the cause of those natural low levels.
This kind of analysis has implicated miR-124 in destabilizing mRNAs in brain tissue, and miR-1 in
destabilizing mRNAs in muscle tissue. By inhibiting
the expression of batteries of genes, miRNAs can
influence the differentiation of cells. For example,
miR-124 inhibits the expression of non-neuronal
genes. Thus, expression of miR-124 in a pre-neuronal
cell pushes the cell toward neuronal differentiation.
Locating Target Sites for Transcription Factors As we
learned in Chapter 12, genes are stimulated by activators,
which bind to enhancers. Many activators have many enhancer targets in a genome and therefore activate many
genes. Such a set of genes that tend to be regulated together
is sometimes called a regulon. To understand fully the effects of a given activator, it is important to identify all the
genes that respond to that activator, and several methods
have been developed to accomplish this task.
The most straightforward method is to compare the
microarray hybridization patterns of RNAs from organisms
wea25324_ch25_789-826.indd Page 803
23/12/10
8:43 AM user-f467
/Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile
25.1 Functional Genomics: Gene Expression on a Genomic Scale
that do not express, express at a low level, or overexpress
the gene for a given activator. This analysis reveals the
genes that are turned on by high expression of the activator
and has been useful for that purpose. But two problems
limit the utility of this sort of experiment. First, the genes
that are turned on may not be direct targets of the activator, but may be targets of other activators whose genes
were stimulated by the first activator. Second, the genes
that are turned on when the activator is overexpressed may
not be turned on in vivo by physiological levels of the activator. Still, there are ways to get around these problems by
examining directly the interaction of an activator with the
control regions of specific genes.
One such strategy, employed by Richard Young and
colleagues (Ren et al., 2000), melds two different techniques: chromatin immunoprecipitation (ChIP, Chapter 13)
and DNA microarray hybridization on a DNA microarray, or chip. The technique is therefore called ChIP-chip
or, sometimes, ChIP on chip. Figure 25.11 shows the general plan of the method, which Young and colleagues
adapted to identify the binding sites for the activator GAL4
throughout the yeast genome. First, they chemically crosslinked proteins to DNA in chromatin so they could not
separate. Then they broke open the cells and sheared
the chromatin into small segments. Next, they immunoprecipitated the sheared yeast chromatin with an antibody
against GAL4 to precipitate DNA bound to GAL4. Then
they reversed the cross-links between the protein and DNA,
and labeled copies of this DNA with a red fluorescent dye
(Cy5) by PCR. By a parallel procedure, they labeled copies
of DNA that was not immunoprecipitated by the antiGAL4 antibody with a green fluorescent dye (Cy3). Then
they probed DNA microarrays representing all the intergenic regions of the yeast genome with the two labeled
DNAs. Figure 25.12 shows the results of a small section of
the array. One spot, denoted by the arrow, clearly shows a
preponderance of red fluorescence, suggesting that it hybridized preferentially to the DNA that was associated
with GAL4. Using this technique, Young and colleagues
identified DNA sequences associated with 10 genes, all of
which are known to be activated by GAL4. Thus, the
method worked well in this trial.
This method is well suited for yeast because of the limited size of the yeast genome and the fact that the yeast
genome has been completely sequenced. But could one perform a similar experiment with the human genome? There
would be a serious problem, because the whole intergenic
fraction of the human genome is almost as large as the genome itself, so a microarray containing all those sequences
would be very complex and difficult to produce. But there
are some ways to narrow the field of DNA sequences to
make the experiment practical. Two of these were reported
in work on the same activator, human E2F4, in 2002.
In their approach to narrowing the field, Peggy Farnham
and coworkers used a microarray containing only CpG
Wild-type
803
Deletion mutant
(a) Cross-link proteins
to DNA
(b) Extract and shear
cross-linked DNA
(c) Immunoprecipitate with
specific antibody
(d) Reverse cross-links, amplify
and label DNA
(e) Hybridize to microarray containing
all intergenic regions
Figure 25.11 Genome-wide search for DNA–protein interactions in
yeast by ChIP-chip analysis. (a) First, proteins are chemically crosslinked to DNA in yeast cells. This is done in wild-type cells and in
reference cells missing the gene encoding the protein of interest (red).
(b) The protein–DNA complexes (cross-linked chromatin) are extracted
from the cells and sheared by sonication. (c) Sheared chromatin is
immunoprecipitated with an antibody directed against the protein of
interest. (d) After precipitation, the cross-links are reversed, and the
precipitated DNA is amplified and labeled by PCR. (e) The labeled DNA
from both kinds of cells is hybridized to a microarray containing DNA
representing all intergenic regions in the yeast genome. The precipitated
DNA from the wild-type cells is labeled with a red fluorescent dye, and
the precipitated DNA from the mutant cells lacking the protein of interest
is labeled with a green fluorescent dye. Thus, if a DNA spot on the
microarray hybridizes to DNA that binds to the protein of interest more
than to other proteins, that spot will fluoresce red. If the DNA hybridizes
to DNA that binds to other proteins preferentially, the spot will fluoresce
green. If it hybridizes to both DNA probes, it will fluoresce yellow. Careful
normalization of the relative intensities of fluorescence of the two DNA
probes allows one to determine the ratio of red and green fluorescence
at each spot and therefore the significance of the preference a given
DNA region has for binding to the protein of interest. (Source: Adapted from
Nature 409: from Lyer et al., 2001, Fig. 1, p. 534).
islands (7776 of them). As we learned in Chapter 24, such
CpG islands are associated with gene control regions and
therefore should be highly enriched in the activatorbinding sequences being sought by this technique. Using
that strategy, Farnham and coworkers identified 68 target
wea25324_ch25_789-826.indd Page 804
8:43 AM user-f467
/Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile
Chapter 25 / Genomics II: Functional Genomics, Proteomics, and Bioinformatics
Binding site
IP-enriched DNA
Unenriched DNA
Merged
Figure 25.12 Identifying a DNA sequence that binds to GAL4.
Young and colleagues prepared a red fluorescent DNA probe by
performing PCR on DNA from chromatin immunoprecipitated by an
anti-GAL4 antibody. Then they prepared a similar, green fluorescent
DNA probe by PCR on DNA that was not immunoprecipitated by the
antibody. Then they hybridized these two probes to a DNA microarray
with DNAs representing all the intergenic regions in the yeast genome.
This is a small section of that array, showing one red spot (arrow)
that indicates a putative GAL4-binding DNA, several green spots,
indicating DNA that does not bind GAL4, and several yellow spots
(binding both red and green probes) that do not show significant
preferential binding of GAL4. (Source: Adapted from Ren et al., Science 290
(2000) Fig. 1A. p. 2306.)
sites for their activator. Instead of CpG islands, David
Dynlacht and colleagues chose the control regions of approximately 1200 genes that were known to be activated as cells
entered the cell cycle (a time when E2F4 is active). From this
panel of DNAs on the microarray, they found that 127 bound
to E2F4 in human fibroblasts. Thus, some foreknowledge of
the timing and selectivity of an activator can be very useful in
designing a microarray to seek out more target genes.
One problem with the ChIP-chip technique for finding
transcription factor binding sites is that it is limited to the
sequences placed on the chip. In order to contain all the
possible sequences in the euchromatic part of the human
genome, such a chip (or chips) would have to contain of the
order of a billion spots—beyond the reach of current technology. Even when chips with tiling arrays (DNAs with overlapping sequences) approach the resolution of just a few
nucleotides, they are predicted to be quite expensive, at least
at first. Another problem is that hybridization efficiency to
spots on a chip is different for different DNAs, so some binding sites will be missed because their hybridization conditions are not met. Also, it is an unfortunate fact of life that
hybridization specificity is not perfect: Sometimes one DNA
will hybridize to more than one spot, or will fail to hybridize
where it should because of DNA secondary structure. Finally,
excellent coverage of the genome by ChIP-chip will be
realized in the near future only for the human genome, in
which high-resolution tiling arrays will be available. Investigators studying other genomes will not have that advantage.
An alternative that solves these problems is a technique
called tag sequencing, in which the amplified pieces of
DNA precipitated in the ChIP procedure are not hybridized to a chip, but repeatedly sequenced using one of the
new high-throughput, next-generation techniques described
in Chapter 5. With 2007 technology, one instrument could
do about 400,000 200-nt reads, or 40 million 25-nt reads
at a time. Barbara Wold and colleagues tested such a
method, which they dubbed ChipSeq (more commonly
known as ChIP-seq) in 2007. They performed millions of
25-nt reads on DNAs isolated by ChIP with an antibody
specific for a transcription factor called neuron-restrictive
silencing factor (NRSF), which represses neuronal genes in
non-neuronal cells and in neuronal precursor cells. Then
they used a computer program to show where these 25-nt
reads mapped to the human genome. They counted as significant any site where 13 or more reads clustered, and
where this clustering was at least five-fold enriched over a
control in which no antibody was used during the ChIP
procedure. Figure 25.13 depicts a cluster of reads that
defines a binding site for a hypothetical protein.
NRSF binding sites were attractive subjects because
they had already been carefully studied by other techniques, and a canonical binding site sequence had been
recognized. The ChIPSeq procedure identified almost all of
the canonical binding sites, and found new binding sites as
well. Some of these had canonical half-sites separated by
noncanonical spacers. Others had only one half-site. Thus,
this technique appears to be comprehensive in its ability to
identify binding sites.
Mathieu Blanchette, François Robert, and their colleagues adopted a different approach to finding transcription
Position on genome
(a) Exp.
Reads
804
23/12/10
(b) Control
Figure 25.13 Mapping a transcription factor binding site by
ChIPSeq. (a) Short (25-nt) reads of sequence of DNAs precipitated by
ChIP using an antibody specific for a transcription factor are plotted
vs. genome position at one particular place in the genome. Each red
block represents one read. The peak defines the binding site for the
transcription factor. (b) A control is run without an antibody in the ChIP
step, and this shows only background binding.
wea25324_ch25_789-826.indd Page 805
23/12/10
8:43 AM user-f467
/Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile
25.1 Functional Genomics: Gene Expression on a Genomic Scale
factor binding sites in the human genome. Instead of searching for binding sites for a single protein, they looked for
clusters of such binding sites (cis regulatory modules [CRMs],
Chapter 12). Whereas each individual transcription factor
binding site can be quite variable in sequence, and thus escape notice, clusters of such sites are relatively easy to find.
Blanchette, Robert, and colleagues took advantage of
the Transfac database, containing binding site sequence information for 229 different transcription factors. They also
realized that CRMs are well conserved, relative to surrounding DNA sequences. Accordingly, they focused on
nonrepetitive, noncoding DNA regions that are conserved
in the human, mouse, and rat genomes and searched in
those regions for transcription factor binding sites from the
Transfac database.
This scan, encompassing the 34% of the human genome that can be aligned with both the mouse and rat
genomes, yielded 118,402 predicted CRMs (pCRMs). This
number surely includes some false positives, but it represents only about one-third of the human genome. While
that part of the genome is likeliest to be enriched in CRMS,
we can still conclude that the human genome probably
contains at least two hundred thousand CRMs. That number may seem surprisingly large, but the authors have validated their data in several ways. For example, they found
a strong enrichment of their pCRMs in known promoter
regions (defined as DNA regions within 1 kb upstream of
the transcription start site), particularly promoters within
CpG islands. They also found good correspondence between the pCRMs and DNase hypersensitive regions,
which, as we learned in Chapter 13, tend to contain gene
regulatory elements.
One somewhat surprising result of this work was the
large number of pCRMs that lie in regions thought to be
devoid of genes. This finding could be explained in several
ways: (1) It may reflect our inability to identify all the
genes in the human genome. (2) It may indicate that some
genes have cryptic transcription start sites that lie far upstream of the canonical start sites. (3) The pCRMs may be
regulating the production of noncoding RNAs. (4) The
pCRMs may be regulating the transcription of genes a
great distance away.
Figure 25.14 depicts the frequency of pCRMs within
and surrounding known genes. As expected, there is a
strong preference for pCRMs in the immediate 59-flanking
region of a gene, where enhancers are classically found. But
there is also a preponderance of pCRMs in regions where
we would not expect them, beginning with the region just
downstream of the transcription start site. This could
reflect alternative, downstream transcription start sites, or
it could be the first indication of widespread regulatory
elements within genes. A second surprise in Figure 25.14 is
the abundance of pCRMs in the region surrounding the
transcription termination site. Again, this has at least two
possible explanations. It could indicate a large class of
805
enhancers just downstream of the genes they control, or it
could represent antisense transcripts that could play a negative role in gene expression. There is a poverty of pCRMs
in the regions 10–50 kb upstream and 10–30 kb downstream of genes, and at the edges of introns (except the first
and last ones). Some of this may be only apparent. For example, there could be a selection in these regions for
pCRMs with few enough factor binding sites that they escaped notice in this study.
SUMMARY ChIP-chip analysis can be used to iden-
tify DNA-binding sites for activators and other proteins. In organisms with small genomes, such as
yeast, all of the intergenic regions can be included in
the microarray. But with large genomes, such as the
human genome, that is now impractical. To narrow
the field, CpG islands can be used, since they are associated with gene control regions. Also, if the timing or conditions of an activator’s activity are
known, the control regions of genes known to be
activated at those times, or under those conditions,
can be used.
Tag sequencing, or ChIP-seq, in which the chromatin pieces precipitated by ChIP are repeatedly sequenced, can also be used to identify transcription
factor binding sites. Knowledge of the sequences of
multiple mammalian genomes also allows one to
narrow the search for human transcription factor
binding sites by beginning with conserved regions of
the genome. In addition, it is easier to search for
CRMs, which contain several transcription factor
binding sites. There are more than 100,000 CRMs
in the human genome. They tend to cluster in the
regions surrounding the transcription start and termination sites, but a surprising number are found in
gene deserts far from any known genes.
Locating Enhancers that Bind Unknown Proteins The
“gene-centric” strategy we have just studied is applicable
only to enhancers that bind known proteins. But there are
still many enhancers whose protein partners are unknown.
In order to identify such enhancers, Len Pennacchio and
colleagues reasoned that they needed a genomic approach,
and they described a very effective one in 2006. They
started their search for vertebrate enhancers by looking for
highly conserved noncoding DNA regions. These DNA
regions could meet their definition of “highly conserved” in
two ways: They were either conserved in distantly related
species (say, human and pufferfish), or 100% conserved
over at least 200 base pairs in more closely related species
(e.g., human and mouse).
Pennacchio and colleagues found 167 such enhancer
candidates. To test these DNA sequences for enhancer
wea25324_ch25_789-826.indd Page 806
806
23/12/10
8:43 AM user-f467
/Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile
Chapter 25 / Genomics II: Functional Genomics, Proteomics, and Bioinformatics
(a)
0.1
Fraction of bases included in a pCRM
0.09
0.08
0.07
0.06
0.05
0.04
0.03
0.02
100 kb
0.01
0
1st
Intron
Last
Intron
5ⴕ-UTR
Middle
Introns
3ⴕ-UTR
(b)
0.3
20 kb
0.25
Fraction of bases included in a pCRM
0.2
0.06
50 kb
10 kb
0.05
0.16
0.2
0.04
0.12
0.15
0.03
0.1
0.02
0.05
0.01
0.08
0
TSS
0
Figure 25.14 Distribution of pCRMs within and surrounding genes.
(a) The fraction of bases included in a pCRM is plotted vs. position
within or outside a gene. Colors in the graph, and in the gene diagram
below, represent various gene regions as follows: Dark blue, upstream
and downstream flanking regions; red, 59-UTR; yellow, first intron;
0.04
0
light blue, middle introns; brown, last intron; aqua, 39-UTR. (The fraction
of bases in a pCRM is off scale for the 39-UTR, so no aqua line is
visible. (b) Same as in (a), except that the horizontal scale has been
lengthened to show the individual regions more clearly.
wea25324_ch25_789-826.indd Page 807
12/28/10
10:57 AM user-f469
/Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile
25.1 Functional Genomics: Gene Expression on a Genomic Scale
807
Figure 25.15 Expression patterns driven by enhancers discovered
by transgenic mouse enhancer assay. The expression patterns are
pictured in typical X-gal-stained mouse 11.5-day embryonic whole
mounts, below the bar graph. The number of DNA elements giving rise
to each expression pattern is shown. Some enhancers produced more
than one expression pattern, which explains why the number of
elements is higher than the total number (75) of enhancers tested.
activity, they hooked them up to lacZ reporter genes under
the control of a mouse minimal promoter. Then they placed
these constructs into mouse zygotes, creating transgenic
mice. They allowed the transgenic embryos to grow to embryonic day 11.5, then stained whole embryo mounts with
X-gal to detect b-galactosidase. Strong blue staining with
X-gal indicates abundant b-galactosidase, and therefore
strong transcription stimulated by proteins binding to an
enhancer. Pennacchio and colleagues chose day 11.5 embryos for several reasons: First, they can be stained and
visualized as whole embryo mounts. Furthermore, major
organ systems are visible by this stage. Finally, highly conserved enhancers are known to be clustered near genes that
are expressed during embryonic life.
Of the 167 enhancer candidates tested in this way, Pennacchio and colleagues found that 75 (45%) were positive
in this transgenic mouse enhancer assay. Figure 25.15
shows the number of enhancers that operated in each of
several different tissues, and the pattern of staining that
demonstrates each of the tissue-specificities. The numbers
add up to more than 75, because many of the enhancers
are active in more than one tissue. It is striking that nervous tissue is by far the most common locus of enhancer
activity in this experiment, but that is not surprising, considering that a large percentage of vertebrate genes are
expressed in nervous tissue, and that the development of
the nervous system is complex and requires the function of
many genes.
Thus, this strategy has a remarkably high success rate:
45%, achieved by sampling only one stage of embryonic
development. One expects that many of the sequences that
gave negative results in this experiment would be positive
if other stages of life were sampled. Also, it is already
known that some of the negative sequences are in fact
silencers, so they are also interesting gene control elements.
Pennacchio and colleagues reported that there are 5500
more noncoding sequences in the human genome that are
conserved between humans and pufferfish, and are thus
good candidates for additional enhancers. This strategy
therefore shows great promise for locating enhancers in the
human and in other genomes.
As successful as this method may be for locating gene
control regions, it suffers from the drawback that it only
detects highly conserved sequences. And there is reason to
believe that not all important gene control regions are conserved. We have already seen examples of poorly conserved
control regions in different species of yeast earlier in this
chapter, and the same phenomenon is also found in vertebrates. In 2008, Duncan Odom and colleagues reported
their studies on gene expression in mouse cells carrying a
copy of human chromosome 21. They found that the levels
of transcription of human chromosome 21 genes in mouse
(Source: Reprinted by permission from Macmilllan Publishers Ltd: Nature, 444,
499–502, 23 November 2006. Pennacchio et al, In vivo enhancer anylysis of human
conserved non-coding sequences. © 2006.)
wea25324_ch25_789-826.indd Page 808
808
23/12/10
8:43 AM user-f467
/Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile
Chapter 25 / Genomics II: Functional Genomics, Proteomics, and Bioinformatics
cells more closely resembles their transcription levels in
human cells than the levels of transcription of homologous
mouse genes in mouse cells. This implies that mouse transcription factors recognize human gene control regions and
homologous mouse gene control regions differently.
Indeed, Odom and colleagues also showed by ChIP analysis
that mouse transcription factors bind to human chromosome 21 in a more human-like than mouse-like pattern.
The most likely reason for these differences is a difference
in sequence between the human and mouse gene control
regions. Thus, one probably misses important gene control
regions if one focuses only on highly conserved sequences,
even between closely related species.
SUMMARY To find enhancers whose protein partners are unknown, one can look for noncoding
sequences that are highly conserved between moderately related species, or absolutely conserved between closely related species. These putative
enhancers can then be verified by linking them to
a reporter gene, such as lacZ, and looking for reporter gene activity in embryos, in which many
genes are active. In the case of the lacZ reporter
gene, one looks for blue tissue in the presence of
the indicator X-gal. One limitation to this kind of
study is that some important gene control regions
are not well conserved, even between closely related species.
Locating Promoters In principle, class II promoters
should be easier to locate than enhancers, as they lie at or
very near the transcription start sites of genes, which are
usually known. Nevertheless, when Bing Ren and colleagues performed a genome-wide search for human promoters, they got a surprise: Many genes have alternative
promoters that are located hundreds of base pairs away
from the primary ones.
Ren and colleagues searched for promoters in human
fibroblasts using a ChIP-chip strategy. As mentioned earlier
in this chapter, the ChIP-chip technique seeks to identify
regions in the genome that bind a particular protein. Ren
and colleagues performed ChIP using a monoclonal antibody against the TAF1 subunit of TFIID, reasoning that
preinitiation complexes forming at promoters should contain this key general transcription factor. Then they amplified the DNA precipitated by ChIP and used it to probe
DNA microarrays containing about 14.5 million 50-mers
representing all the nonrepetitive DNA in the human genome. Figure 24.16 summarizes the method and presents
some of their findings.
They found 12,150 TFIID-binding sites, of which
10,553 (87%) mapped within 2.5 kb of a known transcription start site. They had to use the fairly large window of
2.5 kb to allow for uncertainties in the mapping of transcript 59-ends and uncertainties in the ChIP-chip mapping
of TFIID-binding sites due to noise in the microarray data.
Some TFIID-binding sites mapped to the same transcript
59-ends; by eliminating these redundancies, Ren and colleagues settled on 9328 binding sites that mapped to unique
transcripts.
They subjected these 9328 binding sites to four tests for
promoter-like character. First, they performed ChIP-chip
analysis with an anti-RNA polymerase II antibody and
found that 97% of the TFIID-binding sites also bound
polymerase II. Second, they selected 28 of these sites at
random and performed standard ChIP analysis with an
anti-RNA polymerase antibody to verify polymerase II
binding. All but one site passed this test. Third, they
searched for CpG islands and Inr, DPE, and TATA box core
promoter elements in the 9328 TFIID-binding sites. They
found enrichment for the first three but not for TATA boxes
(Figure 24.17c). Fourth, they used ChIP-chip analysis to
look for histone modifications (acetylated histone H3 and
dimethylated lysine 4 on histone H3) that are associated
with gene activity. Again, 97% of the TFIID-binding sites
were associated with these modifications. In summary, the
ChIP-chip method appears to have selected promoters very
accurately, and most of these promoters lack TATA boxes,
in accord with other data showing a paucity of TATA boxes
in yeast and Drosophila.
Ren and colleagues discovered that over 1600 of the
genes they identified had multiple promoters. In most cases,
these promoters gave rise to transcripts that differed only
in the lengths of their 59-UTRs, or in having a distinct first
exon, but did not affect the protein products of the genes.
In other cases, they gave rise to transcripts that were
spliced, polyadenylated, or translated differently. These latter cases could provide another layer of control over gene
expression, if cells can select which promoter to use at a
given time.
SUMMARY Class II promoters can be identified using
ChIP-chip analysis with an anti-TAF1 antibody. In
one such study with human fibroblasts, over 9000 promoters were identified, and over 1600 genes had
multiple promoters.
In Situ Expression Analysis Consider the following
opportunity: As is well known, human chromosome 21
is involved in Down syndrome. To discover which
gene(s) on this chromosome are responsible for the disorder, it would be useful to know the pattern of expression during embryonic life of all the genes on this
chromosome.
Such studies are routinely done in lower organisms,
typically by performing in situ hybridization (Chapter 5)
wea25324_ch25_789-826.indd Page 809
23/12/10
8:43 AM user-f467
/Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile
25.1 Functional Genomics: Gene Expression on a Genomic Scale
809
(a)
4
ENr231 - chr1:148,374,643-148,874,642
TFIID ChIP
log2 R
0
RefGene
TFIID ChIP Replicate 1
Peak
TFIID ChIP Replicate 2
Peak
TCFL1
Count
3000
n =9328
2000
1000
0
0
1
2
Position of TFIID binding site
relative to the matched 5ⴕ-end (kb)
Percentage of occurrence
(c)
(b)
100
Control
IMR90
DBTSS
80
60
40
20
0
CpG
Inr TATA DPE
Promoter elements
Figure 25.16 Finding promoters. Ren and colleagues performed
ChIP-chip analysis using an anti-TAF1 antibody to identify TFIIDbinding sites in human fibroblasts. (a) Representative results from a
relatively small region of human chromosome 1. The top panel
presents the logarithmic ratio (log2 R) of hybridization of DNA
precipitated by TAF1-ChIP to hybridization of a control DNA. Peaks
show putative TFIID-binding sites. The middle panel shows a gene
annotation of this DNA region from the RefSeq database. Note that
the peaks in the top panel generally align with the 59-ends of the
annotated genes. The bottom panel presents a blow-up of two
replicate ChIP analyses of the TCFL1 gene. Arrows show the peak of
hybridization, determined by a peak-finding algorithm, and the
position of the gene is given below, with the 59-end on the right.
(b) Alignment of TFIID-binding sites with 59-ends of genes. The bulk
of the binding sites (83%) fall within 500 bp of the 59-ends of genes.
(c) Association of CpG islands and three core promoter elements with
promoters. Red, TFIID-binding sites identified in this study; blue,
promoters from the DBTSS database; yellow, control DNA.
with cDNA probes in embryonic sections. But that presents a serious problem: Such studies are ethically problematic when performed on human embryos. Fortunately,
now that we have the sequence of the mouse genome,
there is a way around this problem. The mouse genome
harbors orthologs for 161 of the 178 confirmed genes on
human chromosome 21. So the expression of these genes
can be followed through time and space during development of mouse embryos, and we can assume a similar pattern of expression applies to the homologous genes in the
human embryo.
Two research groups applied this strategy to the mouse
orthologs of the genes on human chromosome 21. In one,
Gregor Eichele, Stylianos Antonarakis, and Andrea Ballabio and their colleagues looked at expression of 158 of the
mouse orthologs at three times during gestation by in situ
hybridization. They also checked the expression patterns of
all 161 orthologs in adults by RT-PCR (Chapter 5). They
found patterned expression (expression confined to specific
sites at specific times) of several genes. Moreover, some of
this patterned expression was in sites (central nervous system, heart, gastrointestinal tract, and limbs) that are consistent with the pathology of Down syndrome.
For example, Figure 25.17 shows the expression of the
Pcp4 gene in day 10.5 mouse embryos (by in situ hybridization to whole mount sections) and in day 14.5 embryos
(by in situ hybridization to embryonic sections). At day
10.5, the gene is expressed in the eye (black arrow), brain,
and dorsal root ganglia (white arrow). At day 14.5, the
gene is expressed in many tissues, including the cortical
plate (red arrow) in the brain, the midbrain, cerebellum,
spinal cord, intestine, heart, and dorsal root ganglia. All of
wea25324_ch25_789-826.indd Page 810
810
23/12/10
8:43 AM user-f467
/Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile
Chapter 25 / Genomics II: Functional Genomics, Proteomics, and Bioinformatics
(a)
(b)
Figure 25.17 Expression of two genes in mouse embryos. Gene
expression was assayed by in situ hybridization (Chapter 5), using
either a whole mount embryo (panel a), or a sectioned embryo (panel
b). (a) Expression of Pcp4 in a whole mount of a day 10.5 embryo. The
black arrow indicates the eye, and the white arrow indicates a dorsal
root ganglion. (b) Expression of Pcp4 in a section of a day 14.5 embryo.
The red arrow indicates the cortical plate of the brain. Dark staining
denotes expression of the gene. (Source: Adapted from Nature 420: from
Reymond et al., fig. 2, p. 583, 2002.)
these are areas affected by Down syndrome, so the Pcp4
gene is a candidate for one of the genes involved in the
disorder.
Another example combines work from Eichele and colleagues and another group headed by Ariel Ruiz i Altaba,
Bernhard Herrmann, and Marie-Laure Yaspo on the
expression of the mouse SH3BGR gene in days 9.5, 10.5,
and 14.5 of gestation. These studies show that this gene is
prominently expressed in the heart at all three stages of
development. Because the heart is one of the organs affected by Down syndrome, the SH3BGR gene is another
candidate for involvement in the disorder.
SUMMARY The mouse can be used as a human sur-
rogate in large-scale expression studies that would
be impermissible to perform on humans. For example, scientists have studied the expression of almost
all the mouse orthologs of the genes on human
chromosome 21. They have followed the expression
of these genes through various stages of embryonic
development and have catalogued the embryonic
tissues in which the genes are expressed.
Single-Nucleotide Polymorphisms:
Pharmacogenomics
Now that we have a finished draft of the human genome
sequence, we can look for differences among individuals.
So far, most of these are differences in single nucleotides,
and we classify them as single-nucleotide polymorphisms, or
SNPs (pronounced “snips”) if the minor variant is present
in at least 1% of the population. The human genome
contains at least 10 million such SNPs, and on average,
any two unrelated people differ in millions of SNPs. If we
can link these SNPs to human diseases governed by defects
in single genes, we could then screen individuals for the
tendency to develop those diseases simply by screening for
SNPs. We might also be able to find sets of SNPs that
associate with polygenic traits, such as susceptibility to
such disorders as cardiovascular disease and cancer and
thus pin down the genes responsible for these traits.
We may also be able to identify SNPs that correlate
with good or poor response to certain drugs. Using this
information, physicians should be able to screen a patient
for key SNPs, then custom design a drug treatment program for that patient based on his or her predicted responses to a range of drugs. This field of study is called
pharmacogenomics.
However, these tasks will not be easy. Already, geneticists are discovering that the vast majority of SNPs are not
in genes at all, but in intergenic regions of DNA. Most of
these do not affect gene function, but a few will if they are
located in gene control regions. Even when they are found
within genes, they tend to be silent mutations that do not
alter the structure of the protein product, and thus do not
usually cause any malfunction that could lead to a disease.
(For an exception, see Chapter 18.) The reason for the preponderance of silent SNPs is clear: Polymorphisms caused
by mutations that change the products of genes are generally deleterious, and are therefore selected against. That is,
the individuals with these damaging mutations generally
die before they can reproduce and thus the mutations are
lost. Finally, if history is a guide, even knowing which SNPs
correlate with diseases may not be of immediate benefit. It
will take time to figure out how to use this information.
One can detect SNPs correlating with disease or other
traits in any given individual by a variety of genotyping
techniques. One of these is to hybridize a primer adjacent to
a SNP and then perform primer extension with fluorescent
nucleotides and observe which nucleotide is incorporated
in the SNP position. Another is to hybridize a person’s
DNA to DNA microarrays containing oligonucleotides
with the wild-type and mutated sequences. Still another
is sequencing: either shotgun sequencing, or amplifying a
region surrounding a SNP by PCR and then sequencing it.
Such knowledge can be useful in helping to prevent or treat
disease.
How do SNPs differ from RFLPs? RFLPs are identical
to SNPs if the single-nucleotide difference between two individuals lies in a restriction site, as we observed in Chapter 24 in the RFLPs involving HindIII sites in Huntington
disease patients. In such a case, a single-nucleotide difference makes a difference in the pattern of restriction fragments. However, RFLPs can also result from insertion of a
chunk of DNA between two restriction sites in one
individual, but not another—VNTRs, for example. That
wea25324_ch25_789-826.indd Page 811
23/12/10
8:43 AM user-f467
/Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile
25.1 Functional Genomics: Gene Expression on a Genomic Scale
would not be a SNP because it involves more than just a
single-nucleotide difference.
For those who are enthusiastic about the potential of
SNPs to help identify the causes of common diseases, 2005
was a banner year. The International HapMap Consortium
published a haplotype map including over 1 million human
SNPs, discovered by genotyping 269 DNA samples from
four distinct human populations (one in Nigeria; one in
Utah, USA; one in China; and one in Japan). A haplotype
map shows the locations of haplotypes, blocks of DNA
that tend to be inherited intact, because of the low rate of
recombination within the block. We have already seen in
our discussion of the human genome that the rate of recombination varies considerably from spot to spot, and
regions of high recombination rate alternate with regions
in which recombination is rare. The latter regions are likely
to contain genetic markers that are inherited together and
therefore make up a haplotype.
By focusing on certain well-chosen SNPs (tag SNPs), the
International HapMap Consortium was able to identify
other SNPs in the same region, thus cutting down on the
total amount of genotyping they had to do. They did this
genotyping largely by hybridizing labeled human DNA
fragments to DNA microarrays designed to detect tag SNPs.
This procedure is highly automated, allowing one worker to
scan 500,000 SNPs covering the whole genome in only
two days.
One immediate payoff of the project was the identification of millions of new SNPs (only 1.7 million were
known at the beginning of the project). Another was new
insight into recombination and natural selection in human evolution.
But the potential payoff that attracts the most attention
is the identification of genes that are involved in human
diseases. This process was straightforward in the case of
HD and other diseases caused by a mutation in a single
gene, because people with particular mutations are all but
certain to have the disease. But it is vastly more difficult
when many genes contribute to a disease, because each
mutation may contribute only a little bit, and so each is
difficult to spot. Unfortunately, the diseases that kill and
disable most people (cancer, heart disease, and dementia,
for example) are of the latter kind. In principle, the HapMap
should make this job easier.
Indeed, in 2005, Josephine Hoh, Margaret PericakVance, and Albert Edwards and their colleagues reported
their work on age-related macular degeneration (AMD), a
common cause of blindness in elderly people. They scanned
116,204 human SNPs looking for linkage to AMD and
found one with a high degree of correlation. That is, one
allele is found significantly more frequently in AMD patients than in normal controls. These workers traced this
SNP to a gene called CFH, which encodes complement
factor H. This protein regulates the complement cascade,
which governs inflammation. Later in 2005, Gregory
811
Hageman, Rando Alikmets, Bert God, Michael Dean, and
their colleagues confirmed the linkage between CFH and
AMD, finding a high-risk variant of the gene and also
several variants of the gene that appeared to be protective.
These results led Hageman, Alikmets, God, Dean, and
their colleagues to look for participation of other components of the complement cascade. Sure enough, they found
a strong association between AMD and the factor B gene,
and both high-risk and protective variants. These findings
validated Hageman’s earlier hypothesis that inflammation
is central to the disease process in AMD, and suggest that
controlling inflammation may be a way to help prevent or
control the disease. But genes in the complement cascade
are not the only ones linked to AMD. Another group has
linked a gene (LOC387715), with a product of unknown
function, in AMD, and there are sure to be others.
Other workers have looked beyond SNPs in comparing
the genomes of different people, and have been surprised
by what they found. The genomes of seemingly normal
people frequently contain not just SNPs, but deletions, insertions, inversions, and other rearrangements of whole
chunks of DNA. Geneticists are now calling such differences in genomes structural variation. For example,
Michael Wigler and his colleagues examined the genomes
of 20 healthy individuals and found 221 places where these
people had different numbers of copies of particular chunks
of DNA. While these variations in copy number had no apparent effect on health in these people, it is possible that, in
combination with certain environmental factors, they
could predispose other people to disease.
On the other hand, some structural variants appear to
be beneficial. Sunil Ahuja and colleagues have shown that
extra copies of a particular immune system gene help protect people against AIDS. And a team of Icelandic scientists
has discovered a large inversion that is carried by 20% of
Europeans. Strikingly, women carrying this inversion have
more children than those who do not, suggesting that the
inversion confers some kind of reproductive advantage,
and that it is therefore probably spreading.
The complete sequences of genomes of simpler organisms can also be important in understanding and treating
human diseases. For example, as soon as the complete yeast
genome had been sequenced, molecular biologists began
systematically mutating every one of the 6000 yeast genes
to see what effects those mutations would have. They also
began systematically screening all 18 million possible protein–
protein interactions using a yeast two-hybrid screen
(Chapter 5, and see later in this chapter). The results of
such experiments can tell us much about the activities of
gene products that are still uncharacterized. And knowing
the activities of all the proteins in an organism, and the
other proteins with which they interact, should lead to
greater understanding of biochemical pathways, such as
the ones that metabolize drugs, or signal transduction
pathways that control gene expression. This understanding,
Fly UP