...

93 242 Techniques in Genomic Sequencing

by taratuta

on
Category: Documents
14

views

Report

Comments

Transcript

93 242 Techniques in Genomic Sequencing
wea25324_ch24_759-788.indd Page 765
22/12/10
9:02 AM user-f467
Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile
Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefiles
24.2 Techniques in Genomic Sequencing
used an exon-trapping strategy and identified a handful of
exon clones. They then used these exons to probe a cDNA
library to identify the DNA copies of mRNAs transcribed
from the target region. One of the clones, called IT15, for
“interesting transcript number 15,” hybridized to cDNAs
that identified a large (10,366 nt) transcript that codes for
a large (3144 amino acid) protein called huntingtin. The
presumed protein product did not resemble any known
proteins, so that did not provide any evidence that this is
indeed HD. However, the gene had an intriguing repeat of
23 copies of the triplet CAG (one copy is actually CAA),
encoding a stretch of 23 glutamines.
Is this really HD? Gusella’s team’s comparison of the
gene in affected and unaffected individuals in 75 HD families demonstrated that it is. In all unaffected individuals,
the number of CAG repeats ranged from 11 to 34, and
98% of these unaffected people had 24 or fewer CAG repeats. In all affected individuals, the number of CAG repeats had expanded to at least 42, up to a high of about
100. Thus, we can predict whether an individual will be
affected by the disease by looking at the number of CAG
repeats in this gene.
Furthermore, the severity, or age of onset of the disease
correlates at least roughly with the number of CAG repeats. People with a number of repeats at the low end of
the affected range (now known to be 36–40) generally survive well into adulthood before symptoms appear, whereas
people with a number of repeats at the high end of the
range tend to show symptoms in childhood. In one extreme
example, an individual with the highest number of repeats
detected (about 100) started showing disease symptoms at
the extraordinarily early age of 2.
Finally, two people were affected, even though their parents were not. In both cases, the affected individuals had
expanded CAG repeats, whereas their parents did not. New
mutations (expanded CAG repeats), although a rare occurrence in HD, apparently caused both these cases of disease.
Another way of demonstrating that this gene is really
HD would be to deliberately mutate it and show that the
mutation has neurological effects. Obviously, one cannot
perform such an experiment in humans, but it would be
feasible in mice, if the gene corresponding to HD is known.
Fortunately, HD is conserved in many species, including
the mouse, where the gene is known as Hdh. In 1995, a
team of geneticists led by Michael Hayden created knockout mice (Chapter 5) with a targeted disruption in exon 5
of Hdh. Mice that are homozygous for this mutation die in
utero. Heterozygotes are viable, but they show loss of neurons with corresponding lowering of intelligence. This reinforces the notion that Hdh, and therefore HD, plays an
important role in the brain—exactly what we would expect
of the gene that causes HD.
How can we put this new knowledge to work? One
obvious way is to perform accurate genetic screening to
detect people who will be affected by the disease. In fact, by
765
counting the CAG repeats, we may even be able to predict
the age of onset of the disease. However, that kind of information is a mixed blessing, as it can be psychologically
devastating. What we really need, of course, is a cure, but
that may be a long way off.
The Advantage of Genomic Data The positional cloning
study we have just examined took years, and much of that
time was spent sequencing DNA in the suspected regions
and trying to determine which gene in the sequence was the
most likely culprit. With the human genome now finished,
that job has become much easier. Just how much easier is
indicated by Neal Copeland, a mouse geneticist who has
been doing positional cloning in mice for years. He says, “It
took us 15 years to get 10 possible cancer genes before we
had the sequence. And it took us a few months to get 130
genes once we had the sequence.” He was talking about the
mouse sequence, of course, but the same principle applies
to humans, and mouse positional-cloning studies very often identify genes that cause similar problems in humans.
So one of the biggest anticipated payoffs of genomics research will be the acceleration of discovery of disease genes
in humans. You should not conclude from this discussion
that positional cloning is obsolete. It will be important as
long as we are curious about finding genes responsible for
traits in any organism. Sequenced genomes simply make
positional cloning much easier.
SUMMARY Using RFLPs, geneticists mapped the
Huntington disease gene (HD) to a region near the
end of chromosome 4. Then they used an exon trap
to identify the gene itself. The mutation that causes
the disease is an expansion of a CAG repeat from the
normal range of 11–34 copies, to the abnormal range
of at least 38 copies. The extra CAG repeats cause
extra glutamines to be inserted into huntingtin.
24.2 Techniques in Genomic
Sequencing
The first genome to be sequenced, as you might expect, was
a very simple one: The small DNA genome of an E. coli
phage called fX174. Frederick Sanger, the inventor of the
dideoxy chain termination method of DNA sequencing,
obtained the sequence of this 5375-nt genome in 1977.
What kind of information can we glean from this sequence? First, we can locate exactly the coding regions for
all the genes. This tells us the spatial relationships among
genes and the distances between them to the exact nucleotide. How do we recognize a coding region? It contains an
ORF that is long enough to code for one of the phage
wea25324_ch24_759-788.indd Page 766
766
22/12/10
9:02 AM user-f467
Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile
Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefiles
Chapter 24 / Introduction to Genomics: DNA Sequencing on a Genomic Scale
proteins. Furthermore, the ORF must start with an ATG
(or occasionally a GTG) triplet, corresponding to an AUG
(or GUG) translation initiation codon, and end with the
DNA equivalent of a stop codon (UAG, UAA, or UGA). In
other words, an ORF in a bacterium or phage is the same
as a gene’s coding region.
The base sequence of the phage DNA also tells us the
amino acid sequences of all the phage proteins. All we have
to do is use the genetic code to translate the DNA base sequence of each open reading frame into the corresponding
amino acid sequence. This may sound like a laborious process, but a personal computer can do it in a split second.
Sanger’s analysis of the open reading frames of the
fX174 DNA revealed something unexpected and fascinating: Some of the phage genes overlap. Figure 24.6a shows
that the coding region for gene B lies within gene A and the
coding region for gene E lies within gene D. Furthermore,
genes D and J overlap by 1 bp. How can two genes occupy
the same space and code for different proteins? The answer
is that the two genes are translated in different reading
frames (Figure 24.6b). Because entirely different sets of codons will be encountered in these two frames, the two protein products will also be quite different.
This was certainly an interesting finding, and it raised
the question of how common this phenomenon would be.
So far, major overlaps seem to be confined almost exclusively to viruses, which is not surprising because these simple infectious agents have small genomes in which the
premium is on efficient use of the genetic material. Moreover, viruses have prodigious power to replicate, so enormous numbers of generations have passed during which
evolution has honed the viral genomes.
With the advent of automated sequencing, geneticists
have added much larger genomes to the list of total known
sequences. In 1988, D.J. McGeoch and colleagues published
B
(a)
A
the sequence of an important human virus (herpes simplex
virus I) with a relatively large genome: 152,260 bp. In
1995, Craig Venter and Hamilton Smith and colleagues
determined the entire base sequences of the genomes of
two bacteria: Haemophilus influenzae and Mycoplasma
genitalium. The H. influenzae (strain Rd) genome contains
1,830,137 bp and it was the first genome from a freeliving organism to be completely sequenced. The
M. genitalium genome, at only 580,000 bp, is the smallest
of any known free-living organism and contains only
about 470 genes.
In April 1996, the leaders of an international consortium of laboratories announced another milestone: The
12-million-bp genome of baker’s yeast (Saccharomyces
cerevisiae) had been sequenced. This was the first eukaryotic genome to be entirely sequenced. Later in 1996, the
first genome of an organism (Methanococcus jannaschii)
from the third domain of life, the archaea, was sequenced.
Then, in 1997, the long-awaited sequence of the 4.6
million-bp E. coli genome was reported. This is only about
one-third the size of the yeast genome, but the importance
of E. coli as a genetic tool made this a milestone as well.
In 1998, the sequence of the first animal genome, from
the roundworm Caenorhabditis elegans, was reported. The
first plant genome (from the mustard family member Arabidopsis thaliana) was completed in 2000. C. elegans and
A. thaliana are both model organisms chosen for study
because of their small genome size, short generation time,
and their ease of manipulation in genetic experiments.
C. elegans has the additional advantages of having fewer than
1000 cells, and being transparent, so the development of
each of its cells can be tracked visually. Two other famous
model organisms are the fruit fly Drosophila melanogaster
and the house mouse Mus musculus. The sequences of the
genomes of these two organisms were reported in 2000
E
C
D
J
G F
Gene E
1
6
ATGAGT
Met Ser
1 2
(b)
1 2
Met Val
89 90
Lys Glu Stop
H
Gene J
1 2
Met Ser
175
465
445
GTTTATGGTA
GAAGGAGTGATGTAATGTCTA
184
Val Tyr Gly
Glu Gly Val Met Stop
59 60 61
149 150 151 152
Gene D
Figure 24.6 The genetic map of phage fX174. (a) Each letter
stands for a phage gene. (b) Overlapping reading frames of fX174.
Gene D (pink) begins with the base numbered 1 in this diagram and
continues through base number 459. This corresponds to amino acids
1–152 plus the stop codon TAA. Dots represent bases or amino acids
not shown. Only the nontemplate strand is shown. Gene E (blue)
begins at base number 179 and continues through base number 454,
corresponding to amino acids 1–90 plus the stop codon TGA. This
gene uses the reading frame one base to the right, relative to the
reading frame of gene D. Gene J (gray) begins at the base number 459
and uses the reading frame one base to the left, relative to gene D.
wea25324_ch24_759-788.indd Page 767
22/12/10
9:02 AM user-f467
Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile
Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefiles
24.2 Techniques in Genomic Sequencing
and 2002, respectively. Also in 2000, the eagerly awaited
rough draft of the human genome sequence was announced.
By 2001, this “working draft” of the human genome was
published.
In 2002, several important genomes were reported,
in at least draft form. These included the genomes of the
single-celled parasite Plasmodium falciparum, which
causes malaria, and the mosquito Anopheles gambiae,
which is the major carrier of the parasite. Together, these
genomes promise to help in designing better ways of combating the terrible scourge of malaria. The year 2002 also
saw the publication of draft sequences of the genomes of
two common varieties of rice (Oryza sativa). This is the
first cereal plant genome to be sequenced, and it has enormous potential significance for human nutrition. Much of
the world’s population relies on cereals, and rice in particular, for the bulk of their food.
The genomic sequences of two more vertebrates also
appeared in 2002: The tiger pufferfish (Fugu rubripes), and
the house mouse (Mus musculus). Comparison of these sequences to that of the human genome has already shed
light on vertebrate evolution. Additional help on this evolutionary investigation has come from the sequence of the
genome of the sea squirt, Ciona intestinalis. The adult of
this species is a sessile marine organism that attaches itself
to rocks and pier pilings. It bears scant resemblance to a
vertebrate, but its larval form resembles a tadpole, complete with a dorsal column made of cartilage that bears
some resemblance to a spine. Thus, the sea squirt is a chordate, in the same phylum with the vertebrates. Comparison
of the genome of this organism with those of vertebrates
and invertebrates, such as nematodes and fruit flies, will
give us additional insight into vertebrate evolution.
Most molecular evolution studies depend on comparisons of base sequences of parts of genomes from different
organisms. The guiding principle is that there is a relationship between the divergence of the genomic sequences between any two organisms and the evolutionary distance
between those two organisms. Thus, the genomes of organisms that diverged relatively recently, such as the mouse and
human, should be more similar than the genomes of organisms that diverged longer ago, such as the sea squirt and
human. In general, this is certainly true, but genomic studies
on these and other organisms have revealed some unexpected features. For example, the rate of evolution of the
human genome is not constant throughout. Instead, there
are regions of relatively rapid change interspersed with regions that have changed relatively slowly over time. It will
be fascinating to discover the reasons for these differences.
Another lesson from the genomes sequenced so far is
that the size of an organism’s genome tends to correlate
with the organism’s complexity. (On the other hand, we
discovered in Chapter 2 when we discussed the C-value
paradox that there are many exceptions to this general
rule.) In accord with the rule, prokaryotic genomes tend
767
to be much smaller than eukaryotic ones. However, it is
interesting that there is some overlap. For example, the
smallest eukaryotic genome sequenced to date is that of the
obligate intracellular parasite of humans and other mammals,
Encephalitozoon cuniculi. This organism has a genome
comprising only about 2.9 Mb, and has only 1997 ORFs
that could potentially code for proteins. (Of course, a parasitic lifestyle enables an organism to survive with fewer
genes because it can rely on its host for many of its needs.)
By contrast, the largest bacterial genome, as of 2008, is that
of the social bacterium Sorangium cellulosum. It has a
genome composed of about 13 Mb, which is even larger
than the genome of budding yeast.
On April 14, 2003, the International Human Genome
Sequencing Consortium announced that it had produced a
“finished” human genome sequence—two years ahead of
schedule. That is, it had done 99% of the sequencing that
was possible with 2003 technology, the sequence was subject to an error rate of only one in 100,000, and all sequences
were in the proper order. This was a significant improvement over the rough draft announced two years earlier. Several hundred gaps remained to be filled, but they were
mostly very challenging repetitive regions and centromeres.
As of December 6, 2010, more than 1440 complete
genomes had been sequenced, of which 1372 were from
microbes, according to the NCBI website (www.ncbi.nlm
.nih.gov/genome). Table 24.1 presents a time line of some of
the most important achievements in genome sequencing. In
the following sections we will discuss the lessons we have
learned from these sequences.
SUMMARY The base sequences of viruses and organisms ranging from phages to bacteria to animals
and plants have been obtained. A rough draft and
finished version of the human genome have also
been obtained. Comparison of the genomes of
closely related and more distantly related organisms
can shed light on the evolution of these species.
The Human Genome Project
In 1990, American geneticists embarked on an ambitious
quest: to map and ultimately sequence the entire human
genome. This effort, which quickly became an international program, was somewhat controversial at first, partly
because of the enormous effort and cost of carrying it
through to its ultimate goal: knowing the entire base sequence of every one of the human chromosomes. The reason for the high cost, of course, is that the human genome
is huge—more than 3 billion bp. To get an idea of the magnitude of this task, consider that if all 3 billion bases were
written down, it would take about 500,000 pages of the
journal Nature to contain all the information. If you could
wea25324_ch24_759-788.indd Page 768
768
22/12/10
9:02 AM user-f467
Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile
Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefiles
Chapter 24 / Introduction to Genomics: DNA Sequencing on a Genomic Scale
Table 24.1
Milestones in Genomic Sequencing
Genome (Importance)
Size (bp)
Phage fX174 (first genome)
Phage l (large-DNA phage)
Herpes simplex virus I (large-DNA eukaryotic virus)
Haemophilus influenzae (bacterium, first organism)
Mycoplasma genitalium (smallest bacterial genome)
Saccharomyces cerevisiae (yeast, first eukaryote)
Methanococcus jannaschii (first archaeon)
Escherichia coli (best studied bacterium)
Caenorhabditis elegans (first animal, roundworm)
Human chromosome 22 (first human chromosome)
Arabidopsis thaliana (first plant, mustard family)
Drosophila melanogaster (a favorite genetic model)
Human (working draft of the “holy grail” of genomics)
Plasmodium falciparum (the malaria parasite)
Anopheles gambiae (the major mosquito malaria carrier)
Fugu rubripes (tiger pufferfish)
Mus musculus (house mouse)
Ciona intestinalis (sea squirt, a primitive chordate)
Canis lupus familiaris (dog, working draft)
Gallus gallus (chicken, first farm animal)
Human (finished sequence)
Oryza sativa (rice, first cereal grain)
Pan troglodytes (chimpanzee, our closest relative, working draft)
Three trypanosomatids (Trypanosoma cruzi, T. brucei, and
Leishmania major, parasites that cause severe human illness)
Populus trichocarpa
(black cottonwood, first tree)
First individual humans (two Caucasians,
one African, and one Han Chinese)
Homo Neanderthalensis (our closest evolutionary relative, working draft)
stand the boredom, it would take you about 60 years,
working 8 h/day, every day, at 5 bases a second, to read it
all. Assuming a 1990 cost of about a dollar a base, the
project would consume more than $3 billion, vastly more
than we are used to devoting to a single biological project.
In the end, more efficient sequencing methods allowed the
project to be completed much sooner and at a lower cost
than originally estimated.
The original plan for the Human Genome Project was
systematic and conservative: First, geneticists would prepare genetic and physical maps of the genome. These would
contain the markers, or signposts, that would allow DNA
sequences to be pieced together in the proper order. The
bulk of the sequencing would be done only after the mapping was complete and clones representing all points on the
map were in hand—systematically stored in freezers around
Year
5375
48,513
152,260
1,830,000
580,000
12,068,000
1,660,000
4,639,221
97,000,000
53,000,000
120,000,000
180,000,000
3,200,000,000
23,000,000
278,000,000
365,000,000
2,500,000,000
117,000,000
,2,400,000,000
1,050,000,000
3,200,000,000
489,000,000
,3,000,000,000
25–55,000,000
1977
1983
1988
1995
1995
1996
1996
1997
1998
1999
2000
2000
2001
2002
2002
2002
2002
2002
2003
2004
2004
2005
2005
2005
,485,000,000
2006
3,200,000,000
2007 and 2008
,3,000,000,000
2010
the world. The original target date for completion of the
sequence was 2005.
Then, in May of 1998, Craig Venter, who had established a private, for-profit company, Celera, to sequence the
human genome (and other genomes), shocked the genomics community by announcing that Celera would complete
a rough draft of the human genome by the end of 2000.
That timetable was astonishing enough, but the method by
which he proposed to do the sequencing was even more
arresting. Instead of relying on a map, with the ordered
clones used to build it, Venter proposed a shotgun sequencing approach in which the whole human genome would be
chopped up and cloned, then the clones would be sequenced at random, and finally the sequences would be
pieced together using powerful computer programs that
find overlapping sequences. It was not long before Francis
wea25324_ch24_759-788.indd Page 769
22/12/10
9:02 AM user-f467
Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile
Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefiles
24.2 Techniques in Genomic Sequencing
Collins, director of the publicly financed Human Genome
Project, rose to Venter’s challenge and promised that he
and his colleagues would also produce a rough draft by
the end of 2000, and a polished final draft by 2003, using
the map-then-sequence strategy.
The upshot of this race was a tie of sorts. Venter and
Collins appeared with President Clinton and other dignitaries at a ceremony in the East Room of the White House on
June 26, 2000, to announce the completion of a rough draft
of the human genome. We will examine the two approaches
to sequencing large genomes: mapping, then sequencing
(clone by clone); and shotgun sequencing. But first, let us
examine the cloning vectors that have been developed for
massive projects like the Human Genome Project.
Vectors for Large-Scale Genome Projects
No matter which sequencing strategy is used, one must first
clone fragments of the genome in appropriate vectors, and
large fragments are particularly valuable. We will describe
two of the most popular here: yeast artificial chromosomes
and bacterial artificial chromosomes. The early mapping
work relied on yeast artificial chromosomes, so we will
begin with those.
Yeast Artificial Chromosomes The main problem with
the cloning tools described in Chapter 4 is that they do not
hold enough DNA for large-scale physical mapping of the
human genome. Even the cosmids accommodate DNA inserts up to only about 50 kb, which is too small for efficient
mapping of regions spanning more than a million bases.
Vectors called yeast artificial chromosomes, or YACs,
were very useful in mapping the human genome because
they could accommodate hundreds of thousands of kilobases each. YACs containing a megabase or more are
known as “megaYACs.” A YAC contains a left and right
yeast chromosomal telomere (Chapter 21), which are both
necessary to protect the chromosome’s ends, and a yeast
centromere, which is necessary for segregation of sister
chromatids to opposite poles of the dividing yeast cell. The
centromere is placed adjacent to the left telomere, and a
huge piece of human (or any other) DNA can be placed in
between the centromere and the right telomere, as shown
in Figure 24.7. The large DNA inserts are prepared by
slightly digesting long pieces of human DNA with a restriction enzyme. The YACs, with their huge DNA inserts, can
then be introduced into yeast cells, where they will replicate just as if they were normal yeast chromosomes.
Using YACs, geneticists made great strides in the mapping phase of the Human Genome Project. They produced
a genetic map of the whole genome that provided an average resolution of 0.7 centimorgan. A centimorgan (cM) is
the distance that yields a 1% recombination frequency
between two markers and corresponds to an average of
about 1 Mb in humans. These researchers also produced
L
C
769
R
+
+
Ligate
L
C
R
Figure 24.7 Cloning in yeast artificial chromosomes. We
begin with two tiny pieces of DNA from the two ends of a yeast
chromosome. One of these, the left arm, contains the left telomere
(yellow, labeled L) plus the centromere (red, labeled C). The right arm
contains the right telomere (yellow, labeled R). These two arms are
ligated to a large piece of foreign DNA (blue)—several hundred
kilobases of human DNA, for example—to form the YAC, which can
replicate in yeast cells along with the real chromosomes.
relatively high-resolution physical maps of two of the
smallest chromosomes, 21 and Y. These maps were especially useful in that they represented long stretches of overlapping DNA segments cloned in YACs. Thus, in the days
before the human genome was sequenced, if you were interested in a disease gene that mapped to one of these chromosomes, you had a much simplified task. You needed only to
discover two markers flanking the gene of interest, look on the
map to find which YAC or YACs contained these markers,
obtain the YACs, and begin your final search for the gene.
Bacterial Artificial Chromosomes Despite all the success
they made possible in human genome mapping, YACs suffer from several serious drawbacks: They are inefficient
(not many clones are obtained per microgram of DNA);
they are hard to isolate from yeast cells; they are unstable;
and they tend to contain scrambled inserts that are really
composites of DNA fragments from more than one site.
Bacterial artificial chromosomes (BACs) solve all of these
problems and were therefore the vector of choice for
much of the sequencing phase of the Human Genome
Project.
BACs are based on a well-known natural plasmid that
inhabits E. coli cells: the F plasmid. This plasmid allows
conjugation between bacterial cells. In some conjugation
events, the F plasmid itself is transferred from a donor F1
cell to a recipient F2 cell, converting the latter to an F1
cell. In other events, a small piece of host DNA is transferred as an insert in the F plasmid (which is called an F9
plasmid if it has an insert of foreign DNA). And in still
other events, the F plasmid inserts into the host chromosome and mobilizes the whole chromosome to pass from
the donor cell to the recipient cell. Thus, because the E. coli
chromosome contains over 4 million bp, the F plasmid can
obviously accommodate a large insert of DNA. In
practice, BACs usually have inserts less than 300,000 bp
(average about 150,000 bp), and these plasmids are stable
wea25324_ch24_759-788.indd Page 770
770
22/12/10
9:02 AM user-f467
Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile
Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefiles
Chapter 24 / Introduction to Genomics: DNA Sequencing on a Genomic Scale
Sal I
HindIII
Not I
BamHI
Not I
Sal I
CmR
oriS
pBAC108L
(6.9 kb)
ParB
ParA
repE
Figure 24.8 Map of the BAC vector, pBAC108L. Key features include the cloning sites HindIII and BamHI, at top; the chloramphenicol resistance
gene (CmR), used as a selection tool; the origin of replication (oriS); and the genes governing partition of plasmids to daughter cells (ParA and ParB).
in vivo and in vitro. Unlike the linear YACs, which tend to
break under shearing forces, the circular, supercoiled BACs
resist breakage.
Figure 24.8 shows the map of one of the first BACs,
which was developed by Melvin Simon and colleagues in
1992. It has an origin of replication, a cloning site with two
restriction sites (for HindIII and BamHI) into which large
DNA fragments may be inserted. It also has genes (the Par
genes) that govern plasmid partition to the daughter cells
that keep the plasmid copy number at about two per cell.
This contributes to the stability of the plasmid, and it has a
chloramphenicol-resistance gene to enable selection of cells
that have the plasmid.
SUMMARY Two high-capacity vectors have been
used extensively in the Human Genome Project.
Much of the mapping work was done with yeast
artificial chromosomes (YACs), which can accept
inserts of a million or more base pairs. Most of the
sequencing work was performed with bacterial artificial chromosomes (BACs) which can accept up to
about 300,000 bp. The BACs are more stable and
easier to work with than the YACs.
The Clone-by-Clone Strategy
This strategy has inherent appeal because it is so systematic. First, the whole genome is mapped by finding markers
regularly spaced along each chromosome. A by-product of
the mapping is a collection of clones corresponding to the
markers. Because we already know the order of these
clones, we can sequence each one and put that sequence in
its proper place in the whole genome. Thus, this method is
commonly called the clone-by-clone sequencing strategy.
Aside from their usefulness in cloning, genetic and physical
maps have another important benefit: They give us signposts to use when searching for the genes responsible for
diseases. In the next section, we will consider some of the
most powerful methods used in mapping large genomes in
preparation for sequencing. As you read this section, bear
in mind that these techniques are designed to map markers
that are not genes but simply stretches of DNA that vary
from one individual to another. We have already seen one
example of such markers: restriction fragment length polymorphisms (RFLPs).
Variable Number of Tandem Repeats The greater the degree of polymorphism of a RFLP, the more useful it will be.
If only 1 person in 100 has one form of the RFLP (the 6-kb
wea25324_ch24_759-788.indd Page 771
22/12/10
9:02 AM user-f467
Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile
Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefiles
24.2 Techniques in Genomic Sequencing
fragment in Figure 24.1, for example), and the other 99
have the other form (the 4-kb and 2-kb fragments), one
must screen many individuals before finding the one rare
variant. This makes mapping very tedious. However, some
RFLPs, called variable number tandem repeats, or VNTRs,
are more useful. These derive from minisatellites (Chapter 5),
stretches of DNA that contain a short core sequence repeated over and over in tandem (head to tail). Because the
number of repeats of the core sequence in a VNTR is likely
to be different from one individual to another, VNTRs are
highly polymorphic, and therefore relatively easy to map.
However, VNTRs have a disadvantage as genetic markers:
They tend to bunch together at the ends of chromosomes,
leaving the interiors of the chromosomes relatively devoid
of markers.
Sequence-Tagged Sites Another kind of anonymous
marker, which is very useful to genome mappers, is the
sequence-tagged site (STS). STSs are short sequences,
about 60–1000 bp long, that can be detected by PCR.
Figure 24.9 illustrates how to use PCR to detect an STS.
One must first know enough about the DNA sequence in
the region being mapped to design short primers that will
250 bp
PCR
n
Electrophoresis
250 bp
Figure 24.9 Sequence-tagged sites. We start with a large cloned
piece of DNA, extending indefinitely in either direction. The sequences
of small areas of this DNA are known, so one can design primers that
will hybridize to these regions and allow PCR to produce doublestranded fragments of predictable lengths. In this example, two PCR
primers (red) spaced 250 bp apart have been used. Several cycles of
PCR generate many copies of a double-stranded PCR product that is
precisely 250 bp long. Electrophoresis of this product allows one to
measure its size exactly and confirm that it is the correct one.
771
hybridize a few hundred base pairs apart and cause amplification of a predictable length of DNA in between. One can
then apply PCR with these two primers to any unknown
DNA; if the proper size amplified DNA fragment appears,
then the unknown DNA has the STS of interest. Notice
that hybridization of the primers to the unknown DNA is
not enough; they must hybridize a specific number of base
pairs apart to give the right size PCR fragment. This provides a check on the specificity of hybridization. One great
advantage of STSs as a mapping tool is that no DNA must
be cloned and examined and kept in someone’s freezer.
Instead, the sequences of the primers used to generate an
STS are published and then anyone in the world can order
those same primers and find the same STS in an experiment that takes just a few hours. Another big advantage is
that it takes much less DNA to perform PCR than to do a
Southern blot.
Microsatellites STSs are very useful in physical mapping
or locating specific sequences in the genome. But they are
worthless as markers in traditional genetic mapping unless
they are polymorphic. Only then can we use them to determine genetic linkage. Fortunately, geneticists have discovered a class of STSs called microsatellites that are highly
polymorphic. Microsatellites are similar to minisatellites in
that they consist of a core sequence repeated over and over
many times in a row. However, whereas the core sequence
in typical minisatellites is a dozen or more base pairs long,
the core in microsatellites is much smaller—usually only
2–4 bp long. In 1992, Jean Weissenbach and his colleagues
produced a linkage map of the entire human genome based
on 814 microsatellites containing a C–A dinucleotide repeat. They isolated cloned DNAs containing these microsatellites and used their sequences to design PCR primers
that flank the repeats at each locus. A given pair of primers
yielded a PCR product whose size depended on the number
of C–A repeats in a given individual’s microsatellite at that
locus. Happily, the number of repeats varied quite a bit
from one individual to another. Besides the fact that microsatellites are highly polymorphic, they are also widespread
and relatively uniformly distributed in the human genome.
Thus, they are ideal as markers for both linkage and physical mapping.
Genetic (linkage) mapping with microsatellites is done
by the same technique outlined in Chapter 1 for traditional
genetic markers in fruit flies. Instead of determining the
recombination frequency between, say, wing shape and eye
color, geneticists would determine the recombination frequency between two microsatellites. For example, consider
an example in which a man’s DNA yields a microsatellite at
one locus that is 78 bp long and a microsatellite at a nearby
locus that is 42 bp long. His wife has a microsatellite at the
first locus that is 102 bp long and a microsatellite at
the second locus that is 36 bp long. Within limits, the more
their children show nonparental combinations of these two
wea25324_ch24_759-788.indd Page 772
772
22/12/10
9:02 AM user-f467
Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile
Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefiles
Chapter 24 / Introduction to Genomics: DNA Sequencing on a Genomic Scale
markers in their gametes (e.g., grandchild with microsatellites that are 78 and 36 bp long, respectively), the more
recombination has occurred between the markers, and the
farther apart the markers are on the chromosome.
Geneticists interested in physically mapping or sequencing a given region of a genome aim to assemble a set of
clones called a contig, which contains contiguous (actually
overlapping) DNAs spanning long distances. This is rather
like putting together a jigsaw puzzle; the bigger the pieces,
the easier the puzzle. Thus, it is essential to have vectors
like BACs and YACs that hold big chunks of DNA. Assuming we have a BAC library of the human genome, we need
some way to identify the clones that contain the region we
want to map. This can be done in several ways. We could
hybridize BAC DNA to a labeled DNA probe corresponding to the region of interest, but this is subject to some uncertainty due to possible nonspecific hybridization. A more
reliable method is to look for STSs in the BACs. It is best to
screen the BAC library for at least two STSs, spaced hundreds of kilobases apart, so BACs spanning a long distance
are selected.
After we have found a number of positive BACs, we
begin mapping by screening them for several additional
STSs, so we can line them up in an overlapping fashion as
shown in Figure 24.10. This set of overlapping BACs is our
new contig. We can now begin finer mapping, and even
sequencing, of the contig.
Radiation Hybrid Mapping Mapping with BACs sounds
straightforward, but it presents difficulties. One of the most
important is that BACs are so small relative to a whole human chromosome that creating a BAC contig of a whole
chromosome would be unbearably laborious. So we need a
method to find linkage between STSs that are even farther
apart than those that could fit into a single BAC. Radiation
hybrid mapping provides a way. We begin by irradiating
(a) Screen for STS1 (
) and STS4 (
etc.
STS1 (
)
STS4 (
)
).
(b)
Screen each BAC
that had STS1 or
STS4 for STS2 ( ),
STS3 ( ), and
STS5 ( ).
(c) Line up STSs to
form a contig.
Contig:
Figure 24.10 Mapping with STSs. At top left, several representative
BACs are shown, with different symbols representing different STSs
placed at specific intervals. In step (a) of the mapping procedure,
screen for two or more widely spaced STSs. In this case screen for
STS1 and STS4. All those BACs with either STS1 or 4 are shown at
top right. The identified STSs are shown in color. In step (b), each of
these positive BACs is further screened for the presence of STS2,
STS3, and STS5.The colored symbols on the BACs at bottom right
denote the STSs detected in each BAC. In step (c), align the STSs in
each BAC to form the contig. Measuring the lengths of the BACs by
pulsed-field gel electrophoresis helps to pin down the spacing
between pairs of BACs.
wea25324_ch24_759-788.indd Page 773
22/12/10
9:02 AM user-f467
Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile
Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefiles
24.2 Techniques in Genomic Sequencing
human cells with lethal doses of ionizing radiation, such as
x-rays or gamma rays, which break the human chromosomes into pieces. Next, we fuse these doomed human cells
with hamster cells to form hybrid cells that contain only
some of the human chromosome fragments. Then, we form
clones of identical hybrid cells by growing groups of cells—
each group deriving from a single progenitor cell. Finally,
we examine clones of hybrid cells to see which STSs tend to
be found together in the hybrid cells. The more often they
are together, the closer together they are likely to be on a
human chromosome.
In 1996, an international consortium of geneticists, including G.D. Schuler, published a human map based on
STSs mapped by this technique. It contained more than
16,000 STS markers, plus about a thousand genetic markers mapped by classical linkage methods (family studies),
which provided an overall framework for the map. The
STS markers used in this study were a special class called
expressed sequence tags (ESTs). These are STSs that are
generated by starting with mRNAs and using the enzyme
reverse transcriptase to make corresponding cDNAs. These
cDNAs can then be amplified by PCR and cloned. Finally,
both ends of the cDNAs are sequenced, yielding two “sequence tags” that are usually less than 500 bases long.
Thus, ESTs represent genes that are expressed in the cell
from which the mRNAs were isolated. Because the STS (or
EST) method yields the sequence of only a small part of a
gene, a given gene may be represented by many different
ESTs in an EST database. To minimize such duplications,
the mapping consortium confined their mapping to ESTs
that represented the 39-untranslated regions (39-UTRs) of
genes. This strategy also has the advantage of avoiding
most introns, which tend not to be found in 39-UTRs. By
1998, the international consortium (P. Deloukas et al.) had
refined and extended the map to include over 30,000 genes.
SUMMARY Mapping the human genome requires a
set of landmarks to which we can relate the positions of genes. Some of these markers are genes, but
many more are nameless stretches of DNA, such as
RFLPs, VNTRs, STSs (including ESTs and microsatellites). The latter two are regions of DNA that can
be identified by formation of a predictable length of
amplified DNA by PCR with pairs of primers.
Shotgun Sequencing
The shotgun-sequencing strategy, first proposed by Craig
Venter, Hamilton Smith, and Leroy Hood in 1996, bypasses
the mapping stage and goes right to the sequencing stage.
The sequencing starts with a set of BAC clones containing
large DNA inserts, averaging about 150 kb. The insert in
each BAC is sequenced on both ends using an automated
773
sequencer that can usually read about 500 bases at a time,
so 500 bases at each end of the clone will be determined.
Assuming that 300,000 clones of human DNA are sequenced this way, that would generate 300 million bases of
sequence, or about 10% of the total human genome, and
the 500-base sequenced regions would therefore occur on
average every 5 kb in the genome. These 500-base
sequences serve as an identity tag, called a sequence-tagged
connector (STC), for each BAC clone. On average, assuming an average clone size of 150 kb, and an STC every 5 kb,
30 clones (150 kb/5 kb 5 30) should share a given STC
somewhere within their span. This is the origin of the term
connector—each clone should be “connected” via its STCs
to about 30 other clones.
The next step is to fingerprint each clone by digesting
it with a restriction enzyme. This serves two important
purposes. First, it tells the insert size (the sum of the sizes of
all the fragments generated by the restriction enzyme).
Second, it allows one to eliminate aberrant clones whose
fragmentation patterns do not fit the consensus of the
overlapping clones. Note that this clone fingerprinting is
not the same as mapping; it is just a simple check before
sequencing begins.
The next step is to obtain the entire sequence of a BAC
that looks interesting (a seed BAC). This is done by subdividing the BAC into smaller clones, frequently in a pUCtype vector with inserts averaging only about 2 kb. This
whole BAC sequence allows the identification of the 30 or
so other BACs that overlap with the seed: They are the ones
with STCs that occur somewhere in the seed BAC.
Next, one selects other BACs with minimal overlap with
the original one and proceeds to sequence them. Then this
process is repeated with other BACs with minimal overlap
with the second set, and so forth. This strategy, called BAC
walking, would in principle allow one laboratory to
sequence the whole human genome—given enough time.
But they did not have that much time, so Venter and colleagues modified the procedure by sequencing BACs at random until they had about 35 billion nt of sequence. In
principle that should cover the human genome ten times over,
giving a high degree of coverage and accuracy. Then they fed
all the sequence into a computer with a powerful program
that found areas of overlap between clones and fit their
sequences together, building the sequence of the whole genome.
As mentioned a little earlier, the bulk of the sequencing
is done with pUC clones with relatively small inserts—only
about 2 kb each. But these small inserts would not provide
enough overlaps to piece together the whole genome. This
drawback is especially apparent in regions of repeated
DNA. A 2-kb cloned sequence from a 10-kb region of tandem DNA repeats would give no clues about where the
cloned sequence fit within the larger repeat region—one
part looks the same as another. That is one way the BAC
clones come in handy: They are large enough to cover
almost any repeated region. They also provide overlaps
Fly UP