...

94 243 Studying and Comparing Genomic Sequences

by taratuta

on
Category: Documents
29

views

Report

Comments

Transcript

94 243 Studying and Comparing Genomic Sequences
wea25324_ch24_759-788.indd Page 774
774
22/12/10
9:02 AM user-f467
Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile
Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefiles
Chapter 24 / Introduction to Genomics: DNA Sequencing on a Genomic Scale
spanning large DNA regions, so they can help to organize
the smaller cloned fragments. This job was also facilitated
by the physical maps, especially the STS maps, that were
already available. So the shotgun strategy for sequencing
the human genome was in practice a hybrid of a pure shotgun and a map-then-sequence strategy.
Any strategy to sequence over 3 billion bp depends on
a high-volume, low-cost sequencing method. We now have
sequencing devices that perform electrophoresis of DNA
fragments in capillary tubes instead of the traditional thin
gel slabs. These instruments are fully automated and each
can handle about 1000 samples per day with only 15 min
of human attention. Another of Venter’s companies, The
Institute for Genomic Research (TIGR), had 230 such instruments; together, they could produce about 100 Mb of
DNA sequence every day, with a relatively low labor cost.
24.3 Studying and Comparing
Genomic Sequences
Once a genomic sequence is in hand, scientists can mine it
for the wealth of information it contains. They can also
compare it to the sequences of other genomes to shed light
on the evolution of these species. We will begin this section
with a discussion of the human genome, and then compare
it with the genomes of closely related, and then more distantly related organisms.
The Human Genome
Sequencing Standards
At the end of 1999, we tasted the first fruit of the Human
Genome Project: The final draft of human chromosome 22.
In February 2001, the Venter group and the public consortium each published their versions of a working draft of the
whole human genome. In 2004, the international consortium announced the finished sequence of the euchromatic
part of the human genome.
In this section we will look at the lessons we learned from
the finished sequence of chromosome 22 (the first human
chromosome to be sequenced), and the working draft and
finished sequence of the whole genome. Before we begin, one
lesson worth noting is that the finished sequences came from
the more orderly clone-by-clone approach. This strategy yields
the final draft sequences of whole chromosomes as soon
as the groups sequencing each chromosome complete their
work. On the other hand, the raw sequence in the shotgun
sequencing approach is not pieced together until the very end,
when the computer finds the overlaps necessary to build contigs. Thus, this strategy may not yield the final draft sequence
of any chromosome until the whole genome is finished.
What do we mean by “rough draft,” or “working draft,”
and “final draft” of a genome? That depends on whom you
ask. Most investigators agree that a working draft may be
only 90% complete and may have an error rate of up to
1%. Although there is less agreement about what qualifies
as a final draft, there is consensus that it should have an
error rate of less than 1/10,000 (0.01%) and should have
as few gaps as possible. Some molecular biologists insist
that a genome is not completely sequenced until every last
gap is filled, but it would be very difficult to eliminate all
gaps in the human genome. As we will see in the next section,
some regions of DNA, for mostly unexplained reasons, resist cloning. The cost of overcoming the obstacles to cloning these regions will likely be prohibitively high, so the
task of filling in the last few million bases of the human
genome may never be done. As detailed in the next section,
the consortium that sequenced human chromosome 22 decided that their sequence was “functionally complete”
when they had obtained all the sequence possible with the
cloning and sequencing tools currently available, even
though significant gaps remained.
Chromosome 22 In reality, only the long arm (22q) of the
chromosome was sequenced; the short arm (22p) is composed of pure heterochromatin and is thought to be devoid
of genes. Also, 11 gaps remained in the sequence. Ten of
these were gaps between contigs that could not be filled
with clones—presumably due to “unclonable” DNA. The
other corresponded to a 1.5-kb region of cloned DNA that
resisted sequencing. The reasons that some DNAs, sometimes called “poison regions,” are unclonable are not completely clear, but it is known that DNAs with unusual
secondary structure or repetitive sequences are frequently
lost from bacterial cells. This is one reason that heterochromatin (Chapter 13) is very poorly represented, even in the
final draft of the human genome. It is found primarily at
the centromeres and near the telomeres of chromosomes
and is rich in repetitive sequences. By failing to sequence
the heterochromatin in the genome, scientists are not missing very many, if any, genes, because genes are not thought
to reside there. But there could be other interesting aspects
of these heterochromatin regions that will be missed.
SUMMARY Massive sequencing projects can take
two forms: (1) In the map-then-sequence strategy,
one produces a physical map of the genome including
STSs, then sequences the clones (mostly BACs) used
in the mapping. This places the sequences in order so
they can be pieced together. (2) In the shotgun approach, one assembles libraries of clones with different size inserts, then sequences the inserts at random.
This method relies on a computer program to find
areas of overlap among the sequences and piece them
together. In practice, a combination of these methods
was used to sequence the human genome.
wea25324_ch24_759-788.indd Page 775
22/12/10
9:02 AM user-f467
Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile
Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefiles
24.3 Studying and Comparing Genomic Sequences
What did we learn from the first completed sequence
of chromosome 22? Several findings were interesting.
First, we are going to have to learn to live with gaps in
our sequence of the human genome, although perhaps
not as many as first appeared in the sequence of this
chromosome. Already by the summer of 2000, one of the
gaps had been filled, and by December of 2010, only four
gaps remained, not counting the short arm of the chromosome. Still, the same problems encountered in spanning the gaps in chromosome 22 bedeviled investigators
sequencing the other chromosomes. Table 24.2 lists the
sequenced contigs in chromosome 22 and the gaps between them as of 1999. The contigs accounted for 33,464 kb,
or about 97% of the long arm of the chromosome, and
they were sequenced with very high accuracy—estimated
at less than one error per 50,000 bases. It is interesting
that all of the gaps occurred in the regions of the chromosome close to the centromere and telomeres. Between
gaps 4 and 5 was an enormous contig composed of
Table 24.2 Chromosome 22 Contigs and Gaps
as of 1999
Contig
Gap
1
Size (kb)
234
1
2
1.9
406
,150
2
3
1394
,150
3
4
1790
,100
4
5
23,006
,50
5
6
767
,50–100
6
7
1528
,150
7
8
2485
,50
8
9
190
,100
9
10
993
,100
10
11
291
,100
11
12
Total sequence length
Total length of 22q
380
33,464
34,491
(Source: Adapted from Dunham, I., N. Shimizu, B.A. Roe, S. Chissoe, A.R. Hunt, J.E.
Collins, et al., The DNA sequence of human chromosome 22. Nature 402:491, 1999.)
775
23,006 kb that covered more than two-thirds of chromosome 22q. By December of 2010, 34,894,566 bases of
chromosome 22q had been sequenced.
The second major finding was that chromosome 22 was
estimated to contain 679 annotated genes (genes or gene-like
sequences that were at least partially identified). These can
be categorized as follows: known genes, whose sequences
are identical to known human genes or to the sequences deduced from known human proteins; related genes, whose
sequences are homologous to known genes of human or
other species or which have regions of similarity to known
genes; predicted genes, which contain sequences homologous to ESTs (so we are fairly sure they are expressed); and
pseudogenes, whose sequences are homologous to known
genes, but they contain defects that preclude proper expression. There were 247 known genes, 150 related genes,
148 predicted genes, and 134 pseudogenes in chromosome 22q.
Thus, not counting the pseudogenes, there were 545 annotated
genes. Computer analysis of the sequence predicted another
325 genes, but such analyses are still very inaccurate because
the algorithms depend on finding exons, and the many long
introns in human genes make exons hard to spot. As of
December, 2010, 855 genes had been found in chromosome
22q, including pseudogenes.
The third major finding was that the coding regions of
genes accounted for only a tiny fraction of the length of the
chromosome. Even counting introns, the annotated genes
accounted for only 39% of the total length of 22q, and the
exons accounted for only 3%. By contrast, fully 41% of 22q
is devoted to repeat sequences, especially Alu sequences and
LINEs (Chapter 23). Table 24.3 lists the interspersed repeat
elements found in chromosome 22 and their prevalences.
A fourth major finding was that the rate of recombination varied across the chromosome, with long regions in
which recombination is relatively low interspersed with
short regions of relatively high rates of recombination
(Figure 24.11). As we have seen earlier in this chapter, geneticists had already made a genetic map of the human genome, including chromosome 22, based on microsatellites.
This map was based on recombination frequencies between
microsatellites and was therefore calibrated in centimorgans. The chromosome 22 sequencing team was able to
find these microsatellites in the sequence and measure the
real physical distance between them. Figure 24.11 shows
that a plot of the genetic distance between markers versus
the physical distance between the same markers is not linear. The numbers indicate regions of high rates of recombination, and therefore high apparent genetic distance,
separated by longer regions of relatively low rates of recombination. The average ratio of genetic distance to physical distance in this chromosome is 1.87 cM/Mb. Of course,
we should remember that the y axis represents cumulative
genetic distance, that is, the sum of the distances between
closely spaced markers. The actual genetic distance between widely separated markers is not the same as the sum
wea25324_ch24_759-788.indd Page 776
776
22/12/10
9:02 AM user-f467
Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile
Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefiles
Chapter 24 / Introduction to Genomics: DNA Sequencing on a Genomic Scale
Table 24.3
Repetitive DNA Content of Human Chromosome 22
Type
Number
Total base pairs
% of chromosome
Alu
HERV
LINE 1
LINE 2
LTR
MER
MIR
MLT
THE
Other
Dinucleotide
Trinucleotide
Tetranucleotide
Pentanucleotide
Other tandem repeats
20,188
255
8043
6381
848
3757
8426
2483
304
2313
1775
166
404
16
305
5,621,998
160,697
3,256,913
1,273,571
256,412
763,390
1,063,419
605,813
93,159
625,562
133,765
18,410
47,691
1612
102,245
16.80
0.48
9.73
3.81
0.77
2.28
3.18
1.81
0.28
1.87
0.40
0.06
0.14
0.0048
0.31
Total
55,664
14,024,657
41.91
(Source: Adapted from Dunham, I., N. Shimizu, B.A. Roe, S. Chissoe, A.R. Hunt, J.E. Collins, et al., The DNA sequence of
human chromosome 22. Nature 402:491, 1999.)
Cumulative genetic distance (cM)
60
4
50
3
40
2
30
1
20
10
0
5
10
25
15
20
Physical distance (Mb)
30
Figure 24.11 Genetic distance plotted against physical distance in
chromosome 22q. The cumulative genetic distance between markers
(in cM) is graphed versus the physical distance between the same
markers (in Mb). The numbers denote four areas of relatively high rates
of recombination (as reflected in the steeply rising curves). (Source:
Adapted from Dunham, I., N. Shimizu, B.A. Roe, S. Chissoe, A.R. Hunt, J.E.
Collins, et al. (The Chromosome 22 Sequencing Consortium), The DNA sequence
of human chromosome 22. Nature 402:492, 1999.)
of the distances between intervening markers. That is
because multiple recombination events are more probable
between distant markers, which makes them appear closer
together than they really are (Chapter 1).
A fifth major finding was that chromosome 22q had
several local and long-range duplications. The most obvious
involved the immunoglobulin l locus. Clustered together at
this locus are 36 gene segments that are at least potentially
able to encode l variable regions (V-l gene segments), as
well as 56 V-l pseudogenes and 27 partial V-l pseudo-genes
known as “relics.” Other duplications are separated by long
distances. In one striking example, a 60-kb region is duplicated with greater than 90% fidelity almost 12 Mb away.
Compared with the interspersed repeats, such as Alu sequences and LINEs, these duplications are found in few
copies, so they are known as low-copy repeats or LCRs.
Seven of the eight previously described LCR22s in the centromeric end of 22q were sequenced; the eighth (LCR22-1)
probably lies in the sequence gap closest to the centromere.
The sixth major finding was that large chunks of human chromosome 22q are conserved in several different
mouse chromosomes. The sequencing team found 113
human genes whose mouse orthologs had been mapped to
mouse chromosomes. (Orthologs are homologous genes in
different species that have evolved from a common ancestral gene. Paralogs, by contrast, are homologous genes that
have evolved by gene duplication within a species. Homologs are any kind of homologous genes—orthologs or
paralogs.) These mouse orthologs clustered into eight regions on seven different mouse chromosomes, as shown in
Figure 24.12. The mouse chromosomes represented in
human 22q are chromosomes 5, 6, 8, 10, 11, 15, and 16.
Mouse chromosome 10 is represented in two regions of
human 22q. As the two species have diverged, their chromosomes have rearranged, but linkage among many markers
wea25324_ch24_759-788.indd Page 777
22/12/10
9:02 AM user-f467
Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile
Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefiles
24.3 Studying and Comparing Genomic Sequences
Human
chromosome 22
Centromere
Size of
syntenic
block (Mb)
Mouse
chromosome
1.727 kb
6
4.064 kb
16
0.989 kb
10
2.549 kb
5
2.830 kb
11
0.061 kb
10
2.121 kb
8
15.401 kb
15
Figure 24.12 Regions of conservation between human and mouse
chromosomes. Human chromosome 22 is depicted on the left, with
the centromere near the top, and prominent bands in white and brown.
Seven different mouse chromosomes contain syntenic blocks
(orthologs in conserved order) and these are shown on the right.
Colors correspond to the mouse chromosomes listed at far right.
(Source: Adapted from Dunham, I., N. Shimizu, B.A. Roe, S. Chissoe, A.R. Hunt,
J.E. Collins, et al. (The Chromosome 22 Consortium), The DNA sequence of human
chromosome 22. Nature 402:494, 1999.)
has been preserved in syntenic blocks. (The preservation of
gene order between two species is known as synteny.)
Clearly, our knowledge of the sequence of the human genome has sped the sequencing of the mouse genome.
SUMMARY Human chromosome 22q has been se-
quenced to high accuracy, but the sequence still has
10 gaps that cannot be filled with available methods. There are 679 annotated genes, but the great
bulk of the chromosome is made up of noncoding
DNA, over 40% of it in interspersed repeats such as
Alu sequences and LINEs. The rate of recombination varies across the chromosome, with long regions of low rates of recombination punctuated by
short regions with relatively high rates. The chromosome contains several examples of local and
long-range duplications. The human chromosome
contains large regions where linkage among genes
has been conserved with that in seven different
mouse chromosomes.
777
Working Draft and Finished Version of the Human
Genome In February 2001, the Venter group and the
public consortium published separately their own versions
of the working draft of the whole human genome. The
drafts of the human genome presented by the two groups
were by no means complete. They had many gaps and inaccuracies, but they also contained a wealth of information
that kept scientists busy for years analyzing and extending
it. Furthermore, the public draft continued to improve as
groups working on its separate parts completed the laborious finishing phase that eliminates gaps and corrects errors.
The most striking discovery from both groups was the
low number of genes in the genome. The Venter group found
26,588 genes for which there were at least two independent
lines of evidence, and about 12,000 more potential genes.
These potential genes were identified computationally, but
there were no other supporting data. Venter and colleagues
assumed that most of these latter sequences were falsepositives. The public consortium estimated that the human
genome contains 30,000–40,000 genes. As we will see later
in this section, the estimate from the finished human
genome sequence is even lower—fewer than 23,000 genes.
Thus, contrary to earlier estimates, the number of human genes seems to be scarcely larger than the number of
genes in a lowly roundworm or a fruit fly. Clearly, the complexity of an organism is not directly proportional to the
number of genes it contains. How then can we explain
human complexity? One emerging explanation is that the
expression of genes in humans is more complex than it is in
simpler organisms. For example, it is estimated that at least
40% of human transcripts experience alternative splicing
(Chapter 14). Thus, a relatively small number of gene regions encoding domains and motifs of proteins can be
shuffled in different ways to give a rich variety of proteins
with different functions. Moreover, posttranslational modification of proteins in humans seems more complex than
that in simpler organisms, and this also gives rise to a
greater variety of protein functions.
Another important finding is that about half of the human genome appears to have come from transposable elements duplicating themselves and carrying human DNA
from place to place within the genome (Chapter 23). However, even though transposons have contributed so greatly
to the genome, the vast majority of them are now inactive.
In fact, all of the non-retrotransposons are inactive, and all
of the LTR-containing retrotransposons seem to be. On the
other hand, as we learned in Chapter 23, a few L1 transposons are still active in the human genome and continue to
contribute to human disease.
Dozens of human genes appear to have come via horizontal transmission from bacteria, and some others came
from new transposons entering human cells. Thus, the
human genome has been shaped not entirely by internal
mutations and rearrangements, but also by importation
of genes from the outside world.
wea25324_ch24_759-788.indd Page 778
778
22/12/10
9:02 AM user-f467
Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile
Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefiles
Chapter 24 / Introduction to Genomics: DNA Sequencing on a Genomic Scale
The total size of the human genome appears to be close
to the 3 billion bp (3 Gb) predicted for many years. The
Venter group sequenced about 2.9 Gb, and the public consortium predicted that the total size of the genome is
about 3.2 Gb.
As mentioned earlier in this chapter, the international
consortium of labs sequencing the human genome announced in the spring of 2003 that they had produced a
finished draft of the human genome, two years earlier than
originally planned. They published their work in 2004. The
major advantages of this version over the rough draft were:
1. It was more complete. Ninety-nine percent of the
sequence that was possible to obtain had been
obtained—2,851,330,913 base pairs, or about 2.85
gigabase pairs (Gb) worth.
2. It was more accurate. The inaccuracy rate was a tiny
0.001%, and all the sequences were in the proper order.
However, there were still 341 gaps, though 33 of those
were in the heterochromatic regions of the genome, which
were not a target of this project. Still, biologists generally
concede that we will have to live with many of these gaps,
perhaps forever. Also, in spite of the polish on the finished
product, annotation is still difficult, and we still do not
know the real number of genes in the human genome. The
international consortium found 22,287 protein-encoding
genes (19,438 known genes and 2188 predicted genes),
considerably fewer than estimated by both rough drafts of
the human genome. The difference appears to be largely
due to earlier double counting of apparent genes that actually map to the same true gene.
The estimated number of human genes has mostly decreased with time, at least if one includes only proteinencoding genes. In 2007, Michele Clamp reported an estimate
of only 20,488 human genes, and allowed that a hundred
or so remained undiscovered. She approached the question
from a bioinformatics angle—using only computational
tools. For example, she looked in a database called
Ensembl for human genes and then compared those with
counterparts in the dog and mouse genomes. This check of
the presumed human genes showed that 19,209 really do
code for proteins, while 3009 were on the list by mistake.
Another 1177 putative genes remained in doubt, so Clamp
analyzed them by comparing them to random DNA sequences for qualities of “geneness,” such as genelike GC
contents. All but 10 failed this test, yielding an estimate of
19,219 genes. Combining this with similar analyses of two
other databases yielded a final estimate of 20,488.
What else have we learned from the finished draft?
Here are a few examples: The estimated 22,289 genes appear to give rise to 34,214 transcripts, or about 1.5 per
gene. These genes are represented by 231,667 exons, or
about 10.4 per gene. The amount of DNA included in all
these exons is just 34 Mb, which is only 1.2% of the
euchromatic part of the human genome. This confirms
something we already knew: The vast majority of the
human genome does not contain protein-encoding genes.
Some of it codes for useful RNAs, such as rRNAs, tRNAs,
snRNAs, and miRNAs that are, of course, not translated.
But the bulk of it appears not to be transcribed at all, and
its functions, if any, remain a mystery.
The finished draft is also a great aid in the study of human evolution. First, it reveals newly duplicated genes that
provide the raw material for new genes with new functions: One gene in the pair can retain its original function,
but the other is free to collect mutations and evolve new
activities, without compromising the original activity,
which may be essential to life.
Second, the finished draft reveals newly inactivated
genes, or pseudogenes. The search for pseudogenes began
with a comparison of the rat, mouse, and human genomes
to find strings of genes that were found in all three organisms. Then the investigators looked for genes within this
string that were present in the rodents, but not in the human. Finally, they examined the region in the human genome predicted to contain these missing genes. They found
37 candidate pseudogenes that were still clearly recognizable, though they had all been inactivated. On average,
each pseudogene had 0.8 premature stop codons and 1.6
frameshifts. Either of these types of mutation would have
rendered the gene inactive. It is clear that these genes must
not be essential to human life, though they presumably
were to the common ancestor of humans, rats, and mice,
and may still be to the rodents.
To verify that these apparent pseudogenes were really
what they appeared to be, the investigators went back and
sequenced 34 of them. In 33 cases, the inactivations were
real, and in one case the apparent inactivation was due to a
sequencing error. Then they compared these 33 sequences
to the corresponding sequences in the chimpanzee genome.
Nineteen of these pseudogenes had two or more inactivating mutations, and these were all pseudogenes in the chimpanzee as well. The other 14, with just one inactivating
mutation, were more interesting. Eight of these were pseudogenes in the chimpanzee, but five were functional genes
in the chimpanzee, and one is a polymorphism (present as
a pseudogene in a fraction of the human population, but
as a functional gene in the others). Thus, we can see the
traces of gene inactivation through evolutionary time—
since the rodent and human lineages diverged, and since
the chimpanzee and human lineages diverged.
SUMMARY The working draft of the human genome
reported by two separate groups allowed estimates
that the genome contains fewer genes than anticipated. About half of the genome has derived from the
action of transposons, and transposons themselves
have contributed dozens of genes to the genome. In
wea25324_ch24_759-788.indd Page 779
22/12/10
9:02 AM user-f467
Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile
Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefiles
24.3 Studying and Comparing Genomic Sequences
addition, bacteria appear to have donated at least
dozens of genes. The finished draft of the human
genome is much more accurate and complete than the
working draft, but it still contains some gaps. On the
basis of the finished draft, geneticists estimate that
the genome contains about 20,000–25,000 genes.
The finished draft also gives valuable information
about gene birth and death during human evolution.
Personal Genomics
By 2007, two groups had used traditional sequencing techniques to sequence the genomes of two major players in the
human genome project, James Watson and Craig Venter.
By 2008, two different groups used high throughput
sequencing to sequence the genomes of two non-Caucasian
individuals, one of Nigerian descent, and one of Han Chinese descent. The addition of the genomes of these two
individuals to the two previously sequenced genomes of
individuals of European descent added more diversity to
the growing pool of human genomes. One can detect millions of SNPs, hundreds of thousands of insertions and
deletions, and thousands of structural variants among the
four genomes. By 2010, several more individual genomes
had been sequenced, including a European (French), a
Southern African (San) and a Papua New Guinean.
As the speed and economy of DNA sequencing have
improved, it has become possible to envision sequencing
the genome of anyone who wants it and who is willing to
pay the cost. The goal (with a significant cash prize) is to
sequence a whole human genome for $1000. No one has
claimed the prize yet, but high-throughput sequencing
techniques (Chapter 5) are making it seem feasible that
millions of people will one day have their whole genomic
sequence on a flash drive, or whatever data storage
medium is popular at that time. That wealth of information is bound to be valuable, but it also will create ethical
problems.
Other Vertebrate Genomes
The complete sequences of the mouse and a pufferfish (the
tiger pufferfish, Fugu rubripes) have been published. What
lessons have these genomes taught us? Here are some of the
most important:
The Fugu genome was chosen for sequencing because it
is a vertebrate with a much smaller genome than human—
only one-ninth the size. But despite the difference in size,
the two genomes have about the same number of genes
(31,059 predicted genes in Fugu). The difference lies, not in
gene content, but in the size of introns and amount of repetitive DNA. The Fugu genome has much smaller introns
than the human, and much less repetitive DNA. Comparing
779
the Fugu and human genomes has allowed genomics researchers to identify 1000 human genes.
Because genetic mutations that cause human diseases
are more likely to occur at important sites in genes, and
because these important sites are especially well conserved,
comparing two relatively distantly related vertebrate genomes, such as human and Fugu, should help identify these
important sites. The mouse genome is not as useful for this
purpose because it is relatively similar to the human genome. There simply has not been enough time for the
mouse and human genomes to diverge very far, and many
sites, not just important ones, have been conserved.
The mouse genome is a little smaller than the human,
about 2.5 Gb compared with about 3 Gb, but both organisms have about the same number of genes, and a high
percentage of these are the same in the two organisms:
99% of mouse genes have a counterpart we can identify
in humans. This 1% difference is obviously much too little to account for the biological differences between humans and mice, so something besides sheer DNA sequence
must be at work. Preliminary studies suggest that it is the
control of the genes, not the genes themselves, that plays
the biggest role in distinguishing humans from mice.
Knowing the great similarity in genomic structure between mice and humans, scientists can use the mouse as a
human surrogate in which to do experiments they could
not do in humans. For example, they can knock out genes
in mice and observe the effects. The results give us clues
about what the homologous genes do in humans. Molecular biologists can also examine the expression patterns of
mouse genes to learn when and where these genes are
expressed during development and in adults. Again, these
results give information about the expression of homologous genes in humans.
By the beginning of 2003, some of the best studies comparing the human and mouse genomes focused on chromosomes whose sequences were finished, including human
chromosome 21 and mouse chromosome 16. Let us consider some results from each of these studies.
A comparison of the DNA in human chromosome 21
and equivalent DNA in the mouse has revealed about 3000
conserved sequences. Surprisingly, only half of these conserved sequences contain genes. However, the fact that they
are so well conserved suggests that they are important, and
we need to find out why. Perhaps they play a role in gene
expression. Humans have 234 so-called “gene deserts” that
are poor in genes. Again, it is surprising that 178 of these
deserts are conserved in the mouse. And again, this degree
of conservation of seemingly useless DNA demands an
explanation. Accordingly, geneticists are knocking out
some of those gene deserts in the mouse to see what effect
their loss will have.
In 2002, Venter and colleagues reported a detailed comparison of the sequence of mouse chromosome 16 with
sequences in the human genome. They found many regions
wea25324_ch24_759-788.indd Page 780
780
22/12/10
9:02 AM user-f467
Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile
Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefiles
Chapter 24 / Introduction to Genomics: DNA Sequencing on a Genomic Scale
Genes
Mouse
chromosome 16
Human
chromosome
16
12 8
22
3
21
Figure 24.13 Regions of conserved synteny between mouse
chromosome 16 and the human genome. Homologous genes were
detected by analysis at the protein level. Mouse chromosome 16 is
depicted at left, with the syntenic regions on six different human
chromosomes illustrated at right (different colors indicate different
human chromosomes). Orthologous genes in mouse and human are
connected by colored lines in the middle of the diagram and indicated
by tiny horizontal lines (purple, mouse; various colors, human). Genes
homologous to mouse chromosome 16 in human chromosome 3 are
found in two distinct syntenic blocks, separated by the dotted line.
Above that line are human chromosome regions 3q27–29; below the
line are regions 3q11.1–13.3. (Source: Adapted from Mural et al., Science 296
(2002) Fig. 3, p. 1666.)
of synteny, that is, regions with conserved gene order that
appear to have derived from an ancestral mammalian
chromosome. Figure 24.13 illustrates these syntenic regions,
analyzed at the protein level. In all, mouse chromosome
16 has homologs on six human chromosomes (represented
by different colors); the homologous genes on human chromosome 3 are found in two syntenic blocks, separated by
the dotted line. Thus, all told, the genes on mouse chromosome 16 are represented in seven syntenic blocks in the
human genome.
The degree of homology between syntenic regions in
the two species is striking. Of 731 mouse genes that could
be predicted with high confidence on the mouse chromosome, 717 (98%) have homologs in the human genome.
This great homology far overshadows the fact that the
mouse chromosome is represented in six separate human
chromosomes and seven different syntenic blocks. Chromosomes frequently become scrambled during evolution
without changing much if anything about gene expression,
and without changing gene orders within large, syntenic
blocks of genes. This can happen by chromosome breakage
and translocation. For example, two closely related species
of muntjac deer have experienced so much chromosome
breakage (or joining, or both) since the two species diverged
that one has 3 pairs of chromosomes, and the other has 23
pairs! Nevertheless, the two species can interbreed to produce healthy, albeit infertile, hybrids.
The degree of similarity of mice and humans at the genomic level is clearly out of proportion to the obvious differences in appearance and behavior between these two
species. How do we explain this discrepancy? If we cannot
find the answer in the genes themselves, it must lie in the
way the genes are expressed. But some answers are already
determined. We know that human genes are subject to an
extraordinary amount of alternative splicing. In fact, it has
been estimated that about 75% of human genes are spliced
in at least two different ways in vivo (Chapter 14). This
makes the human proteome (the total complement of human proteins) much more complex than the genome suggests. We also have evidence that the pattern of expression
of human genes varies considerably from the expression of
the almost identical set of genes in our closest relative, the
chimpanzee, and varies even more from the pattern in mice.
This could derive from control by miRNAs, which, in contrast to protein-encoding genes, seem to be much different
in mice and humans.
Another source of variation in gene expression between
two closely related species could come from the interaction
between transcription factors and their binding sites on the
DNA. As we have learned, eukaryotic genes have ciscontrol elements known as promoters and enhancers, and
these are the targets of many transcription factors. We might
predict that closely related species with highly conserved
gene sets would also have highly conserved cis-control elements, but that seems not necessarily to be true. For
example, Michael Snyder and colleagues reported in 2007
on ChIP analysis coupled with DNA microchip assays on
the DNA targets for two transcription factors from three
closely related species of yeast, which showed that these
wea25324_ch24_759-788.indd Page 781
22/12/10
9:02 AM user-f467
Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile
Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefiles
24.3 Studying and Comparing Genomic Sequences
factors bound in the same places relative to the genes they
control only 20% of the time in all three species. (This kind
of experimentation is called ChIP-chip analysis and is described in more detail in Chapter 25.)
The great variation in transcription factor binding observed among these three yeast species is partly due to elements missing in one or two of the genomes, but it also
sometimes occurs because a factor fails to bind, even when
the element is still present. A similar phenomenon has been
observed in a comparison of factor binding in human and
mouse genomes.
How do we relate this rapid evolution of cis-regulatory
elements to changes in phenotype between organisms? At
this point, it is very difficult, because of uncertainty about
how much each element contributes to expression of a
particular gene. It is possible that most of the differences
Snyder and colleagues observed play no role in the
phenotypic differences among the three species, especially
because of the redundancy that appears to be built into
many cis-regulatory elements. On the other hand, it seems
likely that some of these differences really are important
to phenotype.
In 2005, scientists presented a working draft of the
chimpanzee genome. Because the chimpanzee is our closest
living relative, this sequence has special significance for
evolutionary studies. Everyone wants to know what sets us
apart from the chimpanzee. What genes give us the intelligence to build a city or write a symphony—or, for that
matter, to wonder what makes us human? But a comparison of the chimpanzee and human genomes shows that we
share almost all our protein-encoding genes in common,
and our genomes differ by only 1.23% at the nucleotide
level. Three hypotheses have been put forward to explain
these data: (1) The important differences are changes in
protein-encoding genes. (2) The “less is more” hypothesis,
which holds that inactivation of certain genes in the human
can explain the differences. (3) The differences are found in
changes in gene control regions.
Each hypothesis has some data to support it. Despite
the paucity of differences between chimpanzees and humans in protein-encoding genes, geneticists have noticed
some differences that could make a big difference. For example, the FOXP2 gene is highly conserved. It experienced
only one change in amino acid coding in the approximately
130 million years between the divergence of the human
and mouse lineages and the divergence of the human and
chimpanzee lineages. But in the approximately 5 million
years since the human and chimpanzee lineages diverged,
two amino acid changes occurred. Why might the FOXP2
gene be important? It encodes a forkhead class transcription factor, and mutations in this gene cause severe speech
impairment in humans. And, of course, speech is one of the
key traits that sets humans and chimpanzees apart.
The “less is more” hypothesis also has some support.
For example, it is easy to imagine that the relative lack of
781
hair in humans is due to the loss or inactivation of a gene
responsible for hairiness. And a comparison of the human
and chimpanzee genomes has uncovered 53 examples of
human genes that have been disrupted by insertions or deletions (indels). These genes are functional in chimpanzees,
but inactive in humans.
There is less direct experimental support for the third
hypothesis—differences in gene control—because of the
difficulty in identifying the genetic elements responsible.
But the great similarity in the protein-coding regions of the
two species suggests we look elsewhere, and genetic control
is an attractive place to look. Indeed, as we will see, the
most rapidly changing DNA sequences that distinguish the
human and chimpanzee genomes are in apparently noncoding DNA regions. The easiest way to make sense of this
finding is to say that these DNA regions are involved in
controlling the protein-encoding genes.
David Haussler and colleagues took the following
approach to finding important differences—coding or
noncoding—between the human and chimpanzee genomes.
They used computational techniques to identify genome
regions that are strongly conserved among vertebrates. Then
they looked in these regions to find regions of DNA that had
experienced a high rate of change since the divergence of
humans and chimpanzees. They found 49 such regions,
which they named HAR1–HAR49 (HAR 5 human accelerated regions). HAR1, a 118-bp DNA region, stood out most
of all. In the 310 million years since the chicken and chimpanzee lineages diverged, only two changes occurred. However, in the 5 million years since the human and chimpanzee
lineages diverged, fully 18 changes have occurred.
Haussler and colleagues then used in situ hybridization
on brain slices and found that one of the two RNAs
(HAR1F) that includes the HAR1 region is expressed in the
developing cerebral neocortex of humans and other primates. The neocortex is thought to be central to higher
cognitive function—perhaps the most salient difference
between chimpanzees and humans.
Thus, we know that HAR1 gives rise to two RNAs, but
these RNAs appear not to encode any proteins. However,
the base sequence of HAR1F allows a prediction of a stable
secondary structure (intramolecular base-pairing). And the
changes between the chimpanzee and human forms of
HAR1F are predicted to cause a significant difference in
secondary structure, including a strengthening of basepairing. We do not know yet what HAR1F and HAR1R
do, but a reasonable hypothesis is that one or both of these
RNAs influence the expression of protein-encoding genes
in the developing human brain and give it some of its cognitive power.
One striking finding from the work of Haussler and
colleagues, as well as other workers in this field, is that the
most rapid changes in the genomes of humans and chimpanzees has not been in protein-encoding genes, but in
noncoding regions of the genome.
wea25324_ch24_759-788.indd Page 782
782
22/12/10
9:02 AM user-f467
Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile
Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefiles
Chapter 24 / Introduction to Genomics: DNA Sequencing on a Genomic Scale
Although the chimpanzee is our closest living relative,
our closest evolutionary relative is the Neanderthal (Homo
neanderthalensis), which has been extinct for about 30,000
years. In 2010, a group led by Svante Pääbo succeeded in a
task many people assumed was impossible—they reported
a draft sequence of the Neanderthal genome. The problem
with sequencing the genome of a fossil organism is that the
DNA is badly degraded, and therefore commonly thought
to be unfit for sequencing. But Pääbo and colleagues solved
this problem by using next-generation sequencing techniques, in which DNAs are intentionally fragmented to begin with, so DNAs that are already fragmented pose less of
a problem. Another difficulty was that the bone samples
from which the Neanderthal DNA came were massively
contaminated with bacterial DNA, but Pääbo and colleagues minimized that problem by cutting the DNA with
restriction enzymes whose recognition sites include CG sequences, which are rare in mammals, but common in microbes. This reduced the size of most microbial DNA
fragments to the point that they did not interfere with the
sequencing.
One limitation of next-generation sequencing is that
the DNA fragments are frequently too short to exhibit obvious overlaps, so they cannot be pieced together to form a
whole genome. But that is not a problem if a closely related
species has already had its genome sequenced, so the fragments can be compared to that sequence and placed in the
proper order. Because the human genome was already
available, Pääbo and colleagues could use it as a framework for their Neanderthal sequence, which they obtained
from DNA extracted from well-preserved fossil remains.
It is fascinating to have the Neanderthal sequence for
many reasons. For example, it appears to be able to answer
the question whether modern humans and Neanderthals
interbred. The two species coexisted for at least 10,000
years in Europe and Asia, until the Neanderthals disappeared, so interbreeding was certainly possible. If interbreeding occurred, and the offspring were fertile, it should
be possible to find traces of the Neanderthal genome in the
present human genome. Indeed, Pääbo and colleagues
found similarities between the Neanderthal genome and
the genomes of a modern European (French), a modern
East Asian (Han Chinese), and a modern Papua New
Guinean, but these similarities did not extend to the genomes of two modern sub-Saharan Africans (a San from
Southern Africa and a Yoruba from West Africa). Thus,
Neanderthals did apparently interbreed with the ancestors
of modern Eurasians, but this happened after the Eurasian
and African lineages diverged. Also, because the Neanderthal
genome resembles the Papua New Guinean, Chinese, and
European genomes equally closely, the interbreeding appears to have happened before those lineages diverged.
Pääbo and colleagues also reported the full Neanderthal mitochondrial DNA sequence, in 2008. They eliminated errors and minimized the effect of contamination by
sequencing so thoroughly that each base was represented
in at least 35 independent reads. Gaps and ambiguities
were resolved by traditional sequencing. The modern human and Neanderthal mitochondrial sequences differ in an
average of 206 bases. This contrasts with differences between modern human mitochondrial sequences that vary
between 2 and 118 bases. These data allowed Pääbo and
colleagues to estimate the time of divergence between the
modern human and Neanderthal lineages at about 660,000
years ago.
SUMMARY Comparing the human genome with
that of other vertebrates has already taught us much
about the similarities and differences among genomes. Such comparisons have also helped to identify many human genes. In the future, such
comparisons will help find the genes that are defective in human genetic diseases. One can also use
closely related species like the mouse to find when
and where their genes are expressed and therefore
to estimate when and where the corresponding human genes are expressed. Detailed comparison of
mouse and human chromosomes has revealed a
high degree of synteny between the two species.
Comparisons of the human genome with that of our
closest living relative, the chimpanzee, have identified a few DNA regions that have changed rapidly
since the two species diverged. These are good candidates for the DNA sequences that set humans and
chimps apart, yet very few of them are in proteinencoding genes. Thus, the thing that really sets us
apart may be control of genes, rather than the
genes themselves. Studies in yeasts have shown
that even closely related species have great variation in the cis-regulatory elements that control their
genes, though the genes themselves are highly conserved. Thus, cis-regulatory elements are subject to
relatively rapid evolution, and that may help to
explain differences in gene control, and therefore
in phenotype. More insight into what makes us human will come from the genome of the Neanderthal.
A working draft of this genome, as well as a finished
version of the mitochondrial DNA, have already
been published.
The Minimal Genome
By early 2002, over 50 bacterial genomes had been sequenced. The smallest of these genomes belong to intracellular parasites, such as mycoplasmas, Rickettsia (one of
whose members causes Rocky Mountain spotted fever),
and parasitic spirochetes like Borrelia burgdorferi, which
causes Lyme disease. The record for smallest bacterial
wea25324_ch24_759-788.indd Page 783
22/12/10
9:02 AM user-f467
Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile
Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefiles
24.3 Studying and Comparing Genomic Sequences
genome is held by Mycoplasma genitalium, at only 530 kb.
This kind of analysis has led some geneticists to ask, “What
is the smallest genome that is still compatible with life?”
One way to answer this question would be to compare
the genomes of bacteria and find the lowest common denominator: the set of genes they all have in common. But
that yields a set of only about 80 genes, which is clearly too
few to sustain life. Thus, different bacteria have followed
different paths to streamlining their genomes, and it is
therefore not useful simply to find where the endpoints of
these different paths overlap.
In 1999, Craig Venter and colleagues reported the results of another approach to finding the minimal genome.
They systematically mutagenized the genes in Mycoplasma
genitalium and the related species M. Pneumoniae, using
transposons to interrupt the genes. Then they looked to see
which genes were essential, and which were not. They discovered that 265–350 of the 480 protein-encoding genes in
these organisms are essential. Surprisingly, 111 of these
genes had unknown functions, suggesting that we still have
a lot to learn about what it takes to sustain life.
This experiment identified the essential gene set, that is
the set of genes whose loss is incompatible with life. But
that is not the same as the minimal genome, the collection
of genes that would sustain life in a real organism. The
distinction comes from the fact that an organism can afford to lose certain genes by themselves, but loss of two or
more of these same genes together is not compatible with
life. Thus, these genes are not part of the essential gene set,
but they are part of the minimal genome.
The next task was to discover which genes need to be
added to the essential gene set to produce a minimal genome. Venter and colleagues proposed to perform this task
in a spectacularly ambitious way. They aimed to synthesize
DNA from scratch, building DNA cassettes carrying several genes. Then they would place these cassettes into
Mycoplasma cells whose own genes had been disabled so
they would not confuse the issue. They would experiment
with different combinations of genes until they found the
combination with the smallest number of genes that could
still support life.
This plan had to deal with a difficult hurdle to get the
genes to function appropriately in a new cell without any
genes of its own. It is true that one can place one or a few
foreign genes into a normal bacterial cell and get them to
turn on very well. But what about an entirely new gene set?
There was a significant chance that the genes would not
turn on, but would just sit there. Bernhard Palsson has
stated the problem this way: “How do you boot up a new
genome?”
However, by 2007, Venter and colleagues had reported
progress that showed that booting up a genome really does
work. They transplanted the genome of Mycoplasma mycoides to another bacterium, Mycoplasma capricolum, and
the resulting cell thrived with its new genome. However,
783
they had to use some creative manipulations to make the
transplant work. First, they added an antibiotic resistance
gene to the donor bacteria (M. mycoides), and embedded
these cells in an agarose gel. Then they broke open the cells
and digested their proteins with proteolytic enzymes.
(Mycoplasma cells lack a cell wall, which makes it easier to
break them.) With the released circular genome protected
from physical stress by the agarose, the recipient bacteria
(M. capricolum) were added, along with the membranefusing agent polyethylene glycol. Apparently, some of these
recipient bacterial membranes opened up and then fused
around the naked donor genomes.
Instead of destroying the recipient cell’s genome, Venter
and colleagues played a clever trick involving the antibiotic
resistance gene they had placed in the donor cell genome.
After fusion, the recipient cell found itself with two genomes:
one it had always had, and one from the donor. With two
genomes, the cell was ready to divide, and proceeded to do
so. One daughter cell got the donor genome, and the other
got the recipient genome. But only the daughter cell with the
donor genome had the antibiotic resistance gene, so growing
the cells in the presence of the antibiotic automatically removed the cells with the recipient genome. The result was
that all of the cells that formed early in the experiment were
M. capricolum cells with a M. mycoides genome.
In 2010, Venter and colleagues used a similar technique
to introduce an entirely synthetic M. mycoides genome into
M. capricolum cells. The success of this experiment ushered
in a new era of “synthetic biology.” Of course, the engineered organisms are not truly synthetic—only their
genome is—but they represent a milestone nonetheless. A
potential ethical question might remain: Is it ethical to
create life from nonliving ingredients? Recognizing this issue, Venter and colleagues submitted their plan to a panel
of ethicists, who decided in 1999 that it presented no serious ethical problems. But they did see some safety issues,
and recommended that public officials should examine the
possibility that the artificial life forms Venter and his colleagues would create could pose an environmental hazard,
or that they might lend themselves to being modified for
use as agents of bioterrorism or biowarfare.
To at least partially address the safety issue, Venter and
colleagues have endowed their synthetic genome with a
watermark—a DNA sequence not found in nature—that
will enable the engineered organisms to be identified. Use
of these organisms by terrorists seems very unlikely because a great deal of sophistication will be needed to create
them, and there is no indication that they would be any
more dangerous than highly toxic natural organisms that
are already available. Ethical questions remain, however,
and President Obama convened an ethics panel to study
the issues and issue a report by the end of 2010.
Why build an organism with a minimal genome? On a
purely scientific level, it will be important to show that
there is such a thing as a minimal genome, and then to
wea25324_ch24_759-788.indd Page 784
784
22/12/10
9:02 AM user-f467
Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile
Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefiles
Chapter 24 / Introduction to Genomics: DNA Sequencing on a Genomic Scale
investigate why these particular genes are required. But
practical applications are also possible. Indeed, Venter and
colleagues plan to supplement the minimal genome with
genes that will enable the bacteria to create fuels such as
hydrogen, or to clean up industrial waste, including CO2
from power plants.
This does not mean that traditional organisms will
lose out to synthetic ones with minimal genomes as the
microbial workhorses of the future. Frederick Blattner
and his colleagues have been trimming away the genome
of E. coli to build an organism with a reduced genome
that is hospitable to new genes. Their strategy is to identify genes that differ from one strain to another, and are
therefore probably dispensable. They have found that
these genes tend to cluster in “islands” that can be conveniently deleted. As of late 2005, they had made 43 deletions,
cutting the genome’s size down by more than 10%.
Already, this altered bacterium was ten times better at
accepting new genes than typical laboratory strains. By
late 2007, Blattner’s group had pared away 14% of the
E. coli genome without harming the ability of the cells to
grow and express foreign genes, and more trimming
remains to be done.
Finally, it is worth noting that M. genitalium and other
intracellular parasites can get away with such a small genome because of their parasitic lifestyle. They get many
of their nutrients from their hosts, so they can safely
shed the genes that produce those nutrients. In fact,
M. genitalium may already have honed its genome to
something close to the minimum required for life in its
human host. But scientists may be able to hone it even
further—to the minimum required to live under rigidly
controlled laboratory conditions.
SUMMARY It is possible to define the essential gene
set of a simple organism by mutating one gene at a
time to see which genes are required for life. In
principle, it is also possible to define the minimal
genome—the set of genes that is the minimum
required for life. It is likely that this minimal
genome is larger than the essential gene set. It is
also possible to place this minimal genome into a
cell lacking genes of its own and thereby create a
new form of life that can live and reproduce under
laboratory conditions. With selected genes added,
such a life form could be modified to perform many
useful tasks.
The Barcode of Life
Taxonomists are in the business of classifying organisms
and understanding their differences and relationships.
Traditionally, they have relied on simple appearances, or
morphological characteristics, to distinguish among different species. Now, in the era of DNA sequencing, they have
gained another tool, because different species have different DNA sequences, as well as different appearances.
Moreover, the degree of difference in the DNA sequences
between two species is a good measure of their evolutionary distance, or the time since the two diverged, assuming
a constant rate of mutation.
However, with millions of species to study, there is no
hope with present technology of sequencing the whole
genomes of even a significant fraction of all these species.
Instead, taxonomists focus on small regions of the genome that show a significant amount of variation among
the species they are studying. Now, a group of scientists
called the Consortium for the Barcode of Life (CBOL) is
proposing to obtain a relatively short DNA sequence, or
barcode, from the genome of every species on earth. In
principle, this would allow the rapid identification of any
known species, including agents of bioterrorism, and it
would help to place new species on the proper branch of
the tree of life. The work would start with the 1.7 known
species of animals and plants and then move to the rest of
the 10 million or more unknown species (not counting
microbes).
CBOL scientists settled on a 648-bp region from the
mitochondrial cytochrome c oxidase subunit I (COI)
gene as the barcode, at least for animals. This gene is
present in all organisms. And, at least in animals, it shows
a good degree of difference between closely related species, but little difference between members of the same
species. For example, the barcodes in different human
beings differ from one another by only one or two base
pairs out of 648, while those in humans and chimpanzees, our closest living relatives in the tree of life, differ
by 60 bp. Moreover, a sequence of 648 bp is easy and
cheap to obtain in one run of a traditional automated
sequencer, and mitochondrial DNA is relatively easy to
purify because each cell contains 100–10,000 copies instead of just two copies in nuclear DNA.
One drawback to the COI barcode is that plant mitochondrial DNA sequences show much less variation than
animal sequences do, so the COI barcode will not work
well for plants. Instead, a consortium of plant systematists,
known as the Plant Working Group of COBOL, has proposed using sequences from two chloroplast genes (matK
and rbcL) for the plant barcode. This is not a perfect solution, as this barcode works better for some plant species
than for others. But it has correctly identified 72% of all
plant species, and has a perfect record in placing plants in
the proper genus.
In Richard Preston’s novel, The Cobra Event, a
deranged man creates a very nasty virus and releases it in
New York City. But scientists in the book have an invaluable
tool for detecting such agents—a handheld device that
almost instantly identifies microbes. We are clearly not at
wea25324_ch24_759-788.indd Page 785
22/12/10
9:02 AM user-f467
Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile
Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefiles
Summary
that point yet. However, someday it may be possible to
miniaturize DNA sequencers to the point that they could
be used as field devices for quick identification, via barcodes,
of unknown organisms.
SUMMARY A movement has begun to create a
barcode to identify any species of life on earth.
The first “barcode of life” will consist of the sequence of a 648-bp piece of the mitochondrial
COI gene from each organism. This sequence is
sufficient to uniquely identify almost any animal.
Other sequences, or barcodes, are being worked
out for plants.
S U M M A RY
Several methods are available for identifying the genes
in a large, unsequenced DNA region. One of these is
the exon trap, which uses a special vector to help clone
exons only. Another is to use methylation-sensitive
restriction enzymes to search for CpG islands—DNA
regions containing unmethylated CpG sequences.
Before the genomics era, geneticists mapped the
Huntington disease gene (HD) to a region near the end
of chromosome 4. Then they used an exon trap to
identify the gene itself.
Rapid, automated DNA sequencing methods have
allowed molecular biologists to obtain the base sequences
of viruses and organisms ranging from simple phages to
bacteria to yeast, simple animals, plants, mice, and
humans. Much of the mapping work in the Human
Genome Project was done with yeast artificial
chromosomes (YACs), vectors that contain a yeast origin
of replication, a centromere, and two telomeres. Foreign
DNA up to 1 million bp long can be inserted between the
centromere and one of the telomeres. It will then replicate
along with the YAC. On the other hand, because of their
superior stability and ease of use, most of the sequencing
work in the Human Genome Project was done with
bacterial artificial chromosomes (BACs). BACs are
vectors based on the F plasmid of E. coli. They can accept
inserts up to about 300 kb, but their inserts average
about 150 kb.
Mapping the human genome, or any large genome,
requires a set of landmarks (markers) to which one
can relate the positions of genes. Genes can be used
as markers in mapping, but markers are usually
anonymous stretches of DNA such as RFLPs, VNTRs,
STSs (including ESTs), and microsatellites. RFLPs
(restriction fragment length polymorphisms) are
785
differences in the lengths of restriction fragments
generated by cutting the DNA of two or more different
individuals with a restriction endonuclease. RFLPs can
be caused by the presence or absence of a restriction site
in a particular place or insertions and deletions between
restriction sites. They can also be caused by a variable
number of tandem (head-to-tail) repeats (VNTRs)
between two restriction sites. STSs (sequence-tagged
sites) are regions of DNA that can be identified by
formation of a predictable length of amplified DNA by
PCR with pairs of primers. ESTs (expressed sequence
tags) are a subset of STSs generated from cDNAs, so
they represent expressed genes. Microsatellites are a
subset of STSs generated by PCR with pairs of primers
flanking tandem repeats of just a few nucleotides
(usually 2–4 nt).
Radiation hybrid mapping allows mapping of STSs
and other markers that are too far apart to fit on one
BAC. In radiation hybrid mapping, human cells are
irradiated to break chromosomes, then these dying cells
are fused with hamster cells. Each hybrid cell has a
different subset of human chromosome fragments. The
closer together two markers are, the more likely they are
to be found in the same hybrid cell.
Massive sequencing projects can take two forms: The
map-then-sequence (clone-by-clone) approach or the
shotgun approach. Actually, a combination of these
methods was used to sequence the human genome. The
clone-by-clone strategy calls for production of a physical
map of the genome including STSs, then sequencing the
overlapping clones (mostly BACs) used in the mapping.
This places the sequences in order so they can be pieced
together. The shotgun strategy calls for the assembly of
libraries of clones with different size inserts, then
sequencing the inserts at random. This method relies on a
computer program to find areas of overlap among the
sequences and piece them together.
Sequencing of human chromosome 22q has revealed:
(1) gaps that cannot be filled with available methods;
(2) 855 annotated genes; (3) the great bulk (about 97%)
of the chromosome is made up of noncoding DNA;
(4) over 40% of the chromosome is in interspersed repeats
such as Alu sequences and LINEs; (5) the rate of
recombination varies across the chromosome, with long
regions of low rates of recombination punctuated by
short regions with relatively high rates; (6) several
examples of local and long-range duplications; (7) large
regions where linkage among genes has been conserved
with that in seven different mouse chromosomes.
The working draft of the human genome reported by
two separate groups allowed estimates that the genome
probably contains fewer genes than anticipated. About
half of the genome has derived from the action of
transposons, and transposons themselves have contributed
dozens of genes to the genome. In addition, bacteria
wea25324_ch24_759-788.indd Page 786
786
22/12/10
9:02 AM user-f467
Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile
Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefiles
Chapter 24 / Introduction to Genomics: DNA Sequencing on a Genomic Scale
appear to have donated at least dozens of genes. The
finished draft of the human genome is much more
accurate and complete than the working drafts, but it still
contains some gaps. On the basis of the finished draft,
geneticists estimate that the genome contains about
20,000–25,000 genes. The finished draft also gives
valuable information about human evolution.
Comparing the human genome to that of other
vertebrates has already taught us much about the
similarities and differences among genomes. Such
comparisons have also helped to identify many human
genes. In the future, such comparisons will help find the
genes that are defective in human genetic diseases. One
can also use closely related species like the mouse to find
when and where their genes are expressed and therefore
to estimate when and where the corresponding human
genes are expressed. Detailed comparison of mouse and
human chromosomes has revealed a high degree of
synteny between the two species.
It is possible to define the essential gene set of a
simple organism by mutating one gene at a time to see
which genes are required for life. It is also possible to
define the minimal genome—the set of genes that is the
minimum required for life. It is likely that this minimal
genome is larger than the essential gene set. In principle,
it is also possible to place this minimal genome into
a cell lacking genes of its own and thereby create a
new form of life that can live and reproduce under
laboratory conditions. With selected genes added,
such a life form could be modified to perform many
useful tasks.
A movement has begun to create a barcode to identify
any species of life on earth. The first “barcode of life”
will consist of the sequence of a 648-bp piece of the
mitochondrial COI gene from each organism. This
sequence is sufficient to uniquely identify almost any
animal. Other sequences, or barcodes, are being worked
out for plants.
REVIEW QUESTIONS
1. What is a CpG island? Why have CpG sequences tended to
disappear from the human genome?
2. a. What kind of mutation gave rise to Huntington disease?
b. What is the evidence that the gene identified as HD is
really the gene that causes HD?
6. Describe the procedure for finding an STS in a genome.
7. Describe microsatellites and minisatellites. Why are
microsatellites better tools for linkage mapping than
minisatellites?
8. Show how to use STSs in a set of BAC clones to form a
contig. Illustrate with a diagram different from the one
given in the text.
9. Describe the use of radiation hybrid mapping to
map STSs.
10. How does an expressed sequence tag (EST) differ from an
ordinary STS?
11. Compare and contrast the clone-by-clone sequencing
strategy and the shotgun sequencing strategy for large
genomes.
12. What major conclusions can we draw from the sequence of
human chromosome 22?
13. What is a pseudogene?
14. What is the difference between an ortholog and a paralog?
15. How do scientists estimate the number of genes in complex
eukaryotes like humans?
16. The tiger pufferfish (Fugu rubripes) genome is nine times
smaller than the human genome, but it contains just as
many genes. How can that be?
17. What do we mean by “syntenic regions” in the mouse and
human genomes?
18. Humans appear to have about as many protein-encoding
genes as roundworms. How do you explain the lack of
correspondence between the apparent numbers of genes and
the complexities of these two organisms?
19. What is the difference between an organism’s “essential
gene set” and its “minimal genome?”
A N A LY T I C A L Q U E S T I O N S
1. Will the following DNA fragments be detected by an exon
trap? Why or why not?
a. An intron
b. Part of an exon
c. A whole exon with parts of introns on both sides
d. A whole exon with part of an intron on one side
2. The following is a physical map of a region you are
mapping by RFLP analysis.
Extent of probe
3. What is an open reading frame (ORF)? Write a DNA
sequence containing a short ORF.
4. What are the essential elements of a YAC vector?
5. On what plasmid are the BAC vectors based? What
essential elements do they contain?
1
2 kb
2
3 kb
3
1 kb
4
wea25324_ch24_759-788.indd Page 787
22/12/10
9:02 AM user-f467
Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile
Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefiles
Suggested Readings
The numbered vertical lines represent restriction sites recognized by SmaI. The circled sites (2 and 3) are polymorphic,
the others are not. You cut the DNA with SmaI, electrophorese the fragments, blot them to a membrane, and probe
with a DNA whose extent is shown at top. Give the sizes of
fragments you will detect in individuals homozygous for the
following haplotypes with respect to sites 2 and 3.
Haplotype
Site 2
Site 3
A
B
C
D
Present
Present
Absent
Absent
Present
Absent
Present
Absent
Fragment
sizes
A
B
C
D
E
F
Human
chromosome content
Morell, V. 1996. Life’s last domain. Science 273:1043–45.
Murray, T.H. 1991. Ethical issues in human genome research.
FASEB Journal 5:55–60.
Ponting, C.P. and G. Lunter. 2006. Human brain gene wins
genome race. Nature 443:149–50.
Reeves, R.H. 2000. Recounting a genetic story. Nature
405:283–34.
Venter, J.C., H.O. Smith, and L. Hood. 1996. A new strategy for
genome sequencing. Nature 381:364–66.
Zimmer, C. 2003. Tinker, tailor: Can Venter stitch together a
genome from scratch? Science 299:1006–07.
Research Articles
3. You are mapping the gene responsible for a human genetic
disease. You find that the gene is linked to a RFLP detected
with a probe called X-21. You hybridize labeled X-21 DNA
to DNAs from a panel of mouse–human hybrid cells.
The following shows the human chromosomes present
in each hybrid cell line, and whether the probe hybridized
to DNA from each. Which human chromosome carries
the disease gene?
Cell Line
787
Hybridization
to X-21
1, 5, 21
6, 7
1, 22, Y
4, 5, 18, 21
8, 21, Y
2, 5, 6
1
2
2
1
2
1
4. You have just obtained the sequence of the genome of an
organism that has been the subject of considerable genetic
study. Describe how you would identify genomic regions
that have experienced high rates of recombination. Explain
the reasoning behind your approach.
SUGGESTED READINGS
General References and Reviews
Ball, P. 2007. Designs for life. Nature 448:32–33.
Collins, F.S., M.S. Guyer, and A. Chakravarti. 1997. Variations
on a theme: Cataloging human DNA sequence variation.
Science 278:1580–81.
Fields, S. 2007. Site-seeing by sequencing. Science 316:1441–42.
Goffeau, A. 1995. Life with 482 genes. Science 270:445–46.
Goffeau, A., B.G. Barrell, H. Bussey, R.W. Davis, B. Dujon,
H. Feldmann, et al. 1996. Life with 6000 genes. Science
274:546–67.
Levy, S., and R.L. Strausberg. 2008. Individual genomes
diversify. Nature 456:49–51.
Bentley, D.R. et al. 2008. Accurate whole human genome
sequencing using reversible terminator chemistry. Nature
456:53–59.
Blattner, F.R., G. Plunkett 3rd, C.A. Bloch, N.T. Perna,
V. Burland, M. Riley, et al. 1997. The complete
genomic sequence of Escherichia coli K12. Science 277:
1453–62.
Bult, C.J., O. White, G.J. Olsen, L. Zhou, R.D. Fleischmann,
G.G. Sutton, et al. 1996. Complete genome sequence
of the methanogenic archaeon, Methanococcus jannaschii.
Science 273:1058–73.
C. elegans Sequencing Consortium. 1998. Genome sequence
of the nematode C. elegans: A platform for investigating
biology. Science 282:2013–18.
Deloukas, P., G.D. Schuler, G. Gyapay, E.M. Beasley,
C. Soderlund, P. Rodriguez-Tome, et al. 1998. A physical
map of 30,000 human genes. Science 282:744–46.
Dunham, I., N. Shimizu, B.A. Roe, S. Chissoe, A.R. Hunt, J.E.
Collins, (The Chromosome 22 Sequencing Consortium).
1999. The DNA sequence of human chromosome 22. Nature
402:489–95.
Grimson, A., M. Srivastava, B. Fahey, B.J. Woodcroft, H.R.
Chiang, N. King, B.M. Degnan, D.S. Rokhsar, and D.P.
Bartel. 2008. Early origins and evolution of microRNAs
and Piwi-interacting RNAs in animals. Nature
455:1193–97.
Gusella, J.F., N.S. Wexler, P.M. Conneally, S.L. Naylor, M.A.
Anderson, R.E. Tauzi, et al. 1983. A polymorphic DNA
marker genetically linked to Huntington’s disease. Nature
306:234–38.
Hudson, T.J., L.D. Stein, S.S. Gerety, J. Ma, A.B. Castle, J. Silva,
et al. 1995. An STS-based map of the human genome.
Science 270:1945–54.
Hutchinson, C.A. III, S.N. Peterson, S.R. Gill, R.T. Cline,
O. White, C.M. Fraser, H.O. Smith, and J.C. Venter. 1999.
Global transposon mutagenesis and a minimal mycoplasma
genome. Science 286:2165–69.
International HapMap Consortium. 2005. A haplotype map of
the human genome. Nature 437:1299–1320.
International Human Genome Sequencing Consortium. 2001.
Initial sequencing and analysis of the human genome. Nature
409:860–921.
wea25324_ch24_759-788.indd Page 788
788
22/12/10
9:02 AM user-f467
Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile
Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefiles
Chapter 24 / Introduction to Genomics: DNA Sequencing on a Genomic Scale
Mural, R.J., M.D. Adams, E.W. Myers, H.O. Smith, G.L. Miklos,
R. Wides, et al. 2002. A comparison of whole-genome
shotgun-derived mouse chromosome 16 and the human
genome. Science 296:1661–71.
Pääbo, S. and many other authors. 2008. A complete Neandertal
mitochondrial genome sequence determined by highthroughput sequencing. Cell 134:416–26.
Pääbo, S. and many other authors. 2010. A draft sequence of the
Neandertal genome. Science 328:710–22.
Schuler, G.D., M.S. Boguski, E.A. Stewart, L.D. Stein, G. Gyapay,
K. Rice, et al. 1996. A gene map of the human genome.
Science 274:540–46.
Shizuya, H., B. Birren, U.-J. Kim, V. Mancino, T. Slepak,
Y. Tachiiri, and M. Simon. 1992. Cloning and stable
maintenance of 300-kilobase-pair fragments of human
DNA in Escherichia coli using an F-factor-based vector.
Proceedings of the National Academy of Sciences USA
89:8794–97.
Venter, J.C., M.D. Adams, E.W. Myers, P.W. Li, R.J. Mural,
G.G. Sutton, et al. 2001. The sequence of the human
genome. Science 291:1304–51.
Fly UP