Comments
Transcript
94 243 Studying and Comparing Genomic Sequences
wea25324_ch24_759-788.indd Page 774 774 22/12/10 9:02 AM user-f467 Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefiles Chapter 24 / Introduction to Genomics: DNA Sequencing on a Genomic Scale spanning large DNA regions, so they can help to organize the smaller cloned fragments. This job was also facilitated by the physical maps, especially the STS maps, that were already available. So the shotgun strategy for sequencing the human genome was in practice a hybrid of a pure shotgun and a map-then-sequence strategy. Any strategy to sequence over 3 billion bp depends on a high-volume, low-cost sequencing method. We now have sequencing devices that perform electrophoresis of DNA fragments in capillary tubes instead of the traditional thin gel slabs. These instruments are fully automated and each can handle about 1000 samples per day with only 15 min of human attention. Another of Venter’s companies, The Institute for Genomic Research (TIGR), had 230 such instruments; together, they could produce about 100 Mb of DNA sequence every day, with a relatively low labor cost. 24.3 Studying and Comparing Genomic Sequences Once a genomic sequence is in hand, scientists can mine it for the wealth of information it contains. They can also compare it to the sequences of other genomes to shed light on the evolution of these species. We will begin this section with a discussion of the human genome, and then compare it with the genomes of closely related, and then more distantly related organisms. The Human Genome Sequencing Standards At the end of 1999, we tasted the first fruit of the Human Genome Project: The final draft of human chromosome 22. In February 2001, the Venter group and the public consortium each published their versions of a working draft of the whole human genome. In 2004, the international consortium announced the finished sequence of the euchromatic part of the human genome. In this section we will look at the lessons we learned from the finished sequence of chromosome 22 (the first human chromosome to be sequenced), and the working draft and finished sequence of the whole genome. Before we begin, one lesson worth noting is that the finished sequences came from the more orderly clone-by-clone approach. This strategy yields the final draft sequences of whole chromosomes as soon as the groups sequencing each chromosome complete their work. On the other hand, the raw sequence in the shotgun sequencing approach is not pieced together until the very end, when the computer finds the overlaps necessary to build contigs. Thus, this strategy may not yield the final draft sequence of any chromosome until the whole genome is finished. What do we mean by “rough draft,” or “working draft,” and “final draft” of a genome? That depends on whom you ask. Most investigators agree that a working draft may be only 90% complete and may have an error rate of up to 1%. Although there is less agreement about what qualifies as a final draft, there is consensus that it should have an error rate of less than 1/10,000 (0.01%) and should have as few gaps as possible. Some molecular biologists insist that a genome is not completely sequenced until every last gap is filled, but it would be very difficult to eliminate all gaps in the human genome. As we will see in the next section, some regions of DNA, for mostly unexplained reasons, resist cloning. The cost of overcoming the obstacles to cloning these regions will likely be prohibitively high, so the task of filling in the last few million bases of the human genome may never be done. As detailed in the next section, the consortium that sequenced human chromosome 22 decided that their sequence was “functionally complete” when they had obtained all the sequence possible with the cloning and sequencing tools currently available, even though significant gaps remained. Chromosome 22 In reality, only the long arm (22q) of the chromosome was sequenced; the short arm (22p) is composed of pure heterochromatin and is thought to be devoid of genes. Also, 11 gaps remained in the sequence. Ten of these were gaps between contigs that could not be filled with clones—presumably due to “unclonable” DNA. The other corresponded to a 1.5-kb region of cloned DNA that resisted sequencing. The reasons that some DNAs, sometimes called “poison regions,” are unclonable are not completely clear, but it is known that DNAs with unusual secondary structure or repetitive sequences are frequently lost from bacterial cells. This is one reason that heterochromatin (Chapter 13) is very poorly represented, even in the final draft of the human genome. It is found primarily at the centromeres and near the telomeres of chromosomes and is rich in repetitive sequences. By failing to sequence the heterochromatin in the genome, scientists are not missing very many, if any, genes, because genes are not thought to reside there. But there could be other interesting aspects of these heterochromatin regions that will be missed. SUMMARY Massive sequencing projects can take two forms: (1) In the map-then-sequence strategy, one produces a physical map of the genome including STSs, then sequences the clones (mostly BACs) used in the mapping. This places the sequences in order so they can be pieced together. (2) In the shotgun approach, one assembles libraries of clones with different size inserts, then sequences the inserts at random. This method relies on a computer program to find areas of overlap among the sequences and piece them together. In practice, a combination of these methods was used to sequence the human genome. wea25324_ch24_759-788.indd Page 775 22/12/10 9:02 AM user-f467 Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefiles 24.3 Studying and Comparing Genomic Sequences What did we learn from the first completed sequence of chromosome 22? Several findings were interesting. First, we are going to have to learn to live with gaps in our sequence of the human genome, although perhaps not as many as first appeared in the sequence of this chromosome. Already by the summer of 2000, one of the gaps had been filled, and by December of 2010, only four gaps remained, not counting the short arm of the chromosome. Still, the same problems encountered in spanning the gaps in chromosome 22 bedeviled investigators sequencing the other chromosomes. Table 24.2 lists the sequenced contigs in chromosome 22 and the gaps between them as of 1999. The contigs accounted for 33,464 kb, or about 97% of the long arm of the chromosome, and they were sequenced with very high accuracy—estimated at less than one error per 50,000 bases. It is interesting that all of the gaps occurred in the regions of the chromosome close to the centromere and telomeres. Between gaps 4 and 5 was an enormous contig composed of Table 24.2 Chromosome 22 Contigs and Gaps as of 1999 Contig Gap 1 Size (kb) 234 1 2 1.9 406 ,150 2 3 1394 ,150 3 4 1790 ,100 4 5 23,006 ,50 5 6 767 ,50–100 6 7 1528 ,150 7 8 2485 ,50 8 9 190 ,100 9 10 993 ,100 10 11 291 ,100 11 12 Total sequence length Total length of 22q 380 33,464 34,491 (Source: Adapted from Dunham, I., N. Shimizu, B.A. Roe, S. Chissoe, A.R. Hunt, J.E. Collins, et al., The DNA sequence of human chromosome 22. Nature 402:491, 1999.) 775 23,006 kb that covered more than two-thirds of chromosome 22q. By December of 2010, 34,894,566 bases of chromosome 22q had been sequenced. The second major finding was that chromosome 22 was estimated to contain 679 annotated genes (genes or gene-like sequences that were at least partially identified). These can be categorized as follows: known genes, whose sequences are identical to known human genes or to the sequences deduced from known human proteins; related genes, whose sequences are homologous to known genes of human or other species or which have regions of similarity to known genes; predicted genes, which contain sequences homologous to ESTs (so we are fairly sure they are expressed); and pseudogenes, whose sequences are homologous to known genes, but they contain defects that preclude proper expression. There were 247 known genes, 150 related genes, 148 predicted genes, and 134 pseudogenes in chromosome 22q. Thus, not counting the pseudogenes, there were 545 annotated genes. Computer analysis of the sequence predicted another 325 genes, but such analyses are still very inaccurate because the algorithms depend on finding exons, and the many long introns in human genes make exons hard to spot. As of December, 2010, 855 genes had been found in chromosome 22q, including pseudogenes. The third major finding was that the coding regions of genes accounted for only a tiny fraction of the length of the chromosome. Even counting introns, the annotated genes accounted for only 39% of the total length of 22q, and the exons accounted for only 3%. By contrast, fully 41% of 22q is devoted to repeat sequences, especially Alu sequences and LINEs (Chapter 23). Table 24.3 lists the interspersed repeat elements found in chromosome 22 and their prevalences. A fourth major finding was that the rate of recombination varied across the chromosome, with long regions in which recombination is relatively low interspersed with short regions of relatively high rates of recombination (Figure 24.11). As we have seen earlier in this chapter, geneticists had already made a genetic map of the human genome, including chromosome 22, based on microsatellites. This map was based on recombination frequencies between microsatellites and was therefore calibrated in centimorgans. The chromosome 22 sequencing team was able to find these microsatellites in the sequence and measure the real physical distance between them. Figure 24.11 shows that a plot of the genetic distance between markers versus the physical distance between the same markers is not linear. The numbers indicate regions of high rates of recombination, and therefore high apparent genetic distance, separated by longer regions of relatively low rates of recombination. The average ratio of genetic distance to physical distance in this chromosome is 1.87 cM/Mb. Of course, we should remember that the y axis represents cumulative genetic distance, that is, the sum of the distances between closely spaced markers. The actual genetic distance between widely separated markers is not the same as the sum wea25324_ch24_759-788.indd Page 776 776 22/12/10 9:02 AM user-f467 Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefiles Chapter 24 / Introduction to Genomics: DNA Sequencing on a Genomic Scale Table 24.3 Repetitive DNA Content of Human Chromosome 22 Type Number Total base pairs % of chromosome Alu HERV LINE 1 LINE 2 LTR MER MIR MLT THE Other Dinucleotide Trinucleotide Tetranucleotide Pentanucleotide Other tandem repeats 20,188 255 8043 6381 848 3757 8426 2483 304 2313 1775 166 404 16 305 5,621,998 160,697 3,256,913 1,273,571 256,412 763,390 1,063,419 605,813 93,159 625,562 133,765 18,410 47,691 1612 102,245 16.80 0.48 9.73 3.81 0.77 2.28 3.18 1.81 0.28 1.87 0.40 0.06 0.14 0.0048 0.31 Total 55,664 14,024,657 41.91 (Source: Adapted from Dunham, I., N. Shimizu, B.A. Roe, S. Chissoe, A.R. Hunt, J.E. Collins, et al., The DNA sequence of human chromosome 22. Nature 402:491, 1999.) Cumulative genetic distance (cM) 60 4 50 3 40 2 30 1 20 10 0 5 10 25 15 20 Physical distance (Mb) 30 Figure 24.11 Genetic distance plotted against physical distance in chromosome 22q. The cumulative genetic distance between markers (in cM) is graphed versus the physical distance between the same markers (in Mb). The numbers denote four areas of relatively high rates of recombination (as reflected in the steeply rising curves). (Source: Adapted from Dunham, I., N. Shimizu, B.A. Roe, S. Chissoe, A.R. Hunt, J.E. Collins, et al. (The Chromosome 22 Sequencing Consortium), The DNA sequence of human chromosome 22. Nature 402:492, 1999.) of the distances between intervening markers. That is because multiple recombination events are more probable between distant markers, which makes them appear closer together than they really are (Chapter 1). A fifth major finding was that chromosome 22q had several local and long-range duplications. The most obvious involved the immunoglobulin l locus. Clustered together at this locus are 36 gene segments that are at least potentially able to encode l variable regions (V-l gene segments), as well as 56 V-l pseudogenes and 27 partial V-l pseudo-genes known as “relics.” Other duplications are separated by long distances. In one striking example, a 60-kb region is duplicated with greater than 90% fidelity almost 12 Mb away. Compared with the interspersed repeats, such as Alu sequences and LINEs, these duplications are found in few copies, so they are known as low-copy repeats or LCRs. Seven of the eight previously described LCR22s in the centromeric end of 22q were sequenced; the eighth (LCR22-1) probably lies in the sequence gap closest to the centromere. The sixth major finding was that large chunks of human chromosome 22q are conserved in several different mouse chromosomes. The sequencing team found 113 human genes whose mouse orthologs had been mapped to mouse chromosomes. (Orthologs are homologous genes in different species that have evolved from a common ancestral gene. Paralogs, by contrast, are homologous genes that have evolved by gene duplication within a species. Homologs are any kind of homologous genes—orthologs or paralogs.) These mouse orthologs clustered into eight regions on seven different mouse chromosomes, as shown in Figure 24.12. The mouse chromosomes represented in human 22q are chromosomes 5, 6, 8, 10, 11, 15, and 16. Mouse chromosome 10 is represented in two regions of human 22q. As the two species have diverged, their chromosomes have rearranged, but linkage among many markers wea25324_ch24_759-788.indd Page 777 22/12/10 9:02 AM user-f467 Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefiles 24.3 Studying and Comparing Genomic Sequences Human chromosome 22 Centromere Size of syntenic block (Mb) Mouse chromosome 1.727 kb 6 4.064 kb 16 0.989 kb 10 2.549 kb 5 2.830 kb 11 0.061 kb 10 2.121 kb 8 15.401 kb 15 Figure 24.12 Regions of conservation between human and mouse chromosomes. Human chromosome 22 is depicted on the left, with the centromere near the top, and prominent bands in white and brown. Seven different mouse chromosomes contain syntenic blocks (orthologs in conserved order) and these are shown on the right. Colors correspond to the mouse chromosomes listed at far right. (Source: Adapted from Dunham, I., N. Shimizu, B.A. Roe, S. Chissoe, A.R. Hunt, J.E. Collins, et al. (The Chromosome 22 Consortium), The DNA sequence of human chromosome 22. Nature 402:494, 1999.) has been preserved in syntenic blocks. (The preservation of gene order between two species is known as synteny.) Clearly, our knowledge of the sequence of the human genome has sped the sequencing of the mouse genome. SUMMARY Human chromosome 22q has been se- quenced to high accuracy, but the sequence still has 10 gaps that cannot be filled with available methods. There are 679 annotated genes, but the great bulk of the chromosome is made up of noncoding DNA, over 40% of it in interspersed repeats such as Alu sequences and LINEs. The rate of recombination varies across the chromosome, with long regions of low rates of recombination punctuated by short regions with relatively high rates. The chromosome contains several examples of local and long-range duplications. The human chromosome contains large regions where linkage among genes has been conserved with that in seven different mouse chromosomes. 777 Working Draft and Finished Version of the Human Genome In February 2001, the Venter group and the public consortium published separately their own versions of the working draft of the whole human genome. The drafts of the human genome presented by the two groups were by no means complete. They had many gaps and inaccuracies, but they also contained a wealth of information that kept scientists busy for years analyzing and extending it. Furthermore, the public draft continued to improve as groups working on its separate parts completed the laborious finishing phase that eliminates gaps and corrects errors. The most striking discovery from both groups was the low number of genes in the genome. The Venter group found 26,588 genes for which there were at least two independent lines of evidence, and about 12,000 more potential genes. These potential genes were identified computationally, but there were no other supporting data. Venter and colleagues assumed that most of these latter sequences were falsepositives. The public consortium estimated that the human genome contains 30,000–40,000 genes. As we will see later in this section, the estimate from the finished human genome sequence is even lower—fewer than 23,000 genes. Thus, contrary to earlier estimates, the number of human genes seems to be scarcely larger than the number of genes in a lowly roundworm or a fruit fly. Clearly, the complexity of an organism is not directly proportional to the number of genes it contains. How then can we explain human complexity? One emerging explanation is that the expression of genes in humans is more complex than it is in simpler organisms. For example, it is estimated that at least 40% of human transcripts experience alternative splicing (Chapter 14). Thus, a relatively small number of gene regions encoding domains and motifs of proteins can be shuffled in different ways to give a rich variety of proteins with different functions. Moreover, posttranslational modification of proteins in humans seems more complex than that in simpler organisms, and this also gives rise to a greater variety of protein functions. Another important finding is that about half of the human genome appears to have come from transposable elements duplicating themselves and carrying human DNA from place to place within the genome (Chapter 23). However, even though transposons have contributed so greatly to the genome, the vast majority of them are now inactive. In fact, all of the non-retrotransposons are inactive, and all of the LTR-containing retrotransposons seem to be. On the other hand, as we learned in Chapter 23, a few L1 transposons are still active in the human genome and continue to contribute to human disease. Dozens of human genes appear to have come via horizontal transmission from bacteria, and some others came from new transposons entering human cells. Thus, the human genome has been shaped not entirely by internal mutations and rearrangements, but also by importation of genes from the outside world. wea25324_ch24_759-788.indd Page 778 778 22/12/10 9:02 AM user-f467 Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefiles Chapter 24 / Introduction to Genomics: DNA Sequencing on a Genomic Scale The total size of the human genome appears to be close to the 3 billion bp (3 Gb) predicted for many years. The Venter group sequenced about 2.9 Gb, and the public consortium predicted that the total size of the genome is about 3.2 Gb. As mentioned earlier in this chapter, the international consortium of labs sequencing the human genome announced in the spring of 2003 that they had produced a finished draft of the human genome, two years earlier than originally planned. They published their work in 2004. The major advantages of this version over the rough draft were: 1. It was more complete. Ninety-nine percent of the sequence that was possible to obtain had been obtained—2,851,330,913 base pairs, or about 2.85 gigabase pairs (Gb) worth. 2. It was more accurate. The inaccuracy rate was a tiny 0.001%, and all the sequences were in the proper order. However, there were still 341 gaps, though 33 of those were in the heterochromatic regions of the genome, which were not a target of this project. Still, biologists generally concede that we will have to live with many of these gaps, perhaps forever. Also, in spite of the polish on the finished product, annotation is still difficult, and we still do not know the real number of genes in the human genome. The international consortium found 22,287 protein-encoding genes (19,438 known genes and 2188 predicted genes), considerably fewer than estimated by both rough drafts of the human genome. The difference appears to be largely due to earlier double counting of apparent genes that actually map to the same true gene. The estimated number of human genes has mostly decreased with time, at least if one includes only proteinencoding genes. In 2007, Michele Clamp reported an estimate of only 20,488 human genes, and allowed that a hundred or so remained undiscovered. She approached the question from a bioinformatics angle—using only computational tools. For example, she looked in a database called Ensembl for human genes and then compared those with counterparts in the dog and mouse genomes. This check of the presumed human genes showed that 19,209 really do code for proteins, while 3009 were on the list by mistake. Another 1177 putative genes remained in doubt, so Clamp analyzed them by comparing them to random DNA sequences for qualities of “geneness,” such as genelike GC contents. All but 10 failed this test, yielding an estimate of 19,219 genes. Combining this with similar analyses of two other databases yielded a final estimate of 20,488. What else have we learned from the finished draft? Here are a few examples: The estimated 22,289 genes appear to give rise to 34,214 transcripts, or about 1.5 per gene. These genes are represented by 231,667 exons, or about 10.4 per gene. The amount of DNA included in all these exons is just 34 Mb, which is only 1.2% of the euchromatic part of the human genome. This confirms something we already knew: The vast majority of the human genome does not contain protein-encoding genes. Some of it codes for useful RNAs, such as rRNAs, tRNAs, snRNAs, and miRNAs that are, of course, not translated. But the bulk of it appears not to be transcribed at all, and its functions, if any, remain a mystery. The finished draft is also a great aid in the study of human evolution. First, it reveals newly duplicated genes that provide the raw material for new genes with new functions: One gene in the pair can retain its original function, but the other is free to collect mutations and evolve new activities, without compromising the original activity, which may be essential to life. Second, the finished draft reveals newly inactivated genes, or pseudogenes. The search for pseudogenes began with a comparison of the rat, mouse, and human genomes to find strings of genes that were found in all three organisms. Then the investigators looked for genes within this string that were present in the rodents, but not in the human. Finally, they examined the region in the human genome predicted to contain these missing genes. They found 37 candidate pseudogenes that were still clearly recognizable, though they had all been inactivated. On average, each pseudogene had 0.8 premature stop codons and 1.6 frameshifts. Either of these types of mutation would have rendered the gene inactive. It is clear that these genes must not be essential to human life, though they presumably were to the common ancestor of humans, rats, and mice, and may still be to the rodents. To verify that these apparent pseudogenes were really what they appeared to be, the investigators went back and sequenced 34 of them. In 33 cases, the inactivations were real, and in one case the apparent inactivation was due to a sequencing error. Then they compared these 33 sequences to the corresponding sequences in the chimpanzee genome. Nineteen of these pseudogenes had two or more inactivating mutations, and these were all pseudogenes in the chimpanzee as well. The other 14, with just one inactivating mutation, were more interesting. Eight of these were pseudogenes in the chimpanzee, but five were functional genes in the chimpanzee, and one is a polymorphism (present as a pseudogene in a fraction of the human population, but as a functional gene in the others). Thus, we can see the traces of gene inactivation through evolutionary time— since the rodent and human lineages diverged, and since the chimpanzee and human lineages diverged. SUMMARY The working draft of the human genome reported by two separate groups allowed estimates that the genome contains fewer genes than anticipated. About half of the genome has derived from the action of transposons, and transposons themselves have contributed dozens of genes to the genome. In wea25324_ch24_759-788.indd Page 779 22/12/10 9:02 AM user-f467 Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefiles 24.3 Studying and Comparing Genomic Sequences addition, bacteria appear to have donated at least dozens of genes. The finished draft of the human genome is much more accurate and complete than the working draft, but it still contains some gaps. On the basis of the finished draft, geneticists estimate that the genome contains about 20,000–25,000 genes. The finished draft also gives valuable information about gene birth and death during human evolution. Personal Genomics By 2007, two groups had used traditional sequencing techniques to sequence the genomes of two major players in the human genome project, James Watson and Craig Venter. By 2008, two different groups used high throughput sequencing to sequence the genomes of two non-Caucasian individuals, one of Nigerian descent, and one of Han Chinese descent. The addition of the genomes of these two individuals to the two previously sequenced genomes of individuals of European descent added more diversity to the growing pool of human genomes. One can detect millions of SNPs, hundreds of thousands of insertions and deletions, and thousands of structural variants among the four genomes. By 2010, several more individual genomes had been sequenced, including a European (French), a Southern African (San) and a Papua New Guinean. As the speed and economy of DNA sequencing have improved, it has become possible to envision sequencing the genome of anyone who wants it and who is willing to pay the cost. The goal (with a significant cash prize) is to sequence a whole human genome for $1000. No one has claimed the prize yet, but high-throughput sequencing techniques (Chapter 5) are making it seem feasible that millions of people will one day have their whole genomic sequence on a flash drive, or whatever data storage medium is popular at that time. That wealth of information is bound to be valuable, but it also will create ethical problems. Other Vertebrate Genomes The complete sequences of the mouse and a pufferfish (the tiger pufferfish, Fugu rubripes) have been published. What lessons have these genomes taught us? Here are some of the most important: The Fugu genome was chosen for sequencing because it is a vertebrate with a much smaller genome than human— only one-ninth the size. But despite the difference in size, the two genomes have about the same number of genes (31,059 predicted genes in Fugu). The difference lies, not in gene content, but in the size of introns and amount of repetitive DNA. The Fugu genome has much smaller introns than the human, and much less repetitive DNA. Comparing 779 the Fugu and human genomes has allowed genomics researchers to identify 1000 human genes. Because genetic mutations that cause human diseases are more likely to occur at important sites in genes, and because these important sites are especially well conserved, comparing two relatively distantly related vertebrate genomes, such as human and Fugu, should help identify these important sites. The mouse genome is not as useful for this purpose because it is relatively similar to the human genome. There simply has not been enough time for the mouse and human genomes to diverge very far, and many sites, not just important ones, have been conserved. The mouse genome is a little smaller than the human, about 2.5 Gb compared with about 3 Gb, but both organisms have about the same number of genes, and a high percentage of these are the same in the two organisms: 99% of mouse genes have a counterpart we can identify in humans. This 1% difference is obviously much too little to account for the biological differences between humans and mice, so something besides sheer DNA sequence must be at work. Preliminary studies suggest that it is the control of the genes, not the genes themselves, that plays the biggest role in distinguishing humans from mice. Knowing the great similarity in genomic structure between mice and humans, scientists can use the mouse as a human surrogate in which to do experiments they could not do in humans. For example, they can knock out genes in mice and observe the effects. The results give us clues about what the homologous genes do in humans. Molecular biologists can also examine the expression patterns of mouse genes to learn when and where these genes are expressed during development and in adults. Again, these results give information about the expression of homologous genes in humans. By the beginning of 2003, some of the best studies comparing the human and mouse genomes focused on chromosomes whose sequences were finished, including human chromosome 21 and mouse chromosome 16. Let us consider some results from each of these studies. A comparison of the DNA in human chromosome 21 and equivalent DNA in the mouse has revealed about 3000 conserved sequences. Surprisingly, only half of these conserved sequences contain genes. However, the fact that they are so well conserved suggests that they are important, and we need to find out why. Perhaps they play a role in gene expression. Humans have 234 so-called “gene deserts” that are poor in genes. Again, it is surprising that 178 of these deserts are conserved in the mouse. And again, this degree of conservation of seemingly useless DNA demands an explanation. Accordingly, geneticists are knocking out some of those gene deserts in the mouse to see what effect their loss will have. In 2002, Venter and colleagues reported a detailed comparison of the sequence of mouse chromosome 16 with sequences in the human genome. They found many regions wea25324_ch24_759-788.indd Page 780 780 22/12/10 9:02 AM user-f467 Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefiles Chapter 24 / Introduction to Genomics: DNA Sequencing on a Genomic Scale Genes Mouse chromosome 16 Human chromosome 16 12 8 22 3 21 Figure 24.13 Regions of conserved synteny between mouse chromosome 16 and the human genome. Homologous genes were detected by analysis at the protein level. Mouse chromosome 16 is depicted at left, with the syntenic regions on six different human chromosomes illustrated at right (different colors indicate different human chromosomes). Orthologous genes in mouse and human are connected by colored lines in the middle of the diagram and indicated by tiny horizontal lines (purple, mouse; various colors, human). Genes homologous to mouse chromosome 16 in human chromosome 3 are found in two distinct syntenic blocks, separated by the dotted line. Above that line are human chromosome regions 3q27–29; below the line are regions 3q11.1–13.3. (Source: Adapted from Mural et al., Science 296 (2002) Fig. 3, p. 1666.) of synteny, that is, regions with conserved gene order that appear to have derived from an ancestral mammalian chromosome. Figure 24.13 illustrates these syntenic regions, analyzed at the protein level. In all, mouse chromosome 16 has homologs on six human chromosomes (represented by different colors); the homologous genes on human chromosome 3 are found in two syntenic blocks, separated by the dotted line. Thus, all told, the genes on mouse chromosome 16 are represented in seven syntenic blocks in the human genome. The degree of homology between syntenic regions in the two species is striking. Of 731 mouse genes that could be predicted with high confidence on the mouse chromosome, 717 (98%) have homologs in the human genome. This great homology far overshadows the fact that the mouse chromosome is represented in six separate human chromosomes and seven different syntenic blocks. Chromosomes frequently become scrambled during evolution without changing much if anything about gene expression, and without changing gene orders within large, syntenic blocks of genes. This can happen by chromosome breakage and translocation. For example, two closely related species of muntjac deer have experienced so much chromosome breakage (or joining, or both) since the two species diverged that one has 3 pairs of chromosomes, and the other has 23 pairs! Nevertheless, the two species can interbreed to produce healthy, albeit infertile, hybrids. The degree of similarity of mice and humans at the genomic level is clearly out of proportion to the obvious differences in appearance and behavior between these two species. How do we explain this discrepancy? If we cannot find the answer in the genes themselves, it must lie in the way the genes are expressed. But some answers are already determined. We know that human genes are subject to an extraordinary amount of alternative splicing. In fact, it has been estimated that about 75% of human genes are spliced in at least two different ways in vivo (Chapter 14). This makes the human proteome (the total complement of human proteins) much more complex than the genome suggests. We also have evidence that the pattern of expression of human genes varies considerably from the expression of the almost identical set of genes in our closest relative, the chimpanzee, and varies even more from the pattern in mice. This could derive from control by miRNAs, which, in contrast to protein-encoding genes, seem to be much different in mice and humans. Another source of variation in gene expression between two closely related species could come from the interaction between transcription factors and their binding sites on the DNA. As we have learned, eukaryotic genes have ciscontrol elements known as promoters and enhancers, and these are the targets of many transcription factors. We might predict that closely related species with highly conserved gene sets would also have highly conserved cis-control elements, but that seems not necessarily to be true. For example, Michael Snyder and colleagues reported in 2007 on ChIP analysis coupled with DNA microchip assays on the DNA targets for two transcription factors from three closely related species of yeast, which showed that these wea25324_ch24_759-788.indd Page 781 22/12/10 9:02 AM user-f467 Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefiles 24.3 Studying and Comparing Genomic Sequences factors bound in the same places relative to the genes they control only 20% of the time in all three species. (This kind of experimentation is called ChIP-chip analysis and is described in more detail in Chapter 25.) The great variation in transcription factor binding observed among these three yeast species is partly due to elements missing in one or two of the genomes, but it also sometimes occurs because a factor fails to bind, even when the element is still present. A similar phenomenon has been observed in a comparison of factor binding in human and mouse genomes. How do we relate this rapid evolution of cis-regulatory elements to changes in phenotype between organisms? At this point, it is very difficult, because of uncertainty about how much each element contributes to expression of a particular gene. It is possible that most of the differences Snyder and colleagues observed play no role in the phenotypic differences among the three species, especially because of the redundancy that appears to be built into many cis-regulatory elements. On the other hand, it seems likely that some of these differences really are important to phenotype. In 2005, scientists presented a working draft of the chimpanzee genome. Because the chimpanzee is our closest living relative, this sequence has special significance for evolutionary studies. Everyone wants to know what sets us apart from the chimpanzee. What genes give us the intelligence to build a city or write a symphony—or, for that matter, to wonder what makes us human? But a comparison of the chimpanzee and human genomes shows that we share almost all our protein-encoding genes in common, and our genomes differ by only 1.23% at the nucleotide level. Three hypotheses have been put forward to explain these data: (1) The important differences are changes in protein-encoding genes. (2) The “less is more” hypothesis, which holds that inactivation of certain genes in the human can explain the differences. (3) The differences are found in changes in gene control regions. Each hypothesis has some data to support it. Despite the paucity of differences between chimpanzees and humans in protein-encoding genes, geneticists have noticed some differences that could make a big difference. For example, the FOXP2 gene is highly conserved. It experienced only one change in amino acid coding in the approximately 130 million years between the divergence of the human and mouse lineages and the divergence of the human and chimpanzee lineages. But in the approximately 5 million years since the human and chimpanzee lineages diverged, two amino acid changes occurred. Why might the FOXP2 gene be important? It encodes a forkhead class transcription factor, and mutations in this gene cause severe speech impairment in humans. And, of course, speech is one of the key traits that sets humans and chimpanzees apart. The “less is more” hypothesis also has some support. For example, it is easy to imagine that the relative lack of 781 hair in humans is due to the loss or inactivation of a gene responsible for hairiness. And a comparison of the human and chimpanzee genomes has uncovered 53 examples of human genes that have been disrupted by insertions or deletions (indels). These genes are functional in chimpanzees, but inactive in humans. There is less direct experimental support for the third hypothesis—differences in gene control—because of the difficulty in identifying the genetic elements responsible. But the great similarity in the protein-coding regions of the two species suggests we look elsewhere, and genetic control is an attractive place to look. Indeed, as we will see, the most rapidly changing DNA sequences that distinguish the human and chimpanzee genomes are in apparently noncoding DNA regions. The easiest way to make sense of this finding is to say that these DNA regions are involved in controlling the protein-encoding genes. David Haussler and colleagues took the following approach to finding important differences—coding or noncoding—between the human and chimpanzee genomes. They used computational techniques to identify genome regions that are strongly conserved among vertebrates. Then they looked in these regions to find regions of DNA that had experienced a high rate of change since the divergence of humans and chimpanzees. They found 49 such regions, which they named HAR1–HAR49 (HAR 5 human accelerated regions). HAR1, a 118-bp DNA region, stood out most of all. In the 310 million years since the chicken and chimpanzee lineages diverged, only two changes occurred. However, in the 5 million years since the human and chimpanzee lineages diverged, fully 18 changes have occurred. Haussler and colleagues then used in situ hybridization on brain slices and found that one of the two RNAs (HAR1F) that includes the HAR1 region is expressed in the developing cerebral neocortex of humans and other primates. The neocortex is thought to be central to higher cognitive function—perhaps the most salient difference between chimpanzees and humans. Thus, we know that HAR1 gives rise to two RNAs, but these RNAs appear not to encode any proteins. However, the base sequence of HAR1F allows a prediction of a stable secondary structure (intramolecular base-pairing). And the changes between the chimpanzee and human forms of HAR1F are predicted to cause a significant difference in secondary structure, including a strengthening of basepairing. We do not know yet what HAR1F and HAR1R do, but a reasonable hypothesis is that one or both of these RNAs influence the expression of protein-encoding genes in the developing human brain and give it some of its cognitive power. One striking finding from the work of Haussler and colleagues, as well as other workers in this field, is that the most rapid changes in the genomes of humans and chimpanzees has not been in protein-encoding genes, but in noncoding regions of the genome. wea25324_ch24_759-788.indd Page 782 782 22/12/10 9:02 AM user-f467 Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefiles Chapter 24 / Introduction to Genomics: DNA Sequencing on a Genomic Scale Although the chimpanzee is our closest living relative, our closest evolutionary relative is the Neanderthal (Homo neanderthalensis), which has been extinct for about 30,000 years. In 2010, a group led by Svante Pääbo succeeded in a task many people assumed was impossible—they reported a draft sequence of the Neanderthal genome. The problem with sequencing the genome of a fossil organism is that the DNA is badly degraded, and therefore commonly thought to be unfit for sequencing. But Pääbo and colleagues solved this problem by using next-generation sequencing techniques, in which DNAs are intentionally fragmented to begin with, so DNAs that are already fragmented pose less of a problem. Another difficulty was that the bone samples from which the Neanderthal DNA came were massively contaminated with bacterial DNA, but Pääbo and colleagues minimized that problem by cutting the DNA with restriction enzymes whose recognition sites include CG sequences, which are rare in mammals, but common in microbes. This reduced the size of most microbial DNA fragments to the point that they did not interfere with the sequencing. One limitation of next-generation sequencing is that the DNA fragments are frequently too short to exhibit obvious overlaps, so they cannot be pieced together to form a whole genome. But that is not a problem if a closely related species has already had its genome sequenced, so the fragments can be compared to that sequence and placed in the proper order. Because the human genome was already available, Pääbo and colleagues could use it as a framework for their Neanderthal sequence, which they obtained from DNA extracted from well-preserved fossil remains. It is fascinating to have the Neanderthal sequence for many reasons. For example, it appears to be able to answer the question whether modern humans and Neanderthals interbred. The two species coexisted for at least 10,000 years in Europe and Asia, until the Neanderthals disappeared, so interbreeding was certainly possible. If interbreeding occurred, and the offspring were fertile, it should be possible to find traces of the Neanderthal genome in the present human genome. Indeed, Pääbo and colleagues found similarities between the Neanderthal genome and the genomes of a modern European (French), a modern East Asian (Han Chinese), and a modern Papua New Guinean, but these similarities did not extend to the genomes of two modern sub-Saharan Africans (a San from Southern Africa and a Yoruba from West Africa). Thus, Neanderthals did apparently interbreed with the ancestors of modern Eurasians, but this happened after the Eurasian and African lineages diverged. Also, because the Neanderthal genome resembles the Papua New Guinean, Chinese, and European genomes equally closely, the interbreeding appears to have happened before those lineages diverged. Pääbo and colleagues also reported the full Neanderthal mitochondrial DNA sequence, in 2008. They eliminated errors and minimized the effect of contamination by sequencing so thoroughly that each base was represented in at least 35 independent reads. Gaps and ambiguities were resolved by traditional sequencing. The modern human and Neanderthal mitochondrial sequences differ in an average of 206 bases. This contrasts with differences between modern human mitochondrial sequences that vary between 2 and 118 bases. These data allowed Pääbo and colleagues to estimate the time of divergence between the modern human and Neanderthal lineages at about 660,000 years ago. SUMMARY Comparing the human genome with that of other vertebrates has already taught us much about the similarities and differences among genomes. Such comparisons have also helped to identify many human genes. In the future, such comparisons will help find the genes that are defective in human genetic diseases. One can also use closely related species like the mouse to find when and where their genes are expressed and therefore to estimate when and where the corresponding human genes are expressed. Detailed comparison of mouse and human chromosomes has revealed a high degree of synteny between the two species. Comparisons of the human genome with that of our closest living relative, the chimpanzee, have identified a few DNA regions that have changed rapidly since the two species diverged. These are good candidates for the DNA sequences that set humans and chimps apart, yet very few of them are in proteinencoding genes. Thus, the thing that really sets us apart may be control of genes, rather than the genes themselves. Studies in yeasts have shown that even closely related species have great variation in the cis-regulatory elements that control their genes, though the genes themselves are highly conserved. Thus, cis-regulatory elements are subject to relatively rapid evolution, and that may help to explain differences in gene control, and therefore in phenotype. More insight into what makes us human will come from the genome of the Neanderthal. A working draft of this genome, as well as a finished version of the mitochondrial DNA, have already been published. The Minimal Genome By early 2002, over 50 bacterial genomes had been sequenced. The smallest of these genomes belong to intracellular parasites, such as mycoplasmas, Rickettsia (one of whose members causes Rocky Mountain spotted fever), and parasitic spirochetes like Borrelia burgdorferi, which causes Lyme disease. The record for smallest bacterial wea25324_ch24_759-788.indd Page 783 22/12/10 9:02 AM user-f467 Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefiles 24.3 Studying and Comparing Genomic Sequences genome is held by Mycoplasma genitalium, at only 530 kb. This kind of analysis has led some geneticists to ask, “What is the smallest genome that is still compatible with life?” One way to answer this question would be to compare the genomes of bacteria and find the lowest common denominator: the set of genes they all have in common. But that yields a set of only about 80 genes, which is clearly too few to sustain life. Thus, different bacteria have followed different paths to streamlining their genomes, and it is therefore not useful simply to find where the endpoints of these different paths overlap. In 1999, Craig Venter and colleagues reported the results of another approach to finding the minimal genome. They systematically mutagenized the genes in Mycoplasma genitalium and the related species M. Pneumoniae, using transposons to interrupt the genes. Then they looked to see which genes were essential, and which were not. They discovered that 265–350 of the 480 protein-encoding genes in these organisms are essential. Surprisingly, 111 of these genes had unknown functions, suggesting that we still have a lot to learn about what it takes to sustain life. This experiment identified the essential gene set, that is the set of genes whose loss is incompatible with life. But that is not the same as the minimal genome, the collection of genes that would sustain life in a real organism. The distinction comes from the fact that an organism can afford to lose certain genes by themselves, but loss of two or more of these same genes together is not compatible with life. Thus, these genes are not part of the essential gene set, but they are part of the minimal genome. The next task was to discover which genes need to be added to the essential gene set to produce a minimal genome. Venter and colleagues proposed to perform this task in a spectacularly ambitious way. They aimed to synthesize DNA from scratch, building DNA cassettes carrying several genes. Then they would place these cassettes into Mycoplasma cells whose own genes had been disabled so they would not confuse the issue. They would experiment with different combinations of genes until they found the combination with the smallest number of genes that could still support life. This plan had to deal with a difficult hurdle to get the genes to function appropriately in a new cell without any genes of its own. It is true that one can place one or a few foreign genes into a normal bacterial cell and get them to turn on very well. But what about an entirely new gene set? There was a significant chance that the genes would not turn on, but would just sit there. Bernhard Palsson has stated the problem this way: “How do you boot up a new genome?” However, by 2007, Venter and colleagues had reported progress that showed that booting up a genome really does work. They transplanted the genome of Mycoplasma mycoides to another bacterium, Mycoplasma capricolum, and the resulting cell thrived with its new genome. However, 783 they had to use some creative manipulations to make the transplant work. First, they added an antibiotic resistance gene to the donor bacteria (M. mycoides), and embedded these cells in an agarose gel. Then they broke open the cells and digested their proteins with proteolytic enzymes. (Mycoplasma cells lack a cell wall, which makes it easier to break them.) With the released circular genome protected from physical stress by the agarose, the recipient bacteria (M. capricolum) were added, along with the membranefusing agent polyethylene glycol. Apparently, some of these recipient bacterial membranes opened up and then fused around the naked donor genomes. Instead of destroying the recipient cell’s genome, Venter and colleagues played a clever trick involving the antibiotic resistance gene they had placed in the donor cell genome. After fusion, the recipient cell found itself with two genomes: one it had always had, and one from the donor. With two genomes, the cell was ready to divide, and proceeded to do so. One daughter cell got the donor genome, and the other got the recipient genome. But only the daughter cell with the donor genome had the antibiotic resistance gene, so growing the cells in the presence of the antibiotic automatically removed the cells with the recipient genome. The result was that all of the cells that formed early in the experiment were M. capricolum cells with a M. mycoides genome. In 2010, Venter and colleagues used a similar technique to introduce an entirely synthetic M. mycoides genome into M. capricolum cells. The success of this experiment ushered in a new era of “synthetic biology.” Of course, the engineered organisms are not truly synthetic—only their genome is—but they represent a milestone nonetheless. A potential ethical question might remain: Is it ethical to create life from nonliving ingredients? Recognizing this issue, Venter and colleagues submitted their plan to a panel of ethicists, who decided in 1999 that it presented no serious ethical problems. But they did see some safety issues, and recommended that public officials should examine the possibility that the artificial life forms Venter and his colleagues would create could pose an environmental hazard, or that they might lend themselves to being modified for use as agents of bioterrorism or biowarfare. To at least partially address the safety issue, Venter and colleagues have endowed their synthetic genome with a watermark—a DNA sequence not found in nature—that will enable the engineered organisms to be identified. Use of these organisms by terrorists seems very unlikely because a great deal of sophistication will be needed to create them, and there is no indication that they would be any more dangerous than highly toxic natural organisms that are already available. Ethical questions remain, however, and President Obama convened an ethics panel to study the issues and issue a report by the end of 2010. Why build an organism with a minimal genome? On a purely scientific level, it will be important to show that there is such a thing as a minimal genome, and then to wea25324_ch24_759-788.indd Page 784 784 22/12/10 9:02 AM user-f467 Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefiles Chapter 24 / Introduction to Genomics: DNA Sequencing on a Genomic Scale investigate why these particular genes are required. But practical applications are also possible. Indeed, Venter and colleagues plan to supplement the minimal genome with genes that will enable the bacteria to create fuels such as hydrogen, or to clean up industrial waste, including CO2 from power plants. This does not mean that traditional organisms will lose out to synthetic ones with minimal genomes as the microbial workhorses of the future. Frederick Blattner and his colleagues have been trimming away the genome of E. coli to build an organism with a reduced genome that is hospitable to new genes. Their strategy is to identify genes that differ from one strain to another, and are therefore probably dispensable. They have found that these genes tend to cluster in “islands” that can be conveniently deleted. As of late 2005, they had made 43 deletions, cutting the genome’s size down by more than 10%. Already, this altered bacterium was ten times better at accepting new genes than typical laboratory strains. By late 2007, Blattner’s group had pared away 14% of the E. coli genome without harming the ability of the cells to grow and express foreign genes, and more trimming remains to be done. Finally, it is worth noting that M. genitalium and other intracellular parasites can get away with such a small genome because of their parasitic lifestyle. They get many of their nutrients from their hosts, so they can safely shed the genes that produce those nutrients. In fact, M. genitalium may already have honed its genome to something close to the minimum required for life in its human host. But scientists may be able to hone it even further—to the minimum required to live under rigidly controlled laboratory conditions. SUMMARY It is possible to define the essential gene set of a simple organism by mutating one gene at a time to see which genes are required for life. In principle, it is also possible to define the minimal genome—the set of genes that is the minimum required for life. It is likely that this minimal genome is larger than the essential gene set. It is also possible to place this minimal genome into a cell lacking genes of its own and thereby create a new form of life that can live and reproduce under laboratory conditions. With selected genes added, such a life form could be modified to perform many useful tasks. The Barcode of Life Taxonomists are in the business of classifying organisms and understanding their differences and relationships. Traditionally, they have relied on simple appearances, or morphological characteristics, to distinguish among different species. Now, in the era of DNA sequencing, they have gained another tool, because different species have different DNA sequences, as well as different appearances. Moreover, the degree of difference in the DNA sequences between two species is a good measure of their evolutionary distance, or the time since the two diverged, assuming a constant rate of mutation. However, with millions of species to study, there is no hope with present technology of sequencing the whole genomes of even a significant fraction of all these species. Instead, taxonomists focus on small regions of the genome that show a significant amount of variation among the species they are studying. Now, a group of scientists called the Consortium for the Barcode of Life (CBOL) is proposing to obtain a relatively short DNA sequence, or barcode, from the genome of every species on earth. In principle, this would allow the rapid identification of any known species, including agents of bioterrorism, and it would help to place new species on the proper branch of the tree of life. The work would start with the 1.7 known species of animals and plants and then move to the rest of the 10 million or more unknown species (not counting microbes). CBOL scientists settled on a 648-bp region from the mitochondrial cytochrome c oxidase subunit I (COI) gene as the barcode, at least for animals. This gene is present in all organisms. And, at least in animals, it shows a good degree of difference between closely related species, but little difference between members of the same species. For example, the barcodes in different human beings differ from one another by only one or two base pairs out of 648, while those in humans and chimpanzees, our closest living relatives in the tree of life, differ by 60 bp. Moreover, a sequence of 648 bp is easy and cheap to obtain in one run of a traditional automated sequencer, and mitochondrial DNA is relatively easy to purify because each cell contains 100–10,000 copies instead of just two copies in nuclear DNA. One drawback to the COI barcode is that plant mitochondrial DNA sequences show much less variation than animal sequences do, so the COI barcode will not work well for plants. Instead, a consortium of plant systematists, known as the Plant Working Group of COBOL, has proposed using sequences from two chloroplast genes (matK and rbcL) for the plant barcode. This is not a perfect solution, as this barcode works better for some plant species than for others. But it has correctly identified 72% of all plant species, and has a perfect record in placing plants in the proper genus. In Richard Preston’s novel, The Cobra Event, a deranged man creates a very nasty virus and releases it in New York City. But scientists in the book have an invaluable tool for detecting such agents—a handheld device that almost instantly identifies microbes. We are clearly not at wea25324_ch24_759-788.indd Page 785 22/12/10 9:02 AM user-f467 Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefiles Summary that point yet. However, someday it may be possible to miniaturize DNA sequencers to the point that they could be used as field devices for quick identification, via barcodes, of unknown organisms. SUMMARY A movement has begun to create a barcode to identify any species of life on earth. The first “barcode of life” will consist of the sequence of a 648-bp piece of the mitochondrial COI gene from each organism. This sequence is sufficient to uniquely identify almost any animal. Other sequences, or barcodes, are being worked out for plants. S U M M A RY Several methods are available for identifying the genes in a large, unsequenced DNA region. One of these is the exon trap, which uses a special vector to help clone exons only. Another is to use methylation-sensitive restriction enzymes to search for CpG islands—DNA regions containing unmethylated CpG sequences. Before the genomics era, geneticists mapped the Huntington disease gene (HD) to a region near the end of chromosome 4. Then they used an exon trap to identify the gene itself. Rapid, automated DNA sequencing methods have allowed molecular biologists to obtain the base sequences of viruses and organisms ranging from simple phages to bacteria to yeast, simple animals, plants, mice, and humans. Much of the mapping work in the Human Genome Project was done with yeast artificial chromosomes (YACs), vectors that contain a yeast origin of replication, a centromere, and two telomeres. Foreign DNA up to 1 million bp long can be inserted between the centromere and one of the telomeres. It will then replicate along with the YAC. On the other hand, because of their superior stability and ease of use, most of the sequencing work in the Human Genome Project was done with bacterial artificial chromosomes (BACs). BACs are vectors based on the F plasmid of E. coli. They can accept inserts up to about 300 kb, but their inserts average about 150 kb. Mapping the human genome, or any large genome, requires a set of landmarks (markers) to which one can relate the positions of genes. Genes can be used as markers in mapping, but markers are usually anonymous stretches of DNA such as RFLPs, VNTRs, STSs (including ESTs), and microsatellites. RFLPs (restriction fragment length polymorphisms) are 785 differences in the lengths of restriction fragments generated by cutting the DNA of two or more different individuals with a restriction endonuclease. RFLPs can be caused by the presence or absence of a restriction site in a particular place or insertions and deletions between restriction sites. They can also be caused by a variable number of tandem (head-to-tail) repeats (VNTRs) between two restriction sites. STSs (sequence-tagged sites) are regions of DNA that can be identified by formation of a predictable length of amplified DNA by PCR with pairs of primers. ESTs (expressed sequence tags) are a subset of STSs generated from cDNAs, so they represent expressed genes. Microsatellites are a subset of STSs generated by PCR with pairs of primers flanking tandem repeats of just a few nucleotides (usually 2–4 nt). Radiation hybrid mapping allows mapping of STSs and other markers that are too far apart to fit on one BAC. In radiation hybrid mapping, human cells are irradiated to break chromosomes, then these dying cells are fused with hamster cells. Each hybrid cell has a different subset of human chromosome fragments. The closer together two markers are, the more likely they are to be found in the same hybrid cell. Massive sequencing projects can take two forms: The map-then-sequence (clone-by-clone) approach or the shotgun approach. Actually, a combination of these methods was used to sequence the human genome. The clone-by-clone strategy calls for production of a physical map of the genome including STSs, then sequencing the overlapping clones (mostly BACs) used in the mapping. This places the sequences in order so they can be pieced together. The shotgun strategy calls for the assembly of libraries of clones with different size inserts, then sequencing the inserts at random. This method relies on a computer program to find areas of overlap among the sequences and piece them together. Sequencing of human chromosome 22q has revealed: (1) gaps that cannot be filled with available methods; (2) 855 annotated genes; (3) the great bulk (about 97%) of the chromosome is made up of noncoding DNA; (4) over 40% of the chromosome is in interspersed repeats such as Alu sequences and LINEs; (5) the rate of recombination varies across the chromosome, with long regions of low rates of recombination punctuated by short regions with relatively high rates; (6) several examples of local and long-range duplications; (7) large regions where linkage among genes has been conserved with that in seven different mouse chromosomes. The working draft of the human genome reported by two separate groups allowed estimates that the genome probably contains fewer genes than anticipated. About half of the genome has derived from the action of transposons, and transposons themselves have contributed dozens of genes to the genome. In addition, bacteria wea25324_ch24_759-788.indd Page 786 786 22/12/10 9:02 AM user-f467 Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefiles Chapter 24 / Introduction to Genomics: DNA Sequencing on a Genomic Scale appear to have donated at least dozens of genes. The finished draft of the human genome is much more accurate and complete than the working drafts, but it still contains some gaps. On the basis of the finished draft, geneticists estimate that the genome contains about 20,000–25,000 genes. The finished draft also gives valuable information about human evolution. Comparing the human genome to that of other vertebrates has already taught us much about the similarities and differences among genomes. Such comparisons have also helped to identify many human genes. In the future, such comparisons will help find the genes that are defective in human genetic diseases. One can also use closely related species like the mouse to find when and where their genes are expressed and therefore to estimate when and where the corresponding human genes are expressed. Detailed comparison of mouse and human chromosomes has revealed a high degree of synteny between the two species. It is possible to define the essential gene set of a simple organism by mutating one gene at a time to see which genes are required for life. It is also possible to define the minimal genome—the set of genes that is the minimum required for life. It is likely that this minimal genome is larger than the essential gene set. In principle, it is also possible to place this minimal genome into a cell lacking genes of its own and thereby create a new form of life that can live and reproduce under laboratory conditions. With selected genes added, such a life form could be modified to perform many useful tasks. A movement has begun to create a barcode to identify any species of life on earth. The first “barcode of life” will consist of the sequence of a 648-bp piece of the mitochondrial COI gene from each organism. This sequence is sufficient to uniquely identify almost any animal. Other sequences, or barcodes, are being worked out for plants. REVIEW QUESTIONS 1. What is a CpG island? Why have CpG sequences tended to disappear from the human genome? 2. a. What kind of mutation gave rise to Huntington disease? b. What is the evidence that the gene identified as HD is really the gene that causes HD? 6. Describe the procedure for finding an STS in a genome. 7. Describe microsatellites and minisatellites. Why are microsatellites better tools for linkage mapping than minisatellites? 8. Show how to use STSs in a set of BAC clones to form a contig. Illustrate with a diagram different from the one given in the text. 9. Describe the use of radiation hybrid mapping to map STSs. 10. How does an expressed sequence tag (EST) differ from an ordinary STS? 11. Compare and contrast the clone-by-clone sequencing strategy and the shotgun sequencing strategy for large genomes. 12. What major conclusions can we draw from the sequence of human chromosome 22? 13. What is a pseudogene? 14. What is the difference between an ortholog and a paralog? 15. How do scientists estimate the number of genes in complex eukaryotes like humans? 16. The tiger pufferfish (Fugu rubripes) genome is nine times smaller than the human genome, but it contains just as many genes. How can that be? 17. What do we mean by “syntenic regions” in the mouse and human genomes? 18. Humans appear to have about as many protein-encoding genes as roundworms. How do you explain the lack of correspondence between the apparent numbers of genes and the complexities of these two organisms? 19. What is the difference between an organism’s “essential gene set” and its “minimal genome?” A N A LY T I C A L Q U E S T I O N S 1. Will the following DNA fragments be detected by an exon trap? Why or why not? a. An intron b. Part of an exon c. A whole exon with parts of introns on both sides d. A whole exon with part of an intron on one side 2. The following is a physical map of a region you are mapping by RFLP analysis. Extent of probe 3. What is an open reading frame (ORF)? Write a DNA sequence containing a short ORF. 4. What are the essential elements of a YAC vector? 5. On what plasmid are the BAC vectors based? What essential elements do they contain? 1 2 kb 2 3 kb 3 1 kb 4 wea25324_ch24_759-788.indd Page 787 22/12/10 9:02 AM user-f467 Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefiles Suggested Readings The numbered vertical lines represent restriction sites recognized by SmaI. The circled sites (2 and 3) are polymorphic, the others are not. You cut the DNA with SmaI, electrophorese the fragments, blot them to a membrane, and probe with a DNA whose extent is shown at top. Give the sizes of fragments you will detect in individuals homozygous for the following haplotypes with respect to sites 2 and 3. Haplotype Site 2 Site 3 A B C D Present Present Absent Absent Present Absent Present Absent Fragment sizes A B C D E F Human chromosome content Morell, V. 1996. Life’s last domain. Science 273:1043–45. Murray, T.H. 1991. Ethical issues in human genome research. FASEB Journal 5:55–60. Ponting, C.P. and G. Lunter. 2006. Human brain gene wins genome race. Nature 443:149–50. Reeves, R.H. 2000. Recounting a genetic story. Nature 405:283–34. Venter, J.C., H.O. Smith, and L. Hood. 1996. A new strategy for genome sequencing. Nature 381:364–66. Zimmer, C. 2003. Tinker, tailor: Can Venter stitch together a genome from scratch? Science 299:1006–07. Research Articles 3. You are mapping the gene responsible for a human genetic disease. You find that the gene is linked to a RFLP detected with a probe called X-21. You hybridize labeled X-21 DNA to DNAs from a panel of mouse–human hybrid cells. The following shows the human chromosomes present in each hybrid cell line, and whether the probe hybridized to DNA from each. Which human chromosome carries the disease gene? Cell Line 787 Hybridization to X-21 1, 5, 21 6, 7 1, 22, Y 4, 5, 18, 21 8, 21, Y 2, 5, 6 1 2 2 1 2 1 4. You have just obtained the sequence of the genome of an organism that has been the subject of considerable genetic study. Describe how you would identify genomic regions that have experienced high rates of recombination. Explain the reasoning behind your approach. SUGGESTED READINGS General References and Reviews Ball, P. 2007. Designs for life. Nature 448:32–33. Collins, F.S., M.S. Guyer, and A. Chakravarti. 1997. Variations on a theme: Cataloging human DNA sequence variation. Science 278:1580–81. Fields, S. 2007. Site-seeing by sequencing. Science 316:1441–42. Goffeau, A. 1995. Life with 482 genes. Science 270:445–46. Goffeau, A., B.G. Barrell, H. Bussey, R.W. Davis, B. Dujon, H. Feldmann, et al. 1996. Life with 6000 genes. Science 274:546–67. Levy, S., and R.L. Strausberg. 2008. Individual genomes diversify. Nature 456:49–51. Bentley, D.R. et al. 2008. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456:53–59. Blattner, F.R., G. Plunkett 3rd, C.A. Bloch, N.T. Perna, V. Burland, M. Riley, et al. 1997. The complete genomic sequence of Escherichia coli K12. Science 277: 1453–62. Bult, C.J., O. White, G.J. Olsen, L. Zhou, R.D. Fleischmann, G.G. Sutton, et al. 1996. Complete genome sequence of the methanogenic archaeon, Methanococcus jannaschii. Science 273:1058–73. C. elegans Sequencing Consortium. 1998. Genome sequence of the nematode C. elegans: A platform for investigating biology. Science 282:2013–18. Deloukas, P., G.D. Schuler, G. Gyapay, E.M. Beasley, C. Soderlund, P. Rodriguez-Tome, et al. 1998. A physical map of 30,000 human genes. Science 282:744–46. Dunham, I., N. Shimizu, B.A. Roe, S. Chissoe, A.R. Hunt, J.E. Collins, (The Chromosome 22 Sequencing Consortium). 1999. The DNA sequence of human chromosome 22. Nature 402:489–95. Grimson, A., M. Srivastava, B. Fahey, B.J. Woodcroft, H.R. Chiang, N. King, B.M. Degnan, D.S. Rokhsar, and D.P. Bartel. 2008. Early origins and evolution of microRNAs and Piwi-interacting RNAs in animals. Nature 455:1193–97. Gusella, J.F., N.S. Wexler, P.M. Conneally, S.L. Naylor, M.A. Anderson, R.E. Tauzi, et al. 1983. A polymorphic DNA marker genetically linked to Huntington’s disease. Nature 306:234–38. Hudson, T.J., L.D. Stein, S.S. Gerety, J. Ma, A.B. Castle, J. Silva, et al. 1995. An STS-based map of the human genome. Science 270:1945–54. Hutchinson, C.A. III, S.N. Peterson, S.R. Gill, R.T. Cline, O. White, C.M. Fraser, H.O. Smith, and J.C. Venter. 1999. Global transposon mutagenesis and a minimal mycoplasma genome. Science 286:2165–69. International HapMap Consortium. 2005. A haplotype map of the human genome. Nature 437:1299–1320. International Human Genome Sequencing Consortium. 2001. Initial sequencing and analysis of the human genome. Nature 409:860–921. wea25324_ch24_759-788.indd Page 788 788 22/12/10 9:02 AM user-f467 Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefiles Chapter 24 / Introduction to Genomics: DNA Sequencing on a Genomic Scale Mural, R.J., M.D. Adams, E.W. Myers, H.O. Smith, G.L. Miklos, R. Wides, et al. 2002. A comparison of whole-genome shotgun-derived mouse chromosome 16 and the human genome. Science 296:1661–71. Pääbo, S. and many other authors. 2008. A complete Neandertal mitochondrial genome sequence determined by highthroughput sequencing. Cell 134:416–26. Pääbo, S. and many other authors. 2010. A draft sequence of the Neandertal genome. Science 328:710–22. Schuler, G.D., M.S. Boguski, E.A. Stewart, L.D. Stein, G. Gyapay, K. Rice, et al. 1996. A gene map of the human genome. Science 274:540–46. Shizuya, H., B. Birren, U.-J. Kim, V. Mancino, T. Slepak, Y. Tachiiri, and M. Simon. 1992. Cloning and stable maintenance of 300-kilobase-pair fragments of human DNA in Escherichia coli using an F-factor-based vector. Proceedings of the National Academy of Sciences USA 89:8794–97. Venter, J.C., M.D. Adams, E.W. Myers, P.W. Li, R.J. Mural, G.G. Sutton, et al. 2001. The sequence of the human genome. Science 291:1304–51.