Comments
Description
Transcript
93 242 Techniques in Genomic Sequencing
wea25324_ch24_759-788.indd Page 765 22/12/10 9:02 AM user-f467 Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefiles 24.2 Techniques in Genomic Sequencing used an exon-trapping strategy and identified a handful of exon clones. They then used these exons to probe a cDNA library to identify the DNA copies of mRNAs transcribed from the target region. One of the clones, called IT15, for “interesting transcript number 15,” hybridized to cDNAs that identified a large (10,366 nt) transcript that codes for a large (3144 amino acid) protein called huntingtin. The presumed protein product did not resemble any known proteins, so that did not provide any evidence that this is indeed HD. However, the gene had an intriguing repeat of 23 copies of the triplet CAG (one copy is actually CAA), encoding a stretch of 23 glutamines. Is this really HD? Gusella’s team’s comparison of the gene in affected and unaffected individuals in 75 HD families demonstrated that it is. In all unaffected individuals, the number of CAG repeats ranged from 11 to 34, and 98% of these unaffected people had 24 or fewer CAG repeats. In all affected individuals, the number of CAG repeats had expanded to at least 42, up to a high of about 100. Thus, we can predict whether an individual will be affected by the disease by looking at the number of CAG repeats in this gene. Furthermore, the severity, or age of onset of the disease correlates at least roughly with the number of CAG repeats. People with a number of repeats at the low end of the affected range (now known to be 36–40) generally survive well into adulthood before symptoms appear, whereas people with a number of repeats at the high end of the range tend to show symptoms in childhood. In one extreme example, an individual with the highest number of repeats detected (about 100) started showing disease symptoms at the extraordinarily early age of 2. Finally, two people were affected, even though their parents were not. In both cases, the affected individuals had expanded CAG repeats, whereas their parents did not. New mutations (expanded CAG repeats), although a rare occurrence in HD, apparently caused both these cases of disease. Another way of demonstrating that this gene is really HD would be to deliberately mutate it and show that the mutation has neurological effects. Obviously, one cannot perform such an experiment in humans, but it would be feasible in mice, if the gene corresponding to HD is known. Fortunately, HD is conserved in many species, including the mouse, where the gene is known as Hdh. In 1995, a team of geneticists led by Michael Hayden created knockout mice (Chapter 5) with a targeted disruption in exon 5 of Hdh. Mice that are homozygous for this mutation die in utero. Heterozygotes are viable, but they show loss of neurons with corresponding lowering of intelligence. This reinforces the notion that Hdh, and therefore HD, plays an important role in the brain—exactly what we would expect of the gene that causes HD. How can we put this new knowledge to work? One obvious way is to perform accurate genetic screening to detect people who will be affected by the disease. In fact, by 765 counting the CAG repeats, we may even be able to predict the age of onset of the disease. However, that kind of information is a mixed blessing, as it can be psychologically devastating. What we really need, of course, is a cure, but that may be a long way off. The Advantage of Genomic Data The positional cloning study we have just examined took years, and much of that time was spent sequencing DNA in the suspected regions and trying to determine which gene in the sequence was the most likely culprit. With the human genome now finished, that job has become much easier. Just how much easier is indicated by Neal Copeland, a mouse geneticist who has been doing positional cloning in mice for years. He says, “It took us 15 years to get 10 possible cancer genes before we had the sequence. And it took us a few months to get 130 genes once we had the sequence.” He was talking about the mouse sequence, of course, but the same principle applies to humans, and mouse positional-cloning studies very often identify genes that cause similar problems in humans. So one of the biggest anticipated payoffs of genomics research will be the acceleration of discovery of disease genes in humans. You should not conclude from this discussion that positional cloning is obsolete. It will be important as long as we are curious about finding genes responsible for traits in any organism. Sequenced genomes simply make positional cloning much easier. SUMMARY Using RFLPs, geneticists mapped the Huntington disease gene (HD) to a region near the end of chromosome 4. Then they used an exon trap to identify the gene itself. The mutation that causes the disease is an expansion of a CAG repeat from the normal range of 11–34 copies, to the abnormal range of at least 38 copies. The extra CAG repeats cause extra glutamines to be inserted into huntingtin. 24.2 Techniques in Genomic Sequencing The first genome to be sequenced, as you might expect, was a very simple one: The small DNA genome of an E. coli phage called fX174. Frederick Sanger, the inventor of the dideoxy chain termination method of DNA sequencing, obtained the sequence of this 5375-nt genome in 1977. What kind of information can we glean from this sequence? First, we can locate exactly the coding regions for all the genes. This tells us the spatial relationships among genes and the distances between them to the exact nucleotide. How do we recognize a coding region? It contains an ORF that is long enough to code for one of the phage wea25324_ch24_759-788.indd Page 766 766 22/12/10 9:02 AM user-f467 Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefiles Chapter 24 / Introduction to Genomics: DNA Sequencing on a Genomic Scale proteins. Furthermore, the ORF must start with an ATG (or occasionally a GTG) triplet, corresponding to an AUG (or GUG) translation initiation codon, and end with the DNA equivalent of a stop codon (UAG, UAA, or UGA). In other words, an ORF in a bacterium or phage is the same as a gene’s coding region. The base sequence of the phage DNA also tells us the amino acid sequences of all the phage proteins. All we have to do is use the genetic code to translate the DNA base sequence of each open reading frame into the corresponding amino acid sequence. This may sound like a laborious process, but a personal computer can do it in a split second. Sanger’s analysis of the open reading frames of the fX174 DNA revealed something unexpected and fascinating: Some of the phage genes overlap. Figure 24.6a shows that the coding region for gene B lies within gene A and the coding region for gene E lies within gene D. Furthermore, genes D and J overlap by 1 bp. How can two genes occupy the same space and code for different proteins? The answer is that the two genes are translated in different reading frames (Figure 24.6b). Because entirely different sets of codons will be encountered in these two frames, the two protein products will also be quite different. This was certainly an interesting finding, and it raised the question of how common this phenomenon would be. So far, major overlaps seem to be confined almost exclusively to viruses, which is not surprising because these simple infectious agents have small genomes in which the premium is on efficient use of the genetic material. Moreover, viruses have prodigious power to replicate, so enormous numbers of generations have passed during which evolution has honed the viral genomes. With the advent of automated sequencing, geneticists have added much larger genomes to the list of total known sequences. In 1988, D.J. McGeoch and colleagues published B (a) A the sequence of an important human virus (herpes simplex virus I) with a relatively large genome: 152,260 bp. In 1995, Craig Venter and Hamilton Smith and colleagues determined the entire base sequences of the genomes of two bacteria: Haemophilus influenzae and Mycoplasma genitalium. The H. influenzae (strain Rd) genome contains 1,830,137 bp and it was the first genome from a freeliving organism to be completely sequenced. The M. genitalium genome, at only 580,000 bp, is the smallest of any known free-living organism and contains only about 470 genes. In April 1996, the leaders of an international consortium of laboratories announced another milestone: The 12-million-bp genome of baker’s yeast (Saccharomyces cerevisiae) had been sequenced. This was the first eukaryotic genome to be entirely sequenced. Later in 1996, the first genome of an organism (Methanococcus jannaschii) from the third domain of life, the archaea, was sequenced. Then, in 1997, the long-awaited sequence of the 4.6 million-bp E. coli genome was reported. This is only about one-third the size of the yeast genome, but the importance of E. coli as a genetic tool made this a milestone as well. In 1998, the sequence of the first animal genome, from the roundworm Caenorhabditis elegans, was reported. The first plant genome (from the mustard family member Arabidopsis thaliana) was completed in 2000. C. elegans and A. thaliana are both model organisms chosen for study because of their small genome size, short generation time, and their ease of manipulation in genetic experiments. C. elegans has the additional advantages of having fewer than 1000 cells, and being transparent, so the development of each of its cells can be tracked visually. Two other famous model organisms are the fruit fly Drosophila melanogaster and the house mouse Mus musculus. The sequences of the genomes of these two organisms were reported in 2000 E C D J G F Gene E 1 6 ATGAGT Met Ser 1 2 (b) 1 2 Met Val 89 90 Lys Glu Stop H Gene J 1 2 Met Ser 175 465 445 GTTTATGGTA GAAGGAGTGATGTAATGTCTA 184 Val Tyr Gly Glu Gly Val Met Stop 59 60 61 149 150 151 152 Gene D Figure 24.6 The genetic map of phage fX174. (a) Each letter stands for a phage gene. (b) Overlapping reading frames of fX174. Gene D (pink) begins with the base numbered 1 in this diagram and continues through base number 459. This corresponds to amino acids 1–152 plus the stop codon TAA. Dots represent bases or amino acids not shown. Only the nontemplate strand is shown. Gene E (blue) begins at base number 179 and continues through base number 454, corresponding to amino acids 1–90 plus the stop codon TGA. This gene uses the reading frame one base to the right, relative to the reading frame of gene D. Gene J (gray) begins at the base number 459 and uses the reading frame one base to the left, relative to gene D. wea25324_ch24_759-788.indd Page 767 22/12/10 9:02 AM user-f467 Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefiles 24.2 Techniques in Genomic Sequencing and 2002, respectively. Also in 2000, the eagerly awaited rough draft of the human genome sequence was announced. By 2001, this “working draft” of the human genome was published. In 2002, several important genomes were reported, in at least draft form. These included the genomes of the single-celled parasite Plasmodium falciparum, which causes malaria, and the mosquito Anopheles gambiae, which is the major carrier of the parasite. Together, these genomes promise to help in designing better ways of combating the terrible scourge of malaria. The year 2002 also saw the publication of draft sequences of the genomes of two common varieties of rice (Oryza sativa). This is the first cereal plant genome to be sequenced, and it has enormous potential significance for human nutrition. Much of the world’s population relies on cereals, and rice in particular, for the bulk of their food. The genomic sequences of two more vertebrates also appeared in 2002: The tiger pufferfish (Fugu rubripes), and the house mouse (Mus musculus). Comparison of these sequences to that of the human genome has already shed light on vertebrate evolution. Additional help on this evolutionary investigation has come from the sequence of the genome of the sea squirt, Ciona intestinalis. The adult of this species is a sessile marine organism that attaches itself to rocks and pier pilings. It bears scant resemblance to a vertebrate, but its larval form resembles a tadpole, complete with a dorsal column made of cartilage that bears some resemblance to a spine. Thus, the sea squirt is a chordate, in the same phylum with the vertebrates. Comparison of the genome of this organism with those of vertebrates and invertebrates, such as nematodes and fruit flies, will give us additional insight into vertebrate evolution. Most molecular evolution studies depend on comparisons of base sequences of parts of genomes from different organisms. The guiding principle is that there is a relationship between the divergence of the genomic sequences between any two organisms and the evolutionary distance between those two organisms. Thus, the genomes of organisms that diverged relatively recently, such as the mouse and human, should be more similar than the genomes of organisms that diverged longer ago, such as the sea squirt and human. In general, this is certainly true, but genomic studies on these and other organisms have revealed some unexpected features. For example, the rate of evolution of the human genome is not constant throughout. Instead, there are regions of relatively rapid change interspersed with regions that have changed relatively slowly over time. It will be fascinating to discover the reasons for these differences. Another lesson from the genomes sequenced so far is that the size of an organism’s genome tends to correlate with the organism’s complexity. (On the other hand, we discovered in Chapter 2 when we discussed the C-value paradox that there are many exceptions to this general rule.) In accord with the rule, prokaryotic genomes tend 767 to be much smaller than eukaryotic ones. However, it is interesting that there is some overlap. For example, the smallest eukaryotic genome sequenced to date is that of the obligate intracellular parasite of humans and other mammals, Encephalitozoon cuniculi. This organism has a genome comprising only about 2.9 Mb, and has only 1997 ORFs that could potentially code for proteins. (Of course, a parasitic lifestyle enables an organism to survive with fewer genes because it can rely on its host for many of its needs.) By contrast, the largest bacterial genome, as of 2008, is that of the social bacterium Sorangium cellulosum. It has a genome composed of about 13 Mb, which is even larger than the genome of budding yeast. On April 14, 2003, the International Human Genome Sequencing Consortium announced that it had produced a “finished” human genome sequence—two years ahead of schedule. That is, it had done 99% of the sequencing that was possible with 2003 technology, the sequence was subject to an error rate of only one in 100,000, and all sequences were in the proper order. This was a significant improvement over the rough draft announced two years earlier. Several hundred gaps remained to be filled, but they were mostly very challenging repetitive regions and centromeres. As of December 6, 2010, more than 1440 complete genomes had been sequenced, of which 1372 were from microbes, according to the NCBI website (www.ncbi.nlm .nih.gov/genome). Table 24.1 presents a time line of some of the most important achievements in genome sequencing. In the following sections we will discuss the lessons we have learned from these sequences. SUMMARY The base sequences of viruses and organisms ranging from phages to bacteria to animals and plants have been obtained. A rough draft and finished version of the human genome have also been obtained. Comparison of the genomes of closely related and more distantly related organisms can shed light on the evolution of these species. The Human Genome Project In 1990, American geneticists embarked on an ambitious quest: to map and ultimately sequence the entire human genome. This effort, which quickly became an international program, was somewhat controversial at first, partly because of the enormous effort and cost of carrying it through to its ultimate goal: knowing the entire base sequence of every one of the human chromosomes. The reason for the high cost, of course, is that the human genome is huge—more than 3 billion bp. To get an idea of the magnitude of this task, consider that if all 3 billion bases were written down, it would take about 500,000 pages of the journal Nature to contain all the information. If you could wea25324_ch24_759-788.indd Page 768 768 22/12/10 9:02 AM user-f467 Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefiles Chapter 24 / Introduction to Genomics: DNA Sequencing on a Genomic Scale Table 24.1 Milestones in Genomic Sequencing Genome (Importance) Size (bp) Phage fX174 (first genome) Phage l (large-DNA phage) Herpes simplex virus I (large-DNA eukaryotic virus) Haemophilus influenzae (bacterium, first organism) Mycoplasma genitalium (smallest bacterial genome) Saccharomyces cerevisiae (yeast, first eukaryote) Methanococcus jannaschii (first archaeon) Escherichia coli (best studied bacterium) Caenorhabditis elegans (first animal, roundworm) Human chromosome 22 (first human chromosome) Arabidopsis thaliana (first plant, mustard family) Drosophila melanogaster (a favorite genetic model) Human (working draft of the “holy grail” of genomics) Plasmodium falciparum (the malaria parasite) Anopheles gambiae (the major mosquito malaria carrier) Fugu rubripes (tiger pufferfish) Mus musculus (house mouse) Ciona intestinalis (sea squirt, a primitive chordate) Canis lupus familiaris (dog, working draft) Gallus gallus (chicken, first farm animal) Human (finished sequence) Oryza sativa (rice, first cereal grain) Pan troglodytes (chimpanzee, our closest relative, working draft) Three trypanosomatids (Trypanosoma cruzi, T. brucei, and Leishmania major, parasites that cause severe human illness) Populus trichocarpa (black cottonwood, first tree) First individual humans (two Caucasians, one African, and one Han Chinese) Homo Neanderthalensis (our closest evolutionary relative, working draft) stand the boredom, it would take you about 60 years, working 8 h/day, every day, at 5 bases a second, to read it all. Assuming a 1990 cost of about a dollar a base, the project would consume more than $3 billion, vastly more than we are used to devoting to a single biological project. In the end, more efficient sequencing methods allowed the project to be completed much sooner and at a lower cost than originally estimated. The original plan for the Human Genome Project was systematic and conservative: First, geneticists would prepare genetic and physical maps of the genome. These would contain the markers, or signposts, that would allow DNA sequences to be pieced together in the proper order. The bulk of the sequencing would be done only after the mapping was complete and clones representing all points on the map were in hand—systematically stored in freezers around Year 5375 48,513 152,260 1,830,000 580,000 12,068,000 1,660,000 4,639,221 97,000,000 53,000,000 120,000,000 180,000,000 3,200,000,000 23,000,000 278,000,000 365,000,000 2,500,000,000 117,000,000 ,2,400,000,000 1,050,000,000 3,200,000,000 489,000,000 ,3,000,000,000 25–55,000,000 1977 1983 1988 1995 1995 1996 1996 1997 1998 1999 2000 2000 2001 2002 2002 2002 2002 2002 2003 2004 2004 2005 2005 2005 ,485,000,000 2006 3,200,000,000 2007 and 2008 ,3,000,000,000 2010 the world. The original target date for completion of the sequence was 2005. Then, in May of 1998, Craig Venter, who had established a private, for-profit company, Celera, to sequence the human genome (and other genomes), shocked the genomics community by announcing that Celera would complete a rough draft of the human genome by the end of 2000. That timetable was astonishing enough, but the method by which he proposed to do the sequencing was even more arresting. Instead of relying on a map, with the ordered clones used to build it, Venter proposed a shotgun sequencing approach in which the whole human genome would be chopped up and cloned, then the clones would be sequenced at random, and finally the sequences would be pieced together using powerful computer programs that find overlapping sequences. It was not long before Francis wea25324_ch24_759-788.indd Page 769 22/12/10 9:02 AM user-f467 Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefiles 24.2 Techniques in Genomic Sequencing Collins, director of the publicly financed Human Genome Project, rose to Venter’s challenge and promised that he and his colleagues would also produce a rough draft by the end of 2000, and a polished final draft by 2003, using the map-then-sequence strategy. The upshot of this race was a tie of sorts. Venter and Collins appeared with President Clinton and other dignitaries at a ceremony in the East Room of the White House on June 26, 2000, to announce the completion of a rough draft of the human genome. We will examine the two approaches to sequencing large genomes: mapping, then sequencing (clone by clone); and shotgun sequencing. But first, let us examine the cloning vectors that have been developed for massive projects like the Human Genome Project. Vectors for Large-Scale Genome Projects No matter which sequencing strategy is used, one must first clone fragments of the genome in appropriate vectors, and large fragments are particularly valuable. We will describe two of the most popular here: yeast artificial chromosomes and bacterial artificial chromosomes. The early mapping work relied on yeast artificial chromosomes, so we will begin with those. Yeast Artificial Chromosomes The main problem with the cloning tools described in Chapter 4 is that they do not hold enough DNA for large-scale physical mapping of the human genome. Even the cosmids accommodate DNA inserts up to only about 50 kb, which is too small for efficient mapping of regions spanning more than a million bases. Vectors called yeast artificial chromosomes, or YACs, were very useful in mapping the human genome because they could accommodate hundreds of thousands of kilobases each. YACs containing a megabase or more are known as “megaYACs.” A YAC contains a left and right yeast chromosomal telomere (Chapter 21), which are both necessary to protect the chromosome’s ends, and a yeast centromere, which is necessary for segregation of sister chromatids to opposite poles of the dividing yeast cell. The centromere is placed adjacent to the left telomere, and a huge piece of human (or any other) DNA can be placed in between the centromere and the right telomere, as shown in Figure 24.7. The large DNA inserts are prepared by slightly digesting long pieces of human DNA with a restriction enzyme. The YACs, with their huge DNA inserts, can then be introduced into yeast cells, where they will replicate just as if they were normal yeast chromosomes. Using YACs, geneticists made great strides in the mapping phase of the Human Genome Project. They produced a genetic map of the whole genome that provided an average resolution of 0.7 centimorgan. A centimorgan (cM) is the distance that yields a 1% recombination frequency between two markers and corresponds to an average of about 1 Mb in humans. These researchers also produced L C 769 R + + Ligate L C R Figure 24.7 Cloning in yeast artificial chromosomes. We begin with two tiny pieces of DNA from the two ends of a yeast chromosome. One of these, the left arm, contains the left telomere (yellow, labeled L) plus the centromere (red, labeled C). The right arm contains the right telomere (yellow, labeled R). These two arms are ligated to a large piece of foreign DNA (blue)—several hundred kilobases of human DNA, for example—to form the YAC, which can replicate in yeast cells along with the real chromosomes. relatively high-resolution physical maps of two of the smallest chromosomes, 21 and Y. These maps were especially useful in that they represented long stretches of overlapping DNA segments cloned in YACs. Thus, in the days before the human genome was sequenced, if you were interested in a disease gene that mapped to one of these chromosomes, you had a much simplified task. You needed only to discover two markers flanking the gene of interest, look on the map to find which YAC or YACs contained these markers, obtain the YACs, and begin your final search for the gene. Bacterial Artificial Chromosomes Despite all the success they made possible in human genome mapping, YACs suffer from several serious drawbacks: They are inefficient (not many clones are obtained per microgram of DNA); they are hard to isolate from yeast cells; they are unstable; and they tend to contain scrambled inserts that are really composites of DNA fragments from more than one site. Bacterial artificial chromosomes (BACs) solve all of these problems and were therefore the vector of choice for much of the sequencing phase of the Human Genome Project. BACs are based on a well-known natural plasmid that inhabits E. coli cells: the F plasmid. This plasmid allows conjugation between bacterial cells. In some conjugation events, the F plasmid itself is transferred from a donor F1 cell to a recipient F2 cell, converting the latter to an F1 cell. In other events, a small piece of host DNA is transferred as an insert in the F plasmid (which is called an F9 plasmid if it has an insert of foreign DNA). And in still other events, the F plasmid inserts into the host chromosome and mobilizes the whole chromosome to pass from the donor cell to the recipient cell. Thus, because the E. coli chromosome contains over 4 million bp, the F plasmid can obviously accommodate a large insert of DNA. In practice, BACs usually have inserts less than 300,000 bp (average about 150,000 bp), and these plasmids are stable wea25324_ch24_759-788.indd Page 770 770 22/12/10 9:02 AM user-f467 Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefiles Chapter 24 / Introduction to Genomics: DNA Sequencing on a Genomic Scale Sal I HindIII Not I BamHI Not I Sal I CmR oriS pBAC108L (6.9 kb) ParB ParA repE Figure 24.8 Map of the BAC vector, pBAC108L. Key features include the cloning sites HindIII and BamHI, at top; the chloramphenicol resistance gene (CmR), used as a selection tool; the origin of replication (oriS); and the genes governing partition of plasmids to daughter cells (ParA and ParB). in vivo and in vitro. Unlike the linear YACs, which tend to break under shearing forces, the circular, supercoiled BACs resist breakage. Figure 24.8 shows the map of one of the first BACs, which was developed by Melvin Simon and colleagues in 1992. It has an origin of replication, a cloning site with two restriction sites (for HindIII and BamHI) into which large DNA fragments may be inserted. It also has genes (the Par genes) that govern plasmid partition to the daughter cells that keep the plasmid copy number at about two per cell. This contributes to the stability of the plasmid, and it has a chloramphenicol-resistance gene to enable selection of cells that have the plasmid. SUMMARY Two high-capacity vectors have been used extensively in the Human Genome Project. Much of the mapping work was done with yeast artificial chromosomes (YACs), which can accept inserts of a million or more base pairs. Most of the sequencing work was performed with bacterial artificial chromosomes (BACs) which can accept up to about 300,000 bp. The BACs are more stable and easier to work with than the YACs. The Clone-by-Clone Strategy This strategy has inherent appeal because it is so systematic. First, the whole genome is mapped by finding markers regularly spaced along each chromosome. A by-product of the mapping is a collection of clones corresponding to the markers. Because we already know the order of these clones, we can sequence each one and put that sequence in its proper place in the whole genome. Thus, this method is commonly called the clone-by-clone sequencing strategy. Aside from their usefulness in cloning, genetic and physical maps have another important benefit: They give us signposts to use when searching for the genes responsible for diseases. In the next section, we will consider some of the most powerful methods used in mapping large genomes in preparation for sequencing. As you read this section, bear in mind that these techniques are designed to map markers that are not genes but simply stretches of DNA that vary from one individual to another. We have already seen one example of such markers: restriction fragment length polymorphisms (RFLPs). Variable Number of Tandem Repeats The greater the degree of polymorphism of a RFLP, the more useful it will be. If only 1 person in 100 has one form of the RFLP (the 6-kb wea25324_ch24_759-788.indd Page 771 22/12/10 9:02 AM user-f467 Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefiles 24.2 Techniques in Genomic Sequencing fragment in Figure 24.1, for example), and the other 99 have the other form (the 4-kb and 2-kb fragments), one must screen many individuals before finding the one rare variant. This makes mapping very tedious. However, some RFLPs, called variable number tandem repeats, or VNTRs, are more useful. These derive from minisatellites (Chapter 5), stretches of DNA that contain a short core sequence repeated over and over in tandem (head to tail). Because the number of repeats of the core sequence in a VNTR is likely to be different from one individual to another, VNTRs are highly polymorphic, and therefore relatively easy to map. However, VNTRs have a disadvantage as genetic markers: They tend to bunch together at the ends of chromosomes, leaving the interiors of the chromosomes relatively devoid of markers. Sequence-Tagged Sites Another kind of anonymous marker, which is very useful to genome mappers, is the sequence-tagged site (STS). STSs are short sequences, about 60–1000 bp long, that can be detected by PCR. Figure 24.9 illustrates how to use PCR to detect an STS. One must first know enough about the DNA sequence in the region being mapped to design short primers that will 250 bp PCR n Electrophoresis 250 bp Figure 24.9 Sequence-tagged sites. We start with a large cloned piece of DNA, extending indefinitely in either direction. The sequences of small areas of this DNA are known, so one can design primers that will hybridize to these regions and allow PCR to produce doublestranded fragments of predictable lengths. In this example, two PCR primers (red) spaced 250 bp apart have been used. Several cycles of PCR generate many copies of a double-stranded PCR product that is precisely 250 bp long. Electrophoresis of this product allows one to measure its size exactly and confirm that it is the correct one. 771 hybridize a few hundred base pairs apart and cause amplification of a predictable length of DNA in between. One can then apply PCR with these two primers to any unknown DNA; if the proper size amplified DNA fragment appears, then the unknown DNA has the STS of interest. Notice that hybridization of the primers to the unknown DNA is not enough; they must hybridize a specific number of base pairs apart to give the right size PCR fragment. This provides a check on the specificity of hybridization. One great advantage of STSs as a mapping tool is that no DNA must be cloned and examined and kept in someone’s freezer. Instead, the sequences of the primers used to generate an STS are published and then anyone in the world can order those same primers and find the same STS in an experiment that takes just a few hours. Another big advantage is that it takes much less DNA to perform PCR than to do a Southern blot. Microsatellites STSs are very useful in physical mapping or locating specific sequences in the genome. But they are worthless as markers in traditional genetic mapping unless they are polymorphic. Only then can we use them to determine genetic linkage. Fortunately, geneticists have discovered a class of STSs called microsatellites that are highly polymorphic. Microsatellites are similar to minisatellites in that they consist of a core sequence repeated over and over many times in a row. However, whereas the core sequence in typical minisatellites is a dozen or more base pairs long, the core in microsatellites is much smaller—usually only 2–4 bp long. In 1992, Jean Weissenbach and his colleagues produced a linkage map of the entire human genome based on 814 microsatellites containing a C–A dinucleotide repeat. They isolated cloned DNAs containing these microsatellites and used their sequences to design PCR primers that flank the repeats at each locus. A given pair of primers yielded a PCR product whose size depended on the number of C–A repeats in a given individual’s microsatellite at that locus. Happily, the number of repeats varied quite a bit from one individual to another. Besides the fact that microsatellites are highly polymorphic, they are also widespread and relatively uniformly distributed in the human genome. Thus, they are ideal as markers for both linkage and physical mapping. Genetic (linkage) mapping with microsatellites is done by the same technique outlined in Chapter 1 for traditional genetic markers in fruit flies. Instead of determining the recombination frequency between, say, wing shape and eye color, geneticists would determine the recombination frequency between two microsatellites. For example, consider an example in which a man’s DNA yields a microsatellite at one locus that is 78 bp long and a microsatellite at a nearby locus that is 42 bp long. His wife has a microsatellite at the first locus that is 102 bp long and a microsatellite at the second locus that is 36 bp long. Within limits, the more their children show nonparental combinations of these two wea25324_ch24_759-788.indd Page 772 772 22/12/10 9:02 AM user-f467 Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefiles Chapter 24 / Introduction to Genomics: DNA Sequencing on a Genomic Scale markers in their gametes (e.g., grandchild with microsatellites that are 78 and 36 bp long, respectively), the more recombination has occurred between the markers, and the farther apart the markers are on the chromosome. Geneticists interested in physically mapping or sequencing a given region of a genome aim to assemble a set of clones called a contig, which contains contiguous (actually overlapping) DNAs spanning long distances. This is rather like putting together a jigsaw puzzle; the bigger the pieces, the easier the puzzle. Thus, it is essential to have vectors like BACs and YACs that hold big chunks of DNA. Assuming we have a BAC library of the human genome, we need some way to identify the clones that contain the region we want to map. This can be done in several ways. We could hybridize BAC DNA to a labeled DNA probe corresponding to the region of interest, but this is subject to some uncertainty due to possible nonspecific hybridization. A more reliable method is to look for STSs in the BACs. It is best to screen the BAC library for at least two STSs, spaced hundreds of kilobases apart, so BACs spanning a long distance are selected. After we have found a number of positive BACs, we begin mapping by screening them for several additional STSs, so we can line them up in an overlapping fashion as shown in Figure 24.10. This set of overlapping BACs is our new contig. We can now begin finer mapping, and even sequencing, of the contig. Radiation Hybrid Mapping Mapping with BACs sounds straightforward, but it presents difficulties. One of the most important is that BACs are so small relative to a whole human chromosome that creating a BAC contig of a whole chromosome would be unbearably laborious. So we need a method to find linkage between STSs that are even farther apart than those that could fit into a single BAC. Radiation hybrid mapping provides a way. We begin by irradiating (a) Screen for STS1 ( ) and STS4 ( etc. STS1 ( ) STS4 ( ) ). (b) Screen each BAC that had STS1 or STS4 for STS2 ( ), STS3 ( ), and STS5 ( ). (c) Line up STSs to form a contig. Contig: Figure 24.10 Mapping with STSs. At top left, several representative BACs are shown, with different symbols representing different STSs placed at specific intervals. In step (a) of the mapping procedure, screen for two or more widely spaced STSs. In this case screen for STS1 and STS4. All those BACs with either STS1 or 4 are shown at top right. The identified STSs are shown in color. In step (b), each of these positive BACs is further screened for the presence of STS2, STS3, and STS5.The colored symbols on the BACs at bottom right denote the STSs detected in each BAC. In step (c), align the STSs in each BAC to form the contig. Measuring the lengths of the BACs by pulsed-field gel electrophoresis helps to pin down the spacing between pairs of BACs. wea25324_ch24_759-788.indd Page 773 22/12/10 9:02 AM user-f467 Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefiles 24.2 Techniques in Genomic Sequencing human cells with lethal doses of ionizing radiation, such as x-rays or gamma rays, which break the human chromosomes into pieces. Next, we fuse these doomed human cells with hamster cells to form hybrid cells that contain only some of the human chromosome fragments. Then, we form clones of identical hybrid cells by growing groups of cells— each group deriving from a single progenitor cell. Finally, we examine clones of hybrid cells to see which STSs tend to be found together in the hybrid cells. The more often they are together, the closer together they are likely to be on a human chromosome. In 1996, an international consortium of geneticists, including G.D. Schuler, published a human map based on STSs mapped by this technique. It contained more than 16,000 STS markers, plus about a thousand genetic markers mapped by classical linkage methods (family studies), which provided an overall framework for the map. The STS markers used in this study were a special class called expressed sequence tags (ESTs). These are STSs that are generated by starting with mRNAs and using the enzyme reverse transcriptase to make corresponding cDNAs. These cDNAs can then be amplified by PCR and cloned. Finally, both ends of the cDNAs are sequenced, yielding two “sequence tags” that are usually less than 500 bases long. Thus, ESTs represent genes that are expressed in the cell from which the mRNAs were isolated. Because the STS (or EST) method yields the sequence of only a small part of a gene, a given gene may be represented by many different ESTs in an EST database. To minimize such duplications, the mapping consortium confined their mapping to ESTs that represented the 39-untranslated regions (39-UTRs) of genes. This strategy also has the advantage of avoiding most introns, which tend not to be found in 39-UTRs. By 1998, the international consortium (P. Deloukas et al.) had refined and extended the map to include over 30,000 genes. SUMMARY Mapping the human genome requires a set of landmarks to which we can relate the positions of genes. Some of these markers are genes, but many more are nameless stretches of DNA, such as RFLPs, VNTRs, STSs (including ESTs and microsatellites). The latter two are regions of DNA that can be identified by formation of a predictable length of amplified DNA by PCR with pairs of primers. Shotgun Sequencing The shotgun-sequencing strategy, first proposed by Craig Venter, Hamilton Smith, and Leroy Hood in 1996, bypasses the mapping stage and goes right to the sequencing stage. The sequencing starts with a set of BAC clones containing large DNA inserts, averaging about 150 kb. The insert in each BAC is sequenced on both ends using an automated 773 sequencer that can usually read about 500 bases at a time, so 500 bases at each end of the clone will be determined. Assuming that 300,000 clones of human DNA are sequenced this way, that would generate 300 million bases of sequence, or about 10% of the total human genome, and the 500-base sequenced regions would therefore occur on average every 5 kb in the genome. These 500-base sequences serve as an identity tag, called a sequence-tagged connector (STC), for each BAC clone. On average, assuming an average clone size of 150 kb, and an STC every 5 kb, 30 clones (150 kb/5 kb 5 30) should share a given STC somewhere within their span. This is the origin of the term connector—each clone should be “connected” via its STCs to about 30 other clones. The next step is to fingerprint each clone by digesting it with a restriction enzyme. This serves two important purposes. First, it tells the insert size (the sum of the sizes of all the fragments generated by the restriction enzyme). Second, it allows one to eliminate aberrant clones whose fragmentation patterns do not fit the consensus of the overlapping clones. Note that this clone fingerprinting is not the same as mapping; it is just a simple check before sequencing begins. The next step is to obtain the entire sequence of a BAC that looks interesting (a seed BAC). This is done by subdividing the BAC into smaller clones, frequently in a pUCtype vector with inserts averaging only about 2 kb. This whole BAC sequence allows the identification of the 30 or so other BACs that overlap with the seed: They are the ones with STCs that occur somewhere in the seed BAC. Next, one selects other BACs with minimal overlap with the original one and proceeds to sequence them. Then this process is repeated with other BACs with minimal overlap with the second set, and so forth. This strategy, called BAC walking, would in principle allow one laboratory to sequence the whole human genome—given enough time. But they did not have that much time, so Venter and colleagues modified the procedure by sequencing BACs at random until they had about 35 billion nt of sequence. In principle that should cover the human genome ten times over, giving a high degree of coverage and accuracy. Then they fed all the sequence into a computer with a powerful program that found areas of overlap between clones and fit their sequences together, building the sequence of the whole genome. As mentioned a little earlier, the bulk of the sequencing is done with pUC clones with relatively small inserts—only about 2 kb each. But these small inserts would not provide enough overlaps to piece together the whole genome. This drawback is especially apparent in regions of repeated DNA. A 2-kb cloned sequence from a 10-kb region of tandem DNA repeats would give no clues about where the cloned sequence fit within the larger repeat region—one part looks the same as another. That is one way the BAC clones come in handy: They are large enough to cover almost any repeated region. They also provide overlaps