97 253 Bioinformatics

by taratuta

on 19-01-2017

Category: Documents

>> Downloads: 16

115

views

Report

Comments

Description

Download 97 253 Bioinformatics

Transcript

97 253 Bioinformatics

wea25324_ch25_789-826.indd Page 820
820
23/12/10
8:44 AM user-f467
/Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile
Chapter 25 / Genomics II: Functional Genomics, Proteomics, and Bioinformatics
SUMMARY Most proteins work with other proteins
to perform their functions. Several techniques are
available to probe these protein–protein interactions. Traditionally, yeast two-hybrid analysis has
been done, but now other methods are available.
These include protein microarrays, immunoaffinity
chromatography followed by mass spectrometry,
and combinations of experimental methods such as
phage display with computational methods. One of
the most useful fruits of such analyses is the discovery of functions for new proteins.
25.3 Bioinformatics
As our databases swell with billions of bases of sequence
from the human and other genomes, and countless protein
structures and protein–protein interactions, one crucial
problem will be to access and manipulate all those data. Accordingly, a new specialty has arisen, known as bioinformatics. Practitioners of bioinformatics must understand both
biology and computerized data processing, so they can manage the data collecting during genomic and proteomic studies and write programs that allow scientists to use the data.
For example: BLAST is a program that searches a database
for a DNA or protein sequence similar to a sequence of interest and shows how the two sequences line up; and GRAIL
is a program that identifies genes in a database.
Two types of databases are already established. First,
we have generalized databases that include DNA and protein sequences from all organisms. Two generalized databases for DNA sequences are GenBank (http://www.ncbi
.nlm.nih.gov/) and EMBL (http://www.ebi.ac.uk/embl/).
Swissprot (http://www.ebi.ac.uk/swissprot/) is a generalized protein sequence database. Second, we have specialized databases that deal with a particular organism. For
example, FlyBase is a database of the genome of the fruitfly
Drosophila melanogaster. You can access it online at http://
flybase.bio.indiana.edu:82, and search it for genetic maps,
genes, DNA sequences, and other information. A similar
site, WormBase, provides the same kind of data for the
nematode C. elegans.
The problem, as William Gelbart has pointed out, is
that we are functional illiterates in understanding the genomic sequence. He uses a language analogy: We know a
few of the “nouns,” or polypeptide coding regions of the
genome, but we don’t know the “verbs,” “adjectives,” and
“adverbs” that tell when and how much of each gene to
express. And we don’t know the “grammar” that tells how
polypeptides assemble into complexes to do their jobs,
such as catalyzing biochemical pathways. Bioinformatics
will supply the databases and annotation that will be
needed to understand genomic grammar fully.
SUMMARY Bioinformatics involves the building
and use of biological databases, some of which contain the DNA sequences of genomes. Bioinformatics
is essential for mining the massive amount of biological data for meaningful knowledge about gene
structure and expression.
Finding Regulatory Motifs
in Mammalian Genomes
Here is an example of scientists using pure computational
tools to discover regulatory motifs in mammalian genomes.
In the discovery phase of their study, these scientists did not
use test tubes (which would be an in vitro study) or whole
cells or animals (an in vivo study). Thus, their work could
be described as in silico, in reference to the silicon-based
chips in their computers.
Earlier in this chapter, we saw an example of an experimental approach to identifying target sites for some
known transcription factors. But what about regulatory
sites that interact with molecules nobody has identified
yet? In 2005, Eric Lander and Manolis Kellis and their
colleagues reported the results of a bioinformatic approach to this question. They reasoned that regulatory
motifs (6–10 bp long) are most likely to be found in the
upstream regulatory regions of genes, where transcription
factors are likely to bind, and in the 39-untranslated
regions (UTRs) of genes, where miRNAs and other regulatory molecules bind and regulate mRNA stability and
translatability. They further reasoned that regulatory
motifs are likely to be conserved among related organisms. So they compared the human, mouse, rat, and dog
genomes to find conserved sequences in the 59-flanking
regions, and in the 39-UTRs of genes.
These researchers focused on about 17,000 genes in
the four species that were well annotated, so there was
little doubt that they were real genes. They defined the
promoter region of each gene as the noncoding sequence
within a 4-kb region centered on the transcription start
site, and they defined the 39-UTR as the region between
the translation stop codon and the polyadenylation signal
as annotated for each mRNA. As a control, they looked at
approximately 123 Mb of sequence from the last two
introns in many genes. The terminal introns are thought to
be poor in regulatory motifs and so should provide a good
negative control.
The authors defined conservation as follows: A “conserved occurrence” is a motif that is absolutely conserved in
all four species. The “conservation rate” is the ratio of conserved occurrences of a motif to total occurrences of that motif in the part of the human genome under study (promoter
regions, for example). Finally, the “motif conservation score,”
or “MCS,” is the number of standard deviations by which
wea25324_ch25_789-826.indd Page 821
23/12/10
8:44 AM user-f467
/Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile
25.3 Bioinformatics
the conservation rate of a motif exceeds the conservation
rate for random motifs of the same size.
To illustrate conservation, the authors chose the 8-mer
TGACCTTG, which is a binding site for the Err-a transcription factor. This motif occurs 434 times in human promoter regions, of which 162 are conserved occurrences.
Thus, the conservation rate is 162/434, or 37%. On the
other hand, a random 8-mer has a conservation rate of
only 6.8% in promoter regions. Furthermore, the conservation of the 8-mer is specific to promoter regions. In introns, the conservation of this sequence is only 6.2%.
Statistical analysis of these and other data allowed the authors to calculate a conservation score. The MCS for the
Err-a motif is 25.2 standard deviations, which reflects the
very small probability of finding a conservation rate of
37% against a background rate of 6.8%.
To get a more general idea of conservation of regulatory motifs, the authors calculated the MCSs of known
transcription factor binding sites from the TRANSFAC database. They found that 63% of these motifs had MCS .3
and nearly 50% had MCS ,6. So they defined “highly
conserved motifs” as those having MCS .6. The authors
listed three reasons why so many known regulatory motifs
failed to achieve an MCS as high as 3: They may be erroneously identified; they may not be conserved in the four species studied; or they may not be common enough.
The authors identified 174 highly conserved motifs in
promoter regions, of which 59 strongly matched, and 10
weakly matched, previously-identified regulatory motifs in
TRANSFAC. The other 105 motifs are likely to represent
new regulatory elements. If these new motifs are authentic
regulatory elements, the genes with which they are associated are likely to show some tissue specificity of expression. That is because genes that are controlled by a
common factor are likely to be active in the same tissues.
The authors consulted databases listing gene expression
data from 75 tissues and found that 86% of the known
motifs and 50% of the new motifs were associated with
genes whose activity was significantly enriched in one or
more tissues.
Another check on authenticity is to see if the elements
show positional bias with respect to the transcription start
site. In fact, the highly conserved elements showed a strong
tendency to cluster within 100 bp of the start site, while
random elements were randomly distributed across the
4-kb region analyzed. Taken together, these data demonstrated that most of the identified motifs are likely to be
part of authentic regulatory elements.
The authors found 106 highly conserved motifs in the
39-UTRs of genes. However, because there was no database
of 39-UTR elements similar to TRANSFAC, they had to use
other means to check their authenticity. Fortunately, two
characteristics stood out. First, the 39-UTR motifs, unlike
those in the promoter region, showed a strong directional
bias—they tended to be found on one strand and not the
821
other. This is consistent with the hypothesis that the
39-UTR motifs act in mRNAs, where they bind to miRNAs
and other molecules to regulate mRNA stability or translatability. That is because the motif must be on the correct
strand to be transcribed into mRNA. On the other hand,
motifs in promoter regions act at the DNA level. They
typically bind to activators, which can work in either orientation, so there is no strand bias.
The second characteristic of the highly conserved motifs in the 39-UTRs of genes is that they had a strong preference for an 8-base length, and for A in the last position. The
motifs in the promoter regions showed no such biases.
These characteristics are consistent with the hypothesis
that the 8-mers are sites for hybridization to miRNAs,
which tend to begin with a U, followed by seven bases that
are complementary to sites in the mRNAs they regulate.
The authors were interested in the apparent relationship of the highly conserved 8-mers and miRNAs, so they
searched the miRNA registry, which contained 207 different human miRNAs, for matches with the 8-mers, and
found 43.5% of the known human miRNAs matched one
of the 8-mers perfectly, while only 2% matched an equal
number of control 8-mers. The 8-mers that did not match
a known miRNA were evolving faster than those that did,
suggesting that the matching 8-mers cannot alter their
sequences without impairing hybridization to miRNAs,
which is important to gene regulation.
Finally, the authors used the conserved 8-mer motifs to
find new miRNA genes. They searched the four genomes
for conserved sequences complementary to the highly conserved 8-mer motifs. Then they examined the sequences
surrounding the conserved sequences for ability to form
the stable stem-loop structures that are characteristic of
miRNAs. They found 242 such stable stem-loop structures,
which presumably encode miRNAs. Of these, 113 encode
known miRNAs, leaving 129 more that encode predicted
miRNAs. The authors chose 12 of these at random and
checked for expression in pooled adult tissues (the only in
vitro experimental part of their work). They found that six
were expressed. Thus, many of the 129 predicted miRNA
genes probably really do encode miRNAs. This means that
many miRNA genes probably remain to be discovered, and
the control of gene expression by miRNAs is probably even
more widespread than had been believed.
SUMMARY Using computational biology techniques, Lander and Kellis have discovered highly
conserved sequence motifs in the promoter regions
and 39-UTRs of four mammalian species, including
humans. The motifs in the promoter regions probably represent binding sites for transcription factors.
Most of the motifs in the 39-UTRs probably represent binding sites for miRNAs.
wea25324_ch25_789-826.indd Page 822
822
23/12/10
8:44 AM user-f467
/Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile
Chapter 25 / Genomics II: Functional Genomics, Proteomics, and Bioinformatics
Using the Databases Yourself
Some very useful databases are kept at the National Center
for Biological Information (NCBI). In this section we will
see a few simple examples of how to access and use the
data. To see how a search works, let us imagine you are a
physician treating a patient with warts. Suspecting a viral
cause, and even having a candidate virus (papilloma virus)
in mind, you excise the warts, homogenize them in a buffer
containing a detergent to break open the virus particles,
purify the DNA, and perform PCR with primers specific
for the candidate virus. You obtain a band, which confirms
that the suspected viral DNA is present in the warts. To
investigate further the exact strain of the virus, you sequence part of the DNA that you have amplified by PCR
and obtain the sequencing gel pictured in Figure 5.19.
If you need practice reading a sequencing gel, ignore the
sequence below and write down the sequence of the first 21
bases from the gel, beginning with the C at the extreme
bottom of the gel. If you want to skip that step, here is the
sequence, written in lowercase, which is less ambiguous
than capitals because it is harder to confuse a g with a c
than a G with a C:
caaaaaacggaccgggtgtac
To begin a search, go to the NCBI home page and click
BLAST, or just start at the NCBI BLAST home page: http://
www.ncbi.nlm.nih.gov/BLAST/. To search in a nucleotide
database, look under the Nucleotide heading and click
Nucleotide-nucleotide BLAST [blastn]. The large box near the
top asks for the query sequence (the sequence you want to
compare to the database). You can type in a sequence, but it is
easier to copy and paste a sequence from another document.
When you have finished entering your sequence, you
can choose just part of it to search for using the Query
subrange boxes. For example, if you wanted to search for
residues 10–21 of your sequence, enter 10 in the “From”
window and 21 in the “To” window (but we will use all 21
nucleotides in our search). You can also select the database
in which you want to search using the top Choose Search
Set box. Because we think this is a viral sequence we ignore
the human and mouse database options and select Others.
The default database under Others is Nucleotide collection
(nr/nt), where nr stands for “nonredundant.” This includes
all the nucleotide sequences in several different databases
and is the most comprehensive of all. To start searching,
click the BLAST button near the bottom of the page. You
should receive a search status message, including a request
ID (RID) number.
If you receive your results promptly, you can proceed
with your analysis. However, you may have to wait. In that
case, remember the RID, so you can log onto the NCBI
website later and retrieve the results for that ID number.
You will receive your results in several forms. First, you
will see a colored bar graph indicating the rough extent of
match between your query sequence and various sequences
in the database. In this case, blue indicates the best match.
You can mouse-over each bar to see the identity of the
DNA sequence that matches your query sequence. If you
click on a bar you will get more details about the matching
sequence. Below the bar graph are the sequences that match
the query sequence, within certain limits, which we will
discuss later. Each of the matching sequences is identified,
with the best match given first and then the others in descending order of closeness of match.
Two scores are assigned to each sequence. The first is a
bit score (S), which is related to the number of matches between the query sequence and the sequence from the database. The larger the bit score, the better the match. For the
best match in this case, it is 42.1, which is good. The second
score is the expect value (E value). This is the number of
matches yielding the corresponding bit score that we would
expect to see by chance. Thus, the lower the E value, the
better the match. Really good matches give E scores much
less than 1.0. For the best match in this case it is 0.021.
What is the identity of the database sequence with the
best match to your query sequence? The mouse-over on the
top bar says it is the human papilloma virus (HPV) type 31,
which suggests that this is the strain of virus that caused
the warts in your patient. You can get the same information from the short list of matching sequences below the
bars. The top black bar corresponds to a Mus musculus
(house mouse) gene, which also gives a fairly good match.
Moving down the page in the results, we come to the
alignments between the query sequence and the database
sequences. You can see that your query sequence matches
the HPV 31 sequence perfectly, but matches the mouse
gene sequence in only 19 out of 21 positions. However,
that apparently minor difference makes a big difference in
the E value. The mouse gene has an E value of 0.33, which
is significantly less exciting than 0.011.
The query sequence we have entered is only 21 bases
long, which is unusually short. To see the effect of increasing the length of the query sequence, either read the
sequencing gel further (to 42 bases), or enter the following
sequence into a new search:
caaaaaacggaccgggtgtacaacttttactatggcgtgaca.
The new E value is 5e214, which is a very small number and
indicates that the perfect match in all 42 positions has a
high degree of significance.
What if you do not have a DNA sequence of your own
to investigate, but you want to use the NCBI database for
general information? For example, you may be interested in
finding genes that are associated with certain human diseases, such as colon cancer. To start such a search, you could
go to the NCBI website and click on Genes and Expression
in the menu on the left. Then enter “colon cancer” in
the box at top, leave the database option on the default,
wea25324_ch25_789-826.indd Page 823
23/12/10
8:44 AM user-f467
/Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile
25.3 Bioinformatics
“all databases,” and click “Search.” The next page asks you
to limit your search, so click on “Gene: gene-centered information.” The next page gives you a list of genes associated
with colon cancer. The fifth entry is MLH1 (at least it was
in December, 2010, but the order will change with time).
Click on the MLH1 link to receive data on this gene.
The summary reveals that MLH1 is the human homolog of
the mutL gene in E. coli, which encodes a protein involved
in mismatch repair (Chapter 20). The MLH1 gene in humans is also involved in mismatch repair, and MLH1 mutations cause mismatches to build up and, therefore,
mutations to accumulate. This presumably predisposes
people to develop cancer, and colon cancer in particular,
because genes that normally keep cancer in check (tumor
suppressor genes) can be inactivated by mutation, and
genes that predispose cells to lose control of their growth
(oncogenes) can be activated by mutation.
We can also use the NCBI site as a source of information about protein structure. For example, suppose we
wanted to see the structure of the p53 protein (a tumor
suppressor gene product), whose inactivation is a feature of
the majority of human cancers. Go to the NCBI website. In
the box at the top, type “p5 complexed with DNA” and
click “Go.” You will be back at the Entrez page, from which
you selected “Gene” when looking for information about
colon cancer. This time, select “Structure.” You will be
presented with a list of entries. Scroll down to the structure
named “1TUP.” In December, 2010, this was entry number 18,
but that will change with time. Click on the structure. This
will bring up a page of information about this structure.
To see it in 3-D, you will need the appropriate free software.
If you already have the Cn3D software on your computer, simply click “Structure view in Cn3D.” If not, click
“Download Cn3D.”
Once you have installed Cn3D, click the “Structure view
in Cn3D” button. You will see a structure based on an x-ray
crystallography study of p53 complexed with DNA. The
Cn3D software allows you to rotate the structure any way
you wish with your mouse. Start with the mouse pointer on
the left of the structure. Left click and hold the button down
and move the mouse to the right. The structure will rotate
from left to right. You can also rotate it from top to bottom,
or through any angle in between horizontal and vertical.
Rotate it so you can clearly see the interaction between the
zinc module and the major groove of the DNA.
You can also look at the 3D structures of some of the
proteins we have studied in previous chapters. For example, look for the structures of GAL4 and the glucocorticoid
receptors. In both cases, the rotation will make the structures even clearer than they were in this book.
SUMMARY The NCBI website contains a vast store
of biological information, including genomic and
proteomic data. You can start with a sequence and
823
discover the gene it belongs to, and compare that
sequence with that of similar genes. You can also
start with a topic you want to study and query the
database for information on that topic. Or you can
look up a protein of interest and view the structure
of that protein in three dimensions by rotating the
structure on your computer screen.
S U M M A RY
Functional genomics is the study of the expression of
large numbers of genes. One branch of this study is
transcriptomics, which is the study of transcriptomes—all
the transcripts an organism makes at any given time. One
approach to transcriptomics is to create DNA microarrays
or DNA microchips, holding thousands of cDNAs or
oligonucleotides, then to hybridize labeled RNAs (or
corresponding cDNAs) from cells to these arrays or chips.
The intensity of hybridization to each spot reveals the
extent of expression of the corresponding gene. Such
arrays can be used to analyze the timing and location of
expression of many genes at once.
SAGE (serial analysis of gene expression) allows one
to determine which genes are expressed in a given tissue
and the extent of that expression. Short tags,
characteristic of particular genes, are generated from
cDNAs and ligated together between linkers. The ligated
tags are then sequenced to determine which genes are
expressed and how abundantly. Cap analysis of gene
expression (CAGE) gives the same information as SAGE
about which genes are expressed, and how abundantly, in
a given tissue. However, because it focuses on the 59-ends
of mRNAs, it also allows the identification of
transcription start sites and, therefore, helps locate
promoters.
High-density whole chromosome transcriptional
mapping studies have shown that the majority of
sequences in cytoplasmic polyadenylated RNAs derive
from non-exon regions of 10 human chromosomes.
Furthermore, almost half of the transcription from these
same 10 chromosomes is nonpolyadenylated. Taken
together, these results indicate that the great majority of
stable nuclear and cytoplasmic transcripts of these
chromosomes comes from regions outside the exons. This
may help to explain the great differences between species,
such as humans and chimpanzees, whose exons are almost
identical.
Genomic functional profiling can be performed by
creating mutants in an organism by replacing genes one at
a time with an antibiotic resistance gene flanked by
oligomers that serve as a barcode to identify each mutant.
wea25324_ch25_789-826.indd Page 824
824
23/12/10
8:44 AM user-f467
/Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile
Chapter 25 / Genomics II: Functional Genomics, Proteomics, and Bioinformatics
Then the whole group of mutants can be grown together
under various conditions to see which mutants disappear
most rapidly. Functional profiling can also be done by
inactivating genes via RNAi.
Tissue-specific expression profiling can be done by
examining the spectrum of mRNAs whose levels are
decreased by an exogenous miRNA, and comparing that
to the spectrum of expression of genes at the mRNA level
in various tissues. If the miRNA in question causes a
decrease in the levels of the mRNAs that are naturally low
in cells in which the miRNA is expressed, it suggests that
the miRNA is at least part of the cause of those natural
low levels. This kind of analysis has implicated miR-124
in destabilizing mRNAs in brain tissue, and miR-1 in
destabilizing mRNAs in muscle tissue.
Chromatin immunoprecipitation followed by DNA
microarray analysis (ChIP-chip analysis) can be used to
identify DNA-binding sites for activators and other
proteins. In organisms with small genomes, such as yeast,
all of the intergenic regions can be included in the
microarray. But with large genomes, such as the human
genome, that is now impractical. To narrow the field, CpG
islands can be used, since they are associated with gene
control regions. Also, if the timing or conditions of an
activator’s activity are known, the control regions of genes
known to be activated at those times, or under those
conditions, can be used.
The mouse can be used as a human surrogate in largescale expression studies that would be ethically impossible
to perform on humans. For example, scientists have
studied the expression of almost all the mouse orthologs
of the genes on human chromosome 21. They have
followed the expression of these genes through various
stages of embryonic development and have catalogued the
embryonic tissues in which the genes are expressed.
Single-nucleotide polymorphisms can probably
account for many genetic conditions caused by single
genes, and even multiple genes. They might also be able to
predict a person’s response to drugs. A haplotype map
with over 10 million SNPs will make it easier to sort out
the important SNPs from those with no effect. Structural
variation (insertions, deletions, inversions, and other
rearrangements of chunks of DNA) is also a surprisingly
prominent source of variation in human genomes. Some
structural variation can in principle predispose certain
people to contract diseases, but some is presumably
benign, and some is demonstrably beneficial.
The sum of all proteins produced by an organism is its
proteome, and the study of these proteins, even smaller
sets of them, is called proteomics. Current research in
proteomics requires first that proteins be resolved,
sometimes on a massive scale. One of the best tools
available for separation of many proteins at once is 2-D
gel electrophoresis. After they are separated, proteins must
be identified, and the best method for doing that involves
digestion of the proteins one by one with proteases and
identifying the resulting peptides by mass spectrometry.
Someday microchips with antibodies attached may allow
analysis of proteins in complex mixtures without
separation.
Most proteins work with other proteins to perform
their functions. Several techniques are available to probe
these protein–protein interactions. Traditionally, yeast twohybrid analysis has been done, but now other methods are
available. These include protein microarrays, immunoaffinity
chromatography followed by mass spectrometry, and
combinations of experimental methods such as phage
display and computational methods. One of the most
useful fruits of such analyses is the discovery of functions
for new proteins.
Bioinformatics involves the building and use of
biological databases, some of which contain the DNA
sequences of genomes. Bioinformatics is essential for
mining the massive amount of genomic information for
meaningful knowledge about gene structure and
expression.
Using computational biology techniques, Lander and
Kellis have discovered highly conserved sequence motifs
in the promoter regions and 39-UTRs of four mammalian
species, including humans. The motifs in the promoter
regions probably represent binding sites for transcription
factors. Most of the motifs in the 39-UTRs probably
represent binding sites for miRNAs.
The NCBI website contains a vast store of biological
information, including genomic and proteomic data. You
can start with a sequence and discover the gene it belongs
to, and compare that sequence with that of similar genes.
You can also start with a topic you want to study and
query the database for information on that topic. Or you
can look up a protein of interest and view the structure of
that protein in three dimensions by rotating the structure
on your computer screen.
REVIEW QUESTIONS
1. Describe the process of making a DNA microchip
(oligonucleotide array).
2. Describe a SAGE experiment to measure transcription in
cancer cells of a certain type. Show how the production of a
ditag works, with actual sequences of your own invention.
3. Explain how the cap-trapper in a CAGE experiment ensures
that only full-length cDNAs are captured.
4. Explain the roles of the MmeI, XmaJI, and XbaI restriction
sites in the CAGE procedure.
5. Describe how genomic functional profiling can be
performed by gene knockout in yeast.
6. Describe how genomic functional profiling can be
performed by RNAi in higher eukaryotes.
wea25324_ch25_789-826.indd Page 825
23/12/10
8:44 AM user-f467
/Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile
Suggested Readings
825
7. Describe a tissue-specific functional profiling method that
shows the effects of miRNAs on gene expression. Give a
hypothetical example of positive results.
A N A LY T I C A L Q U E S T I O N S
8. Explain how ChIP-chip analysis works. Show how it can be
used to find DNA regions (enhancers) that bind to a
particular activator.
1. List in order the steps you would perform to make an oligonucleotide array in which two of the spots contain the dinucleotides AC and AT. You may ignore all of the other spots.
9. Explain how tag sequencing (ChIP-seq) works. What
problems in ChIP-chip analysis are solved by ChIP-seq?
2. Describe a hypothetical experiment using a DNA microarray
to measure the transcription from viral genes at two stages
of infection of cells by the virus. Present sample results.
10. What are cis-regulatory modules (CRMs)? Why are they
easier to find than single enhancers?
11. Outline a genomic strategy for finding enhancers that bind
to unknown proteins. Describe at least one drawback to
this strategy.
12. Ren and colleagues employed ChIP-chip analysis using an
anti-TAF1 antibody to locate promoters in human cells. They
found that these promoters were not enriched in TATA boxes.
Given that TAF1 is part of a transcription factor (TFIID) that
binds to TATA boxes, why is it not surprising that many of
the promoters identified lack TATA boxes? You will need
information from Chapter 11 to answer this question.
13. Describe how in situ expression analysis works. Give a
hypothetical example of a positive result.
14. What are SNPs? Why are most of them unimportant? How
can some of them be useful? How can they be abused?
15. Compare and contrast in a general way the techniques used in,
and information obtained in, transcriptomics and proteomics.
16. Describe a bioinformatic approach to identifying human
gene control motifs.
3. Perform a BLAST search on the first 20 nt of this sequence
(the sequence is divided into blocks of 10 nt): ttaagtgaaa
taaagagtga atgaaaaaat aatatcctta. What gene did you identify?
What was the best E value you obtained? Now try again
with all 40 nt. Did you still retrieve the same gene? What is
the best E value you obtained this time? Why is the E value
different this time? On what chromosome is this gene located? Is there any relationship between this gene and prostate cancer in men? If so, what is the relationship?
4. You are an MD/PhD developmental biologist (highly
trained in techniques of molecular biology) studying the
pathogenesis of Type I insulin-dependent diabetes mellitus
(IDDM). You have several patients who are predisposed to
becoming diabetic, and control subjects who have no family
history or predisposition to becoming diabetic, enrolled in a
clinical study that will involve the removal of a small section of pancreatic beta cells. You want to analyze the differences in gene expression between cells from these two
groups of subjects. Describe the experimental method(s)
you would use, and what information you hope to obtain
from this study.
17. Explain how MS/MS analysis can yield the sequence of a
protein. Present hypothetical results.
18. Explain how isotope coded affinity tags (ICATs) can enable
you to quantify the changes in protein concentration in cells
grown under two different conditions.
19. In Figure 25.21, estimate what has happened to the
concentrations of peptides 4–7 when cells are shifted from
condition 1 (no serum) to condition 2 (1serum)?
20. Explain how you would use stable isotope labeling by
amino acids in cell culture (SILAC) to quantify the changes
in protein concentration in cells grown under two different
conditions. Show sample results.
21. How would you measure the absolute concentration of a
particular protein in a cell?
22. What do the data in Figure 25.22 tell us about the accuracy
of estimating a protein’s concentration from its mRNA’s
concentration?
23. What would happen to the gray data points in Figure 25.22
if there were a lower correlation between the abundances of
orthologous proteins in the two organisms? What would
happen if there were a higher correlation?
24. Describe how affinity tagging and mass spectrometry can be
used to examine an organism’s interactome.
25. Explain how a protein microarray could be used to examine
an organism’s interactome.
26. Describe an experiment in which you would use phage
display to investigate an organism’s interactome.
SUGGESTED READINGS
General References and Reviews
Abbott, A. 1999. A post-genomic challenge: Learning to read
patterns of protein synthesis. Nature 402:715–20.
Cheung, V.G., M. Morley, F. Aguilar, A. Massimi,
R. Kucherlapati, and G. Childs. 1999. Making and reading
microarrays. Nature Genetics Supplement 21:15–19.
Cox, J. and M. Mann. 2007. Is proteomics the new genomics?
Cell 130:395–98.
Hieter, P. and Boguski, M. 1997. Functional genomics: It’s all
how you read it. Science 278:601–02.
Kruglyak, L. and D.L. Stern. 2007. An embarrassment of
switches. Science 317:758–59.
Kumar, A. and M. Snyder. 2002. Protein complexes take the bait.
Nature 415:123–24.
Lipshutz, R.J., S.P.A. Fodor, T.R. Gingeras, and D.J. Lockhart.
1999. High density synthetic oligonucleotide arrays. Nature
Genetics Supplement 21:20–24.
Marx, J. 2006. A clearer view of macular degeneration. Science
311:1704–05.
Service, R.F. 1998. Microchip arrays put DNA on the spot.
Science 282:396–99.
Young, R.A. 2000. Biomedical discovery with DNA arrays. Cell
102:9–15.
wea25324_ch25_789-826.indd Page 826
826
23/12/10
8:44 AM user-f467
/Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile
Chapter 25 / Genomics II: Functional Genomics, Proteomics, and Bioinformatics
Research Articles
Arbeitman, M.N., E.E.M. Furlong, F. Imam, E. Johnson, B.H.
Null, B.S. Baker, M.A. Krasnow, M.P. Scott, R.W. Davis, and
K.P. White. 2002. Gene expression during the life cycle of
Drosophila melanogaster. Science 297:2270–75.
Blanchette, M., A.R. Bataille, X. Chen, C. Poitras, J. LaganiPre,
G. Debois, V. GiguPre, V. Ferretti, D. Bergeron, B. Coulombe,
and F. Robert. 2006. Genome-wide computational prediction
of transcriptional regulatory modules reveals new insights
into human gene expression. Genome Research 16:656–68.
Cheng, J., T.R. Gingeras, et al. 2005. Transcriptional maps of
10 human chromosomes at 5-nucleotide resolution. Science
308:1149–54.
Gavin, A.-C., M. Bosche, R. Krause, P. Grandi, M. Marzioch,
A. Bauer, et al. 2002. Functional organization of the yeast
proteome by systematic analysis of protein complexes. Nature
415:141–47.
Glaever, G., A.M. Chu, L. Ni, C. Connelly, L. Riles,
S. Veronneau, et al. 2002. Functional profiling of the
Saccharomyces cerevisiae genome. Nature 418:387–91.
Gygi, S.P., B. Rist, S.A. Gerber, F. Turecek, M.H. Gelb, and
R. Aebersoldgene. 1999. Quantitative analysis of complex
protein mixtures using isotope-coded affinity tags. Nature
Biotechnology 17:994–99.
Ho, Y., A. Gruhler, A. Heilbut, G.D. Bader, L. Moore,
S.L. Adams, D. Figeys, and many other authors. 2002.
Systematic identification of protein complexes in
Saccharomyces cerevisiae by mass spectrometry. Nature
415:180–83.
Iyer, V.R., M.B. Eisen, D.T. Ross, G. Schuler, T. Moore, J. Lee,
et al. 1999. The transcriptional program in the response to
human fibroblasts to serum. Science 283:83–87.
Lim, L.P., N.C. Lau, P. Garrett-Engele, A. Grimson, J.M. Schelter,
J. Castle, D.P. Bartel, P.S. Linsley, and J.M. Johnson. 2005.
Microarray analysis shows that some microRNAs downregulate
large numbers of target mRNAs. Nature 433:769–73.
Pennacchio, L.A., et al. 2006. In vivo enhancer analysis of human
conserved non-coding sequences. Nature 444:499–502.
Ren, B., F. Robert, J.J. Wyrick, O. Aparicio, E.G. Jennings,
I. Simon, et al. 2000. Genome-wide location and function of
DNA binding proteins. Science 290:2306–09.
Sönnichsen, B., et al. 2005. Full-genome RNAi profiling of early
embryogenesis in Caenorhabditis elegans. Nature 434:462–69.
Stelzl, U., et al. 2005. A human protein–protein interaction
network: A resource for annotating the proteome. Cell
122:957–68.
Tong, A.H., B. Drees, G. Nardelli, G.D. Bader, B. Brannetti,
L. Castagnoli, et al. 2002. A combined experimental and
computational strategy to define protein interaction networks
for peptide recognition modules. Science 295:321–4.
Velculescu, V.E., L. Zhang, B. Vogelstein, and K.W. Kinsler. 1995.
Serial analysis of gene expression. Science 270:484–87.
Wang, D.G., J.B. Fan, C.J. Sino, A. Berno, P. Young, R. Sapolsky,
et al. 1998. Large-scale identification, mapping, and
genotyping of single-nucleotide polymorphisms in the human
genome. Science 280:1077–82.
Xie, X., J. Lu, E.J. Kulbokas, T.R. Golub, V. Mootha,
K. Lindblad-Toh, E.S. Lander, and M. Kellis. 2005. Systematic
discovery of regulatory motifs in human promoters and 39
UTRs by comparison of several mammals. Nature 434:338–45.
Zhu, H., M. Bilgin, R. Bangham, D. Hall, A. Casamayor,
P. Bertone, et al. 2001. Global analysis of protein activities
using proteome chips. Science 293:2101–05.