21-05-2012, 05:23 PM
Bioinformatics Sequence and Genome Analysis
Bioinformatics Sequence and Genome Analysis.pdf (Size: 8.55 MB / Downloads: 117)
Historical Introduction and Overview
The first sequences to be collected were those of proteins, 2
DNA sequence databases, 3
Sequence retrieval from public databases, 4
Sequence analysis programs, 5
The dot matrix or diagram method for comparing sequences, 5
Alignment of sequences by dynamic programming, 6
Finding local alignments between sequences, 8
Multiple sequence alignment, 9
Prediction of RNA secondary structure, 9
Discovery of evolutionary relationships using sequences, 10
Importance of database searches for similar sequences, 11
The FASTA and BLAST methods for database searches, 11
Predicting the sequence of a protein by translation of DNA sequences, 12
Predicting protein secondary structure, 13
The first complete genome sequence, 14
ACEDB, the first genome database, 15
REFERENCES, 15
2 CHAPTER 1
THE DEVELOPMENT OF SEQUENCE ANALYSIS METHODS has depended on the contributions of
many individuals from varied scientific backgrounds. This chapter provides a brief historical
account of the more significant advances that have taken place, as well as an overview
of the chapters of this book. Because many contributors cannot be mentioned due to space
constraints, additional references to earlier and current reference books, articles, reviews,
and journals provide a broader view of the field and are included in the reference lists to
this chapter.
THE FIRST SEQUENCES TO BE COLLECTED WERE THOSE OF PROTEINS
The development of protein-sequencing methods (Sanger and Tuppy 1951) led to the
sequencing of representatives of several of the more common protein families such as
cytochromes from a variety of organisms. Margaret Dayhoff (1972, 1978) and her collaborators
at the National Biomedical Research Foundation (NBRF), Washington, DC, were the
first to assemble databases of these sequences into a protein sequence atlas in the 1960s, and
their collection center eventually became known as the Protein Information Resource (PIR,
formerly Protein Identification Resource; http://watson.gmu.edu:8080/pirwww/index.
html). The NBRF maintained the database from 1984, and in 1988, the PIR-International
Protein Sequence Database (http://www-nbrf.georgetown.edu/pir) was established as a
collaboration of NBRF, the Munich Center for Protein Sequences (MIPS), and the Japan
International Protein Information Database (JIPID).
Dayhoff and her coworkers organized the proteins into families and superfamilies based
on the degree of sequence similarity. Tables that reflected the frequency of changes observed
in the sequences of a group of closely related proteins were then derived. Proteins that were
less than 15% different were chosen to avoid the chance that the observed amino acid
changes reflected two sequential amino acid changes instead of only one. From aligned
sequences, a phylogenetic tree was derived showing graphically which sequences were most
related and therefore shared a common branch on the tree. Once these trees were made,
they were used to score the amino acid changes that occurred during evolution of the genes
for these proteins in the various organisms from which they originated (Fig. 1.1).
Figure 1.1. Method of predicting phylogenetic relationships and probable amino acid changes during
the evolution of related protein sequences. Shown are three highly conserved sequences (A, B, and
C) of the same protein from three different organisms. The sequences are so similar that each position
should only have changed once during evolution. The proteins differ by one or two substitutions,
allowing the construction of the tree shown. Once this tree is obtained, the indicated amino
acid changes can be determined. The particular changes shown are examples of two that occur much
more often than expected by a random replacement process.
ORGANISM A
ORGANISM B
ORGANISM C
Margaret Dayhoff
HISTORICAL INTRODUCTION AND OVERVIEW 3
Subsequently, a set of matrices (tables)—the percent amino acid mutations accepted by
evolutionary selection or PAM tables—which showed the probability that one amino acid
changed into any other in these trees was constructed, thus showing which amino acids are
most conserved at the corresponding position in two sequences. These tables are still used
to measure similarity between protein sequences and in database searches to find
sequences that match a query sequence. The rule used is that the more identical and conserved
amino acids that there are in two sequences, the more likely they are to have been
derived from a common ancestor gene during evolution. If the sequences are very much
alike, the proteins probably have the same biochemical function and three-dimensional
structural folds. Thus, Dayhoff and her colleagues contributed in several ways to modern
biological sequence analysis by providing the first protein sequence database as well as
PAM tables for performing protein sequence comparisons. Amino acid substitution tables
are routinely used in performing sequence alignments and database similarity searches,
and their use for this purpose is discussed in Chapters 3 and 7.
DNA SEQUENCE DATABASES
DNA sequence databases were first assembled at Los Alamos National Laboratory (LANL),
New Mexico, by Walter Goad and colleagues in the GenBank database and at the European
Molecular Biology Laboratory (EMBL) in Heidelberg, Germany. Translated DNA
sequences were also included in the Protein Information Resource (PIR) database at the
National Biomedical Research Foundation in Washington, DC. Goad had conceived of the
GenBank prototype in 1979; LANL collected GenBank data from 1982 to 1992. GenBank
is now under the auspices of the National Center for Biotechnology Information (NCBI)
(http://www.ncbi.nlm.nih.gov). The EMBL Data Library was founded in 1980
(http://www.ebi.ac.uk). In 1984 the DNA DataBank of Japan (DDBJ), Mishima, Japan,
came into existence (http://www.ddbj.nig.ac.jp). GenBank, EMBL, and DDBJ have now
formed the International Nucleotide Sequence Database Collaboration (http://www.
ncbi.nlm.nih.gov/collab), which acts to facilitate exchange of data on a daily basis. PIR has
made similar arrangements.
Initially, a sequence entry included a computer filename and DNA or protein sequence
files. These were eventually expanded to include much more information about the
sequence, such as function, mutations, encoded proteins, regulatory sites, and references.
This information was then placed along with the sequence into a database format that
could be readily searched for many types of information. There are many such databases
and formats, which are discussed in Chapter 2.
The number of entries in the nucleic acid sequence databases GenBank and EMBL has
continued to increase enormously from the daily updates. Annotating all of these new
sequences is a time-consuming, painstaking, and sometimes error-prone process. As time
passes, the process is becoming more automated, creating additional problems of accuracy
and reliability. In December 1997, there were 1.26 109 bases in GenBank; this
number increased to 2.57 109 bases as of April 1999, and 1.0 1010 as of September
2000. Despite the exponentially increasing numbers of sequences stored, the implementation
of efficient search methods has provided ready public access to these sequences.
To decrease the number of matches to a database search, non-redundant databases that
list only a single representative of identical sequences have been prepared. However, many
sequence databases still include a large number of entries of the same gene or protein
sequences originating from sequence fragments, patents, replica entries from different
databases, and other such sequences.
Many types of sequence
databases are
described in the first
annual issue of the
journal Nucleic Acids
Research.
The growth of the
number of sequences
in GenBank can be
tracked at http://www.
ncbi.nlm.nih.gov/Gen
Bank/genebankstats.
html.
Walter Goad
4 CHAPTER 1
SEQUENCE RETRIEVAL FROM PUBLIC DATABASES
An important step in providing sequence database access was the development of Web
pages that allow queries to be made of the major sequence databases (GenBank, EMBL,
etc.). An early example of this technology at NCBI was a menu-driven program called GENINFO
developed by D. Benson, D. Lipman, and colleagues. This program searched rapidly
through previously indexed sequence databases for entries that matched a biologist’s query.
Subsequently, a derivative program called ENTREZ (http://www.ncbi.nlm.nih.gov/Entrez)
with a simple window-based interface, and eventually a Web-based interface, was developed
at NCBI. The idea behind these programs was to provide an easy-to-use interface with a
flexible search procedure to the sequence databases.
Sequence entries in the major databases have additional information about the
sequence included with the sequence entry, such as accession or index number, name and
alternative names for the sequence, names of relevant genes, types of regulatory
sequences, the source organism, references, and known mutations. ENTREZ accesses this
information, thus allowing rapid searches of entire sequence databases for matches to one
or more specified search terms. These programs also can locate similar sequences (called
“neighbors” by ENTREZ) on the basis of previous similarity comparisons. When asked to
perform a search for one or more terms in a database, simple pattern search programs will
only find exact matches to a query. In contrast, ENTREZ searches for similar or related
terms, or complex searches composed of several choices, with great ease and lists the
found items in the order of likelihood that they matched the original query. ENTREZ
originally allowed straightforward access to databases of both DNA and protein sequences
and their supporting references, and even to an index of related entries or similar
sequences in separate or the same databases. More recently, ENTREZ has provided access
to all of Medline, the full bibliographic database of the National Library of Medicine
(NLM), Washington, DC. Access to a number of other databases, such as a phylogenetic
database of organisms and a protein structure database, is also provided. This access is
provided without cost to any user—private, government, industry, or research—a decision
by the staff of NCBI that has provided a stimulus to biomedical research that cannot
be underestimated. NCBI presently handles several million independent accesses to their
system each day.
A note of caution is in order. Database query programs such as ENTREZ greatly facilitate
keeping up with the increasing number of sequences and biomedical journals.
However, as with any automated method, one should be wary that a requested database
search may not retrieve all of the relevant material, and important entries may be
missed. Bear in mind that each database entry has required manual editing at some
stage, giving rise to a low frequency of inescapable spelling errors and other problems.
On occasion, a particular reference that should be in the database is not found because
the search terms may be misspelled in the relevant database entry, the entry may not be
present in the database, or there may be some more complicated problem. If exhaustive
and careful attempts fail, reporting such problems to the program manager or system
administrator should correct the problem.
David Lipman
HISTORICAL INTRODUCTION AND OVERVIEW 5
SEQUENCE ANALYSIS PROGRAMS
Because DNA sequencing involves ordering a set of peaks (A, G, C, or T) on a sequencing
gel, the process can be quite error-prone, depending on the quality of the data.
As more DNA sequences became available in the late 1970s, interest also increased in
developing computer programs to analyze these sequences in various ways. In 1982 and
1984, Nucleic Acids Research published two special issues devoted to the application of computers
for sequence analysis, including programs for large mainframe computers down to
the then-new microcomputers. Shortly after, the Genetics Computer Group (GCG) was
started at the University of Wisconsin by J. Devereux, offering a set of programs for analysis
that ran on a VAX computer. Eventually GCG became commercial (http://www.gcg).
Other companies offering microcomputer programs for sequence analysis, including Intelligenetics,
DNAStar, and others, also appeared at approximately the same time. Laboratories
also developed and shared computer programs on a no-cost or low-cost basis. For example,
to facilitate the collection of data, the programs PHRED (Ewing and Green 1998; Ewing et
al. 1998) and PHRAP were developed by Phil Green and colleagues at the University of
Washington to assist with reading and processing sequencing data. PHRED and PHRAP are
now distributed by CodonCode Corporation (http://www.codoncode.com).
These commercial and noncommercial programs are still widely used. In addition, Web
sites are available to perform many types of sequence analyses; they are free to academic
institutions or are available at moderate cost to commercial users. Following is a brief
review of the development of methods for sequence analysis.
THE DOT MATRIX OR DIAGRAM METHOD FOR COMPARING SEQUENCES
In 1970, A.J. Gibbs and G.A. McIntyre (1970) described a new method for comparing two
amino acid and nucleotide sequences in which a graph was drawn with one sequence written
across the page and the other down the left-hand side. Whenever the same letter
appeared in both sequences, a dot was placed at the intersection of the corresponding
sequence positions on the graph (Fig. 1.2). The resulting graph was then scanned for a
series of dots that formed a diagonal, which revealed similarity, or a string of the same
characters, between the sequences. Long sequences can also be compared in this manner
on a single page by using smaller dots.
The dot matrix method quite readily reveals the presence of insertions or deletions
between sequences because they shift the diagonal horizontally or vertically by the amount
of change. Comparing a single sequence to itself can reveal the presence of a repeat of the
same sequence in the same (direct repeat) or reverse (inverted repeat or palindrome) orientation.
This method of self-comparison can reveal several features, such as similarity
between chromosomes, tandem genes, repeated domains in a protein sequence, regions of
low sequence complexity where the same characters are often repeated, or self-complementary
sequences in RNA that can potentially base-pair to give a double-stranded structure.
Because diagonals may not always be apparent on the graph due to weak similarity,
Gibbs and McIntyre counted all possible diagonals and these counts were compared to
those of random sequences to identify the most significant alignments.
Maizel and Lenk (1981) later developed various filtering and color display schemes that
greatly increased the usefulness of the dot matrix method. This dot matrix representation
of sequence comparisons continues to play an important role in analysis of DNA and protein
sequence similarity, as well as repeats in genes and very long chromosomal sequences,
as described in Chapter 3 (p. 59).
Methods for DNA
sequencing were developed
in 1977 by
Maxam and Gilbert
(1977) and Sanger et
al. (1977). They are
described in greater
detail at the beginning
of Chapter 2.
6 CHAPTER 1
ALIGNMENT OF SEQUENCES BY DYNAMIC PROGRAMMING
Although the dot matrix method can be used to detect sequence similarity, it does not
readily resolve similarity that is interrupted by regions that do not match very well or that
are present in only one of the sequences (e.g., insertions or deletions). Therefore, one
would like to devise a method that can find what might be a tortuous path through a dot
matrix, providing the very best possible alignment, called an optimal alignment, between
the two sequences. Such an alignment can be represented by writing the sequences on successive
lines across the page, with matching characters placed in the same column and
unmatched characters placed in the same column as a mismatch or next to a gap as an
insertion (or deletion in the other sequence), as shown in Figure 1.3. To find an optimal
alignment in which all possible matches, insertions, and deletions have been considered to
find the best one is computationally so difficult that for proteins of length 300, 1088 comparisons
will have to be made (Waterman 1989).
To simplify the task, Needleman and Wunsch (1970) broke the problem down into a
progressive building of an alignment by comparing two amino acids at a time. They started
at the end of each sequence and then moved ahead one amino acid pair at a time, allowing
for various combinations of matched pairs, mismatched pairs, or extra amino acids in
one sequence (insertion or deletion). In computer science, this approach is called dynamic
programming. The Needleman and Wunsch approach generated (1) every possible
alignment, each one including every possible combination of match, mismatch, and single
insertion or deletion, and (2) a scoring system to score the alignment. The object was to
determine which was the best alignment of all by determining the highest score. Thus,
every match in a trial alignment was given a score of 1, every mismatch a score of 0, and
individual gaps a penalty score. These numbers were then added across the alignment to
Figure 1.2. A simple dot matrix comparison of two DNA sequences, AGCTAGGA and GACTAGGC.
The diagonal of dots reveals a run of similar sequence CTAGG in the two sequences.
G
A
C
T
A
G
G
C
A G C T A G G A
Figure 1.3. An alignment of two sequences showing matches, mismatches, and gaps (). The best
or optimal alignment requires that all three types of changes be allowed.
SEQUENCE A
SEQUENCE B
A
A
C
C
G
G
ΛE
ΛY
EΛ
V
I
D
D
G
G
I
I
HISTORICAL INTRODUCTION AND OVERVIEW 7
obtain a total score for the alignment. The alignment with the highest possible score was
defined as the optimal alignment.
The procedure for generating all of the possible alignments is to move sequentially
through all of the matched positions within a matrix, much like the dot matrix graph (see
above), starting at those positions that correspond to the end of one of the sequences, as
shown in Figure 1.4. At each position in the matrix, the highest possible score that can be
achieved up to that point is placed in that position, allowing for all possible starting points
in either sequence and any combination of matches, mismatches, insertions, and deletions.
The best alignment is found by finding the highest-scoring position in the graph, and then
tracing back through the graph through the path that generated the highest-scoring positions.
The sequences are then aligned so that the sequence characters corresponding to this
path are matched.
Figure 1.4. Simplified example of Needleman-Wunsch alignment of sequences GATCTA and
GATCA. First, all matches in the two sequences are given a score of 1, and mismatches a score of 0
(not shown), chosen arbitrarily for this example. Second, the diagonal 1s are added sequentially, in
this case to a total score of 4. At this point the row cannot be extended by another match of 1 to a
total score of 5. However, an extension is possible if a gap is placed in GATCA to produce
GATC A, where is the gap. To add the gap, a penalty score is subtracted from the total match
score of 5 now appearing in the last row and column. The best alignment is found starting with the
sequence characters that correspond to the highest number and tracing back through the positions
that contributed to this highest score.