11-04-2012, 05:12 PM
Data mining for proteins characteristic of clades
dna mining 2.pdf (Size: 257.93 KB / Downloads: 17)
INTRODUCTION
Biology textbooks typically use phenotypic characters to
describe clades, e.g. milk and hair for mammals. Not only
do these synapomorphies aid in phylogenetic inference, but
they also record key innovations in the history of life, as
exemplified by such famous clades as Amniota and Eutheria
(placental mammals). A number of papers have used molecular
synapomorphies to weigh in on phylogenetic debates. A
convincing molecular synapomorphy can often resolve a phylogeny
that cannot be unambiguously determined by more
continuously varying characters (1). Moreover, characteristic
proteins (2) or regulatory sequences (3)—i.e.
MATERIALS AND METHODS
Given a set of n genomes—n ¼ 30 is typical—and a sequence
‘window’ length k, Conserv returns a list of families of
orthologous protein sequences of the desired length
(k amino acid residues). The number of families in the list
can be specified by the user with a typical value being
1000, but if Conserv determines that there are not 1000 sufficiently
conserved proteins in the set of genomes (e.g. if the
set of organisms includes reduced genomes or both eubacteria
and archaea), then Conserv will return a shorter list. The list
is initially ranked from ‘most conserved’ to ‘least conserved’
over the first m genomes in the set, where the user supplies
m < n and thus defines the in-group and out-group. In order
to find synapomorphies we further process the list as follows:
(i) we rerank the list by synaptitude;
Molecular synapomorphies
We first report on using Conserv to find molecular synapomorphies
for accepted and hypothesized bacterial clades.
For this experiment, m genomes form a putative clade, the
in-group, and n m genomes putatively lie outside that
clade and form the out-group. Conserv finds sequences that
are highly conserved in the in-group and quite different in
the out-group. More specifically, it ranks ortholog families
by a metric we call synaptitude. We calculate the synaptitude
of an ortholog family by first computing all m
similarity scores from the putative clade, sorting them, and
taking the median. Then we compute all m(n m) pairwise
distances between orthologous in-group and out-group
sequences, and take the three-quartile score (the m
largest, where larger scores are less similar).
DISCUSSION
Molecular synapomorphies are potentially very valuable phylogenetic
characters, because rare discontinuous events—a
large insertion or deletion, or the ‘sudden’ appearance of a
novel, highly conserved gene—are not easily erased by subsequent
point mutations. Moreover, molecular synapomorphies
are complementary to popular sequence-based
methods, such as maximum likelihood, which do not ordinarily
take into account non-ubiquitous characters, such as insertions
and deletions (‘gap columns’) and proteins unique to a
clade. To date, however, molecular synapomorphies have
been used on an ad hoc basis, with phylogenies