Data mining for proteins characteristic of clades

seminar paper · 11-04-2012, 05:12 PM

Data mining for proteins characteristic of clades

.pdf

dna mining 2.pdf (Size: 257.93 KB / Downloads: 17)

INTRODUCTION
Biology textbooks typically use phenotypic characters to
describe clades, e.g. milk and hair for mammals. Not only
do these synapomorphies aid in phylogenetic inference, but
they also record key innovations in the history of life, as
exemplified by such famous clades as Amniota and Eutheria
(placental mammals). A number of papers have used molecular
synapomorphies to weigh in on phylogenetic debates. A
convincing molecular synapomorphy can often resolve a phylogeny
that cannot be unambiguously determined by more
continuously varying characters (1). Moreover, characteristic
proteins (2) or regulatory sequences (3)—i.e.

MATERIALS AND METHODS
Given a set of n genomes—n ¼ 30 is typical—and a sequence
‘window’ length k, Conserv returns a list of families of
orthologous protein sequences of the desired length
(k amino acid residues). The number of families in the list
can be specified by the user with a typical value being
1000, but if Conserv determines that there are not 1000 sufficiently
conserved proteins in the set of genomes (e.g. if the
set of organisms includes reduced genomes or both eubacteria
and archaea), then Conserv will return a shorter list. The list
is initially ranked from ‘most conserved’ to ‘least conserved’
over the first m genomes in the set, where the user supplies
m < n and thus defines the in-group and out-group. In order
to find synapomorphies we further process the list as follows:
(i) we rerank the list by synaptitude;

Molecular synapomorphies
We first report on using Conserv to find molecular synapomorphies
for accepted and hypothesized bacterial clades.
For this experiment, m genomes form a putative clade, the
in-group, and n m genomes putatively lie outside that
clade and form the out-group. Conserv finds sequences that
are highly conserved in the in-group and quite different in
the out-group. More specifically, it ranks ortholog families
by a metric we call synaptitude. We calculate the synaptitude
of an ortholog family by first computing all m

similarity scores from the putative clade, sorting them, and
taking the median. Then we compute all m(n m) pairwise
distances between orthologous in-group and out-group
sequences, and take the three-quartile score (the m

largest, where larger scores are less similar).

DISCUSSION
Molecular synapomorphies are potentially very valuable phylogenetic
characters, because rare discontinuous events—a
large insertion or deletion, or the ‘sudden’ appearance of a
novel, highly conserved gene—are not easily erased by subsequent
point mutations. Moreover, molecular synapomorphies
are complementary to popular sequence-based
methods, such as maximum likelihood, which do not ordinarily
take into account non-ubiquitous characters, such as insertions
and deletions (‘gap columns’) and proteins unique to a
clade. To date, however, molecular synapomorphies have
been used on an ad hoc basis, with phylogenies

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Ranked, Efficient and Secure Keyword search over encrypted cloud data PPT	seminar post	1	814	21-09-2017, 11:55 AM Last Post: jaseela123
	Data Mining: What is Data Mining? Report	project girl	1	2,262	21-09-2017, 11:47 AM Last Post: jaseela123
	DEMONSTRATING DATAPOSSESSION AND UN CHEATABLE DATA TRANSFER	seminar flower	1	1,466	19-09-2017, 11:05 AM Last Post: jaseela123
	Processing of collected data PPT	seminar projects maker	1	718	15-09-2017, 12:48 PM Last Post: jaseela123
	Enabling Secure and Efficient Ranked Keyword Search over Outsourced Cloud Data pdf	study tips	1	2,018	13-09-2017, 12:59 PM Last Post: jaseela123
	INCREMENTAL MINING USING FREQUENT PATTERN TREE	project topics	1	10,061,816	13-09-2017, 09:40 AM Last Post: jaseela123
	Data Warehouse Report	study tips	1	879	12-09-2017, 12:23 PM Last Post: jaseela123
	CONFIDENTIAL DATA STORAGE AND DELETION details	seminar ideas	1	1,668	06-09-2017, 01:23 PM Last Post: jaseela123
	A Privacy-Preserving Remote Data Integrity Checking Protocol	seminar ideas	1	2,350	06-09-2017, 12:31 PM Last Post: jaseela123
	Integrated Voice & Data	presentation Abstract	1	693	31-08-2017, 04:59 PM Last Post: jaseela123

Quick Reply
Message Type your reply to this message here. Disable Smilies	You have selected one or more posts to quote. Quote these posts now or deselect them.