07-09-2016, 12:52 PM
1453471923-Abstract.docx (Size: 89.73 KB / Downloads: 5)
ABSTRACT.
One of the most fundamental aspects of bioinformatics in understanding sequence evolution
and relationships is molecular phylogenetics, in which the evolutionary histories of living organisms are represented by finite directed (weighted) graphs, in particular, directed (weighted) trees. There are basically two types of phylogenetic methods, distance based methods and character based methods. Distance based methods include two clutering based algorithms, UPGMA, NJ, and two optimality based algorithms, Fitch- Margoliash and minimum evolution [7]. This paper focuses on distance based methods. The paper starts with some preliminary knowledge and definitions in the area, including finite directed graphs, directed trees and matrices. It discusses the verification of the metric property of distance matrices, including detections of errors if a distance matrix fails to satisfy the metric property, and then provides an algorithm in modifying the distance matrix to satisfy the metric property. The second part of the paper is a brief survey based on the excerpts from the references, on various frequently used distance based phylogenetic tree construction methods, both cluster-based and optimality base methods, including UPGMA, Neighbor Joining, Fitch- Margoliash, and Minimum Evolution methods. Also, it discusses the assessment of the phylogenetic trees and some analysis of the algorithms.
INTRODUCTION
Phylogenetic is the study of the evolutionary histories of living organisms, and represent the evolutionary
divergences by finite directed (weighted) graphs, or directed (weighted) trees, known as phylogeny. Based on molecular sequences, phylogenetic trees can be built to reconstruct the evolutionary tree of species involved. In particular, the representation derived from genes or protein sequences is known as gene phylogeny, while the representation of the evolutionary path of the species are often referred as species phylogeny. A gene phylogeny is, to some extend, a local desciption. It only describes the evolution of a particular gene or encoded protein, and this sequence could evolve much more or less differently than other genes in the genome, or it may have completely different evolutionary history from the rest of the genome due to horizontal gene transfers. Therefore, the evolution of a particular gene only provides a local picture, not necessarily reflect the global evolutionary picture of the species. We can only hope that we could assemble the jigsaw puzzle pieces with a wide selection of gene family to give an overall assessment of the species evolution. While in general the topology in phylogenetic trees represents the relationships between the taxa, assigning scales to edges in the trees could provide extra information on the amount of evolution divergence as well as the time of the divergence. The phylogenetic trees with the scaled edges are called phylograms, while the non-scaled phylogenetic trees are called cladograms. Purely for the sake of computers data processing, some special formats were artificially created, such as Newick format. From biological point of view, the building of phylogenetic trees can be roughly divided into the following steps .
CHOOSE MOLECULAR MARKER.
In building molecular phylogenetic trees, either nucleotide or protein sequence data can be used, but the outcomes from the choice could be quite different. The rule of thumb is to select nucleotide sequences when some very closely related organisms are studied because they tend to evolve more rapidly than proteins; and to select protein sequence (or slowly evolving nucleotide sequences) when more widely divergent groups of organisms are studied.
PERFORM SEQUENCE ALIGNMENT.
The sequence alignment establishes positional correspondence in evolution, and aligned positions are assumed to be genealogically related. Though there have been numerous so-called “state-of-art” alignment programs available and many times they do help, manual editing is oftencrucial in the quality of alignment, and yet there is no firm or clearly defining rule on how to modify a sequence alignment.
CHOOSE A MODEL OF EVOLUTION.
One quantitate measure of divergence between two sequences is the number of substitutions in an alignment. However, this measure can somehow be misleading in representing the true evolution. Not only the nucleotide may actually undergone several intermediate steps of changes behind a single mutation in the sequence, but also an observed identical nucleotide may hide parallel mutations of both sequences. This is known as homoplasy. The statistical models used to correct homoplasy are called evolutioinary models. For constructing DNA phylogenies, there have been Jukes-Cantor model and Kimural model.
DETERMINE A TREE BUILDING METHOD.
There are basically two types of phylogenetic methods, character based methods and distance based methods. Character-based methods based on discrete characters from molecular sequences from individual taxa. The theory is that characters at corresponding positions in a multiple sequence alignment are homologous (this word has different meaning in mathematics and is precisely defined for sequences with differentials.) among the sequences involved. On the other hand, distance-based methods is based on the distance, the degrees of differences between pairs of seqeuences. Such distance will be used to construct the distance matrix between individual pairs of taxa. The theory in this case is that all sequences involved are homologous and the weighted directed tree will satisfy the additive properties. There are two different algorithms in distance based method, the cluster-based and the optimality-based. The cluster-based method algorithms build a phylogenetic tree based on a distance matrix starting from the most similar sequence pairs. The algorithms of cluster-bsed include unweighted pair group method using arithmetic average (UPGMA) and neighbor joining (NJ). The optimality-based method algorithms compare numerous different tree topologies and select the one which is believed to best fit between computed distances in the trees and the desired evolutionary distances which often referred as actual evolutionary distances. Algorithms of optimality based include Fitch-Margoliash and minimum evolution.
ASSESS TREE RELIABILITY.
The assessment of constructed phylogenetic trees will be on the reliability of the inferred phylogeny. More precisely, is any particular tree more reliable than the others? Or, is it more biologically or statistically significant than the other? Bootstrapping and Jacknifing are two analytical strategies which repeatedly resample data from the original database to test for errors of a phylogenetic trees. Kishino-Hasegawa and Shimodaira-Hasegawa tests are often used to compare the significance of two phylogenetic trees. In summary, finding correct and acurate phylogenetic trees is generally an extremely difficult task. One argument is to calculate how many different such trees exist, and these numbers are indeed astronomically large when the number of taxa increases. While the argument makes some mathematical sense, it makes absolutely no biological sense because many of the mathematically existing trees can be eliminated by biological common sense. It is more due to the facts that often times one has to derive a consensus tree from multiple individual trees based majority rule, or any other rules which would make biological sense in practice, and both consensus and these rules are often very fuzzy and varies from people to people. Moreover, at each step of the tree building, there are multiple choices, criteria to make those choices are so vague that different choices often lead to some significantly different outcomes. Our focuses in this paper shall be the step 4 in the process, to determine and access tree building methods, in particular, the distance based methods. The paper starts with some preliminary knowledge and definitions in the area, including finite directed graphs, directed trees and matrices. It discusses the verification of the metric property of distance matrices, including detections of errors if a distance matrix fails to satisfy the metric property, and then provides an algorithm in modifying the distance matrix to satisfy the metric property. The second part of the paper is a brief survey based on the excerpts from the references, on various frequently used distance based phylogenetic tree construction methods, both cluster-based and optimality base methods, including UPGMA, Neighbor Joining, Fitch-Margoliash, and Minimum Evolution methods. Also, it discusses the assessment of the phylogenetic trees and some analysis
of the algorithms.
PRELIMINARIES
Definition 2.1. A finite graph G is a pair (V,E), where V is a finite set, and E is a subset of all two member subsets (non ordered pairs) of V . The members in V are referred as vertices or nodes, and the members in E are referred as edges or branches. A finite directed graph is a finite graph with edge set E be a subset of the cartesian product (ordered paris) V ×V . For any edge v = (C,B) 2 E in a directed graph,
where C,B 2 V , C is called an ancestor of B and B is called an offspring of C. It is common practice in biology that if there is a branch from one node to another node, the offspring is drawn higher (lower) than the ancestor node, instead of specifically defining the direction of the branches
OPTIMALITY-BASED METHODS
The clustering based methods produce a single tree as outcome. However, there is no criterion in judging how this tree is compared to other alternative trees. In contrast, optimality based methods have a well-defined algorithm to compare all possible tree topologies and select a tree that best fits the actual evolutionary distance matrix. Based on the differences in optimality criteria, there are two types of algorithm, Fitch-Margoliash and minimum evolution. The exhaustive search for an optimal tree necessitates a slow computation, which is a clear drawback especially when the dataset is large.
5.1. FITCH-MARGOLIASH.
The Fitch-Margoliash method selects a best tree among all possible trees based on minimal deviation between the distances calculated in the overall branches in the tree and the distances in the original dataset. It starts by randomly clustering two taxa in a node and creating three equations to describe the distances, and the solving the three algebraic equations for unknown branch lengths. The clustering of the two taxa helps to create a newly reduced matrix. This process is iterated until a tree is completely resolved. The method searches for all tree topologies and selects the one that has the lowest squared deviation of actual distances and calculated tree branch lengths. The optimality criterion is expressed in the following formula