![]()
![]()
![]()
![]()
Introduction to Phylogenetics
Portions of this lecture have been paraphrased or outright stolen from web pages created by: Dr. Brian Golding, Department of Biology, McMaster University, Hamilton, Ontario, Canada, L8S 4K1
Similarity searches and multiple alignments of sequences naturally leads to the question:
"How are these sequences related?"and more generally:
"How are the organisms from which these sequences come related?"After working with sequences for a while, one develops an intuitive understanding that for a given gene, closely related organisms have similar sequences and more distantly related organisms have more dissimilar sequences.
Also, it seems logical that given a set of sequences, it should be possible to reconstruct the evolutionary relationships (ancestral relationships) among genes and among organisms. This involves creating a branching structure, termed a phylogeny or tree, that illustrates the relationships between the sequences.
![]()
The study of the relationships between groups of organisms is called taxonomy , an ancient and venerable branch of classical biology. The branch of taxonomy that deals with numerical data such as DNA sequence is know as phylogenetics . This subject also overlaps significantly with a branch of evolutionary biology know as molecular evolution .
Check out the
Tree of Life project on the Web for introductory information about phylogenetics and its relationship to biodiversity.
DNA sequences have many advantages over more classical types of taxonomic characters:
Character states can be scored unambiguously
Large numbers of characters can be scored for each individual
Information on both the extent and the nature of divergence between sequences is available (nucleotide substitutions, insertion/deletions, or genome rearrangements)
Disclaimers
Before describing any theoretical or practical aspects of phylogenetics, it is necessary to give some disclaimers. This area of computational biology is an intellectual minefield.
Neither the theory nor the practical applications of any algorithms are universally accepted throughout the scientific community.
The application of different software packages to a data set is very likely to give different answers and minor changes to a data set are also likely to profoundly change the result.
Despite all of these caveats, it is possible to calculate phylogenetic trees for data sets.
Provided the data are clean, outgroups are correctly specified, appropriate algorithms are chosen, no assumptions are violated, bootstrapping is used, etc., can the true, correct tree be found and proven to be scientifically valid?
Unfortunately, it is impossible to ever conclusively state what is the "true" tree for a group of sequences (or a group of organisms); taxonomy is constantly under revision as new data is gathered.
Relationships calculated from sequence data actually represent the relationships between genes, this is not necessarily the same as relationships between whole organisms.
Your data (the sequence of some gene or some other form of sequence data) may not have had the same phylogenetic history as the species within which they are contained.
Different genes evolve at different speeds, and there is always the possibility of horizontal gene transfer (either by hybridization, vector mediated DNA movement, or direct uptake of DNA).
Cladistic vs. Phenetic Analysis Methods
Within the field of taxonomy there are two different methods and philosophies of building phylogenetic trees.
The phenetic approach is popular with molecular evolutionists because it relies heavily on character data - such as sequences - and requires relatively few assumptions.
In this approach, a tree is constructed by considering the phenotypic similarities of the species without trying to understand the evolutionary history that brought the species to their current phenotypes.
Since a tree constructed by this "current data only" method does not necessarily reflect evolutionary relationships, but rather is designed to represent phenotypic similarity, trees constructed via this method are called phenograms.
A phylogenetic tree per se is often termed a dendrogram (a branching order that may or may not be the correct phylogeny).
Computer algorithms based on the phenetic model generally rely on distance methods for the calculation of relationships and building of trees.
From a practical standpoint, phenetic methods will consider all sequence differences equally, so a single event that creates a large change in sequence will move two sequences far apart on the final tree.
The second approach, known as the cladistic method , relies on a knowledge of ancestral relationships as well as current data.
Via cladistic methods, a tree is reconstructed by considering the various possible pathways of evolution and choosing from amongst these the best possible tree.
Trees reconstructed via these methods are called cladograms .
Computer algorithms based on the cladistic model generally rely on parsimony or maximum likelihood methods for the calculation of relationships and building of trees.
In order to use cladistic software with sequence data, certain sequences must be designated as ancestral and others as derived . As a result, changes at certain positions will have a larger effect than others on the location of each sequence in the predicted tree.
For character data (physical traits of organisms such as morphology of organs etc.) and for higher (or perhaps we should say deeper) levels of taxonomy, the cladistic approach is almost certainly superior.
However, cladistic methods are often difficult to implement with assumptions that are not always satisfied with molecular data. Phenetic approaches lead to generally faster algorithms and they often have nicer statistical properties for molecular data.
![]()
![]()
![]()
![]()
Using Computers for Molecular Biology
Stuart M. Brown, Ph.D., RCR, NYU Medical Center Comments to: browns02@mcrcr.med.nyu.edu