![]()
![]()
![]()
![]()
Clustering Algorithms
Clustering algorithms use distances to calculate phylogenetic trees. These tress are based solely on the relative numbers of similarities and differences between a set of sequences.
First pairwise distances must be computed for all sequences that will be used to build the tree - thus creating a distance matrix.
Cluster methods construct a tree by linking the least distant pairs of taxa, followed by successively more distant taxa.
Cluster algorithms can be applied to many different types of molecular data including isozymes, restriction sites, RFLP's etc., but we will restrict this discussion to DNA and protein sequence data.
UPGMA
The simplest of the distance methods is a type of cluster algorithm that is known as UPGMA (Unweighted Pair Group Method using Arithmetic averages)
(Sneath, P.H.A. and Sokal, R.R. (1973) in Numerical Taxonomy (pp; 230-234), W.H. Freeman and Company, San Francisco, California, USA)This method has gained popularity mostly because of its simplicity and also because of its speed (though many other distance methods are as fast).
The GCG program PILEUP uses UPGMA to create its dendrogram of DNA sequences, and then uses this dendrogram to guide its multiple alignment algorithm.
The GCG program DISTANCES calculates pairwise distances between a group of sequences.
DISTANCES writes its output into a matrix file that can then be used by the program GROWTREE to draw a tree based on either UPGMA or the Neighbor-Joining method.
Absolute distance calculations can be corrected for multiple substitutions at a site with a variety of formulas.
Neighbor Joining
Another very popular distance method is the Neighbor Joining Method
(Saitou and Nei 1987, Mol. Biol. Evol. 4:406).
This method attempts to correct the UPGMA method for its (frequently invalid) assumption that the same rate of evolution applies to each branch. Hence this method yields an unrooted tree.
A modified distance matrix is constructed to adjust for these differences in the rate of evolution of each taxon.
The least distant pairs of nodes are linked and their common ancestral node is added to the tree, their terminal nodes are pruned from the tree. This continues until only two nodes remain.
Others
There are a wide variety of other distance-based clustering algorithms constructed with differing sets of assumptions. Neighbor Joining has given the best results in simulation studies and it is the most computationally efficient of the distance algorithms (N. Saitou and T. Imanishi, Mol. Biol. Evol. 6:514 (1989). However, the choice of algorithm must reflect the type of data being analyzed and the history of the sub-discipline from which the data is derived (i.e.. immunology vs. numerical plant taxonomy).
In some cases your choice of phylogenetic algorithm may be limited by the software that is accessible - which may in turn be limited by the computing hardware that is present in your lab. If this is the case, then it is just as important to know the limitations of the algorithms that you do use as it would be if you had to select from a wide variety of different algorithms
![]()
![]()
![]()
![]()
Using Computers for Molecular Biology
Stuart M. Brown, Ph.D., RCR, NYU Medical Center Comments to: browns02@mcrcr.med.nyu.edu