next next next index

Calculating Distances

*** It is often useful to measure the genetic distance between two species, between two populations, or even between two individuals. For example, if you have two individuals who come to a hospital, and they both have the same genetic disease, you might want to know if they are related and if they might therefore have inherited the same gene. Otherwise, this might be manifestation of two separate mutations.

*** The entire concept of numerical taxonomy is based on computing phylogenies from a table of distances. In the case of sequence data, pairwise distances must be calculated between all sequences that will be used to build the tree - thus creating a distance matrix.

*** Distance measurements also allow for some measurement of the reliability of the final tree by the calculation of a variance which is computed from the variances of each entry in the initial distance matrix.

*** Distance methods give a single measure of the amount of evolutionary change between two genomes since divergence from a common ancestor.

*** Distances between DNA sequences are relatively simple to compute as the sum of all base pair differences between two sequences (this type of algorithm can only work for pairs of sequences that are similar enough to be aligned).

*** Either all base changes are considered equally, or a simple matrix of the frequencies of the 12 possible types of replacements (each base can be replaced by one of the three other bases) can be used.

*** Differences due to insertions/deletions (indels) are generally given a larger weight than replacements, but indels of multiple bases at one position are given less weight than multiple independent indels.

*** It is also possible to correct for multiple substitutions at a single site, which is more common in distant relationships and for rapidly evolving sites.

*** Distances between amino acid sequences are a bit more complicated to calculate.

*** Some amino acids can replace one another with relatively little effect on the structure and function of the final protein while other replacements can be devastating.

*** From the standpoint of the genetic code, changes between some amino acids can be made by a single DNA mutation while others require two or even three changes in the DNA sequence.

*** In practice, what has been done is to calculate tables of frequencies of all amino acid replacements between sets of related amino acid sequences in the databanks.

*** The most famous of these tables is the PAM250 matrix created by Dayhoff et al. in 1978.

*** PAM stands for "Percent Accepted Mutations", also know as a mutation probability matrix, i.e. the probability that any amino acid will change to any other amino acid. A score above 0 indicates that these amino acids replace each other more often than expected by chance. That is they are functionally equivalent and/or easily inter-mutable. Scores below 0 indicate two amino acids that are seldom interchanged.



The PAM250 scoring matrix

           A  R  N  D  C  Q  E  G  H  I  L  K  M  F  P  S  T  W  Y  V
        A            
        R  2
        N -2  6
        D  0  0  2
        C  0 -1  2  4
         -2 -4 -4 -5  4
        Q  0  1  1  2 -5  4
        E  0 -1  1  3 -5  2  4
        G  1 -3  0  1 -3 -1  0  5
        H -1  2  2  1 -3  3  1 -2  6
        I -1 -2 -2 -2 -2 -2 -2 -3 -2  5
        L -2 -3 -3 -4 -6 -2 -3 -4 -2  2  6
        K -1  3  1  0 -5  1  0 -2  0 -2 -3  5
        M -1  0 -2 -3 -5 -1 -2 -3 -2  2  4  0  6
        F -4 -4 -4 -6 -4 -5 -5 -5 -2  1  2 -5  0  9
        P  1  0 -1 -1 -3  0 -1 -1  0 -2 -3 -1 -2 -5  6
        S  1  0  1  0  0 -1  0  1 -1 -1 -3  0 -2 -3  1  3
        T  1 -1  0  0 -2 -1  0  0 -1  0 -2  0 -1 -2  0  1  3
        W -6  2 -4 -7 -8 -5 -7 -7 -3 -5 -2 -3 -4  0 -6 -2 -5 17
        Y -3 -4 -2 -4  0 -4 -4 -5  0 -1 -1 -4 -2  7 -5 -3 -3  0 10
        V  0 -2 -2 -2 -2 -2 -2 -1 -2  4  2 -2  2 -1 -1 -1  0 -6 -2  4
Dayhoff, M, Schwartz, RM, Orcutt, BC (1978) A model of evolutionary change in proteins. in Atlas of protein sequence and structure, vol 5, supplement 3, pp 345-352. M. Dayhoff ed., National Biomedical Research Foundation, Silver Spring, MD..

next next next index

Using Computers for Molecular Biology
Stuart M. Brown, Ph.D., RCR, NYU Medical Center
Comments to: browns02@mcrcr.med.nyu.edu