next next next index

Distances and Scoring Matrixes

*** Distance methods give a single measure of the amount of evolutionary change between two genomes since divergence from a common ancestor.

*** Distances between DNA sequences are relatively simple to compute as the sum of the differences between two sequences.

*** Either all base changes are considered equally, or a simple matrix of the frequencies of the 12 possible types of replacements (each base can be replaced by one of the three other bases) can be used.

*** Differences due to insertions/deletions (indels) are generally given a larger score than substitutions.

*** Distances between amino acid sequences are a bit more complicated to calculate.


*** From a functional standpoint, some amino acids can replace one another with relatively little effect on the structure and function of the final protein while other replacements can be devastating.

*** From the standpoint of the genetic code, some amino acid changes can be made due to the replacement of a single DNA base while others require two or even three changes in the DNA sequence.

*** In practice, what has been done is to calculate tables of frequencies of all amino acid replacements (mutations) between sets of related amino acid sequences in the databanks (protein families).

*** One of the most widely used of these mutation frequency tables is the PAM250 matrix - shown below.

*** The PAM stands for "percent accepted mutations", also know as a mutation probability matrix, i.e. the probability that any amino acid will change to any other amino acid.

*** The 250 in PAM 250 stands for the data gathered from 71 sets of aligned sequences extrapolated up to the level of 250 amino acid replacements per 100 residues.

  • A score above 0 indicates that these amino acids replace each other more often than expected by chance. That is they are functionally equivalent and/or easily inter-mutable.

  • Scores below 0 indicate two amino acids that are seldom interchanged.


The PAM250 scoring matrix

           A  R  N  D  C  Q  E  G  H  I  L  K  M  F  P  S  T  W  Y  V
        A  2
        R -2  6
        N  0  0  2
        D  0 -1  2  4
        C -2 -4 -4 -5  4
        Q  0  1  1  2 -5  4
        E  0 -1  1  3 -5  2  4
        G  1 -3  0  1 -3 -1  0  5
        H -1  2  2  1 -3  3  1 -2  6
        I -1 -2 -2 -2 -2 -2 -2 -3 -2  5
        L -2 -3 -3 -4 -6 -2 -3 -4 -2  2  6
        K -1  3  1  0 -5  1  0 -2  0 -2 -3  5
        M -1  0 -2 -3 -5 -1 -2 -3 -2  2  4  0  6
        F -4 -4 -4 -6 -4 -5 -5 -5 -2  1  2 -5  0  9
        P  1  0 -1 -1 -3  0 -1 -1  0 -2 -3 -1 -2 -5  6
        S  1  0  1  0  0 -1  0  1 -1 -1 -3  0 -2 -3  1  3
        T  1 -1  0  0 -2 -1  0  0 -1  0 -2  0 -1 -2  0  1  3
        W -6  2 -4 -7 -8 -5 -7 -7 -3 -5 -2 -3 -4  0 -6 -2 -5 17
        Y -3 -4 -2 -4  0 -4 -4 -5  0 -1 -1 -4 -2  7 -5 -3 -3  0 10
        V  0 -2 -2 -2 -2 -2 -2 -1 -2  4  2 -2  2 -1 -1 -1  0 -6 -2  4


Dayhoff, MO, Schwartz, RM, Orcutt, BC (1978) A model of evolutionary change
in proteins, matrixes for detecting distant relationships. In Dayhoff, MO
(ed.), Atlas of protein sequence and structure, Vol 5, pp. 345-358. National 
Biomedical Research Foundation, Washington, DC.


*** Many other scoring matrixes have been developed since the "classic" PAM250.

*** Both the NCBI web BLAST server and GCG use versions of the BLOSUM matrix (developed by Henikoff & Henikoff) as the default, which uses a larger set of protein families to calculate mutation frequencies.

Henikoff, S. and Henikoff J.G. (1992). Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. USA 89:10915-19.

*** Many of these matrixes are optimized for either larger or smaller evolutionary distances than PAM250.

*** Certain matrixes work better for comparisons within certain protein families.

*** It is best to try several different scoring matrixes when comparing protein sequences.


next next next index

Using Computers for Molecular Biology
Stuart M. Brown, Ph.D, RCR, NYU Medical Center
Comments to: browns02@mcrcr.med.nyu.edu