![]()
![]()
![]()
![]()
Distances and Scoring Matrixes
Distance methods give a single measure of the amount of evolutionary change between two genomes since divergence from a common ancestor.
Distances between DNA sequences are relatively simple to compute as the sum of the differences between two sequences.
Either all base changes are considered equally, or a simple matrix of the frequencies of the 12 possible types of replacements (each base can be replaced by one of the three other bases) can be used.
Differences due to insertions/deletions (indels) are generally given a larger score than substitutions.
Distances between amino acid sequences are a bit more complicated to calculate.
From a functional standpoint, some amino acids can replace one another with relatively little effect on the structure and function of the final protein while other replacements can be devastating.
From the standpoint of the genetic code, some amino acid changes can be made due to the replacement of a single DNA base while others require two or even three changes in the DNA sequence.
In practice, what has been done is to calculate tables of frequencies of all amino acid replacements (mutations) between sets of related amino acid sequences in the databanks (protein families).
One of the most widely used of these mutation frequency tables is the PAM250 matrix - shown below.
The PAM stands for "percent accepted mutations", also know as a mutation probability matrix, i.e. the probability that any amino acid will change to any other amino acid.
The 250 in PAM 250 stands for the data gathered from 71 sets of aligned sequences extrapolated up to the level of 250 amino acid replacements per 100 residues.
- A score above 0 indicates that these amino acids replace each other more often than expected by chance. That is they are functionally equivalent and/or easily inter-mutable.
- Scores below 0 indicate two amino acids that are seldom interchanged.
The PAM250 scoring matrix A R N D C Q E G H I L K M F P S T W Y V A 2 R -2 6 N 0 0 2 D 0 -1 2 4 C -2 -4 -4 -5 4 Q 0 1 1 2 -5 4 E 0 -1 1 3 -5 2 4 G 1 -3 0 1 -3 -1 0 5 H -1 2 2 1 -3 3 1 -2 6 I -1 -2 -2 -2 -2 -2 -2 -3 -2 5 L -2 -3 -3 -4 -6 -2 -3 -4 -2 2 6 K -1 3 1 0 -5 1 0 -2 0 -2 -3 5 M -1 0 -2 -3 -5 -1 -2 -3 -2 2 4 0 6 F -4 -4 -4 -6 -4 -5 -5 -5 -2 1 2 -5 0 9 P 1 0 -1 -1 -3 0 -1 -1 0 -2 -3 -1 -2 -5 6 S 1 0 1 0 0 -1 0 1 -1 -1 -3 0 -2 -3 1 3 T 1 -1 0 0 -2 -1 0 0 -1 0 -2 0 -1 -2 0 1 3 W -6 2 -4 -7 -8 -5 -7 -7 -3 -5 -2 -3 -4 0 -6 -2 -5 17 Y -3 -4 -2 -4 0 -4 -4 -5 0 -1 -1 -4 -2 7 -5 -3 -3 0 10 V 0 -2 -2 -2 -2 -2 -2 -1 -2 4 2 -2 2 -1 -1 -1 0 -6 -2 4 Dayhoff, MO, Schwartz, RM, Orcutt, BC (1978) A model of evolutionary change in proteins, matrixes for detecting distant relationships. In Dayhoff, MO (ed.), Atlas of protein sequence and structure, Vol 5, pp. 345-358. National Biomedical Research Foundation, Washington, DC.
Many other scoring matrixes have been developed since the "classic" PAM250.
Both the NCBI web BLAST server and GCG use versions of the BLOSUM matrix (developed by Henikoff & Henikoff) as the default, which uses a larger set of protein families to calculate mutation frequencies.
Henikoff, S. and Henikoff J.G. (1992). Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. USA 89:10915-19.
Many of these matrixes are optimized for either larger or smaller evolutionary distances than PAM250.
Certain matrixes work better for comparisons within certain protein families.
It is best to try several different scoring matrixes when comparing protein sequences.
![]()
![]()
![]()
![]()
Using Computers for Molecular Biology
Stuart M. Brown, Ph.D, RCR, NYU Medical Center Comments to: browns02@mcrcr.med.nyu.edu