next next next index

Introduction to Similarity Searching

*** If you have just determined the sequence of an interesting bit of DNA, one of the first questions you are likely to ask yourself is:
"Has anybody else seen anything like this?"
*** Fortunately, there has been a very successful international effort to collect all the sequences (both DNA and protein) that researchers have determined into databases so they can be searched.

*** However, these databases are HUGE , and as a result, you must compare your sequence with the vast number of other sequences.

*** A number of computer programs have been written to rapidly search a database for similarity to a query sequence.

*** The techniques used by these programs to make searching rapid, result in some loss of rigor of comparison. [Much slower, but more rigorous programs are available.]

*** It is possible (although, as it turns out, unlikely) that a weak but important similarity could be missed by these programs (a "false negative" score).

*** In addition, many times these programs will flag a sequence as being similar to your query sequence when this similarity is not significant (a "false positive" score).

*** Thus, these programs should be seen as tools for identifying a set of sequences from the database for retrieval and further analysis, rather than a complete analysis in themselves.


Similarity vs. Homology

*** A database search is a frequently -but incorrectly- referred to as homology searching.

*** The term homology implies a common evolutionary relationship between two traits -whether they are DNA sequences or bristle patterns on a fly's nose.

*** Just because two sequences share a stretch of nearly identical nucleotides (or amino acids), does not mean that they are directly descended from a common ancestor. That is a question for you, the biologist, to answer, not a computer program.

*** So, from now on, let's all use correct terminology and call database searching by sequence matching similarity searching.

*** Of course, a very high level of similarity is a strong indication of homology. As a general rule, 25% identity over a stretch of 100 amino acids can be considered to be good evidence of common ancestry for two sequences.


Which program should you use to search a database?

*** There is no correct answer to this question. Many different similarity searching programs are available. I will give you some information about several popular programs and enough theory for you to make your own choices.

*** The molecular biology community has come to rely on shortcut algorithms that use a heuristic approach (based on a process of successive approximations).

*** There is a tradeoff between sensitivity for detecting distantly related sequences vs. the number of unrelated "false positives" that are found in a search.

*** It is also important to know that protein similarity searches are more sensitive than comparisons of DNA sequences.

  1. The DNA "alphabet" contains only 4 letters, while the amino acid alphabet has 20 letters, so the probability of chance matches is much greater with DNA-DNA comparisons.

  2. A pair of DNA bases is generally scored as a match or a mismatch, while two amino acids can share varying degrees of similarity based on their physical and chemical properties, similarity of DNA codons, and natural inter-mutation rates.

  3. The protein databanks are much smaller than the DNA databanks, so searches can be more sensitive without incurring too many false positives.


Global vs. Local similarity

*** Early similarity tools developed by Needelman & Wunch (J. Mol Biol. 48:444-453, 1970) and Sellers (SIAM J Appl Math. 26:787-793, 1974) calculated a "global" similarity score between the entire lengths of the sequences being compared.

*** Global algorithms are often not as sensitive for highly diverged sequences, a better (and faster) method focuses on short regions of "local" similarity.

*** We will discuss the three most widely used local similarity algorithms:

Smith-Waterman (J Mol Biol 147:195-197, 1981)
BLAST (Altschul et al, J Mol Biol 215:403-410, 1990)
FASTA (Pearson and Lipman, Proc Natl Acad Sci USA 85:2444-2448, 1988).
*** The Smith-Waterman algorithm is a rigorous "dynamic programming" approach that does not make use of heuristic shortcuts. It not generally used for routine database searching because although it is more sensitive than BLAST and FASTA, it runs 100 times slower.

*** For some DNA-DNA searches, FASTA may be more sensitive than BLAST, but in most cases both programs should be tried to insure the most comprehensive search.

*** On the RCR Alpha, FASTA searches run locally on our processors with our local copy of the GenBank database. BLAST searches are sent over the Internet to NCBI using the netblast command.


next next next index

Using Computers for Molecular Biology
Stuart M. Brown, Ph.D, RCR, NYU Medical Center
Comments to: browns02@mcrcr.med.nyu.edu