Similarity searching with BLAST

Similarity searching is the best way to answer the question:
Has anybody ever seen a sequence like this one before?"

"Sequence alignments provide a powerful way to compare novel sequences with previously characterized genes. Both functional and evolutionary information can be inferred from well designed queries and alignments."

GenBank is a database with the largest collection of gene sequences (both DNA and protein). It is run by the US government group called the National Center for Biotechnology Information which is part of the National Library of Medicine at the National Institue of Health. The web address is www.ncbi.nlm.nih.gov perhaps not the most intuitive, but well worth memorizing.

The NCBI maintains a GenBank search tool called BLAST (Basic Local Alignment Search Tool). It is extremely powerful and usually quite fast.

The NCBI has a BLAST Tutorial that is quite informative and fairly easy to follow.

If the NCBI tutorial proves to be tough going, Sandra Porter at GeoSpiza Inc. has created a truly excellent BLAST for Beginners tutorial.

OK, now let's try some BLAST searches. Lets start with an easy one. Go to the NCBI website and get the protein sequence for the mouse leptin gene (NP_032519). Now go to the BLAST page, choose protein-protein BLAST and paste the protein sequence into the "Search" box. Now choose a database to search. For this exercise, use the "swissprot" datbase. This is a well annotated set of all known proteins. It is far from complete, but all of the sequences in this database have usuful annotation information.

Look at the results. There is a graphic at the top of the page that shows where each hit matches your query sequence with a color code for the quality of the match. In this case there are only two kinds of hits. Very good, full length matches and short low quality matches. Below the graphic is a list of the names of the matching sequences and the e-value of the matches. There is a very sharp drop off from the highly significant e-values in the e-50 range to the non-significant e-values close to 1. Leptin is an unususal gene. It has orthologs in a number of species, but it is not a member of a gene family - there is only one copy of the gene in each species and nothing at all that is similar to it.

If you have some time, try this search again using other databases. Try to search the really huge DNA databases (genome sequences, EST sequences) using the translated BLAST search tool (Protein query - Translated db). I don't think you can find any intermediate quality matches. Why aren't there any other proteins similar to leptin? Biology is fascinating.


Stuart Brown - RCR
Last modified: Wed May 15 13:47:30 EDT 2002