next next next index

Strategies and Options for Similarity Searching

There are a variety of options to consider in a similarity search.

*** First, you should decide whether you want to work within GCG, on the Web, or in a custom program on your personal computer. GCG will give you the most flexibility to do further work with sequences that are found by the search, but may not be as simple as a Web server.

*** As a general strategy, it is best to start with the fastest tools. Initial searches should be done with both the BLAST and FASTA programs.

*** You also need to decide whether to search protein or nucleotide databases.

*** Generally if your query sequence is protein, you will search protein data, and with DNA sequence you will search nucleotide data (and this is the default for both BLAST and FASTA).

*** However, it is possible to automatically translate a DNA sequence into amino acids in all six reading frames (BLASTX) and compare it to protein databases, or to compare a protein sequence to the six reading frame translation of all DNA database sequences (TFASTA and TBLASTN).

*** Searching translated databases uses a LOT more computer power, so use this option sparingly - however, this is probably the best way to search the EST databases.

Next, you need to decide what databases and what sections of those databases to search.


*** Some people want to find every match to a given sequence, others may want to limit the search to humans or to mammals, or to the animal kingdom.

*** The more you can restrict your search, the faster it will run and the fewer uninteresting hits you will have to sort through.

*** Another important consideration is whether to search the EST, STS, and new GSS and HTG (genomic survey sequence) databases.

*** These "genome project" mass sequencing databases are very large, often unannotated and contain a lot of low quality "single pass sequence data.

*** By leaving them out, your search will go significantly faster, and leave you with many fewer false matches to sort through.

*** If you are starting with an essentially unknown sequence, then a match against an unannotated EST or genome fragment will probably not contribute much useful information.

*** Once you have competed a search of the well documented sections of the databases, you might run a separate search against the EST and GSS sections.




FRAMESEARCH

*** One other search method that you should know about is called FRAMESEARCH. It is specifically designed for dealing with bad data. That is, it can find an alignment between a protein query and a nucleotide test sequence even if the latter contains frame-shifting gaps.

*** There are a lot of sequences in the databases, primarily in the EST divisions, that contain many frameshifts in the sequence.

*** In a really junky sequence there might be multiple frame shifts, so that nothing is long enough to show up in BLAST or FASTA.

*** Framesearch, unfortunately, is a very slow program, as it is an extended version of the Smith-Waterman method.

*** One thing that you might want to do, which is much less time consuming, is to run FRAMESEARCH directly against any EST hit you find with either FASTA or BLAST.


next next next index

Using Computers for Molecular Biology
Stuart M. Brown, Ph.D, RCR, NYU Medical Center
Comments to: browns02@mcrcr.med.nyu.edu