![]()
![]()
![]()
![]()
FASTA
FASTA is a powerful tool for scanning databases to find sequences that are similar to a query sequence. It is generally best to make protein-protein comparisons, but FASTA can also compare DNA sequence to DNA databanks.
The related program TFASTA allows a protein query sequence to be compared to DNA databanks. Each DNA sequence in the databank is translated in all six reading frames, then six protein-protein comparisons are made with FASTA. This is generally the best way to scan the EST databases.
In GCG 10 an improved version of TFASTA known as TFASTX has been added. This compares protein query sequences to DNA databanks taking into account possible frameshifts - this is an ideal toll for searching ESTs which often have sequencing errors. There is also a FASTX tool for comparing DNA query sequences (with frameshifts) to protein databanks.
FASTA starts by making a generalization from the idea of dotplots.
In a dotplot, regions of similarity between two sequences show up as diagonals.
FASTA essentially calculates the sum of the dots along each diagonal.
The "FAST" in fasta comes from the method of calculating those diagonal sums.
If it were necessary to actually construct a dotplot matrix and then add along the diagonals for every sequence in the database, then FASTA wouldn't be any quicker than Smith-Waterman.
Instead, FASTA uses a "word" based method.
- It makes a list of all words, (1 or 2 amino acids, or 5 or 6 nucleotides) in each sequence.
- It matches identical words from each list, and then creates diagonals by joining adjacent matches, but it only counts non-overlapping words.
- It then re-scores the highest scoring regions using a replacement matrix such as the PAM250 - the best of these scores is called "init1".
- It then tries to join together the high scoring diagonals, allowing for gaps. The best score from that is called "initn".
- Finally, it makes an optimal local alignment around the regions it has discovered using the Smith-Waterman algorithm.
- This last alignment step is only applied to a small number of "hit" sequences, which had high "initn" values after the database search.
![]()
A couple of things to keep in mind with FASTA:
For amino acid sequences, it is most sensitive with a word size (ktup ) of 1, but the default is 2.
Conversely, it will take much longer to run a search with a word size of 1 than with one of 2.
Here is another explanation of how FASTA works:
FASTA uses four steps to calculate similarity scores between a pair of sequences:
- Identify regions shared by the two sequences that have the highest density of single residue identities (ktup=1) or two-consecutive identities (ktup=2)
- Re-scan the best regions identified in step 1 using the PAM-250 matrix. The single best score is stored as init1 for reporting later.
- Determine if gaps can be used to join the regions identified in step 2. If so, determine a similarity score for the gapped alignment, which is reported as initin .
- Construct an optimal alignment of the query sequence and the library sequence (Smith-Waterman algorithm). This score is reported as the optimized score (opt.)
FASTA Output
FASTA calculates an E()-value (expectation of significance). According to Dr. Pearson, if it is < 0.02, the similarity bears further examination; if not, statistical significance is simply not there.
The final output plots the initial scores of each library sequence in a histogram ranked by the z-score which is derived from the opt score corrected for differences in sequence length. The general idea of this graph is to show a normal curve of z-scores and Expect values that allows you to see the typical values of these statistics for random matches versus the more significant matches at the very bottom of the graph.
A list of the most significant scores follows the histogram and then the optimal alignments are displayed for these matches (the cutoff can be set by the user). The list of matches also contains, for each database sequence, the beginning and end postions of the region of significant similarity to your query sequence. This list can be used by other GCG programs, such as PILEUP, to extract all of the similar sequence fragments.
It is also possible to force FASTA to show global alignments between the best hits and your query sequence rather than the local alignments used in its similarity calculations [use the
"/SHOWALL"option on the FASTA command line].
Using FASTA at the RCR
On the RCR's Alpha server, FASTA searches are done within the GCG suite of programs using databases that are maintained locally.
A local FASTA search (for similarity of a single sequence to the entire GenBank non-redundant database) typically take about 4 minutes and rarely more than 10 minutes.
We recommend that you run FASTA in batch mode, either by using the EZFASTA program to automatically submit your job in batch mode, or else use the option
"/BATCH"from the FASTA command line.
The output of a FASTA search can be used as input for other GCG programs that use lists of sequences - such as PILEUP .
The original FASTA manual (written by Dr. W.R. Pearson) is available on the RCR web site at: http://rcr-www.med.nyu.edu/rcr/fastaman.html
and the GCG Program Manual chapter on the GCG version of FASTA is at: http://mcrcr0.med.nyu.edu/gcg/fasta.html
![]()
![]()
![]()
![]()
Using Computers for Molecular Biology
Stuart M. Brown, Ph.D, RCR, NYU Medical Center Comments to: browns02@mcrcr.med.nyu.edu