COPYRIGHT NOTICE

Copyright 1988, 1991, 1992, 1994, 1995 by William R. Pearson and the University of Virginia. All rights reserved. The FASTA program and documentation may not be sold or incorporated into a commercial product, in whole or in part, without written consent of William R. Pearson and the University of Virginia. For further information regarding permission for use or reproduction, please contact: David Hudson, Assistant Provost for Research, University of Virginia, P.O. Box 9025, Charlottesville, VA 22906-9025, (804) 924-6853

The FASTA program package

Introduction

This documentation describes the version 2.0x of the FASTA program package (see W. R. Pearson and D. J. Lipman (1988), "Improved Tools for Biological Sequence Analysis", PNAS 85:2444- 2448, and W. R. Pearson (1990) "Rapid and Sensitive Sequence Comparison with FASTP and FASTA" Methods in Enzymology 183:63- 98). Version 2.0 modifies version 1.8 to include explicit statistical estimates for similarity scores based on the extreme value distribution. In addition, FASTA alignments now use the Smith-Waterman algorithm with no limitation on gap size. FASTA and SSEARCH now use the BLOSUM50 matrix by default, with options to change gap penalties on the command line. Version 1.7 replaces rdf2 and rss with prdf and prss, which use the extreme-value distribution to calculate accurate probability estimates. This 2.0x version is an "experimental" version. A number of improvements have been included that have been tested only for a few weeks.

Although there are a large number of programs in this package, they belong to four groups:

  1. Library search programs: FASTA, TFASTA, SSEARCH
  2. Local homology programs: LFASTA, PLFASTA, LALIGN, PLALIGN
  3. Statistical significance: PRDF, RELATE, PRSS
  4. Global alignment: ALIGN
In addition, I have included several programs for protein sequence analysis, including a Kyte-Doolittle hydropathicity plotting program (GREASE, TGREASE), and a secondary structure prediction package (GARNIER).

The FASTA sequence comparison programs on this disk are improved versions of the FASTP program, originally described in Science (Lipman and Pearson, (1985) Science 227:1435-1441). We have made several improvements. First, the library search programs use a more sensitive method for the initial comparison of two sequences which allows the scores of several similar regions to be combined. As a result, the results of a library search are now given with three scores, initn (the new initial score which may include several similar regions), init1 (the old fastp initial score from the best initial region), and opt (the old fastp optimized score allowing gaps in a 32 residue wide band).

These programs have also been modified to become "universal" (hence FAST-A, for FASTA-All, as opposed to FAST-P (protein) or FAST-N (nucleotides)); by changing the environment variable SMATRIX, the programs can be used to search protein sequences, DNA sequences, or whatever you like. By default, FASTA, LFASTA, and the PRDF programs automatically recognize protein and DNA sequences. Sequences are first read as amino acids, and then converted to nucleotides if the sequence is greater than 85% A,C,G,T (the '-n' option can be used to indicate DNA sequences). TFASTA compares protein sequences to a translated DNA sequence. Alternative scoring matrices can also be used. In addition to the BLOSUM50 matrix for proteins, the PAM250 matrix or matrices based on simple identities or the genetic code can also be used for sequence comparisons or evaluation of significance. Several different protein sequence matrices have been included; instructions for constructing your own scoring matrix are included in the file FORMAT.DOC.

The remainder of this document is divided into three sections:

  1. A guide to using the FASTA programs.
  2. A guide to installing the programs and databases
  3. a brief history of the changes to the FASTA package
The programs are very easy to use, so if you are using them on a machine that is administered by someone else, you may want to learn how to use the programs, and then read section (3) to look at some of the more recent changes. If you are installing the programs on your own machine, you will need to read section (2) carefully.


1. Using the FASTA Package

1.1. Overview

The FASTA sequence comparison programs all require similar information, the name of a query sequence file, a library file, and the ktup parameter. All of the programs can accept arguments on the command line, or they will prompt for the file names and ktup value. To use FASTA, simply type: FASTA and you will be prompted for : the name of the test sequence file the name of the library file and whether you want ktup = 1 or 2. (or 1 to 6 for DNA sequences) ktup of 2 is about 5 times faster than ktup = 1. For a 200 aa sequence against a 10,000,000 aa library, the program takes about 30 min with ktup = 2, 150 min with ktup = 1, on a 12 Mhz 286 IBM-PC.


The program can also be run by typing: FASTA test.aa /lib/bigfile.lib ktup (1 or 2) Included with the package are the test files, MUSPLFM.AA, LCBO.AA, MCHU.AA and BOVPRL.SEQ. To check to make certain that everything is working, you can try: fasta musplfm.aa lcbo.aa and tfasta musplfm.aa bovprl.seq To test the local similarity programs LFASTA and PLFASTA, try: lfasta mchu.aa mchu.aa and plfasta mchu.aa mchu.aa (use this only on an IBM-PC with graphics or on a Tektronix terminal under UNIX or VMS) MCHU (calmodulin) has four duplicated calcium binding sites that are clearly detected by LFASTA. For a more complicated example, try MWRTC1.aa, myosin heavy chain.

1.2. Sequence files

The FASTA programs know about three kinds of sequence files (four under VMS): (1) plain sequence files that can only be used as query sequences or for LFASTA, PRDF, and ALIGN. (2) Standard library files. These are the same as plain sequence files, each sequence is preceded by a comment line with a '>' in the first column. (3) distributed sequence libraries (this is a broad class that includes the NBRF/PIR VMS and blocked ascii formats, Genbank flat-file format, EMBL flat-file format, and Intelligenetics format. All of the files that you create should be of type (1) or (2). Type (2) files (ones with a be used as query or library sequence files by all of the programs.

I have included several sample test files, *.AA. The first line may begin with a '>' or ';' followed by a comment. The text after ';' in other lines will be ignored. Spaces and tabs (and anything else that is not an amino-acid code) are ignored.

Library files should have the form:

    >Sequence name and identifier
    A F A S Y T .... actual sequence.
    F S S       .... second line of sequence

    >Next sequence name and identifier

This is often referred to as "FASTA" or "Pearson" format. You can build your own library by concatenating several sequence files. Just be sure that each sequence is preceded by a line beginning with a '>' with a sequence name.

The test file should not have lines longer than 120 characters, and sequences entered with word processors should use a document mode, with normal carriage returns at the end of lines.

1.3. Program Summary

1.3.1. Sequence search programs

1.3.2. Local similarity programs

1.4. Statistical Significance

With version 2.0 of the FASTA program distribution, FASTA, TFASTA, and SSEARCH now provide estimates of statistical significance for library searches. Work by Altschul, Arratia, Karlin, Mott, Waterman, and others (see Altschul et al. (1994) Nature Genetics 6:119 for an excellent review) suggests that local sequence similarity scores follow the extreme value distribution, so that P(s > x) = 1 - exp(-exp(-lambda(x-u)) where u = ln(Kmn)/lambda and m,m are the lengths of the query and library sequence. This formula can be rewritten as: 1 - exp(-Kmn exp(-lambda x), which shows that the average score for an unrelated library sequence increases with the logarithm of the length of the library sequence. FASTA and SSEARCH use simple linear regression against the the log of the library sequence length to calculate a normalized "z-score" with mean 50, regardless of library sequence length, and variance 10. These z-scores can then be used with the extreme value distribution and the poisson distribution (to account for the fact that each library sequence comparison is an independent test) to calculate the number of library sequences to obtain a score greater than or equal to the score obtained in the search. The original idea and routines to do the linear regression on library sequence length were provided Phil Green, U. Washington. This version of FASTA and SSEARCH uses a slightly different strategy for fitting the data than those originally provided by Dr. Green.

The expected number of sequences is plotted in the histogram using an "*". Since the parameters for the extreme value distribution are not calculated directly from the distribution of similarity scores, the pattern of "*'s" in the histogram gives a qualitative view of how well the statistical theory fits the similarity scores calculated by FASTA and SSEARCH. For FASTA, if optimized scores are calculated for each sequence in the database (-o option), the agreement between the actual distribution of "z-scores" and the expected distribution based on the length dependence of the score and the extreme value distribution is usually very good. Likewise, the distribution of SSEARCH Smith- Waterman scores typically agrees closely with the actual distribution of "z-scores." The agreement with unoptimized scores, ktup=2, is often not very good, with too many high scoring sequences and too few low scoring sequences compared with the predicted relationship between sequence length and similarity score. In those cases, the expectation values may be overestimates.

The statistical routines assume that the library contains a large sample of unrelated sequences. If this is not the case, then the expectation values are meaningless. Likewise, if there are fewer than 20 sequences in the library, the statistical calculations are not done.

For protein searches, library sequences with E() values < 0.01 for searches of a 10,000 entry protein database are almost always homologous. Frequently sequences with E()-values from 1 - 10 are related as well. Remember, however, that these E() values also reflect differences between the amino acid composition of the query sequence and that of the "average" library sequence. Thus, when searches are done with query sequences with "biased" amino-acid composition, unrelated sequences may have "significant" scores because of sequence bias. The programs below, PRDF and PRSS, can address this problem by calculating similarity scores for random sequences with the same length and amino acid composition.

Unless, optimization is used "-o", E-values for DNA sequences overestimate the significance of the scores that are obtained and unrelated sequences frequently have E()-values < 0.0005. With optimization, the agreement between E()-value compares favorably with protein sequence comparison. This is in part due to the use of more stringent gap penalties for DNA sequence comparison, -16, -4 rather than -12, -2. With the latter penalties, many unrelated sequences appear to have significant similarity. Nevertheless, since protein sequence comparison is much more sensitive, DNA sequence comparison should not be used to identify sequences that encode protein. Even with ktup=6, optimization rarely increases run-times more than 50% with mRNA-size query sequences. Optimization should be used whenever possible.

Similar comments apply to TFASTA, where higher gap penalties (-16,-4) are required for accurate statistical estimates. Because TFASTA produces so many artificial "coding" sequences with atypical amino acid compositions, the statistical estimates with TFASTA are often over estimates. With optimized scores, ktup=1, and gap penalties of -16, -4, unrelated sequences will sometimes have E() values of 0.1. If initn scores are used, unrelated sequences may have have E() values < 0.01.

1.5. Other analysis programs

1.6. Options

These programs have a number of output options, which are invoked by the environment variables LINLEN, SHOWALL, and MARKX. Alternatively, these values can be controlled by command line options. The number of sequence residues per output line is now adjustable by setting the environment variable LINLEN, or the command line option -w. LINLEN is normally 60, to change it set LINLEN=80 before running the program or add -w 80 to the command line. LINLEN can be set up to 200. SHOWALL (-a) determines whether all, or just a portion, of the aligned sequences are displayed. Previously, FASTP would show the entire length of both sequences in an alignment while FASTN would only show the portions of the two sequences that overlapped. Now the default is to show only the overlap between the two sequences, to show complete sequences, set SHOWALL=1, or use the -a option on the command line.

The differences between the two aligned sequences can be highlighted in three different ways by changing the environment variable MARKX or the -m option. Normally (MARKX=0) the program uses ':' do denote identities and '.' to denote conservative replacements. If MARKX=1, the program will not mark identities; instead conservative replacements are denoted by a 'x' and non- conservative substitutions by a 'X'. If MARKX=2, the residues in the second sequence are only shown if they are different from the first. MARKX=3 displays the aligned library sequences without the query sequence; these can be used to build a primitive multiple alignment. MARKX=4 provides a graphical display of the boundaries of the alignments. Thus the five options are:


     MARKX=0      MARKX=1       MARKX=2       MARKX=3      MARKX=4

    MWRTCGPPYT   MWRTCGPPYT    MWRTCGPPYT                 MWRTCGPPYT
    ::..:: :::     xx  X       ..KS..Y...    MWKSCGYPYT   ----------
    MWKSCGYPYT   MWKSCGYPYT

1.7. Command line options

It is now possible to specify several options on the command line, instead of using environment variables. The command line options are preceded by a dash; the following options are available:

Not all of these options are appropriate for all of the programs. The options above are used by FASTA and TFASTA. RELATE uses the -s option, ALIGN uses the -w, -m, and -s options, and the PRDF program uses -c, -f, -k, and -s.

1.8. Environment variable summary

Environment variables allow you to set search parameters that will be used frequently when you run a program; for example, if you prefer to use the PAM250 scoring matrix, you might "set SMATRIX=250." Command line parameters, if used, always override environment variable settings. The following environment variables are used by this program:


2. Installing the FASTA package

2.1. Installing the programs

2.1.1. Unix version

The FASTA distribution comes with several makefile's that can be used to compile the FASTA programs. Over the years, as ATT Unix System 5 and BSD unix have converged, these files have become very similar. To begin with, I recommend using the standard Makefile. There are two values in the makefile that should be checked against the values used on your system: the HZ value, which is the frequency in ticks per second used by the times() system call, this value can usually be found by running:

    grep HZ /usr/include/sys/*
and the functions available to return random numbers. If you have a rand48() function that returns a 32-bit random number, use it and use the lines:
    NRAND=nrand48
    RANFLG= -DRAND32
If not, you will need to use the rand() function call and determine whether it returns a 16-bit or a 32-bit value. These functions are used by PRDF and PRSS. If you have problems compiling the programs, you may want to examine the makefile.unx and makefile.sun files, to look for differences. I have tried to use very standard unix functions in these programs, and they have been successfully compiled, with very small changes to the Makefile, on Sun's (Sun OS 4.1), IBM RS/6000's (AIX), and MIPS machines (under the BSD environment).

2.1.2. IBM-PC/DOS version

For the IBM-PC/DOS version, the FASTA source code disk contains the complete source code to all of the programs on the other disks. The programs were compiled with Borland's Turbo 'C++', using Borland's MAKE utility. The graphics programs (PLFASTA, TGREASE) use the graphics device drivers supplied with the Turbo 'C' V2.0 package. Also included are the documentation files PROGRAMS.DOC and FORMAT.DOC. You do not need any of the files the source code disk to run the programs. The files on this disk are identical to the UNIX and VMS versions that run on larger machines. Also included is the code to compile ALIGN0.EXE. ALIGN0 is the same as ALIGN, but does not penalize for end-gaps.

If you have the DOS or Macintosh version of the FASTA package, to install the programs you should:

  1. Make a new directory (folder) for the FASTA programs. This need not be the same as the directory for your sequence databases.
  2. Copy the files from the FASTA source disk to the new directory.
  3. (DOS only) Edit your AUTOEXEC.BAT file to (a) modify your PATH command to include the FASTA directory and (b) add the line:
               set FASTLIBS=c:\yourfastadirectory\fastgbs
          
    On the Macintosh, you may need to edit the "environment" file and change the line that reads:
               FASTLIBS=fastgbs
          
    to indicate the full directory path for the fastgbs file, for example:
               FASTLIBS=Q105:FASTA:fastgbs
    
  4. Finally, you will need to edit the fastgbs file. This is usually the most confusing part of the installation. An example of this file is shown below; to customize this file for your machine, you will need to change the file names from those provided in the fastgbs file to ones that reflect the directory names and file names you use on your machine. This is explained in more detail below. In addition, some entries in the fastgbs file refer to other files of file names. These files of file names (as opposed to actual database files) may also need to be edited.

2.2. Installing the libraries

2.2.1. The NBRF protein sequence library

The FASTA program package does not include any protein or DNA sequence libraries. You can obtain the PIR protein sequence database from:

National Biomedical Research Foundation
Georgetown University Medical Center
3900 Reservoir Rd, N.W.
Washington, D.C. 20007
In addition, this database is available via anonymous ftp from the host "ftp.bchs.uh.edu". It is available in two formats, VMS and CODATA format. The "VMS" format (library type 5 below) can be searched much faster, can be easily reformatted for use by the "BLAST" rapid searching program, and is compatible with the Genetics Computer Group package of programs. The CODATA format is used by the EUGENE/MBIR computing package from Baylor (library type 2).

2.2.2. The GENBANK DNA sequence library

FASTA, and TFASTA search sequences from the GENBANK "flatfile" (not ASN.1) DNA sequence library in the flat-file format distributed by the National Center for Biotechnology Information and the PIR format used by EBI/EMBL. CD-ROMs can be obtained from:

Genbank
National Center for Biotechnology Information
National Library of Medicine
National Institutes of Health
8600 Rockville Pike
Bethesda, MD 20894
The GenBank DNA sequence library is also available via anonymous FTP from ncbi.nlm.nih.gov.

2.2.3. The EBI/EMBL CD-ROM libraries

The European Bioinformatics Institute (EBI) is now distributing the EMBL CD-ROM that contains both the complete EMBL DNA sequence database (which should be essentially identical to the GenBank DNA sequence database) and the SWISS-PROT protein sequence database. SWISS-PROT is derived from the NBRF Protein sequence database with additions from the EBI/EMBL DNA sequence database. This CD-ROM is a "best-buy," since it provides both DNA and protein sequence libraries. It is available from:

European Bioinformatics Institute
Hinxton Genome Campus, Hinxton Hall
Hinxton, Cambridge CB10 1RQ,
United Kingdom
Tel: +44 1223 4944
Fax: +44 1223 494468
Email: DATALIB@ebi.ac.uk
In addition, the SWISS-PROT protein sequence database is available via anonymous FTP from ncbi.nlm.nih.gov.

2.3. Finding the libraries: FASTLIBS

FASTA and TFASTA use the environment variable FASTLIBS to find the protein and DNA sequence libraries. The FASTLIBS variable contains the name of a file that has the actual filenames of the libraries. The FASTGBS file on is an example of a file that can be referred to by FASTLIBS. To use the FASTGBS file, type:

    setenv FASTLIBS /usr/lib/fasta/fastgbs (BSD UNIX/csh)
    or
    export FASTLIBS=/usr/lib/fasta/fastgbs (SysV UNIX/ksh)

Then edit the FASTGBS file to indicate where the protein and DNA sequence libraries can be found. If you have a hard disk and your protein sequence library is kept in the file /usr/lib/aabank.lib and your Genbank DNA sequence library is kept in the directory: /usr/lib/genbank, then fastgbs might contain:

    NBRF Protein$0P/usr/lib/seq/aabank.lib 0
    SWISS PROT 10$0S/usr/lib/vmspir/swiss.seq 5
    GB Primate$1P@/usr/lib/genbank/gpri.nam
    GB Rodent$1R@/usr/lib/genbank/grod.nam
    GB Mammal$1M@/usr/lib/genbank/gmammal.nam

The first line of this file says that there is a copy of the NBRF protein sequence database (which is a protein database) that can be selected by typing "P" on the command line or when the database menu is presented in the file /usr/lib/seq/aabank.lib.

Note that there are 4 or 5 fields in the lines in fastgbs. The first field is the description of the library which will be displayed by FASTA; it ends with a '$'. The second field (1 character), is a 0 if the library is a protein library and 1 if it is a DNA library. The third field (1 character) is the character to be typed to select the library.

The fourth field is the name of the library file. In the example above, the /usr/lib/seq/aabank.lib file contains the entire protein sequence library. However the DNA library file names are preceded by a '@', because these files (gpri.nam, grod.nam, gmammal.nam) do not contain the sequences; instead they the names of the files which contain the sequences. This is done because the GENBANK DNA database is broken down in to a large number of smaller files. In order to search the entire primate database, you must search more than a dozen files.

In addition, an optional fifth field can be used to specify the format of the library file. Alternatively, you can specify the library format in a file of file names (a file preceded by an '@'). This field must be separated from the file name by a space character (' ') from the filename. In the example above, the aabank.lib file is in Pearson/FASTA format, while the swiss.seq file is in PIR/VMS format (from the EMBL CD-ROM), while the DNA sequences are in compressed GenBank format. No file type number is included for the Genbank files, because it is included in the file of filenames (see below). Currently, FASTA can read the following formats:

    0 Pearson/FASTA (>SEQID - comment/sequence)
    1 Uncompressed Genbank (LOCUS/DEFINITION/ORIGIN)
    2 NBRF CODATA (ENTRY/SEQUENCE)
    3 EMBL/SWISS-PROT (ID/DE/SQ)
    4 Intelligenetics (;comment/SEQID/sequence)
    5 NBRF/PIR VMS (>P1;SEQID/comment/sequence)
    11 NCBI Blast1.3.2 format  (unix only)

In particular, this version will work with the EMBL and PIR VMS formats that are distributed on the EMBL CD-ROM. The latter format (PIR VMS) is much faster to search than EMBL format. This release also works with the protein and DNA database formats created for the BLASTP and BLASTN programs by SETDB and PRESSDB and with the new NCBI search format. If a library format is not specified, for example, because you are just comparing two sequences, Pearson/FASTA (format 0) is used by default. To change this default, you may set the LIBTYPE environment variable to a number. For example,

    setenv LIBTYPE 1
would cause the program to use the GenBank LOCUS format by default for libraries (or the second sequence file), but the Pearson/FASTA format would still be used for the query sequence.

You can specify a group of library files by putting a '@' symbol before a file that contains a list of file names to be searched. For example, if @gmam.nam is in the fastgbs file, the file "gmam.nam" might contain the lines:

    /usr/lib/genbank
    gbpri.seq 1
    gbrod.seq 1
    gbmam.seq 1
In this case, the line beginning with a '<' indicates the directory the files will be found in. The remaining lines name the actual sequence files. So the first sequence file to be searched would be:

    /usr/lib/genbank/gbpri.seq

The notation "/PIRNAQ:" might be used under the VAX/VMS operating system. Under UNIX, the trailing '/' is left off, so the library directory might be written as "/usr/seqlib".

With version 1.4 of the FASTA package, the FASTA and TFASTA programs can search a library composed of different files in different sequence formats. For example, you may wish to search the Genbank files (in GenBank flat file format) and the EMBL DNA sequence database on CD-ROM. To do this, you simply list the names and filetypes of the files to be searched in a file of filenames. For example, to search the mammalian portion of Genbank, the unannotated portion of Genbank, and the unannotated portion of the EMBL library, you could use the file:

  
    /usr/lib/DNA
    gbpri.seq 1
    #  (this '#' causes the program to display the size of the library)
    gbrod.seq 1
    ...
    gbmam.seq 1
    ...
    gbuna.seq 1
    ...
    unanno.seq 5
    #
You do not need to include library format numbers if you only use the Pearson/FASTA version of the PIR protein se- quence library. If no library type is specified, the program assumes that type 0 is being used (unless you have set LIBTYPE).
Support for the old compressed GenBank files, which have not been supported for more than three years, is begin removed from programs in the FASTA package.

Test the setup by running FASTA. Enter the sequence file 'MUSPLFM.AA' when the program requests it (this file is included with the programs). The program should then ask you to select a protein sequence library. Alternatively, if you run the TFASTA program and use the MUSPLFM.AA query sequence, the program should show you a selection of DNA sequence libraries. Once the fastgbs file has been set up correctly, you can set FASTLIBS=fastgbs in your AUTOEXEC.BAT file, and you will not need to remember where the libraries are kept or how they are named.

FASTA and TFASTA must open a large number of files when searching and reporting the results of a GENBANK floppy disk format library search. You may have problems with the large number of files under DOS on IBM-PC's (Unix and VMS users will not have these problems). If you are going to search the GENBANK floppy disk format DNA sequence library under DOS, you should add the line:

    FILES=16
to your CONFIG.SYS file. (Typically this is already done for programs like Windows or WordPerfect.)


3. FASTA Revision History

3.1. Changes with version 2.0x

Version 2.0x provides several major improvements over previous versions of FASTA (and SSEARCH). The most important is the incorporation of explicit statistical estimates and appropriate normalization of similarity scores. This improvement is discussed in more detail below in the section entitled Statistical Significance. In addition, all of the protein comparison programs now use the BLOSUM50 matrix, with gap penalties of -12, -2, by default. BLOSUM50 performs significantly better than the older PAM250 matrix. PAM250 can still be used with the command line option: -s 250. (DNA sequence comparisons use a more stringent gap penalty of -16, -4, which produces excellent statistical estimates when optimized scores are used. TFASTA uses -16, -4 as well.)

Finally, the algorithm used to produce the final alignment is now a full Smith-Waterman, with unlimited gaps. Both the optimized and Smith-Waterman scores are reported; if the Smith- Waterman score is higher, then additional gaps allowed a better alignment and similarity score to be calculated. As a result of this change, the program takes much longer to calculate alignments, particularly for long DNA sequences. However, DNA alignments tend to benefit substantially, because they often have large gaps, such as short introns, that can be spanned with the rigorous Smith-Waterman alignment.

FASTA searches should use the "-o" option whenever possible (this slows searches about 2-fold (worst case) for ktup=2). The "-o" (optimize) option significantly improves the sensitivity of FASTA, so that it almost matches Smith-Waterman. With version 2.0, the default band width used for optimized calculations can be varied with the "-y" option. For proteins with ktup=2, a width of 16 (-y 16) is used; 16 is also used for DNA sequences. For proteins and ktup=1, a width of 32 is used. Searches without the "-o" option will work fine for sequences that share 25% or more identity in general, but to detect evolutionary relationships with 20% - 25% identity, the more sensitive "-o" option is often required. The "-o" option is required for accurate statistical estimates with either protein or DNA sequences.

With explicit expectation calculations, the program now shows all scores and alignments with expectations less than 10.0 (with optimized scores, 2.0 without optimization) when the "-Q" (quiet) mode is used.

3.2. Changes with version 1.7

Version 1.7 has been released to provide the PRDF and PRSS programs for shuffling sequences and estimating accurately the probabilities of the unshuffled-sequence scores.

3.3. Changes with version 1.6

FASTA version 1.6 uses a new method for calculating optimal scores in a band (the optimization or last step in the FASTA algorithm). In addition, it uses a linear-space method for calculating the actual alignments. FASTA v1.6 package includes several new programs:

The LALIGN/PLALIGN programs incorporate the "sim" algorithm described by Huang and Miller (1991) Adv. Appl. Math. 12:337-357. The SSEARCH and PRSS programs incorporate algorithms described by Huang, Hardison, and Miller (1990) CABIOS 6:373-381.

LFASTA and PLFASTA now calculate a different number of local similarities; they now behave more like LALIGN/PLALIGN. Since local alignments of identical sequences produce "mirror-image" alignments, lalign and lfasta consider only one-half of the potential alignments between sequences from identical file names. Thus

    lfasta mchu.aa mchu.aa
Displays only two alignments, with earlier versions of the program, it would have displayed five, including the identity alignment. PLFASTA does display five alignments; when two identical filenames are given, it draws the identity alignment, calculates the two unique local alignments, draws them, and draws their mirror images. LFASTA/PLFASTA and LALIGN/PLALIGN use the filenames, rather than the actual sequences, to determine whether sequences are identical; you can "trick" the programs into behaving the old way by putting the same sequence in two different files.

3.4. Changes with version 1.5

FASTA version 1.5 includes a number of substantial revisions to improve the performance and sensitivity of the program. It is now possible to tell the program to optimize all of the initn scores greater than a threshold. The threshold is set at the same value as the old FASTA cutoff score. Alternatively, you can tell FASTA to sort the results by the init1, rather than the initn, score by using the -1 option. FASTA -1 ... will report the results the way the older FASTP program did.

A new method has been provided for selecting libraries. In the past, one could enter the name of a sequence file to be searched or a single letter that would specify a library from the list included in the $FASTLIBS file. Now, you can specify a set of library files with a string of letters preceded by a '%'. Thus, if the FASTLIBS file has the lines:


    Genbank 70 primates$1P/seqlib/gbpri.seq 1
    Genbank 70 rodents$1R/seqlib/gbrod.seq 1
    Genbank 70 other mammals$1M/seqlib/gbmam.seq 1
    Genbank 70 vertebrates $1B/seqlib/gbvrt.seq 1
Then the string: "%PRMB" would tell FASTA to search the four libraries listed above. The %PRMB string can be entered either on the command line or when the program asks for a filename or library letter.

FASTA1.5 also provides additional flexibility for specifying the number of results and alignments to be displayed with the -Q (quiet) option. The -b number option allows you to specify the number of sequence scores to show when the search is finished. Thus

FASTA -b 100 ...
tells the program to display the top 100 sequence scores. In the past, if you displayed 100 scores (in -Q mode), you would also have store 100 alignments. The -d option allows you to limit the number of alignments shown. FASTA -b 100 -d 20 would show 100 scores and 20 alignments.

The old CUTOFF parameter is no longer used. The program stores the best 2000 (IBM-PC, MAC) or 6000 (Unix, VMS) scores and then throws out the lowest 25%, stores the next 500 (1500) better than the threshold determined with the first scores were discarded, and repeats the process as the library is scanned. As a result, the best 1500 - 2000 (4500 - 6000) scores are saved. The old cut-off parameter was also used to set the joining threshold for the calculation of the initn score from initial regions. This joining threshold can now be set with the -k option or with the GAPCUT parameter.

Finally, FASTA can provide a complete list of all of the sequences and scores calculated to a file with the -r (results) option. FASTA -r results.out ... creates a file with a list of scores for every sequence in the library. The list is not sorted, and only includes those scores calculated during the initial scan of the library (the optimized score is not calculated unless the -o option is used).


As always, please inform me of bugs as soon as possible.

William R. Pearson
Department of Biochemistry
Box 440, Jordan Hall
U. of Virginia
Charlottesville, VA

wrp@virginia.EDU