next next next index

FASTA Format

FASTA format is a compact and simple method of storing DNA and protein sequences as text files that can be read by virtually all molecular biology programs including (finally!) GCG version 9.


A sequence in FASTA format begins with a single-line description (or header), followed by lines of sequence data. The description line is distinguished from the sequence data by a greater-than (">") symbol in the first column. It is recommended that all lines of text be shorter than 80 characters in length.

An example sequence in FASTA format is:

>gi|532319|pir|TVFV2E|TVFV2E envelope protein
ELRLRYCAPAGFALLKCNDADYDGFKTNCSNVSVVHCTNLMNTTVTTGLLLNGSYSENRT
QIWQKHRTSNDSALILLNKHYNLTVTCKRPGNKTVLPVTIMAGLVFHSQKYNLRLRQAWC
HFPSNWKGAWKEVKEEIVNLPKERYRGTNDPKRIFFQRQWGDPETANLWFNCHGEFFYCK
MDWFLNYLNNLTVDADHNECKNTSGTKSGNKRAPGPCVQRTYVACHIRSVIIWLETISKK
TYAPPREGHLECTSTVTGMTVELNYIPKNRTNVTLSPQIESIWAAELDRYKLVEITPIGF
APTEVRRYTGGHERQKRVPFVXXXXXXXXXXXXXXXXXXXXXXVQSQHLLAGILQQQKNL
LAAVEAQQQMLKLTIWGVK

Sequences are expected to be represented in the standard IUB/IUPAC amino acid and nucleic acid codes, lower-case letters are are equivalent to upper-case. Some, but not all programs that accept 'FASTA Format" recognize a hyphen or dash (-) to represent a gap of indeterminate length and an asterix (*) to represent an unknown or ambiguous character.


next next next index

Using Computers for Molecular Biology
Stuart M. Brown, Ph.D, RCR, NYU Medical Center
Comments to: browns02@mcrcr.med.nyu.edu