SEQUENCE INPUT: all sequences must be in 1 file, one after another. 6 formats are automatically recognised: NBRF/PIR, EMBL/SWISSPROT, Pearson (Fasta), Clustal (*.aln), GCG/MSF (Pileup) and GDE flat file. All non-alphabetic characters (spaces, digits, punctuation marks) are ignored except "-" which is used to indicate a GAP ("." in GCG/MSF).
To do a MULTIPLE ALIGNMENT on a set of sequences, use item 1 from this menu to INPUT them; go to menu item 2 to do the multiple alignment.
PROFILE ALIGNMENTS (menu item 3) are used to align 2 alignments. Use this to add a new sequence to an old alignment, or to use secondary structure to guide the alignment process. GAPS in the old alignments are indicated using the "-" character. PROFILES can be input in ANY of the allowed formats; just use "-" (or "." for MSF) for each gap position.
PHYLOGENETIC TREES (menu item 4) can be calculated from old alignments (read in with "-" characters to indicate gaps) OR after a multiple alignment while the alignment is still in memory.
The program tries to automatically recognise the different file formats used and to guess whether the sequences are amino acid or nucleotide. This is not always foolproof.
FASTA and NBRF/PIR formats are recognised by having a ">" as the first character in the file.
EMBL/Swiss Prot formats are recognised by the letters ID at the start of the file (the token for the entry name field).
CLUSTAL format is recognised by the word CLUSTAL at the beginning of the file.
GCG/MSF format is recognised by the word PileUp at the start of the file. If your msf files do not contain this word first, edit it in at the start of the first line.
If 85% or more of the characters in the sequence are from A,C,G,T,U or N, the sequence will be assumed to be nucleotide. This works in 97.3% of cases but watch out!
Multiple alignments are carried out in 3 stages (automatically done from menu item 1 ...Do complete multiple alignments now):
MULTIPLE ALIGNMENT parameters control the gaps in the final multiple alignments.
RESET GAPS (menu item 6) will remove any new gaps introduced into the sequences during multiple alignment if you wish to change the parameters and try again. This only takes effect just before you do a second multiple alignment. You can make phylogenetic trees after alignment whether or not this is ON. If you turn this OFF, the new gaps are kept even if you do a second multiple alignment. This allows you to iterate the alignment gradually. Sometimes, the alignment is improved by a second or third pass.
SCREEN DISPLAY can be used to send the output alignments to the screen as well as to the output file.
You can skip the first stages (pairwise alignments; dendrogram) by using an old dendrogram file (menu item 3); or you can just produce the dendrogram with no final multiple alignment (menu item 2).
OUTPUT FORMAT: Menu item 8 (format options) allows you to choose from 5 different alignment formats (CLUSTAL, GCG, NBRF/PIR, PHYLIP and GDE).
You can choose between the 2 alignment methods using menu option 8. The slow/accurate method is fine for short sequences but will be VERY SLOW for many (e.g. >20) long (e.g. >1000 residue) sequences.
SLOW/ACCURATE alignment parameters: These parameters do not have any affect on the speed of the alignments. They are used to give initial alignments which are then rescored to give percent identity scores. These % scores are the ones which are displayed on the screen. The scores are converted to distances for the trees.
GAP PENALTY: This is a penalty for each gap in the fast alignments. It has little affect on the speed or sensitivity except for extreme values.
TOP DIAGONALS: The number of k-tuple matches on each diagonal (in an imaginary dot-matrix plot) is calculated. Only the best ones (with most matches) are used in the alignment. This parameter specifies how many. Decrease for speed; increase for sensitivity.
WINDOW SIZE: This is the number of diagonals around each of the 'best' diagonals that will be used. Decrease for speed; increase for sensitivity.
Each step in the final multiple alignment consists of aligning two alignments or sequences. This is done progressively, following the branching order in the GUIDE TREE. The basic parameters to control this are two gap penalties and the scores for various identical/non-indentical residues.
CLUSTAL format output is a self explanatory alignment format. It shows the sequences aligned in blocks. It can be read in again at a later date to (for example) calculate a phylogenetic tree or add a new sequence with a profile alignment.
GCG output can be used by any of the GCG programs that can work on multiple alignments (e.g. PRETTY, PROFILEMAKE, PLOTALIGN). It is the same as the GCG .msf format files (multiple sequence file); new in version 7 of GCG.
PHYLIP format output can be used for input to the PHYLIP package of Joe Felsenstein. This is an extremely widely used package for doing every imaginable form of phylogenetic analysis (MUCH more than the the modest intro- duction offered by this program).
NBRF/PIR: this is the same as the standard PIR format with ONE ADDITION. Gap characters "-" are used to indicate the positions of gaps in the multiple alignment. These files can be re-used as input in any part of clustal that allows sequences (or alignments or profiles) to be read in.
GDE: this format is used by the GDE package of Steven Smith.
OUTPUT ORDER is used to control the order of the sequences in the output alignments. By default, it is the same as the input order. This switch can be used to make the order correspond to the order in which the sequences were aligned (from the guide tree/dendrogram), thus automatically grouping closely related sequences.
The profiles can be in any of the allowed input formats with "-" characters used to specify gaps (except for GCG/MSF where "." is used).
You have to specify the 2 profiles by choosing menu items 1 and 2 and giving 2 file names. Then Menu item 3 will align the 2 profiles to each other. Secondary structure masks in either profile can be used to guide the alignment.
Menu item 4 will take the sequences in the second profile and align them to the first profile, 1 at a time. This is useful to add some new sequences to an existing alignment, or to align a set of sequences to a known structure. In this case, the second profile need not be pre-aligned.
The alignment parameters can be set using menu items 5, 6 and 7. These are EXACTLY the same parameters as used by the general, automatic multiple alignment procedure. The general multiple alignment procedure is simply a series of profile alignments. Carrying out a series of profile alignments on larger and larger groups of sequences, allows you to manually build up a complete alignment, if necessary editing intermediate alignments.
SECONDARY STRUCTURE OPTIONS. Menu Option 0 allows you to set secondary structure parameters. If a solved structure is available, it can be used to guide the alignment by raising gap penalties within secondary structure elements, so that gaps will preferentially be inserted into unstructured surface loop regions. Alternatively, a user-specified gap penalty mask can be supplied for a similar purpose.
A gap penalty mask is a series of numbers between 1 and 9, one per position in the alignment. Each number specifies how much the gap opening penalty is to be raised by at that position (raised by multiplying the basic gap opening penalty by the number) i.e. a mask figure of 1 at a positiion means no change in gap opening penalty; a figure of 4 means that the gap opening penalty is four times greater at that position, making gaps 4 times harder to open.
The format for gap penalty masks and secondary structure masks is explained in the help under option 0 (secondary structure options).
Options 1 and 2 control whether the input secondary structure information or gap penalty masks will be used.
Option 3 controls whether the secondary structure and gap penalty masks should be included in the output alignment.
Options 4 and 5 provide the value for raising the gap penalty at core Alpha Helical (A) and Beta Strand (B) residues. In CLUSTAL format, capital residues denote the A and B core structure notation. Basic gap penalties are multiplied by the amount specified.
Option 6 provides the value for the gap penalty in Loops. By default this penalty is not raised. In CLUSTAL format, loops are specified by "." in the secondary structure notation.
Option 7 provides the value for setting the gap penalty at the ends of secondary structures. Ends of secondary structures are observed to grow and/or shrink in related structures. Therefore by default these are given intermediate values, lower than the core penalties. All secondary structure read in as lower case in CLUSTAL format gets the reduced terminal penalty.
Options 8 and 9 specify the range of structure termini for the intermediate penalties. In the alignment output, these are indicated as lower case. For Alpha Helices, by default, the range spans the end helical turn. For Beta Strands, the default range spans the end residue and the adjacent loop residue, since sequence conservation often extends beyond the actual H-bonded Beta Strand.
CLUSTAL W can read the masks from SWISS-PROT, CLUSTAL or GDE format input files. For many 3-D protein structures, secondary structure information is recorded in the feature tables of SWISS-PROT database entries. You should always check that the assignments are correct - some are quite inaccurate. CLUSTAL W looks for SWISS-PROT HELIX and STRAND assignments e.g.
FT HELIX 100 115 FT STRAND 118 119The structure and penalty masks can also be read from CLUSTAL alignment format as comment lines beginning "!SS_" or "!GM_" e.g.
!SS_HBA_HUMA ..aaaAAAAAAAAAAaaa.aaaAAAAAAAAAAaaaaaaAaaa.........aaaAAAAAA !GM_HBA_HUMA 112224444444444222122244444444442222224222111111111222444444 HBA_HUMA VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKNote that the mask itself is a set of numbers between 1 and 9 each of which is assigned to the residue(s) in the same column below.
In GDE flat file format, the masks are specified as text and the names must begin with SS_ or GM_.
Either a structure or penalty mask or both may be used. If both are included in an alignment, the user will be asked which is to be used.
For VERY divergent sequences, the distances cannot be reliably corrected. You will be warned if this happens. Even if none of the distances in a data set exceed the reliable threshold, if you bootstrap the data, some of the bootstrap distances may randomly exceed the safe limit.
There are three 'in-built' series of weight matrices offered. Each consists of several matrices which work differently at different evolutionary distances. To see the exact details, read the documentation. Crudely, we store several matrices in memory, spanning the full range of amino acid distance (from almost identical sequences to highly divergent ones). For very similar sequences, it is best to use a strict weight matrix which only gives a high score to identities and the most favoured conservative substitutions. For more divergent sequences, it is appropriate to use "softer" matrices which give a high score to many other frequent substitutions.
A new matrix can be read from a file on disk, if the filename consists only of lower case characters. The values in the new weight matrix must be integers and the scores should be similarities. You can use negative as well as positive values if you wish, although the matrix will be automatically adjusted to all positive scores.
INPUT FORMAT The format used for a new matrix is the same as the BLAST program. Any lines beginning with a # character are assumed to be comments. The first non-comment line should contain a list of amino acids in any order, using the 1 letter code, followed by a * character. This should be followed by a square matrix of integer scores, with one row and one column for each amino acid. The last row and column of the matrix (corresponding to the * character) contain the minimum score over the whole matrix.
-INFILE=file.ext :input sequences.
-PROFILE1=file.ext and -PROFILE2=file.ext :profiles (old alignment).
VERBS (do things)
-OPTIONS :list the command line paramters
-HELP or -CHECK :outline the command line params.
-ALIGN :do full multiple alignment
-PROFILE :merge two alignments (PROFILE1 and 2) by profile alignment
-SEQUENCES :sequentially add PROFILE2 sequences to PROFILE1 alignment
-TREE :calculate NJ tree.
-BOOTSTRAP(=n) :bootstrap a NJ tree (n= number of bootstraps; def. = 1000).
PARAMETERS (set things)
***General settings:****
-INTERACTIVE :read command line, then enter normal interactive menus
-QUICKTREE :use FAST algorithm for the alignment guide tree
-NEWTREE= :file for new guide tree
-USETREE= :file for old guide tree
-NEGATIVE :protein alignment with negative values in matrix
-OUTFILE= :sequence alignment file name
-OUTPUT= :GCG, GDE, PHYLIP or PIR
-OUTORDER= :INPUT or ALIGNED
-CASE :LOWER or UPPER (for GDE output only)
***Fast Pairwise Alignments:***
-KTUP=n :word size -TOPDIAGS=n :number of best diags.
-WINDOW=n :window around best diags. -PAIRGAP=n :gap penalty
-SCORE :PERCENT or ABSOLUTE
***Slow Pairwise Alignments:***
-PWMATRIX= :BLOSUM, PAM, MD, ID or filename
-PWGAPOPEN=f :gap opening penalty -PWGAPEXT=f :gap opening penalty
***Multiple Alignments:***
-MATRIX= :BLOSUM, PAM, MD, ID or filename
-GAPOPEN=f :gap opening penalty -GAPEXT=f :gap extension penalty
-ENDGAPS :end gap separation pen. -GAPDIST=n :gap separation pen. range
-NOPGAP :Pascarella gaps off -NOHGAP :hydrophilic gaps off
-HGAPRESIDUES= :list hydrophilic res. -MAXDIV=n :% ident. for delay
-TYPE= :PROTEIN or DNA -TRANSITIONS :transitions NOT weighted.
***Trees:*** -SEED=n :seed number for bootstraps.
-KIMURA :use Kimura's correction. -TOSSGAPS :ignore positions with gaps.