Algorithm for Alignment of Sequencing Fragments

This discussion is taken directly from the GCG documentation for the GelMerge program, but it applies equally to all other fragment assembly systems (Staden, Sequencher, MacVector, GeneWorks, LaserGene, etc).

The essence of building contigs is identifying and aligning fragments that overlap. This is done with the same basic algorithms that we have already discussed for similarity searching and pairwise/multiple alignment. The only difference is that alignment should be found in short regions at the ends of fragments.

General Method

Each fragment in the sequencing project database is part of a contig. A contig contains either a group of aligned sequences associated together or a single, unaligned fragment.

GelMerge finds the two contigs with the longest overlap and then aligns them to assemble a single contig. The program then finds the next two contigs with the longest overlap and aligns them to assemble a single contig.

GelMerge repeats this process of overlap determination and contig assembly until there are no remaining overlaps among the contigs in the project database. The result of GelMerge may either be a single contig or several contigs (if none of the remaining contigs share significant overlap). As you add new fragments to the sequencing project database, they may connect separate contigs to form larger assemblies.

Finding Overlaps

In finding overlaps among the contigs, GelMerge represents each contig by its consensus sequence. GelMerge uses a modification of the approximate alignment procedure of Wilbur and Lipman (SIAM J. Appl. Math. 44; 557-567 (1984)) to determine the amount of overlap between any two consensus sequences.

In GelMerge, each alignment must contain at least one long block of contiguous sequence identities. You can set the minimum length for each long block with the /MINIdentity command-line parameter (default length is 14). The requirement for at least one long block of identities allows the program to exclude trivial overlaps from consideration. This requirement also limits the extent of gapping permitted in the approximate alignment. The alignment cannot get out of phase by more than 10 registers of comparison from diagonals containing long blocks of sequence identities. (Use the /MAXGap command-line parameter to adjust this default of 10 registers of comparison.) This effectively prevents the alignment from wandering too far away from registers of comparison with significant sequence identity.

GelMerge determines all of the possible distinct alignments between the two contig consensus sequences being compared. Each alignment corresponds to a different overlap between the same two contig consensus sequences. After finding all of the distinct alignments between a pair of contigs, GelMerge counts the number of exact nucleotide matches in the best alignment. If this overlap does not meet the identity criterion (80% by default), the next best alignment is checked. If no alignment of two contig consensus sequences meets the identity criterion, then the two contigs do not overlap.


Aligning Contigs

Once GelMerge determines the pair of contigs with the longest overlap (see above), it aligns them using the method of Needleman and Wunsch (J. Mol. Biol. 48; 443-453 (1970)). This method, originally used to align individual sequences, has been extended for use with contigs of aligned sequences. When the program inserts gaps into a contig to produce an alignment, they are inserted at the same position in all of the sequences of the contig.

Recognizing (and Optionally Removing) Vector Sequences

GelMerge searches for vector sequences in single-fragment contigs using a two-step approach. First, GelMerge finds approximate alignments between vector and contig sequences using a modification of the method of Wilbur and Lipman (described above in "Finding Overlaps"). You can fine tune the vector searching by adjusting some parameters of the approximate alignment procedure. For vector searching, the minimum length for each short block of sequence identities is the same as the length used to find overlaps among the contigs; you can set the minimum length in response to the What word size program prompt (default length is 7). The minimum length of each long block of sequence identities in vector searching can be set with the /VECTORMINIdentity command-line parameter (default length of 12). The alignment cannot get out of phase by more than 5 registers of comparison from diagonals containing long blocks of sequence identities. (This default of 5 registers of comparison can be adjusted with the /VECTORMAXGap command-line parameter.)

Each of the approximate alignments indicates the position of possible vector sequences in the contig. In the second step of vector recognition, GelMerge refines these alignments using the method of Smith and Waterman (Advances in Applied Mathematics 2; 482-489 (1981)). By default, if the aligned portions of the contig and vector sequences share greater than 80% sequence identity, the contig bases in the alignment become candidates for excision. (You can adjust this value using the /VECTORSTringency command-line parameter.) Additionally, by default, the vector sequences must begin within 12 bases of either end of the contig in order to be excised. (This value is the same as the minimum length for each long block; you can adjust it with the /VECTORMINIdentity command-line parameter).