Pairwise and Multiple Alignment Exercise

A. Pairwise Alignment: GAP and BESTFIT

Lets start with a hypothetical situation: you have found a cDNA and identified a similar gene in GenBank by a FASTA search - but it has a low significance score.. Now you want to know if the cDNA is really related to this gene, or just a chance match. The obvious starting point is a pairwise alignment.

1) FETCH the RATNRDC sequence that we used in the LOOKUP exercise and the cDNA sequence RNU83985. Use both GAP and BESTFIT to align these two sequences and compare the output.

2) Now lets try something a bit more challenging. FETCH the sequence RNSCA1. First, try to align this whole sequences to RATNRDC with GAP and look at the output - get used to the look of this totally bogus, non-significant alignment, you will see it often when you use GAP.

3) Now use BESTFIT to find the regions of the two sequences that have significant similarity. Look at the output from BESTFIT - you should see a tiny region of near identity between the two sequences. Can this region be extended? Let's try using the starting point in each sequence identified by BESTFIT and make a new alignment of the rest of the sequences with GAP

GAP of what sequence 1 ?  Ratnrdc.Gb_Ro

               Begin (* 1 *) ?  1436
               End (*  3581 *) ?
               Reverse (* No *) ?

 to what sequence 2 (* Ratnrdc.Gb_Ro *) ?  Rnsca1.Gb_Ro

               Begin (* 1 *) ?  2483
               End (*  5588 *) ?
               Reverse (* No *) ?

4) How does the output look? GAP has stubbornly ignored the region of similarity in order to produce a better overall (but totally non-significant) alignment. Let's try again, but this time limit the region of alignment to 500 base pairs.

GAP of what sequence 1 ?  Ratnrdc.Gb_Ro

               Begin (* 1 *) ?  1436
               End (*  3581 *) ?  1936
               Reverse (* No *) ?

 to what sequence 2 (* Ratnrdc.Gb_Ro *) ?  Rnsca1.Gb_Ro

               Begin (* 1 *) ?  2483
               End (*  5588 *) ?  2983
               Reverse (* No *) ?

5) Now we get the aligned region identified by BESTFIT and also an answer to the real question: Are these sequences really related, or is the similarity limited to just an isolated region? It is possible to explore this more carefully by using GAP to align a region that includes the short section of similarity in the middle. Try different values for the Begin and End sequences to look for similarity upstream from #1436 in Ratnrdc.

6) There is one more trick you can try. Do the protein sequences of these two genes align any better than the DNA sequences? You decide, is there significant similarity here?




B. Multiple alignment with PILEUP

The essence of working with PILEUP is building a list of sequences that you wish to align. The most important considerations are that sequences must be of approximately the same length to avoid adding a lot of gaps at the ends and all of the sequences must be fairly closely related. Do not use PILEUP to try to align sequences unless you have already found significant similarity between them!

Start with a NETBLAST search of all GenBank protein sequences (option #1) using the NRDC gene as the query. Multiple alignments of proteins are generally much more informative and can span much more distantly related genes. I hope you can figure out how to find the protein sequence of the human NRDC gene (hint: use ENTREZ). Limit your search to hits expected to occur by chance no more than 0.1 times and limit the number of sequences in your output file to 20 (PILEUP can't handle huge numbers of sequences).

Now you can retrieve all of the matching sequences that were in the BLAST results file into your directory with the NETFETCH command. [This may seem rather convoluted, but remember, NETBLAST searches the databases at the NCBI and they may use different accession numbers and filenames than we do here.]

 
     > netfetch nrdc_human.netblastp

NetFetch creates a file in .RSF format, so to use this as input for PILEUP you will need to add {*} after the filename, like this:


	>  pileup  nrdc_human.rsf{*}

Watchout for the filename of the output file. Oh no! PILEUP fails due to too many gap insertions! That means there are too many poorly related sequences that either refuse to align, or align with too many gaps. Go back to the nrdc_human.netblastp file and remove some of the least similar sequences (you can just comment them out by adding a ! character before their name in the list of matches:
pir||E83400 pyrroloquinoline quinone biosynthesis protein F PA19...  94  7e-18
sp|P55174|PQQF_PSEFL COENZYME PQQ SYNTHESIS PROTEIN F >gi|212066...  92  3e-17
!pir||T27351 hypothetical protein Y70C5C.1 - Caenorhabditis elega...  89  4e-16
!pir||T46182 hypothetical protein T8H10.60 - Arabidopsis thaliana... 73  2e-11
pir||A70410 processing proteinase (EC 3.4.-.-) - Aquifex aeolicu... 69  3e-10
!dbj|BAB06124.1| (AP001515) processing protease [Bacillus halodur... 68  6e-10
pir||B72264 hypothetical protein TM1346 - Thermotoga maritima (s... 68  7e-10
!pir||S74378 processing proteinase sll0055 - Synechocystis sp(s...   67  1e-09
!dbj|BAB39420.1| (AP002901) putative mitochondrial processing pep..  65  5e-09
!sp|Q04805|YMXG_BACSU HYPOTHETICAL ZINC PROTEASE YMXG (ORFP) >gi|..  63  2e-08 
Now repeat the NetFetch with your reduced list of genes, and then repeat the Pileup to see if it works better. You may have to repeat this a few times until you eliminate all of the genes that don't align well enough to meet the standards of Pileup. You can also look at the pairwise alignment scores produced by Pileup as it works to see which sequences don't play well with the others, then identify these sequences by name in the .RSF file.

Once you get Pileup to work, you can look at the PILEUP output. You can see that some sequences extend beyond the rest at the 5' or 3' end. If you were building a larger alignment, or making a phylogenetic tree these extra bases would need to be chopped off. Look more carefully at this alignment - some sequences match up well, but others do not. In fact, the alignment can be divided into several sub-groups of sequences that align well. If we could view the dendrogram created by PILEUP (called PILEUP.FIGURE in your directory) then you would get a clue what is going on here - but there is no simple way to do this in the Library classroom. This would be a good time to experiment with this same project in SeqWeb.

PileUp of DNA sequences

You can also use Pileup to align DNA sequences as well as protein, but their are some pitfalls to watch out for. The biggest problem is the length of the sequences. Try this - grab the human mRNA sequence for the leptin (ob gene) - its accesion # is U43653, and do a FASTA search against the GenBank section for mamals (not primate or rodent) gb_om:* Once that is done, try to use the results file as input for a Pileup.

It works surprisingly well, but some sequences stick out at the 5' and/or the 3' end. You can go back to the Fasta output file and edit each sequence by changing the Begin and End values in the list (the values that you see were created by Fasta to match the region of alginment with U43653 - isn't that at nice GCG feature! Try to trim them all neatly.

Send your final alignment to me (browns02@mcrcr.med.nyu.edu) by e-mail.




C. Use ClustalW for Multiple Alignment

Now lets try the same alignment with ClustalW. ClustalW is another multiple alingment program, but it is not part of the GCG package - so it does not use the same file formats. ClustalW wants its input files (the sequences to be aligned), in a single Fasta format file.

Fortunately, GCG has a simple way to convert any list of sequences (like your nrdc_human.rsf file) into a Fasta file. The command is just tofasta, so the whole thing looks like this


tofasta nrdc_human.rsf{*}
(remember the stupid curly bracket and star)

And remember to give your output file a sensible name.

Now look at the output file. Isn't that a useful way to grab and store sequence data?

Now use the clustalwprogram.
It has its own interface. You have to first load your file of sequence data (Sequence Input From Disc), then do a Multiple Alignment. You can also set parameters and output file format. It is usually wise to choose GCG/MSF format for compatibility with other GCG programs.

Anyway, ClustalW works fine on this data set. You should try to run it on the original ~20 sequences that were found in your NetBlast search before you commented out some of them and see if it is more tolerant of mismatches than PileUp.


C. ClustalX on a desktop computer

There is a version of Clustal that can run on your desktop computer. Download and install it now.

Mac version

Windows version

You can use the same Fasta file that you created for ClustalW as an input file for ClustalX on your own computer. Give it a try.

That's all for now