Lets start with a hypothetical situation: you have found a cDNA and identified a similar gene in GenBank by a FASTA search - but it has a low significance score.. Now you want to know if the cDNA is really related to this gene, or just a chance match. The obvious starting point is a pairwise alignment.
1) FETCH the RATNRDC sequence that we used in the LOOKUP exercise and the cDNA sequence RNU83985. Use both GAP and BESTFIT to align these two sequences and compare the output.
2) Now lets try something a bit more challenging. FETCH the sequence RNSCA1. First, try to align this whole sequences to RATNRDC with GAP and look at the output - get used to the look of this totally bogus, non-significant alignment, you will see it often when you use GAP.
3) Now use BESTFIT to find the regions of the two sequences that have significant similarity. Look at the output from BESTFIT - you should see a tiny region of near identity between the two sequences. Can this region be extended? Let's try using the starting point in each sequence identified by BESTFIT and make a new alignment of the rest of the sequences with GAP
GAP of what sequence 1 ? Ratnrdc.Gb_Ro
Begin (* 1 *) ? 1436
End (* 3581 *) ?
Reverse (* No *) ?
to what sequence 2 (* Ratnrdc.Gb_Ro *) ? Rnsca1.Gb_Ro
Begin (* 1 *) ? 2483
End (* 5588 *) ?
Reverse (* No *) ?
4) How does the output look? GAP has stubbornly ignored the region of similarity in order to produce a better overall (but totally non-significant) alignment. Let's try again, but this time limit the region of alignment to 500 base pairs.
GAP of what sequence 1 ? Ratnrdc.Gb_Ro
Begin (* 1 *) ? 1436
End (* 3581 *) ? 1936
Reverse (* No *) ?
to what sequence 2 (* Ratnrdc.Gb_Ro *) ? Rnsca1.Gb_Ro
Begin (* 1 *) ? 2483
End (* 5588 *) ? 2983
Reverse (* No *) ?
5) Now we get the aligned region identified by BESTFIT and also an answer to the real question: Are these sequences really related, or is the similarity limited to just an isolated region? It is possible to explore this more carefully by using GAP to align a region that includes the short section of similarity in the middle. Try different values for the Begin and End sequences to look for similarity upstream from #1436 in Ratnrdc.
6) There is one more trick you can try. Do the protein sequences of these two genes align any better than the DNA sequences? You decide, is there significant similarity here?
The essence of working with PILEUP is building a list of sequences that you wish to align. The most important considerations are that sequences must be of approximately the same length to avoid adding a lot of gaps at the ends and all of the sequences must be fairly closely related. Do not use PILEUP to try to align sequences unless you have already found significant similarity between them!
Start with a NETBLAST search of all GenBank protein sequences (option #1) using the NRDC gene as the query. Multiple alignments of proteins are generally much more informative and can span much more distantly related genes. I hope you can figure out how to find the protein sequence of the human NRDC gene (hint: use ENTREZ). Limit your search to hits expected to occur by chance no more than 0.1 times and limit the number of sequences in your output file to 20 (PILEUP can't handle huge numbers of sequences).
Now you can retrieve all of the matching sequences that were in the BLAST results file into your directory with the NETFETCH command. [This may seem rather convoluted, but remember, NETBLAST searches the databases at the NCBI and they may use different accession numbers and filenames than we do here.]
> netfetch nrdc_human.netblastp
NetFetch creates a file in .RSF format, so to use this as input for PILEUP you will need to add {*} after the filename, like this:
> pileup nrdc_human.rsf{*}
Watchout for the filename of the output file.
Oh no! PILEUP fails due to too many gap insertions! That means there are
too many poorly related sequences that either refuse to align, or align with
too many gaps. Go back to the nrdc_human.netblastp file and remove some of
the least similar sequences (you can just comment them out by adding a
! character before their name in the list of matches:
pir||E83400 pyrroloquinoline quinone biosynthesis protein F PA19... 94 7e-18 sp|P55174|PQQF_PSEFL COENZYME PQQ SYNTHESIS PROTEIN F >gi|212066... 92 3e-17 !pir||T27351 hypothetical protein Y70C5C.1 - Caenorhabditis elega... 89 4e-16 !pir||T46182 hypothetical protein T8H10.60 - Arabidopsis thaliana... 73 2e-11 pir||A70410 processing proteinase (EC 3.4.-.-) - Aquifex aeolicu... 69 3e-10 !dbj|BAB06124.1| (AP001515) processing protease [Bacillus halodur... 68 6e-10 pir||B72264 hypothetical protein TM1346 - Thermotoga maritima (s... 68 7e-10 !pir||S74378 processing proteinase sll0055 - Synechocystis sp(s... 67 1e-09 !dbj|BAB39420.1| (AP002901) putative mitochondrial processing pep.. 65 5e-09 !sp|Q04805|YMXG_BACSU HYPOTHETICAL ZINC PROTEASE YMXG (ORFP) >gi|.. 63 2e-08Now repeat the NetFetch with your reduced list of genes, and then repeat the Pileup to see if it works better. You may have to repeat this a few times until you eliminate all of the genes that don't align well enough to meet the standards of Pileup. You can also look at the pairwise alignment scores produced by Pileup as it works to see which sequences don't play well with the others, then identify these sequences by name in the .RSF file.
Once you get Pileup to work, you can look at the PILEUP output. You can see that some sequences extend beyond the rest at the 5' or 3' end. If you were building a larger alignment, or making a phylogenetic tree these extra bases would need to be chopped off. Look more carefully at this alignment - some sequences match up well, but others do not. In fact, the alignment can be divided into several sub-groups of sequences that align well. If we could view the dendrogram created by PILEUP (called PILEUP.FIGURE in your directory) then you would get a clue what is going on here - but there is no simple way to do this in the Library classroom. This would be a good time to experiment with this same project in SeqWeb.
You can also use Pileup to align DNA sequences as well as protein, but their are some pitfalls to watch out for. The biggest problem is the length of the sequences. Try this - grab the human mRNA sequence for the leptin (ob gene) - its accesion # is U43653, and do a FASTA search against the GenBank section for mamals (not primate or rodent) gb_om:* Once that is done, try to use the results file as input for a Pileup.
It works surprisingly well, but some sequences stick out at the 5' and/or the 3' end. You can go back to the Fasta output file and edit each sequence by changing the Begin and End values in the list (the values that you see were created by Fasta to match the region of alginment with U43653 - isn't that at nice GCG feature! Try to trim them all neatly.
Send your final alignment to me (browns02@mcrcr.med.nyu.edu) by e-mail.
Now lets try the same alignment with ClustalW. ClustalW is another multiple alingment program, but it is not part of the GCG package - so it does not use the same file formats. ClustalW wants its input files (the sequences to be aligned), in a single Fasta format file.
Fortunately, GCG has a simple way to convert any list of sequences (like
your nrdc_human.rsf file) into a Fasta file. The command is just
tofasta, so the whole thing looks like this
tofasta nrdc_human.rsf{*}
(remember the stupid curly bracket and star)