First download and install ClustalX on your computer.
CLUSTAL
Most of you will want the Windows version:
clustalx1.81.msw.zip
We are going to collect a group of related proteins from GenBank by first using a keyword query to locate one sequence, then making a BLAST search to find a bunch of related seqeunces. We will collect some of these and make a multiple alignment with Clustal. Clustal also produces a decent phylogenetic tree.
Start at the home page of the NCBI (National Center for Biotechnology Information):
Now set the Display to FASTA format and copy the amino acid sequence.
Go back to the main NCBI page and go to BLAST (use the tiny
navigation bar at the top of the page).
Make a protein-protein BLAST search (blastp) of new GenBank (month) sequences
using the protein sequence of the Rat NRDC gene as the query.
This "month" data set is only those sequences submitted to GenBank in the last month. It is obviously much smaller than the whole of GenBank. The default database is callede "nr" (non-redundant) - which is what you will usually want to search. I chose "month" for this exercise because it is a faster search and it will give us a more manageable number of hits with an interesting distribution of scores.Also, since we are planning on making a multiple alignment, lets limit our search to only the most similar sequences, so lower the "E" value cutoff from the default of 10 to 0.1, and limit the output to the top 50 hits.
There are quite a few similar sequences. Now retrieve all of the matching sequences that have HSPs longer than 50 bases. These each need to be converted into FASTA format (this can be done right on the Entrez web page) and saved into one long text file (you can use Notepad or WordPad for this).
The essence of working with CLUSTAL is building a list of sequences that you wish to align. The most important considerations are that sequences must be of approximately the same length to avoid adding a lot of gaps at the ends and all of the sequences must be fairly closely related. Do not use CLUSTAL to try to align sequences unless you have already found significant similarity between them!
Then upload the entire set of sequences to CLUSTALX on your PC an do the alignmenthttp://www.ebi.ac.uk/clustalw/
http://dot.imgen.bcm.tmc.edu:9331/multi-align/
By looking at the CLUSTAL output, you can see that some sequences extend beyond the rest at the 5' and 3' ends. If you were building an alignment for publication or to calculate phylogenetic relationships, these extra bases would need to be chopped off. Look more carefully at this alignment - some sequences match up well, but others do not. In fact, the alignment can be divided into several sub-groups of sequences that align well with each other. You can look at the dendrogram created by CLUSTAL to get a clue what is going on here. The next step would be a careful phylogenetic study of these sequences.
Go back to the BLAST output and look where each sequence aligns with Ratnrdc. These sequences are matching to different parts of Ratnrdc - in some cases the regions of alignment do not even overlap. In addition, the region of alignment is limited to a small part of some sequences, it is not clear if the rest of those sequences are also similar to Ratnrdc. This problem of multiple domains is frustrating attempts to use similarity searches like BLAST to annotate new genes as they are discovered in genome sequencing.