Multiple Alignment Exercise
We are going to make some multiple alignments of DNA and protein sequences and examine them with some alignment editors and display tools.
There are several ways to build a set of sequences for a multiple alignment project. Obviously, you can acquire them experimentally, by some form of cloning and sequencing – perhaps across various species and strains, or from environmental or medical samples. Alternately, you can search a database by keywords and/or accession numbers to find existing data for known genes/proteins/loci. Or you can use sequence similarity searching to scan a database for all sequences that match a query sequence – but care must be taken in your choice of database in order to acquire an appropriate number and range of sequences.
1) Use the command line tools built into EMBOSS
a. log into your account on genetraffic.med.nyu.edu
b. make a new sub-directory for this exercise and cd into it
c. use seqret
to retrieve opsin-2
proteins from the Swissprot database
type: showdb to get a list of available databases and the
names to use to access them
type: seqret
then when it prompts you for
sequences, type swall:ops2_*
(the ÒswallÓ is a combined Swissprot+TREMBL
database, the Òops2_* Ò is a wildcard for all genes with ops2 in their locus
names)
give a filename for the FASTA file that will contain your set of opsin sequences
d. view your opsin sequences with more or emacs
e. now use emma to access CLUSTALW for multiple alignment, input your fasta file
2) Use plotcon to define the regions of high and low similarity across the alignment
http://emboss.bioinformatics.nl/cgi-bin/emboss/plotcon
3) Use showalign to create a consensus sequence, and only print those amino acids that differ from the consensus.
http://emboss.bioinformatics.nl/cgi-bin/emboss/showalign
4) Lets repeat this process using a new set of sequences selected by making a BLAST search of GenBank
Have a look at this paper:
A multigene family on human chromosome 12 encodes natural killer-cell lectins.
Yabe T, McSherry C, Bach FH, Fisch P, Schall RP, Sondel PM, Houchins JP.
Immunogenetics. 1993;37(6):455-60..
In the article (from 1993),
they refer to 4 genes on human chromosome 12 NKG2-A, B, C, and D. You will
quickly find that GenBank does not follow this naming
scheme. Instead, you will discover a much larger number of KLR genes A-G with
additional C1, C2, C3, C4 isoforms as well as a bunch
of CLEC genes. Study this region
of ch12 on the UCSC Genome Browser. How many genes are in this multi-gene
family?
[I can imagine a really
nice phylogenetics project using all of the genes in
this locus from several mammals and other vertebrates to try to determine when
the various gene duplications occurred.]
Lets start with NKG2-D (NP_031386). Use
this as the query for a BLAST search against the RefSeq
protein database, and limit to human. Choose the top 10 matching sequences and Display them in FASTA format and Send to a text file on your desktop.
Paste this file into the
EBI ClustalW server and save the alignment as a text
file.
http://www.ebi.ac.uk/Tools/es/cgi-bin/clustalw2/
This alignment shows a
very nice distribution of conserved and variable regions.
5)
Now take your aligned sequences and paste into the Boxshade server to highlight the conserved and variable
regions.
http://www.ch.embnet.org/cgi-bin/BOX_form_parser
or
http://mobyle.pasteur.fr/cgi-bin/MobylePortal/portal.py?form=boxshade
6)
Grab one highly conserved region from the original Clustal alignment (about 20 amino acids wide), and paste
into the web logo server (result should be a thing of beauty).
http://weblogo.berkeley.edu/logo.cgi
Send me your showalign
result for the opsin sequences and the weblogo
for the lectins.