Multiple Alignment Exercise

 

 

We are going to make some multiple alignments of DNA and protein sequences and examine them with some alignment editors and display tools.

 

There are several ways to build a set of sequences for a multiple alignment project. Obviously, you can acquire them experimentally, by some form of cloning and sequencing – perhaps across various species and strains, or from environmental or medical samples. Alternately, you can search a database by keywords and/or accession numbers to find existing data for known genes/proteins/loci.  Or you can use sequence similarity searching to scan a database for all sequences that match a query sequence – but care must be taken in your choice of database in order to acquire an appropriate number and range of sequences.

 

1)   Use the command line tools built into EMBOSS

a.     log into your account on genetraffic.med.nyu.edu

b.     make a new sub-directory for this exercise and cd into it

c.      use   seqret   to retrieve opsin-2 proteins from the Swissprot database
   type:   showdb    to get a list of available databases and the names to use  to access them
   type:    seqret   
             then when it prompts you for  sequences, type   swall:ops2_*
(the ÒswallÓ is a combined Swissprot+TREMBL database, the Òops2_* Ò is a wildcard for all genes with ops2 in their locus names)
give a filename for the FASTA file that will contain your set of opsin sequences

d.     view your opsin sequences with more or emacs

e.     now use emma to access CLUSTALW for multiple alignment, input your fasta file

 

2)   Use plotcon to define the regions of high and low similarity across the alignment

http://emboss.bioinformatics.nl/cgi-bin/emboss/plotcon

 

3)   Use showalign to create a consensus sequence, and only print those amino acids that differ from the consensus.

http://emboss.bioinformatics.nl/cgi-bin/emboss/showalign

 

4)   Lets repeat this process using a new set of sequences selected by making a BLAST search of GenBank

 

Have a look at this paper:
A multigene family on human chromosome 12 encodes natural killer-cell lectins.

Yabe T, McSherry C, Bach FH, Fisch P, Schall RP, Sondel PM, Houchins JP.
Immunogenetics. 1993;37(6):455-60..

 

In the article (from 1993), they refer to 4 genes on human chromosome 12 NKG2-A, B, C, and D. You will quickly find that GenBank does not follow this naming scheme. Instead, you will discover a much larger number of KLR genes A-G with additional C1, C2, C3, C4 isoforms as well as a bunch of CLEC genes.  Study this region of ch12 on the UCSC Genome Browser. How many genes are in this multi-gene family?

 

[I can imagine a really nice phylogenetics project using all of the genes in this locus from several mammals and other vertebrates to try to determine when the various gene duplications occurred.]

 

Lets start with NKG2-D (NP_031386). Use this as the query for a BLAST search against the RefSeq protein database, and limit to human. Choose the top 10 matching sequences and Display them in FASTA format and Send to a text file on your desktop.

 

Paste this file into the EBI ClustalW server and save the alignment as a text file.

http://www.ebi.ac.uk/Tools/es/cgi-bin/clustalw2/

 

This alignment shows a very nice distribution of conserved and variable regions.

 

5)   Now take your aligned sequences and paste into the Boxshade server to highlight the conserved and variable regions.

http://www.ch.embnet.org/cgi-bin/BOX_form_parser      or

http://mobyle.pasteur.fr/cgi-bin/MobylePortal/portal.py?form=boxshade

 

6)   Grab one highly conserved region from the original Clustal alignment (about 20 amino acids wide), and paste into the web logo server (result should be a thing of beauty).

http://weblogo.berkeley.edu/logo.cgi

 

Send me your showalign result for the  opsin sequences and the weblogo for the lectins.