DNA and Protein Patterns Exercise

This will be a short exercise since all of you have projects to work on - and pattern finding should be a part of all projects.

Start with the sequence cecos.seq that I have linked to this page.

If this is a new or unknown sequence, where do you start an analysis?

OK, so most of you know by now that a BLAST or FASTA search is the best place to start any analysis of an unknown sequence, but lets skip that for now and think about pattern recognition. Does this gene match any known patterns.

The first problem is that this is an unknown chunk of DNA, does it have any genes in it? Lets do a quick check for open reading frames. The best tool for this is FRAMES we can also use MAP if no graphics viewing options are available.

The output from MAP is a bit confusing, but if you choose just one enzyme and the option to translate only open reading frames, it can be deciphered. The command line option "-open=20" will limit the display to open reading frames longer than 20 amino acids.

Try searching this sequence for ORFs using the GRAIL online service http://compbio.ornl.gov/grailexp

OK, now try a BLAST search against the Swissprot database (the GCG NETBLAST program will automatically use BLASTX to translate your DNA query sequence in all six reading frames for comparison to a protein database).

Isn't that an easier way to find the coding sequence in a stretch of DNA? Too bad this doesn't work for all new sequences that you find in the lab.

Take the protein sequence generated by BLASTX and use it for a MOTIFS search. In this case it is simple and it works. Use this same protein sequence to search the ProDom database and look at a multiple alignment of related genes.
[ http://protein.toulouse.inra.fr/prodom.html ]

If you were new to the study of this gene, this information would probably be valuable.




Read the essay on "Profile Analysis" by John Devereux (founder of the GCG company) [http://mcrcr0.med.nyu.edu/gcg/profileanalysis.html]

Create a multiple alignment with PILEUP using the following list file, then create a profile with PROFILEMAKE. Alternately, you may use a multiple alignment that you have created in your project for this course, in your lab work, or from a protein family that you are studying.

!!SEQUENCE_LIST 1.0
..
swissprot:XYLB_PSEPU  begin:41 end:90
swissprot:Adh1_Pethy begin:50 end:100
swissprot:P32771 begin:50 end:100
SPTREMBL:P93629 begin:50 end:100
swissprot:P08319 begin:47 end:97

[copy and paste these lines into a new text file created with EMACS, or FETCH them into your directory and make your own list file]
Note the begin and end values for each sequence - this makes a much better pattern, and requires a fair bit of work to define the conserved functional domain for a new set of sequences.

Use that profile to search the newest entries in GenPept (gp_upd:*) with PROFILESEARCH. Then use PROFILESEGMENTS to view your list of matches.

Send me the first two pattern match alignments in your PROFILESEGMENTS file. Do these alignments make biological sense - did you fine a real functional set of proteins in the database?