Similarity Searching Exercise

FASTA

I want you to keep sending me results from each of these weekly exercises so I can track your progress. Collect some results from each part into a text file and then e-mail it to me at the end.

Lets start with FASTA and a sequence that I know something about: rat NRDC (L27124).

This is a DNA sequence for a N-Arginine dibasic convertase from rat brain.

However, as we have learned, it is better to start with a protein sequence. If you use NETFETCH L27124 and look at it with more you will see that the GenBank accession actually contains the protein translation - however GCG cannot read this protein sequence out of the DNA sequence file (an obvious feature missing here). So FETCH the protein NRDC_RAT from a local copy of the SwissProt database. Now you should see the file nrdc_rat.swissprot in your directory.

We will start with FASTA because it runs directly on the local machine and we can control it the best. Type:  fasta

FASTA will ask you for a sequence - I like to just copy the name (nrdc_rat.swissprot) and paste it in here to save retyping and possible mistakes. You are asked if you want to start at the beginning and run until the end - answer yes (just hit return to take the defaults).

Now FASTA will ask what databank do you want to search (don't use the default "uniprot" - which is broken today...), use "sw" which stands for SwissProt. A good protein database which has detailed annotation, and is much smaller and less redundant than GenBank.

Search for query in what sequence(s) (* uniprot:* *) ?  sw:* 

SwissProt has about 200,000 sequences. This search should take about a minute to run. Have a look at the output file (if you used the default filename then type: more nrdc_rat.fasta ). One really nice thing about FASTA in GCG is that this output file can be used as input for any program that can read files that list sequence names.

Save the best 5 matches from the results summary (not the full alignments) to a text file for your e-mail to me for this exercise.


BLAST

OK, now lets do the same search using BLAST. BLAST searches can be done several different ways. A lot of people use the NCBI's WWW BLAST server, so lets try this way first. Launch a Web browser and type in the URL for NCBI : http://www.ncbi.nlm.nih.gov/

Go to BLAST, and select the "Standard protein-protein BLAST" - You can either paste in your query sequence in the little box, or give it an accession number. Either way, you need more information than just the sequence name NRDC_RAT. There are two ways to get the sequence and/or accesion number. You can go back to GCG, FETCH SWISSPROT:NRDC_RAT , and read the annotation with a text editor, or use the TOFASTA program to spit out the sequence in FASTA format,

tofasta  SWISSPROT:NRDC_RAT 

Or you can use ENTREZ to do the search (also at NCBI), and grab the NRDC sequence directly, then copy and paste into the BLAST window.

Choose the month from the Database pulldown menu to search just the sequences that have been added to GenBank in the past month. Then click the big blue BLAST! button. Your search will take an amount of time that is proportional to the amount of cpu time required and the number of other people running searches at the same time - weekday afternoons are the busiest time.

Once the search is done, it comes back in a new WWW window. The nice thing about this window is that all the sequences mentioned in the list of "hits" are linked directly to the GenBank sequences - the bad thing is that this data is only in a browser window, not stored on your computer in any way. You can save text files of the search results, or of each sequence, but in order to do further work with them, you will need to get the sequences (or at least their accession numbers) into your RCR account. I usually save the "Sequences producing High-scoring Segment Pairs:" list as a text file, and then FETCH the sequences based on the accession numbers - or for PILEUP, just use the text file as an input list file - and remember to add your query sequence to the list.

gb|AA184065|AA184065 mo96g04.r1 Stratagene mouse testi... +1   867  3.3e-113  1
gb|AA171061|AA171061 ms50e07.r1 Life Tech mouse embryo... +3   215  1.2e-23   2
gb|AA180898|AA180898 zp41g07.s1 Stratagene muscle 9372... -2    93  0.00014   1
dbj|D90791|D90791    E.coli genomic DNA, Kohara clone ... -2    79  0.058     3
gb|AA170383|AA170383 ms60h06.r1 Stratagene mouse embry... +2    76  0.25      1
gb|U75941|HSU75941   Human mitochondrial control regio... +3    40  0.97      2

As you look at the actual alignments, something should strike your eye. There seem to be a lot of D and E amino acids, and these are the ones that are matching. In fact regions rich in Aspartic acid and Glutamic acid are fairly common among many different classes of proteins. In a larger sense, this phenomenon of "skewed sequence composition" frequently causes false matches with similarity programs.

OK, lets try the same BLAST search again, but from within GCG using the "NETBLAST+" program.
Type:

netblast+ nrdc_rat.swissprot 

Again choose to search the "month" database and take the default parameters. BLAST searches in GCG are actually sent to the same place at NCBI that the WWW searches go, but seem to wait in a different que. This is probably faster than the WWW server.

Trying cruncher.nlm.nih.gov (130.14.25.175)

Connected to cruncher.nlm.nih.gov
Waiting for 14 of 15 other BLAST requests on the system to finish.
Still waiting
Still waiting ...

Finally your result comes back as a standard GCG text output file: nrdc_rat.blastp
This looks very similar to the FASTA output, so it makes sense to compare the two. First of all, you will find some different sequences, and for the ones that are in both lists, they will be ranked differently. This is a good thing - indicating that these two programs perform the database searching task differently, so that by using both of them, you are likely to have a more complete search.

Now you can use NETFETCH to grab all of these sequences (or the top 10 or whatever) into your RCR account.

OK, now lets try something a bit more challenging. Use the BLAST website (and any other tools that you can think of) to analyse a truly unknown sequence. Here it is:


  1  AAATAATTTG GTTATATCTT TTTTTTTATC GCATAAATAT AATTGTATTT
51 ATATATGCAT TTTGTGTATT CAAAATAATT CCCATATATA TACAATTTTT
101 TATATTTGTA AACATGCCTA CATAATATGC ATGTTATTAA TACAAATTTC
151 AATTACAACA TAAAATTTAT AAAAATCAAA TTGTCAATTT AAAAAAAAGA
201 AAAGAAACTT TATGTTTTAC GAAATATAGC GATCAATAGG AAAATATAAA
251 TAAAAAATTA TTTAGTGTTA AAAAAAACTT TAAAATAGAT AAAATTGTAA
301 TAAGATGGAA CATATTTTGT TCACTATAAT ATTTATAGGA AACTCATTGC
351 AAAATATATA ATTTTTTATA TATACGTCTG TGAAGCACGC TGCATTATTC
401 AAATnAAAAA AAAAAAAAAA AAAAAAAAAA AAAAAGTCCT CTGCGTTGTT
451 ACCACTGCTT AAGGGCGAAT TCGTTTAAAC CTGCAGGACT AG

Yes, it is a real sequence - an mRNA.

Send me your best match by e-mail (along with the lists of 5 matches from the earlier FASTA searches: browns02@med.nyu.edu