Assignment #4: Finding Sequences

I. LOOKUP



A. Search for "Voltage Gated Calcium Channel Alpha Subunit"

  1. Launch the lookup program and type in "Voltage Gated Calcium Channel Alpha Subunit" in the "All Text" field.
  2. Try - calcium channel & voltage & alpha and then - calcium channel & alpha
  3. Do you get different answers using the entire string vs. selected keywords?
  4. Which combination of search terms worked best to get a complete listing of relevant database accessions (with a minimum of irrelevant entries in your list)?

B. Search by Author for a particular sequence mentioned in an article

  1. Lets find the sequence of the TFE3 Transcription enhancer mentioned in the following article: TFE3: A helix-loop-helix protein that activated transcription through the immunoglobulin enhancer uE3 motif. Beckmann,H., Su,L-K, Kadesch,T, 1990. Genes and Development 4:167-179.

    1. try searching "TFE3" under "All Text"
    2. now try searching "Beckmann & Su & Kadesch" under "Author" (move down to "Author" with the Tab key)
    3. now try searching for TFE3 under "All Text" and Beckmann under "Author"


  2. Which is the simplest way to find the correct sequence?

  3. Now lets do a search that finds a lot of sequences.

    Enter "histone" in the All Text field and "homo" in the Organism field (no quotes)

  4. Better narrow down the search - so save these entries to a file named "histone.long", and use that file as input for another lookup search.
  5. Type "1" to save the file and give it the name "histone.long" Now re-launch lookup with the histone.long file as it's data library with the following command
    > lookup -INFILE=@histone.long
  6. You will not see the database selection window. Now lets reduce our list of histone sequences to just human histone H2As; so in the "All text:" field type "H2A". Save this reduced list to a file named histone.h2a and remember to delete histone.long
  7. Now you are ready to FETCH all of these sequences into your working directory

II. FETCH

FETCH is a GCG program that has several different uses. At its most basic, it is a retrieval tool that allows you to copy any sequence from the RCR's database to your personal directory.

Many people FETCH sequences in order to read the annotation and have a look at the sequence, but you do not actually need to have the sequence in your working directory in order to use it for manipulations with GCG programs.

To FETCH a sequence requires that you know either an accession number or an exact sequence name. (This can most easily be obtained from LOOKUP, or from a published paper.)

  1. FETCH can also be used as a rough-and-ready tool for grabbing groups of sequences with similar names such as
    > fetch *bov*
  2. This will copy most bovine sequences and a whole lot of other nonsense.
    Since sequence names are by no means standardized, this approach is essentially futile: you can't get everything that you want and you can't eliminate everything else.
  3. Another use of FETCH is to grab all of the files in a list file such as the list generated by LOOKUP.
  4. Try this out by FETCHing the sequences listed in the histone.h2a file
    But wait - you are about to load a huge number of files into your working directory, how will you get rid of them later???. Lets create a sub-directory for this exercise:
    > mkdir Test
  5. copy the histone.h2a file into the new Test directory
    > cp histone.h2a Test
  6. Now move into the directory you just created
    > cd test
  7. Now make a quick check to be sure that worked:
    > ls
  8. OK now use the command:
    > fetch @histone.h2a

    (the little "@" sign is required whenever you use a list file as input for any GCG program that can read list files).

  9. And look at the results:
    > ls

    Plenty of sequence files, eh?

  10. Browse one or two with more (or emacs), then delete them all.

III. NETFETCH

Sometimes (i.e. often) there will be inconsistancies in the naming or indexing of sequences. This is particularly problematic for "hypothetical" protein sequences that are derived from cDNAs. The accession #s and locusnames reported by BLAST or in journal articles may not work with FETCH on our system. As an alternative, the FETCH program retrieves sequences directly from GenBank over the Internet.

  1. Try to fetch q9beg9

  2. Now try netfetch q9beg9

  3. Look at the q9beg9.rsf file with more

  4. Try to use the q9beg9.rsf file as the query sequence for a FASTA search

  5. You have to reformat the stupid thing! Like this:

    	> reformat q9beg9.rsf{*}
    
  6. Now you can use the reformatted file for a FASTA search, but pay careful attention to the new filename created by reformat

IV. ENTREZ

Let's move away from GCG and look at sequence finding over the Web. The neatest tool is ENTREZ run by NCBI at the National Library of Medicine in Maryland. ENTREZ is a smart sequence retrieval tool which can do far more than just locate sequences by name, accession number, author etc. To start ENTREZ, point a Web browser at http://www3.ncbi.nlm.nih.gov/Entrez/ and choose what database to start searching (nucleotide, protein, or MEDLINE references).

As an exercise, we will repeat our test search for Voltage Gated Calcium Channel Alpha Subunit

  1. Type your query terms terms into the search field under the "NCBI" banner: "calcium channel" and click the "Go" button.
  2. Too many accessions are found, so narrow the search with another search keyword. In the search field, type "alpha" and click "Go" again.
  3. It is useful to use the "History" funtion to combine queries.
  4. At the top of the window it now shows your current query. followed by some grey fields that allow you to customize the diplay.

    This is where the real power of ENTREZ becomes evident. Each protein sequence is linked to all corresponding DNA sequences, as well as to all similar protein sequences (pre-computed with BLAST). Proteins are also linked to all MEDLINE references that mention that sequence. These linked sequences and references have their own links, so from virtually any starting point, you can expand your search horizontally to learn about entire families of related database sequences.

  5. Follow some of these links to find and save as a text file the abstract of the following article:
    De Jongh, 1990 Subunits of purified calcium channels. Alpha 2 and delta are encoded by the same gene. J Biol Chem 265, 14738-41 (1990) [90368635]