Assignment #4: Finding Sequences
I. LOOKUP
A. Search for "Voltage Gated Calcium Channel Alpha Subunit"
- Launch the
lookup program
and type in "Voltage Gated Calcium Channel Alpha Subunit"
in the "All Text" field.
- Try - calcium channel & voltage & alpha and then -
calcium channel & alpha
- Do you get different answers using the entire string vs. selected
keywords?
- Which combination of search terms worked best to get a
complete listing of relevant database accessions (with a
minimum of irrelevant entries in your list)?
B. Search by Author for a particular sequence mentioned in an article
- Lets find the sequence of the TFE3 Transcription enhancer mentioned in the
following article: TFE3: A helix-loop-helix protein that activated transcription
through the immunoglobulin enhancer uE3 motif. Beckmann,H., Su,L-K, Kadesch,T, 1990.
Genes and Development 4:167-179.
- try searching "TFE3" under "All Text"
- now try searching "Beckmann & Su & Kadesch" under "Author"
(move down to "Author" with the Tab key)
- now try searching for TFE3 under "All Text" and Beckmann under "Author"
- Which is the simplest way to find the correct sequence?
- Now lets do a search that finds a lot of sequences.
Enter "histone" in the All Text field and
"homo" in the Organism field (no quotes)
- Better narrow down the search - so save these entries to a file
named "histone.long", and use that file as input for another
lookup search.
- Type "1" to save the file and give it the name "histone.long"
Now re-launch
lookup with the histone.long file
as it's data library with the following command
> lookup -INFILE=@histone.long
- You will not see the database selection window. Now lets reduce
our list of histone sequences to just human histone H2As;
so in the "All text:" field type "H2A". Save this reduced list to a file named
histone.h2a and remember to delete histone.long
- Now you are ready to FETCH all of these sequences into your working directory
II. FETCH
FETCH is a GCG program that has several different uses. At its most basic, it is a retrieval tool
that allows you to copy any sequence from the RCR's database to your personal directory.
Many people FETCH sequences in order to read the annotation and have a look at the sequence, but
you do not actually need to have the sequence in your working directory in order to use it for
manipulations with GCG programs.
To FETCH a sequence requires that you know either an accession number or an exact sequence name.
(This can most easily be obtained from LOOKUP, or from a published paper.)
- FETCH can also be used as a rough-and-ready tool for grabbing groups of
sequences with similar names such as
> fetch *bov*
- This will copy most bovine sequences and a whole lot of other nonsense.
Since sequence names are by no means standardized, this approach is essentially futile:
you can't get everything that you want and you can't eliminate everything else.
- Another use of FETCH is to grab all of the files in a list file such as the list generated
by LOOKUP.
- Try this out by FETCHing the sequences listed in the histone.h2a file
But wait - you are about to load a huge number of files into your
working directory, how will you get rid of them later???. Lets create
a sub-directory for this exercise:
> mkdir Test
- copy the histone.h2a file into the new Test directory
> cp histone.h2a Test
- Now move into the directory you just created
> cd test
- Now make a quick check to be sure that worked:
> ls
- OK now use the command:
> fetch @histone.h2a
(the little "@" sign is required whenever you use a list file as input for
any GCG program that can read list files).
- And look at the results:
> ls
Plenty of sequence files, eh?
- Browse one or two with
more
(or emacs), then delete them all.
III. NETFETCH
Sometimes (i.e. often) there will be inconsistancies in the naming or
indexing of sequences. This is particularly problematic for
"hypothetical" protein sequences that are derived from cDNAs. The
accession #s and locusnames reported by BLAST or in journal articles
may not work with FETCH on our system.
As an alternative, the FETCH program
retrieves sequences directly from GenBank over the Internet.
- Try to
fetch q9beg9
- Now try
netfetch q9beg9
- Look at the q9beg9.rsf file with
more
- Try to use the q9beg9.rsf file as the query sequence for a FASTA search
- You have to
reformat the stupid thing!
Like this:
> reformat q9beg9.rsf{*}
- Now you can use the reformatted file for a FASTA search, but pay
careful attention to the new filename created by
reformat
IV. ENTREZ
Let's move away from GCG and look at sequence finding over the Web. The neatest tool is
ENTREZ run by NCBI at the National Library of Medicine in Maryland. ENTREZ is a smart
sequence retrieval tool which can do far more than just locate sequences by name, accession
number, author etc. To start ENTREZ, point a Web browser at http://www3.ncbi.nlm.nih.gov/Entrez/
and choose what database to start searching (nucleotide, protein, or MEDLINE references).
As an exercise, we will repeat our test search for Voltage Gated Calcium Channel Alpha Subunit
- Type your query terms terms into the search field under the "NCBI"
banner: "calcium channel" and click the "Go" button.
- Too many accessions are found, so narrow the search with another search
keyword. In the search field, type "alpha" and click "Go" again.
- It is useful to use the "History" funtion to combine queries.
- At the top of the window it now shows your current query.
followed by some grey fields that allow you to customize the diplay.
This is where the real power of ENTREZ becomes evident. Each protein sequence is linked
to all corresponding DNA sequences, as well as to all similar protein sequences (pre-computed
with BLAST). Proteins are also linked to all MEDLINE references that mention that sequence.
These linked sequences and references have their own links, so from virtually any starting point,
you can expand your search horizontally to learn about entire families of related database sequences.
- Follow some of these links to find and save as a text file the abstract of the following article:
De Jongh, 1990 Subunits of purified calcium channels. Alpha 2 and delta are encoded by the
same gene. J Biol Chem 265, 14738-41 (1990) [90368635]