![]()
![]()
![]()
![]()
LOOKUP
If you do not know an exact accession number, but just want to find all sequences that relate to a specific biological function, such as "G proteins", then it is necessary to use a program that can perform text-based searches of the annotation portion of database sequence entries.
GCG provides the program LOOKUP for this function.
The LOOKUP interface is considerably different from that of most GCG programs - this is a testament to the recent heritage of the program as a stand-alone database tool called SRS (Sequence Retrieval Service) that was only incorporated into GCG in version 8.
GCG's documentation provides a concise description of LOOKUP:
LookUp uses the Sequence Retrieval System (SRS) created by Dr. Thure Etzold to identify sequences in sequence databases (CABIOS 9(1); 49-57 (1993)). For example, you can find all of the protein sequences published by a particular author or all of the sequences whose annotation contains a particular word.
The expressions you use to find sequences in a database are known as queries. LookUp presents a form on your screen that lets you enter the elements of your query. Then LookUp finds all the sequences that contain those elements. The output of LookUp is a list file of sequences (formerly known as a file of sequence names) suitable either for input to any GCG programs that accept database input.
After typing the command
LOOKUP, you will be presented with the following screen:LookUp identifies sequences by name, accession number, author, organism, keyword, title, reference, feature, definition, length, or date. The output is a list of sequences. The LookUp program is experimental in this release--please look carefully at your results. LOOKUP in what sequence libraries:a) swissprot b) sptrembl c) pir d) embl e) genbank f) em_tags g) gb_tags h) gb_tags2 i) gb_tags3 j) All libraries q) quitPlease choose one or more (* f *):In general choose "All libraries" which is the default. This brings up the text entry (or "query") screen:
Complete the query form below:All text: Definition: Author: Keyword: Sequence name: Accession number: Organism: Reference: Title: Feature: On or after (dd-mmm-yy): On or before (dd-mmm-yy): Shortest sequence length: Longest sequence length: Inter-field operator: AND Form of output list: Whole EntriesPress [Ctrl-D] to continue.Enter your search words in the "All text" field unless you know for certain that you are looking for a sequence name, organism, or other subset of the annotation data. Then type Control>Z
LOOKUP searches are very fast, so you will soon see the output screen:
Searching swissprot pir genbank gb_tags gb_updates 4936 entries were found. Do you wish to:1) write out this list to a file 2) preview the results 3) refine the query 4) choose different libraries q) quitPlease choose one (* 1 *):For a quick look at the results of your search, type "2".
Often the search finds too many (or too few) sequences, so the "refine query" option (#3) returns you to the query form where you can change the query terms.
The Boolean search terms "AND" and "OR" can be used to control searches more precisely.
By default, entries in different fields are combined with the AND operator so you find only those sequence entries that match the terms in both fields.
LookUp accepts question marks (?) or asterisks (*) as wildcards anywhere within a value.
- A question mark means any single character. A value of s?ith includes Smith, Slith, Sjith, etc., but does not include Sith.
- The asterisk "*" means anything or nothing. The value *smith* includes Smith, Hocsmith, Smithies, Hocsmithels, etc.
- Leading wildcards like *Smith significantly reduce the speed of LookUp.
- By default, LookUp treats every value in your query as if it ended with an asterisk wildcard.
Lookup also accepts the logical operators "AND", "OR", and "NOT".
- Use the keyboard symbol "&" for "AND",
- "|" for "OR" (located below the "delete" key on most keyboards)
- "!" for "NOT".
- You can also use parenthesis to group logical operators.
Once you are satisfied with the results of a LOOKUP search, save the search as a list file in your working directory by typing 1 (
write out this list to a file).
Be sure to give the list a sensible name different than the default name of
"lookup.list"since you would very quickly fill your directory with lots of files with this same name.
This list file does not contain entire database entries, only sequence names and one or two lines of description.
It is necessary to use FETCH to retrieve the entire database accession. However, FETCH can use the list file as input to grab all of them if you treat the file created by LOOKUP as a "file of sequence names" and type
"FETCH @lookup.list"
In general it is not recommended to use the list files created by LOOKUP as input for PILEUP or other programs that can work with lists of sequence names since the lists generally contain both DNA and protein sequences.
![]()
![]()
![]()
![]()
Using Computers for Molecular Biology
Stuart M. Brown, Ph.D., RCR, NYU Medical Center Comments to: browns02@mcrcr.med.nyu.edu