
Research Computing News
Volume 6, Number 1
May 1998
Contents
Searching for similarity to ESTs
RCR Installs Pearson's FASTA 3.0
News from Academic Computing
Computer Security
RCR Upgrades to CLUSTAL W 1.7
Searching for Similarity to ESTs
A top priority in the multi-disciplinary effort to sequence the human genome is the identification of all of the genes. Since only 2-3% of human genomic DNA actually encodes proteins, it makes sense to focus sequencing efforts on these "expresses" regions first. This is the basis for a number of Express Sequence Tag (EST) project.ESTs are short (200-500) base pair "single pass" sequencing reads taken from both the 3' and 5' end of cDNA clones. The sequence data is then fed directly from automated sequencers into the data banks with no error correction and only automatic annotation. The EST sequences are not complete genes nor complete cDNAs, but are merely gene "Tags." GenBank contains over 1.4 million of these EST sequences. The EST section (dbEST) now constitutes more than half of GenBank and is growing much faster than the other sections.
It is likely that the vast amount of EST data contains at least part of virtually all of the expressed human genes. Exceptions most likely include genes expressed in tiny quantities, in rare specialized cell types, or at restricted developmental periods. Various EST sequencing projects have addressed this issue by sequencing ESTs from a variety of tissues and developmental stages, but it is impossible to be absolutely comprehensive. Comparisons of EST data from various tissues can be used as a measure of tissue specific expression patterns. There are some genes such as histones and non-coding RNA genes that are underrepresented in the EST databanks because their RNA transcripts lack poly-A tails.
EST sequences are short and contain many errors, particularly frameshifts, which make the identification of open reading frames (protein coding regions) very difficult. The dataset is huge and highly redundant, containing thousands of copies of many common genes. Yet the elusive value of ESTs is clear: every gene is probably in there somewhere. Searches of EST databanks are an essential component of the search for new disease genes.
The NCBI's (National Center for Biotechnology Information) BLAST (Basic Local Alignment Search Tool) similarity search engine http://www.ncbi.nlm.nih.gov/BLAST/
can be used to search for EST sequences that are similar to a specific query sequence. Since EST sequences are likely to contain many incorrect bases and insertion/deletions, it is best to use the new Gapped BLAST (BLAST 2.0) tool:
http://www.ncbi.nlm.nih.gov/cgi-bin/BLAST/nph-newblast?Jform=0
Also, since protein-protein comparisons are much more sensitive than DNA-DNA comparisons, use TBLASTN for protein query sequences (compares your protein to six-frame translated ESTs) and TBLASTX for DNA query sequences (translates your query in six frames and compares it to six-frame translated ESTs). In addition to BLAST searches of the entire dbEST database, NCBI offers subsets of mouse ESTs, human ESTs and "other ESTs" for searching. Do not be confused by other database options such as dbSTS (Sequence Tagged Sites - short marker sequences for genome mapping projects), gss (Genome Survey Sequences) or htgs (High Throughput Genomic Sequences); the data in these sections are not derived from cDNAs and are not directly comparable to ESTs.
When might a typical molecular biologist benefit from a search of EST data?
- If similarity searches of other databases have not yielded any significant matches, then the ESTs can be used as a databank of last resort. If a sequence matches nothing in the EST databank, then the chances are very good that it is not an expressed sequence.
- The EST databanks contain new members of known protein families. Researchers who suspect that they have identified a new protein by function may discover candidates among ESTs that are similar, but not identical to other known proteins.
- EST sequences include many mutations and allelic variations of known genes. Investigators interested in gene diversity or seeking polymorphisms as genetic markers can obtain useful data from the EST databanks
If an interesting EST sequence is identified in the databanks, the original cDNA clone from which that sequence was derived can be obtained from the IMAGE. consortium for $45 for the first clone and $24 for each additional clone
http://www.atcc.org/hilights/tasc2.html
The redundancy of the EST databanks can be used to partially compensate for the high error rate and the fragmentary nature of the individual sequences. The reverse transcription reactions that are used to create cDNA libraries do not all run to completion. The pool of cDNA clones contains many partial sequences of varying lengths, therefore ESTs databases will contain partially overlapping sequences from the same gene. Several groups are working on procedures to cluster all ESTs that are derived from a single gene. In this way, partial and fragmentary sequences can be assembled into contigs which correspond to whole gene transcripts. Thus, the vast, low quality, EST databanks can be collapsed into a single consensus set of non-redundant transcripts from all expressed human genes.
Investigators can conduct routine similarity searches against the EST clusters, which require only a tiny fraction of the disk storage space used by the complete EST databank; and these searches can be computed much more quickly. Similarity matches to the EST clusters will be much more informative than matches to the current set of minimally annotated ESTs. Those wishing to examine tissue specific gene frequencies or allelic sequence variations can extract the original EST sequences used to create the consensus from a central repository (i.e. dbEST at the NCBI website).
There are three different sets of clustered human ESTs available on the web. Each set has been created with different software using different assumptions as to data quality, minimum overlap, etc. The three sets of consensus sequences are available for download, so investigators can use their own searching software against these databases.
UNIGENE
The National Center for Biotechnology Information, the home of GenBank and BLAST, has created UniGene, a collection of human EST sequences from GenBank that have been grouped into clusters by sequence similarity. Some of these clusters correspond to genes that have been identified by other means (i.e. functional cloning) while others are genes of unknown function. As of January 1998, UniGene contains 39,377 clusters. This is a much more manageable set of sequences to search than the 1,409,359 sequences in the complete EST section of GenBank (January 1998). The UniGene collection is also indexed by tissue of origin, so it is possible to search for genes that are expressed in a particular tissue.http://www.ncbi.nlm.nih.gov/UniGene/index.html
TIGR Human Gene Index
http://www.tigr.org/tdb/hgi/hgi.htmlThe Institute for Genomic Research has created a Human Gene Index (HGI) for their own collection of ESTs. As of July 1997 (HGI release 3.2) TIGR has identified 60,388 „tentative human consensus¾ sequences (THCs) by assembling over 600,000 ESTs into complete „virtual transcripts. THCs are consensus sequences based on two or more ESTs that overlap for at least 40 bases with at least 95% sequence identity. THCs retain information about the source library and abundance of the ESTs from which they are composed. TIGR also provides a free web-based similarity search service that can be used to compare query sequences against the HGI. Similarity results are returned by e-mail in less than 5 minutes.
STACK
http://ziggy.sanbi.ac.za/stack/The South African National Bioinformatics Institute (SANBI) at the University of the Western Cape has created an EST consensus database known as STACK
(Sequence Tag Alignment and Consensus Knowledgebase). It consists of all available human EST sequences in GenBank, clustered, aligned, and made into joined consensus records. STACK consensus sequences are often longer than TIGR or UniGene clusters. STACK creates separate consensus clusters for ESTs from each of 17 different tissue types.
SANBI provides a free web-based BLAST search utility that can be used to compare query sequences against the STACK clusters. Similarity searches can be calculated as DNA-DNA (BLASTN), protein query against 6 frame translations of the STACK database (TBLASTN), or 6 frame translation of a DNA query against 6 frame translations of the STACK database (TBLASTX). There is also the option of returning search results by e-mail.
The EST databanks give researchers access to basic data from the genome project almost as soon as it is collected in the lab. This allows those investigators who are willing to sift through mountains of low quality data to discover some hidden jewels. For most others, the pre-analyzed clustered EST databases will be much more useful. The promise of EST data remains that „Your gene is out there.¾ You just have to have the patience and persistence to look for it, and to be smart enough to know when you have found it.
The EST section of GenBank is directly searchable on the NCBI website:
http://www.ncbi.nlm.nih.gov/dbEST/index.html
The dbEST database can be searched by any of the following keywords:
DBID EST id number IDS EST name, GenBank accession, gi number CLIN Clone information, Source information COM Comments LIB Library name and organism LIBX Library description SUB Submitter info CIT Citation info MAP Mapping data NBR Homology (neighboring) information Finding gems in the EST databanks: Adams et al., Initial assessment of human gene diversity and expression patterns based upon 83 million nucleotides of cDNA sequence. April 28, 1995, Nature (1995) 377(Suppl): 3-174). Capone MC, et al. Identification through bioinformatics of cDNAs encoding human thymic shared Ag-1/stem cell Ag-2. A new member of the human Ly-6 family. J Immunol. 1996 Aug 1;157(3): 969-973. Krizman DB., Gene identification by 3' terminal exon trapping. Genet Eng (N Y). 1996; 18: 49-56. Wells TN, Peitsch MC, The chemokine information source: identification and characterization of novel chemokines using the WorldWideWeb and expressed sequence tag databases. J Leukoc Biol. 1997 May; 61(5): 545-550.
RCR Installs Pearson's FASTA 3.0
The RCR has installed FASTA release 3.0, the package of sequence similarity programs developed by Bill Pearson. This new FASTA 3.0 functions as a stand-alone group of programs running on the ALPHA and is not a part of the GCG molecular biology package (GCG uses the older FASTA version 2.0). FASTA 3.0 includes 4 programs: FASTA3, TFASTA3, FASTX3, and TFASTX3. The FASTA3 and TFASTA3 programs are very similar to the FASTA and TFASTA programs currently available in GCG. However, FASTX3 and TFASTX3 are new programs that were previously not available to RCR users. FASTX3 translates a DNA query sequence in all six reading frames and compares it to a protein database (similar to BLASTX). TFASTX3 compares a protein query sequence to a six-frame translated DNA database, calculating similarities with frameshifts in the forward and reverse orientations. This frameshift capability of TFASTX3 will be particularly useful for investigators interested in making similarity searches of ESTs. Previously the only way to make this type of search was to use the painfully slow GCG program FRAMESEARCH.The new FASTA 3.0 programs can be accessed whether or not GCG is activated by typing FASTA3, TFASTA3, FASTX3, or TFASTX3. The program will then prompt the user for a query sequence. The query sequence may be in either the FASTA or GCG formats. FASTA3 will accept either a protein or DNA query sequence, FASTX3 will only accept a DNA query sequence,TFASTA3 and TFASTX3 will only accept a protein query sequence. Once a query sequence is specified, the user will be prompted to choose a sequence database (library) to be searched. Personal data sets (in FASTA or GCG format) can also be specified by filename. All of the FASTA 3.0 programs will accept UNIX style command line options (i.e. "-b", "-H", etc.). A list of these options is available in the FASTA 3.0 documentation file on the RCR website
mcrcr0.med.nyu.edu/rcr/fasta3.html
The following databases are available for searching with FASTA 3.0 on the RCR system:
Protein databases
W: SwissProt
D: PIR
X: GenPept
L: TREMBL
O: Owl non-redundantDNA databases (GenBank sub-divisons)
G: GB All R: GB Rodent N: GB New S: GB RNA E: GB EST Y: GB Synthetic H: GB Primate O: GB Organelle B: GB Bacterial P: GB Plant I: GB Invertebrate A: GB Phage M: GB Mammals T: GB Patent V: GB Vertebrates U: GB Unannotated L: GB Viral
News from Academic Computing
New Search Engine available: Infoseek's UltraseekAcademic Computing announces the availability of InfoSeek's internet search engine UltraSeek. Users can search the entire NYU Health Systems Center from http://mcrcr2.med.nyu.edu:8765/
Currently version 1.2 is supported. Look for 2.0 with its advanced search interface and „find similar documents¾ feature in May. UltraSeek has a fast and simple interface (just enter words of interest) as well as more advanced features such as boolean searches (using + and - prefixes on words) as well as the ability to restrict searches by meta tags or url. To add your url or to create a separate collection, private or public, send mail to: webmaster@mcrcr2.med.nyu.edu.
New Email Server
After two months of testing, Academic Computing has migrated all POPMAIL users from the old DEC Ultrix server mchip00.med.nyu.edu to the new DEC Alpha Digital UNIX server endeavor.med.nyu.edu. The endeavor server now responds to all requests to send mail and read mail for the popmail.med.nyu.edu server.The new email server uses the Cyrus package from Carnegie Mellon University. It supports user mail folders on the server and both POP3 (Eudora, Outlook, Netscape) and IMAP4 (Simeon, Netscape, Mulberry, Messenger, Maildrop) clients. The Alpha server has more power than the Ultrix server, and the Cyrus mail server is tuned for IMAP clients and large mailboxes.
Users can also read their email using Telnet and Pine. Previously popmail users had to use an email client on the PC or Mac such as Eudora or Netscape to read email. Now users can read email on the server using Pine by making a Telnet connection to popmail.med.nyu.edu. Additional configuration is required so users must request this option.
Computer Security
The loss of your files or serious damage to the computer you use could be a disaster for you (kiss that grant proposal good bye, forget your address-books, data? shmater, its is all gone!). But many people neglect or minimize attention to the risks of damage and loss by neglectful, risky and sometimes downright irresponsible behavior. This ranges from failure to keep backup copies of key documents to providing account names and passwords to public machines to others. The recent hacker attack on a MC machine which cost several weeks of work to repair and disrupted an entire department by depriving them of email and research tools highlights just one of the terrible consequences of a machine failure: and the incident was far from as damaging as it might have been.
HOW CAN I HELP?
- If you have a PC or Mac: Keep backups of your data and key files. Install and keep anti-virus software up-to-date. Don't leave your machine open for others to use when you are away from it. Invest in a maintenance contract.
- If you have an account on a multi-user machine (e.g. the RCR Cluster): NEVER give your password to anyone for any reason (NO your supervisor is NOT entitled to know your password) don't leave an open session (telnet, email, anything) that a passerby can use.
- Don't rely on the daily system backup to protect your key documents; maintain your own separate copy.
If you own a multi-user computer (e.g. a SUN or SGI workstation): Employ a trustworthy, knowledgeable system manager for a minimum of 1 day a week. Maintain the software at the latest state-of-the-art. Install all security patches. Have Network Services survey your machine for security issues. (If none of this makes sense to, you should consider GETTING RID of the machine, or, in minimum, removing it from NYU-Net)
In all cases: As a computer user and owner, you have full responsibility for your own material and to contribute to an environment where others can be secure. Stay on top of basic security by understanding your machine, and ASK QUESTIONS to be sure you¼re not the individual hurt when the next serious problem arises.
RCR Upgrades to CLUSTAL W 1.7
The RCR has upgraded to version 1.7 of the CLUSTAL W multiple sequence alignment program.
[Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994) CLUSTAL W: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice. Nucleic Acids Research, 22:4673-4680.]CLUSTAL W is a stand alone program, not part of the GCG sequence analysis package, but it can work together with GCG. CLUSTAL W can use GCG formatted sequences as input and it can create multiple alignments in GCG style MSF files as output.CLUSTAL is superior to the GCG multiple alignment program PILEUP in several ways. Both programs use a progressive pairwise method to build multiple alignments but PILEUP uses a global alignment algorithm (similar to