
Research Computing News
Volume 5, Number 3
October 1997
Contents
Free Medline Searches on the Web
Academic Computing Group Formed
Course Update: Using Computers for Molecular Biology
TREMBL: new protein database
Year 2000: Issues for RCR Researchers
Beware of MS Word Macro viruses
GenBank passes the One Billion base mark
Free Medline Searches on the Web
- The NCBI's ENTREZ web service has expanded to encompass all of MEDLINE and has been renamed PubMed. This service was previously limited to just a subset of MEDLINE articles that directly referenced sequences in GenBank. Now you can do your complete MEDLINE searches for free on the Web at: http://www4.ncbi.nlm.nih.gov/PubMed/. The protein and nucleotide search functions of ENTREZ remain unchanged.
The PubMed interface is quite attractive (but undergoing almost constant revisions), and relatively simple to use.
- Some advantages of PubMed over all other forms of MEDLINE access:
- 1) Its free. Not only does this save you money, it also saves you the hassle of remembering a password and it does not require that you access the database from a computer that is part of any specific network.
- 2) It provides universal access- now you can do MEDLINE searches from any computer anywhere in the world with one consistent interface, and colleagues at different institutions can have the exact same access to the same information at the same time.
- 3) The database uses the familiar graphical interface of the web- searches are typed into a simple text entry box, article titles are linked to abstracts, on-line help is available via hotlinks from any unfamiliar terms.
- 4) All of the articles in MEDLINE have been compared with each other. Every citation identified in a PubMed search is linked to a pre-computed set of "Related Articles" that share a significant number of keywords.
- 5) PubMed works much like other familiar Internet search engines (AltaVista, Infoseek, etc.). Complex searches can be made with the Boolean terms "and", "or", "not". PubMed will search for all words that begin with a given set of letters if an asterisk is used at the end; for instance bacter* will find all terms that begin with the letters bacter, e.g. bacteria, bacterium, bacteriophage, etc. Phrases composed of multiple words can be enclosed in quotes to insure that the words occur together.

- 6) It does a fair job of understanding search terms stated in plain English. For example, a search request was typed in as "complete E. coli genome map" which was translated by PubMed into "complete[All] AND E.coli[All] AND genome[All] AND map[All]" which produced 11 relevant citations.
DON'T FORGET! If you use the RCR's systems, remember to acknowledge support for the computing through grant BIR-9318128 from the NSF. Send us a reprint too!
Academic Computing Group Formed
- With the 1997-98 academic year a new division was formed as part of the Dean's Office. "Academic Computing" was created by joining the resources of "Educational Computing" (formerly the Hippocrates Project) with those of the "Research Computing Resource". This new division will be co-directed by Drs. Marty Nachbar, M.D., and Ross Smith M.D., Ph.D. The mission of the new division is to provide and assist with the central computing needs of the School of Medicine, its faculty, students and staff, and to assist with the responsibility for the provision of Medical Center-wide central services.
- David S. Scotch
M.D., Associate Dean ![]()
Course Update: Using Computers for Molecular Biology
- The RCR's bioinformatics course, "Using Computers for Molecular Biology", was successfully completed by 14 students in the Spring 1997 semester. These students are now roaming the hallways and laboratories of the Medical Center (and beyond) providing expert advice on sequence analysis to all comers.
- In addition to the registered students, up to 50 additional students, faculty and staff were welcomed as auditors. Some lectures were also attended by auditors from Mount Sinai, Cornell Medical Center and other "sister" institutions in New York City. These outside auditors provided useful alternate perspectives and made valuable contributions to group discussions.
- Course notes from all of the 1997 lectures are on the RCR website. http://mcrcr0.med.nyu.edu/rcr/course/index.html
- The course will next be offered in February of 1998.
TREMBL: new protein database
- The RCR has added a new protein database called TREMBL (Translated EMBL) to our GCG system. The RCR's TREMBL update section (trembl_upd) will be automatically updated each week. TREMBL is similar to the GenPept database (peptide translations of GenBank DNA sequences) also maintained by the RCR, but it has several advantages:
- TREMBL contains fewer redundant sequences than GenPept making searches faster.
- TREMBL contains annotations, allowing it to be indexed by the SRS system, and is therefore searchable with the LOOKUP program.
- TREMBL contains real accession numbers, making sequence retrieval with FETCH more straightforward.
- We have created a joint Swiss-Prot & TREMBL database (known by the logical names SPTREMBL or TP) that is almost exactly equivalent to the GenPept database. This will facilitate a comprehensive search of all protein sequences. For example, using the FASTA program, at the prompt, you can type:
TREMBL:* (to search just TREMBL), or
SPTREMBL:* (to search SwissProt and TREMBL), or
TREMBL_HUM:* (to search just human sequences in TREMBL)
Year 2000: Issues for RCR Researchers
![]()
- In a little more than 2 years we will be entering the year 2000. Unfortunately, this creates a problem for some older computer systems and software which have made use of two-digit years, since "00" is thought to mean 1900, not 2000, with resulting errors in software and hardware that requires date information. This problem is very significant and pervasive, since date information is used in a wide variety of electronic equipment, such as the digital controls for elevators, clinical metering pumps, aircraft navigation equipment, in addition to computers. This is known in the computer industry as "The Year 2000 Problem", "Y2K" or the "Millennium Bug".
- For the RCR itself, the problem is small since VMS and Digital UNIX, the operating systems for the Alphas are intrinsically "Year 2000 Compliant". However, researchers with their own desktop computers do need to be concerned. While Macintosh and Windows95 operating systems do not have a problem with the year 2000, older PCs and applications and the files created by them, will require repair. This is particularly problematic for "home made" spreadsheets and databases that may contain shortcut references to two digit years.
- Even some common applications such as Microsoft Access95 cannot deal with Y2K. The initial release of Lotus 1-2-3 didn't account for the turn of the millennium leap year. The program doesn't know there's a February 29 in the year 2000. To make matters worse, competing spreadsheets Excel and Quattro Pro duplicated the problem, presumably for compatibility. Paradox treats the year 2000 as 1900 when sorting tables. There will be upgrades or patches for such programs- but it will be up to each computer owner to identify ALL of the software components on their computers that need upgrading- down to the little shareware gadgets that add a fun or a useful little function, but still have the capacity to crash the whole system.
- MIS will be conducting a survey in the next few months, which will aim to establish the scope of the problem Medical Center wide. However, it is already known that many areas outside the core clinical and administrative areas will be responsible for dealing with the Year 2000 issues on their own.
- To help research groups affiliates with the RCR to manage the Year 2000 problem, we have established a web site: http://mcrcr0.med.nyu.edu/rcr/Year2000 that will contain information aimed to help you deal with the problem. We also encourage you to cooperate as fully as possible with MIS in their survey, since the information they collect will be very helpful in dealing with your own lab machines.
Beware of MS Word Macro viruses
- The Medical Center has been the site of an epidemic of computer viruses that affect Microsoft Word documents. These viruses, known as Word macro viruses are written in Microsoft's Word Basic macro language and are embedded in Word documents. When an infected document is opened, the macro virus can copy itself into your global template file, and from there into other Word documents on your computer.
- These viruses can infect both Macintosh and Windows PC type computers that run Microsoft Word version 6.0 or later, and can move across platforms. On a Macintosh, Word viruses are pretty much limited to infecting and possibly corrupting other Word documents, but on a PC they can be much more damaging up to and including wiping all of the data off of hard drives.
- The viruses can be transmitted from one computer to another as part of Word files on floppy disks, sent by AppleTalk over the local network, or as Word documents attachments to e-mail messages. Note that e-mails themselves do not have viruses, but they are in attached Word documents. Many Word documents on local file servers (departmental machines, etc.) are also infected. It is also quite likely that infected files are present on backup tapes and CD-ROMs, so even after all infected files have been cleaned off of a computer, the risk of re-infection is high.
![]()
- To protect yourself against Word macro viruses, there are several alternatives. It is simply not practical to avoid all transferring of files on floppies, by file sharing, and via e-mail. The simplest strategy is to remove Microsoft Word from your computer and use other word processing programs. On a Macintosh, MS Word version 5.1 does not support the Word Basic macro language, so it is immune to macro viruses. If you must use Word 6.0 (or newer versions), then the only alternative is to rely on commercial anti-virus software.
- There are a number of virus protection programs available, but the NYU Medical Center has purchased a site license for Virex (for Macs) and VET (for PCs) from Datawatch Inc. Get information and download VIREX/VET at this web page: http://mcnet02.med.nyu.edu/virex/virus.htm
- These programs will protect your computer from MS Word macros and all other types of viruses. However, over 200 new macro viruses appear each month, so it is necessary to constantly update the virus information used by Virex/VET to detect viruses. Datawatch maintains a web site http://www.datawatch.com/ that contains a continuously updated virus listing file. It is necessary for each computer user to update their virus ID information on a monthly or even weekly basis to maintain a virus free computer.
![]() | =1,000,000,000 |
- With the August 15, 1997 publication of release 102, GenBank has officially exceeded one billion base pairs of DNA sequence information from over 1.6 million individual accessions. The database is continuing to grow with a doubling rate of approximately 14 months. The RCR is committed to maintaining a local copy of the full GenBank database with daily updates, as well as a full complement of protein and other specialized databases.
- The EST sections now comprise more than half of the total data in GenBank, and are the fastest growing part of the database. There is a good chance that in 1998, a new system of organizing EST data will be able to drastically shrink the storage size of the EST database sections by building consensus sequences and removing redundancy.
Recently Updated RCR Databases
(as of Sept. 1, 1997)
GenBank release 102, Aug 15, 1997 with daily updates 1,610,848 accessions 1,053,474,516 total bases GenPept release 102, Aug 21, 1997 with daily updates 253,209 accessions 78,095,171 amino acids SwissProt release 34, Oct. 1996 with weekly updates 59,021 accessions 21,210,389 amino acids TREMBL release 4.1, August 1997 with weekly updates 137,255 accessions 34,256,737 amino acids PIR release 53, June 30 1997 95,051 accessions 30,409,580 amino acids