Research Computing Newsletter


Research Computing News

Volume 5, Number 4
December 1997



Contents


Course Reminder


The RCR's BioInformatics course: Using Computers for Molecular Biology (G16.2604) will be offered again as 12 weekly lectures in 1998. Tentative lecture dates are Tuesday afternoons from February17 to May 5. Check the web page http://mcrcr0.med.nyu.edu/rcr/molbio/syllabus-98.html for the course syllabus and up-to-date scheduling information.

Registered students will attend 12 1.5 hour lectures followed by a two hour computer lab. Students will complete an independent bioinformatics research project which will be presented both as a research paper and in a public poster session at the end of the course. All other NYU faculty, students, and staff are welcome to audit the lectures; there will be ample room in the hall for drop-ins.



Evaluate New Software


The RCR needs your help to evaluate some of the newest Molecular Biology software. There are some new programs that offer innovative approaches to laboratory data management - and the only way to decide whether they are worthwhile is to test them in an active research lab.

In order to make it easy for you to try out these programs for yourselves, the RCR has installed demo versions and tutorials for each of these programs on our public Macintosh computers in room 174-MSB and we have put copies of the demos in our public software archive. This archive can be accessed with a web browser at , or on the Med. Center's AppleShare network by connecting as a "Guest" to the file server endeavor in the NYU MC-Hippocrates AppleTalk Zone.

The three programs that we are currently trying to evaluate are the Gene Construction Kit and the Gene Inspector from Textco Inc. and Vector NTI from InforMax Inc. Each of these programs has unique features designed to appeal to the molecular biologist, but they are outside of the core mission of the RCR ‹ which has traditionally focused on providing high-performance computing power for database searching and sequence analysis. None of these programs can replace all, or even a significant number of the functions provided by GCG running on our Alpha server, but they can be a tremendous asset in the laboratory as an aid in managing data and preparing figures.

The Gene Construction Kit


The Gene Construction Kit is the most straightforward of these new products. It is a plasmid drawing/restriction mapping tool. It can make some really beautiful drawings suitable for publications, slide presentations etc.. Each step of complex plasmid cloning projects can be simulated using laboratory-like operations such as restriction digestion, ligation, adding linkers, etc. Gene Construction Kit is a Macintosh only program. More information about this program can be found at the Textco web site:
http://www.textco.com/


Vector NT


Vector NTI is also a plasmid drawing/restriction mapping program, but it is far more ambitious than Gene Construction Kit. Vector NTI also incorporates PCR primer design and database management tools that can store all of the molecules, oligos and enzymes used by an entire laboratory group. These databases can be accessed from multiple computers over a local network and there are tools that allow interfaces between Vector NTI databases, ORACLE and other SQL-based databases on mainframe computers, and web servers. The interface for managing cloning projects is extremely sophisticated; Vector NTI can create a plan for constructing your desired molecule that takes advantage of optimal genetic engineering techniques.

Vector NTI has some built in internet capabilities: It can connect directly to Molecular Biology Web servers. It is also able to create output in an HTML: format that can be posted directly onto your own web server. The program is designed to allow new functions to be added easily - the user can create custom scripts or add new modules created by other programmers. InforMax will release a Motif searching tool and a multiple alignment tool (similar to ClustalW) in the near future. This program is clearly poised to be a major player in the Network based computing model that is developing in bioinformatics. Vector NTI is available for Macintosh, Windows, and Windows NT operating systems. More information about Vector NTI can be found on the InforMax website: http://www.informaxinc.com/vectornti.html

Gene Inspector


Gene Inspector is the most ambitious of these new Molecular Biology programs. It is a combination of a versatile electronic laboratory notebook, a sequence analysis package (with a multiple sequence editor), and an illustration tool. This program is actually designed to replace the traditional lab notebook with a computer software tool that lets you write, draw, calculate, analyse sequences, and even paste in (scanned) photos and X-ray films. Notebook pages can be formatted for use as posters, slides, or figures for publication.

Gene Inspector provides over 60 different nucleic acid and protein analyses tools including: sequence alignment, base composition/distribution, ORF determination and evaluation, restriction mapping, dot matrix comparisons, antigenicity, hydropathy, transmembrane helices, helical wheel, Prosite motif searching, physical characteristics, signal sequence, and protein secondary structure prediction. The program also allows automated scripts to be developed to perform an entire series of sequence analyses using pre-defined parameters and creating custom formatted output. More information about this program can be found at the Textco web site: http://www.textco.com/



GCG version 9.1 and beyond...
The Genetics Computer Group has produced an upgrade from version 9.0 to 9.1 for the GCG suite of molecular biology programs. The RCR installed GCG version 9.1 as of September, 1997. Version 9.1 provides several new programs including the PAUP phylogenetics package and several programs for analysis of protein structure. Version 9.1 also includes several bug fixes and many minor enhancements to SeqLab, the X-Windows based graphical interface to GCG which was introduced in version 9.0 in January of1997. Aside from the availability of the new programs, users should not notice any changes in the commands or the performance of any GCG programs.

The most significant addition to GCG 9.1 is the inclusion of PAUP (Phylogenetic Analysis Using Parsimony) written by Dr. David Swofford and copyrighted by the Smithsonian Institution. PAUP is a complex molecular phylogenetics program that offers several different methods of computing evolutionary trees from groups of aligned sequences. All of these methods use a cladistic approach to evolution rather than the phenetic (distance only) approach used by the DISTANCES program that has been available in GCG for several years. For more information about phylogenetic methods, the GCG documentation (GenHelp:PaupSearch) contains an extensive explanation of the algorithms used by PAUP and the RCR's "Using Computers" lecture notes has a general summary of molecular phylogenetics: http://mcrcr0.med.nyu.edu/rcr/course/phylo-intro.html

The GCG program PAUPSearch can be used to analyze sequence alignments. It searches for optimal trees using one of three optimality criteria: maximum parsimony, minimum evolution distance, or maximum likelihood (nucleotide sequences only). The PAUP functions supported by PAUPSearch include searching for optimal trees, neighbor-joining reconstruction of a tree, bootstrap analysis (a method of assigning confidence levels to groupings in the tree), and length analysis. The program PAUPDisplay creates graphical output of the trees calculated by PAUPSearch. PAUPDisplay can also plot trees calculated by DISTANCES and GROWTREES.

The three new protein analysis programs introduced in GCG 9.1 are all related to the prediction of secondary structure: CoilScan locates segments of coiled-coil structure within protein sequences. SPScan identifies secretory signal peptides. HTHScan detects the presence of helix-turn-helix motifs that may be recognized by sequence-specific DNA binding proteins and are often associated with gene regulation.

In other news, the Genetics Computer Group has officially announced that it will produce a Web interface to the GCG program in mid 1998. The initial version of the Web interface may not offer access to all GCG programs and options, and it will not provide the interactive multiple sequence editor that is the heart of the SeqLab interface. Nevertheless, the Web interface has been eagerly awaited by many users and system administrators since it will greatly simplify use of the software for new users and encourage experienced users to try out new programs and new options once they are available with just a mouse click in a web browser. The move of GCG to the Web should be viewed as adding momentum to the trend of all bioinformatics applications and databases to adopt the Web as a universal interface. This will clearly benefit the individual scientist by providing a more unified access to research data and computer applications as well as through reducing or eliminating the need to purchase ever more powerful standalone desktop computers and copies of expensive local software.


RCR Account Renewal


A reminder to all Primary Investigators who are RCR members; January is RCR account renewal month. We will be sending account renewal forms to all RCR account holders in December. The RCR membership fee will remain $500 for 1998. Remember, an RCR membership gets you unlimited computing time, unlimited number of accounts for the people working in your lab, mainframe and Macintosh molecular biology software, disk storage space for your projects, news, e-mail, and bioinformatics consulting services.

In addition to paying the RCR's annual membership fee, we ask each member to review all of the accounts in their group. It is important, both for our internal accounting proceedures and for the security of your data, to delete the accounts of people who are no longer working in your lab. Please contact Ross Smith or Stuart Brown with any questions about RCR account renewal proceedures.


E-Mail Troubleshooting


People who use Eudora or Netscape Mail (using the POP3 mail client) to collect mail from the VMS cluster may (very rarely) notice problems collecting their mail. These problems can be annoying, but they are usually easy to fix.

The problem is often due to an overloaded, or maybe even damaged mailbox on the RCR machines. To fix the problem, log into MCRCR with TELNET (or VersaTerm) and run MAIL at the "$" prompt. If you use POP exclusively to read your mail, your mailbox and all folders should all be empty. Typing DIR at the MAIL prompt should show no messages: if you find any, read them then delete them. Repeat this process for the MAIL and NEWMAIL folders (type SELECT MAIL at the MAIL> prompt to move to the MAIL folder). Exit from MAIL, then re-start MAIL. If you get an annoucement that you have new messages, type READ/NEW at the MAIL> prompt twice: this re-sets the newmail counter to zero. Exit and re-start MAIL to be sure that the mailbox is indeed empty, and you don't get an erroneous message saying you hva new mail.

Finally, type COMPRESS at the MAIL> prompt. A new, compressed and more efficinet mailbox will be created for you. The old mailbox (called MAIL.OLD) can be deleted. All RCR users should do this (relatively) simple clean-up proceedure ‹ regardless of whether you think you have any problems with mail. This will make our entire system work more efficiently and with less downtime for system maintenece.

If you get large volumes of mail, this 'clean-up' of your VMS mailbox should be done every couple of months. If, when you type MAIL you get an announcement that you have, for example, 6 new mail messages, but DIR only shows 4, your mailbox has been damaged. It is important to fix the newmail count to corectly report the number of new mesages: if you don't POP won't be able to transfer mail to your PC or Mac, and we have even had system-wide problems traced to damaged mailboxes. We're happy to help deal with mailbox problems. Call if you feel you need to do so!

- Ross Smith
M.D., Ph.D., Director of RCR



BLAST Gets Better


The NCBI (National Center for Biotechnology Information), home of GenBank, ENTREZ, and BLAST, has introduced some major improvements in the BLAST program used to search protein and DNA databases for sequence similarities.

The BLAST Website now offers two new options, Gapped BLAST and PSI-BLAST. These new options represent significant changes in the BLAST algorithm, but the upshot for the researcher is more sensitivity to weak but biologically meaningful sequence similarities.

Gapped BLAST allows the introduction of gaps (deletions and insertions) into alignments. Gapped BLAST produces longer continuous alignments rather than the multiple short segments that researchers are used to seeing in BLAST output. Also, the scoring of gapped results tends to be more biologically meaningful than ungapped results. The inability of BLAST to utilize gaps in alignments has long been its major weakness, this new algorithm promises to be a major improvement. In addition to improving sensitivity, generating longer aligned regions, and improving the predictive power of the statistical scores, gapped BLAST is about three times faster than the traditional BLAST algorithm. This is very important since the rate of growth in database sizes currently exceeds the rate of improvement in computer processor speeds.

The basic BLAST algorithm compares two sequences by breaking them into sets of short words, and then looks for close matches between pairs of words between two sequences. Wherever statistically significant matches are found between word pairs, BLAST tries to create an alignment by extending the match in both directions. The new gapped BLAST algorithm requires matches between two different pairs of words located near each other on the two sequences before it tries to create an alignment. Since the chance of randomly finding two different matching words between two unrelated sequences is much lower than the chance for a single random match, the gapped BLAST algorithm is able to use a less stringent cutoff score for considering two similar words to be a match. The net result is a more sensitive search with less computer time wasted trying to extend uninformative random word matches, also better longer and more meaningful alignments.

PSI-BLAST (Position Specific Iterated Basic Local Alignment Tool) uses a technique similar to GCG's PROFILESEARCH. PSI-BLAST first performs an initial gapped BLAST search of the database. Then, it makes a multiple alignment from the significant hits. The multiple alignment is then used to construct a position-specific scoring matrix - a table of amino acid frequencies at each position in the sequence. This matrix is then used instead of a query sequence for another BLAST search. New alignments found in the second search are incorporated into the scoring matrix and the process is repeated (iterated) until no more significant hits are found. Remarkably, this entire process uses only as much computer time as a series of BLAST searches equal to the number of times the matrix search is repeated.

PSI-BLAST is able to find highly diverged members of protein families. This is quite similar to (but MUCH faster and easier than) the procedure that most investigators usually follow of collecting meaningful hits from a BLAST (or FASTA) search, building a multiple alignment with PILEUP, then using PROFILEMAKE and PROFILESEARCH to search for additional protein sequences that are similar to the consensus of the multiple alignment. PSI-BLAST also makes obsolete another "quick and dirty" technique that many researchers use which is to use each of the hits from one database search as query sequences for additional searches and then compile a list of all sequences that are found in multiple searches. This remarkable PSI-BLAST program is very new and still in its evaluation phase, but it could have a major impact on the process of identifying the function of new genes as they are sequenced by the Genome Projects.

You can use the new BLAST programs on the NCBI's web server at:
http://www.ncbi.nlm.nih.gov/BLAST/


GeneWorks is discontinued, MacVector upgraded


Oxford Molecular Group, the software company that produces GeneWorks has offically discontinued the program.

"GeneWorks has been discontinued because it was written in an old programming language that is no longer supported by Apple's development tools, and had become impossible to maintain or to continue to develop."

There will be no immediate impact of this announcement. All labs may continue to use their existing copies of GeneWorks (or upgrade to version 2.5.1 that is available from Stuart Brown at the RCR office in 183 MSB) and the RCR will continue to operate the GeneWorks key server for the NYU Med Center's site license. However, there will be no further upgrades or bug fixes for GeneWorks.

Oxford Molecular urges all GeneWorks users to switch to their new product: MacVector, which has virtually all of the capabilities of GeneWorks plus many new features including the ability to search and upload seqences from ENTREZ and perform BLAST searches all within the MacVector application. The RCR currently has a 3-user site license for MacVector, and we will add additional users as demand increases. All RCR members can obtain a free copy of MacVector version 6.0 from the RCR office, 183 MSB. MacVector 6.0 offers only minor changes from version 5.0, but should perform better on newer Macintosh computers.


Year 2000 Update


The survey commissioned by the MIS department of the Medical Center's computers and systems for "Y2K compliance" is essentially complete, however, the survey was limited almost exclusively to administrative and hospital systems: no survey of computers in the School of Medicine is planned. Academic Departments and individual faculty will be responsible for identifying and fixing problem machines and software on their own.

Academic Computing is acutely aware of the difficulties that faculty may experience. We are working to build a worthwhile set of documents addressing these problems at the URL: http://rcr-www.med.nyu.edu/rcr/Year2000

Both AC and MIS will be contributing to the collection. In addition, AC will offer workshops on the practicalities of the Y2K problem in 1998. Please note that Academic Computing is not able to provide hands-on services to faculty to deal with this problem: we don't have the resources. If a lab finds that it requires a great deal of work, AC will help in the selection of an outside contractor to provide these services.