![]()
![]()
![]()
![]()
Genome Projects and Genomics
Everyone in the biological research community has been hearing a lot about the Human Genome Project and how it will revolutionize biology and medicine.
What is it all about, how will it affect us as researchers, and what are the direct implications for bioinformatics?
- The human genome contains about 3 billion base pairs.
- GenBank currently hold about 650 million base pairs from all organisms.
- This is not an unfathomable jump - less than an order of magnitude.
The entire human genome sequence will be completed by 2005, possibly even by 2000.
However, this milestone is merely a starting point, the real goal of the Genome Project is to understand the function of the 100,000 genes present in the human genome.
The human is a very poor organism for genetic research:
- gene knock outs are impossible
- controlled breeding is impossible
- there is not even a comprehensive collection of mutants.
Therefore, the complete genomic sequences of many model organisms such as mouse, fruit fly, and worm are also needed.
The collection of complete genome sequences from multiple species and multiple individuals from these species is not beyond the capacity of the sequencing labs within the next few years.
But the analysis of data at this level is beyond the scope of our current computational tools.
- What computational hardware and software would be necessary to perform whole genome comparisons between species to identify all common genes?
- How about comparing the complete genomic sequences of two individuals of one species to identify genes responsible for various phenotypic differences.
Dr. Eric Lander (director of the MIT Genome Lab) has discussed the new kind of biology that will emerge after the human genome is completely sequenced. [Lander, E.S. (1996) The new genomics: Global views of biology, Science 274:536-539]
He compared the human genome sequence to the periodic table of chemistry, suggesting that once the genome is sequenced, general rules of information flow in biology can begin to be formulated.
Informative regions can be re-sequenced much faster and more easily - both for evolutionary and population genetics studies as well as for medical diagnostics.
Routine whole genome diagnostics was imagined in the "StarTrek" universe of the 24th century, but this type of data will be readily available within one or two decades.
As biologists, we need to imagine how our work will be different in this "Post Genome Project" era.
Once all of the genes are known, new technologies will allow complete genome wide expression maps to be created.
Early forms of these technologies exist today - such as the DNA chips created by Affymetrix, or the glass slides created by David Botstein that can measure the differential hybridization of thousands of different DNA molecules.
Imagine that data are available for mRNA levels for all 100,000 genes from a cancerous cell, and from that same cell type in a normal and pre-cancerous state.
Imagine comparing complete genomic sequences between a dozen family members with an uncharacterized disease and a dozen who are unaffected.
The Genome Project and the Internet
So how does the Genome Project relate to the Internet?
Clearly the Genome Project involves a huge amount of data that is stored on computers all over the world.
This data can be sorted, annotated and organized in many different ways using different types of database software, different analysis algorithms, and different forms of interfaces.
In many cases it is this "value added" processed data that is most useful to the researcher.
Access to the data is via the internet.
Genome scientists have an unprecedented level of dependence on the internet in order to do their work.
In many cases genome project data centers are leading the world in developing new internet tools for accessing data.
![]()
![]()
![]()
![]()
Using Computers for Molecular Biology
Stuart M. Brown, Ph.D., RCR, NYU Medical Center Comments to: browns02@mcrcr.med.nyu.edu