next next next index

Practical and theoretical problems in sequencing and assembly of contigs

*** There are a number of computational problems faced by sequence assembly programs.

*** FIrst, the 500 bp reads of sequence data produced in the lab have errors of both incorrectly determined bases and of insertions and deletions.

*** Second, the error rate is highest at the beginning and ends of the reads - precisely the regions that must be overlapped.

*** Third, DNA fragments are frequently cloned into plasmids or other vectors, but some sequence from these vectors is often included at the ends of sequence reads.

*** Based on their faith in the speed and reliability of sequence analysis/assembly software, researchers have generally taken one of three different approaches to planning sequencing projects.

*** People who don't trust the software generally put a lot of time into the preparation of DNA fragments before sequencing; dividing large pieces of DNA into small ordered overlapping fragments.
This strategy requires much more initial cloning work in the laboratory, but usually minimizes the number of actual sequencing reads required to complete a project, and makes minimal demands of software to organize the reads since it is known how they should fit together to form the final contig.
*** A second strategy known as "primer walking" requires very fast and accurate analysis of sequence reads since each sequencing reaction uses information from the previous read.
Again, assembly problems are minimized since both the order and the amount of overlap of the reads are known.
*** A third strategy, know as "shotgun sequencing" takes maximum advantage of the speed and low cost of automated sequencing, but relies totally on software to assembly a jumble of essentially random sequence reads into a coherent and accurate contig.
This approach relies on many more individual sequencing reactions, but much less meticulous cloning and record keeping for a large project. The Institute for Genomic Research (TIGR) has demonstrated the power and utility of this shotgun approach by determining the complete genomic sequences of Haemophilus influenzae , Methanococcus jannaschii , and Mycoplasma genitalium.
*** Many sequencing projects use approaches that involve a mixture of these three basic strategies.
Large sections of genomic DNA are first carefully sub-cloned into overlapping megabase-sized fragments (YACs), which are then carefully sub-cloned into overlapping 20-40 KB fragments (cosmids or lambda clones), and then these fragments are shotgun sequenced. Gaps in the assembled sequences are then filled by primer walking.


next next next index

Using Computers for Molecular Biology
Stuart M. Brown, Ph.D., RCR, NYU Medical Center
Comments to: browns02@mcrcr.med.nyu.edu