next next next index

Finding Genes in DNA Sequences

One aspect of the analysis of an unknown DNA sequence is the identification of protein coding regions, also known as open reading frames (ORFs).




A. Finding coding regions by similarity searching

*** A simple way to approach this problem might be to translate the sequence in all six reading frames (3 forward and three reverse) and do a similarity search against the protein databanks.

*** There is a variant of the BLAST program (BLASTX) that automatically translates a DNA query sequence and performs a similarity search against protein databanks.

*** If a protein sequence matches, get its DNA sequence and align it with your unknown sequence. The start and stop codons should line up nicely. If your query sequence is genomic, then the introns should also be obvious.




B. Scanning DNA for ORFs

*** If you cannot find a handy template sequence in the databank, then you must rely on a knowledge of signal sequences.

*** The transcription initiation site is always an ATG codon and it is always about 30 base pairs downstream from a TAATAA sequence.

*** This is enough information to specify a pattern for the GCG program FINDPATTERNS .

*** It may be even easier to just produce a map of ORFs in all 6 reading frames and look for a long one.

*** Simple software that maps an ORF starting at every ATG and stops it at every stop codon is available in a wide variety of forms.

*** GCG provides the FRAMES program. The MAP program can also be used to identify open reading frames. GeneWorks, MacVector, and Sequencher all handle this function quite elegantly.

*** Introns can often be identified as breaks in ORFs and with moderate reliability by the occurrence of consensus splice signal sequences.

*** However the only way to truly prove the existence of an intron is experimentally by comparing RNA (cDNA) to genomic sequences.




C. Statistical tools to predict coding regions

*** There are several other methods for identifying ORFs in DNA sequences.

*** GCG provides several tools that help to identify protein coding sequences by statistics that measure codon usage ( CODONPREPERENCE ) and the non-random use of particular nucleotides in the third position of each codon ( TESTCODE ).

*** These statistical methods are imprecise, but can help identify possible coding regions in large chunks of genomic DNA sequence.




D. Custom Gene Finding Servers on the Web

*** There are other forms of custom software designed to identify ORFs, but the validity of these packages are not universally accepted - but if nothing else is working, why not take a shot?

  • GRAIL is the most widely used ORF identification tool.
    GRAIL provides analysis of protein coding potential of a DNA sequence. GRAIL uses variable-length windows tailored to each potential exon candidate, defined as an open reading frame bounded by a pair of start/donor, acceptor/donor or acceptor/stop sites. This scheme facilitates the use of more genomic context information (splice junctions, translation starts, non-coding scores of 60-base regions on either side of a putative exon) in the exon recognition process. GRAIL finds about 91% of all coding regions with an apparent false positive rate of 8.6%.
  • GenLang
    GenLang is a syntactic pattern recognition system, which uses the tools and techniques of computational linguistics to find genes and other higher-order features in biological sequence data. Patterns are specified by means of rule sets called grammars, and a general purpose parser, implemented in the logic programming language Prolog, then performs the search.
  • BCM GeneFinder
    GeneFinder offers some unusual custom algorithms: "The algorithm first predicts all possible potential internal exons, and potential 5' and 3'-exon for each internal by linear discriminant functions combining characteristics describing various contextual features of these exons. Then by method of dynamic programming it searches for optimal combination of these exons and construct gene model."
  • ORFfinder at NCBI.
    This is a simple open reading frame dectector, but it allows automatic BLAST searches with each putative ORF.
  • DNA translation at the Univ. of Minnesota Med. School
    This is another version of an ORF detector and automatic translation tool.

next next next index


Using Computers for Molecular Biology
Stuart M. Brown, Ph.D., RCR, NYU Medical Center
Comments to: browns02@mcrcr.med.nyu.edu