Nature Genetics has published an excellet supplement issue on geneome sequence analysis called "A user's guide to the human genome."
Do Exercise #7
The rest of this issue is also worth a look. In your own time,
check it out.
User's Guide to the Human Genome
A member of our School faculty is trying to design a DNA chip with mouse promoter sequences on it. He has chosen about 2,000 genes that are relevant to developmental processes. It is extremely non-trivial to determine the "true" or even a best guess as to the transcription start site for many of these genes.
Here are GenBank accession numbers for 3 genes that have proven to be difficult.
AK017771Can you find these genes on the UCSC Genome Browser (for Mouse)? It may require getting the DNA sequence from GenBank and using the BLAT alignment tool.
We have also been using the Riken Database of full length mouse cDNA clones. If there is a longer RIken sequence, then where is the true promoter?
Choose a "gene model" that makes the most sense to you for each of these genes and extract 500 bases of genomic sequence directly upstream from your chosen transcription start sites (the first exons).
Once you have chosen a promoter region for each of these three mouse genes, then go to the Human genome and find the orthologs of these genes and again extract your best guess as to the promoter region (500 bp upstream of the first exon).
Now for each pair of orthologs, try to align the promoter regions. Can they be aligned?
Search for transcription factor binding sites (TESS).
Search TESS
Search for consensus eukaryotic promoter elements.
TFSiteScan
Neural Network Promoter Prediction
Can you think of a better way to find promoters for large sets
of genes?
Lecture Notes by Shifra Ben-Dor