The midterm exam is a Perl program. 

 

This year, we are going to try to find novel genes in bacterial genomic DNA. I just happen to have some unknown bacterial genomic DNA sequences sitting on my hard drive. I will share them with you as a FASTA file. You will need a mini-program to split them into separate sequences at the header lines and to strip out non-DNA characters (watch out for dashes).

 

There are a number of ways to identify genes in bacterial genomic DNA, but transcription initiation sites are highly conserved and less prone to random noise than ORFs. Bacterial transcription signals are pretty simple. There is an 8 letter consensus known as the Shine-Dalgarno sequence (TAAGGAGG) followed 4-10 bases downstream by the initiation codon (TAC in coding strand DNA). However, there are many variants on this sequence that make perfectly functional proteins, so we have a case of inexact pattern matching. 

 

Here are the most common variants: [TA][AC]AGGA[GA][GA]

(You are welcome to do additional research on variant Shine-Dalgarno patterns.)

 

The 4-10 base space is best represented as the following Perl regexp:   .{4,10}

 

You can match this set of sequence variants using a regular expressions.

 

Write a Perl program that can take an input sequence in FASTA format, search for a match to the Shine-Dalgarno consensus plus the start codon 4-10 based downstream. Make sure that your program is able to find more than one match since bacterial genomic DNA fragments may have more than one gene.

 

 

To make a really good program:

a.      Take a sequence to be searched as a command line parameter rather than hard code it in the program or prompt the user to type it in while the program is running (enable shell scripting)

b.      Strip out carriage returns and whitespace

c.      Don’t include the header line in the sequence

d.      Make your search pattern case insensitve

e.      Make sure that your program is able to find more than one match since bacterial genomic DNA fragments may have more than one gene

f.        Find a way to use the wildcard “.” character in your pattern to allow for a mismatch at any single position?

g.      Scan the opposite strand (reverse/complement sequence or search with rev/comp pattern)

h.      Print the location of each match (include + or – strand)