Unix Homework (more challenging)
1. Log into your Mendel.med.nyu.edu account. Set up your account for remote access to databases using the EMBOSS seqret tool. Copy this file from my directory to yours:
cp /home/newstu/emboss.default .
2. Retrieve a DNA sequence from the EMBL database:
seqret EMBL:FJ788172
3. Look at the new sequnce file
that was created by seqret
more fj788172.fasta
4. Count the number of 'a' bases (hint: use grep and wc)
5. Now for some bigger data. Copy this chunk from a raw Illumina sequence.txt file:
cp /home/newstu/s_1_sequence.txt .
6. Have a look at this file - messy! Illumina calls this FASTQ format, but it is not a standard file format.
cp /home/newstu/s_1_sequence.txt .
Most other software requires FASTA format, which has a single header line that starts with the ">" character, a title, carriage return, then a line of sequence, followed by the next sequence, like this:
>HWI-EAS305:1:1:3:1240#0/1
AGGAGGGGGAGAGAGGAGGGAAGGCAAGAGGGGA
>HWI-EAS305:1:1:3:1330#0/1
AGTTCACGCTAAAACATTGTATTTCAGCTGTAAA
How can we convert the FASTQ file to proper FASTA format?
(hint, I can do this in 2 commands, one is a 'grep' and one is a 'sed')
7. Here is another file created by Illumina sequencing software. This time everything is on one line.
cp /home/newstu/s2_sorted.txt .
Cut out just the sequence and the genome location (chromosome and position).
(hint 'cut -f' you figure out the column numbers)
I am worried about duplicate sequences in this file. Use the sort -u command to remove duplicates. How many duplicates were removed?