Protein structure and family exercises This is a set of URLs for online bioinformatics data and tools that will be useful for these exercises: Genome Data http://www.ncbi.nlm.nih.gov/genome/seq/ Cold Spring Harbor http://nucleus.cshl.org/humchr18web/ MIT Genome Center http://www-seq.wi.mit.edu/public_release/byprogress.shtml Sanger Center http://www.sanger.ac.uk/HGP/sequence/ UniGene http://www.ncbi.nlm.nih.gov/UniGene/ ORF Finder http://www.ncbi.nlm.nih.gov/gorf/gorf.html GRAIL http://compbio.ornl.gov/Grail-1.3/ Other gene recognition programs http://linkage.rockefeller.edu/wli/gene/programs.html BLAST http://www.ncbi.nlm.nih.gov/BLAST/ Psi-BLAST http://www.ncbi.nlm.nih.gov/cgi-bin/BLAST/nph-psi_blast FASTA3 http://www2.ebi.ac.uk/fasta3/ http://vega.crbm.cnrs-mop.fr/bin/fasta-guess.cgi PROSITE http://www.expasy.ch/prosite/ PFam http://pfam.wustl.edu/hmmsearch.shtml ProDom http://protein.toulouse.inra.fr/prodom.html CLUSTAL http://www2.ebi.ac.uk/clustalw/ http://pbil.ibcp.fr/cgi-bin/npsa_automat.pl?page=npsa_clustalw.html http://mbshortcuts.com/mbsalign/ MEME http://www.sdsc.edu/MEME 1) Here is a protein sequence; see if you can identify any known motifs in it by searching online protein family databases. CAA74572 mdpaeavlqe kalkfmnsse redcnngepp rkiipeknsl rqtynscarl clnqetvcla stamktencv aktklangts smivpkqrkl sasyekekel cvkyfeqwse sdqvefvehl isqmchyqhg hinsylkpml qrdfitalpa rgldhiaeni lsyldakslc aaelvckewy rvtsdgmlwk kliermvrtd slwrglaerr gwgqylfknk ppdgnappns fyralypkii qdietiesnw rcgrhslqri hcrsetskgv yclqyddqki vsglrdntik iwdkntleck riltghtgsv lclqydervi itgssdstvr vwdvntgeml ntlihhceav lhlrfnngmm vtcskdrsia vwdmasptdi tlrrvlvghr aavnvvdfdd kyivsasgdr tikvwntstc efvrtlnghk rgiaclqyrd rlvvsgssdn tirlwdiecg aclrvleghe elvrcirfdn krivsgaydg kikvwdlvaa ldprapagtl clrtlvehsg rvfrlqfdef qivssshddt iliwdflndp aaqaepprsp srtytyisr Build a list of other proteins related to this one. Try limiting your search to human proteins. € use motifs, domains, or alignments to locate additional members of this family € find the corresponding DNA sequence and use PSI-BLAST to search DNA databases (ESTs, genomes) 2) This one is a bit harder. Again, look for known motifs in online databases, collect a list of related proteins, and try to locate additional members of this family. U74586 masvlgsgrg sgglssqlkc kskrrrrrrs krkdkvsils tflapfkyls pgttnteded nlstssaevk enrnvsnlgt rplppgdwar ggstpsvkrk rpleegnggh fcklqliwkk lswsmtpkna lvqlhelkpg lqyrmvsqtg pvhapvfava vevngltfeg tgptkkkakm raaemalksf vqfpnafqah lamgsstspc tdftsdqadf pdtlfkefep ssrnedfpgc cpvdteflss ayrrgrllyh tldlmgqalp drsrlapgal gernpvvvln elrsglryvc lsetaekprv ksfvmavcvd grtfegsgrs kklakgqaaq aalqalfdir lpghipsrsk snllpqdfad svsqlvtqkf reltvgltsv yarhktlagi vmtkgldtkq aqvivlssgt kcisgehisd qglvvndcha eivarraflh flytqlelhl skhqedpers ifirvkeggy rlrenilfhl yvstspcgda rlnspyeiti dlnsskhivr kfrghlrtki esgegtvpvr gpsavqtwdg illgeqlvtm sctdkiaswn vlglqgallc hfiepvylhs iivgslhhtg hlarvmshrm egigqlpasy rqnrpllsgv snaearqpgk sphfsanwvv gsadleiina ttgkrscggs srlckhvfsa rwarlhgrls tripghgdtp smyceakrga htyqsvkqql fkafqkaglg twvrkppeqd qfllsl 3) There are a huge number of protein sequences in the databanks with no useful annotation; the vast majority of them, in fact. Working with a few of these is similar to working with the your own experimental results of a yeast two hybrid screen or some other procedure that generates a lot of cDNA sequences - except that you are guaranteed no easy matches to known sequences. This is where I often start, since people tend to bring me their harder sequence analysis problems (they are generally able to figure out the actins and ribosomal proteins on their own). In my opinion this is the most interesting and important bioinformatics work. The automated annotation programs can take care of the easy ones - 95% matches to known sequences - but it takes careful work from trained biologists to discover entirely new protein families. Grab any old "hypothetical protein" from GenBank and try and find out something useful about it. To start, you should realize that GenBank annotations are almost never updated; so proteins with no useful annotation may be closely related to more recently discovered genes. Do your own BLAST search to check. If nothing interesting turns up, then you can skip PROSITE (which only contains well known motifs), but do a search of PFam and ProDom to see if it has any less well known motifs. Next, try to build your own family by collecting similar sequences from other "hypothetical proteins," ESTs, and genomic sequences (HTGs). If you do have a group of similar sequences, check for a UniGene cluster. Next, build your own multiple alignment. Study the alignment to see if there are conserved regions. If you find anything interesting, be sure to send it to NCBI. Here is one to get you started: BAA20773 vpkvkrgrgr ppkvkitell nktdnrplkk leaqetlnee dkakiakskk kmrqkvqrge cqttiqgqar nkrkqetksl kqkeakkksk aekekgktkq eklkekvkre kkekvkmkek eevtkakpac kadktlatqr rleerqrqqm ileemkkpte dmcltdhqpl pdfsrvpglt lpsgafsdcl tiveflhsfg kvlgfdpakd vpslgvlqeg llcqgdslge vqdllvrllk aalhdpgfps ycqslkilge kvseipltrd nvseilrcfl maygvepalc drlrtqpfqa qppqqkaavl aflvhelngs tliineidkt lesmssyrkn kwivegrlrr lktvlakrtg rsevemegpe eclgrrrssr imeetsgmee eeeeesiaav pgrrgrrdge vdatassipe lerqieklsk rqlffrkkll hssqmlravs lgqdryrrry wvlpylagif vegtegnlvp eevikketds lkvaahasln palfsmkmel agsnttassp arargrprkt kpgsmqprhl kspvrgqdse qpqaqlqpea qlhapaqpqp qlqlqlqshk gfleqegspl slgqsqhdls qsaflswlsq tqshssllss svltpdsspg kldpapsqpp eepepdeaes spdpqalwfn isaqmpcnaa ptpppavsed qptpspqqla sskpmnrpsa anpcspvqfs stplaglapk rragdpgemp qsptglgqpk rrgrppskff kqmeqryltq ltaqpvppem csgwwwirdp emldamlkal hprgirekal hkhlnkhrdf lqevclrpsa dpifeprqlp afqegimsws pkektyetdl avlqwveele qrvimsdlqi rgwtcpspds tredlayceh lsdsqeditw rgrgreglap qrkttnpldl avmrlaaleq nverrylrep lwpthevvle kallstpnga pegttteisy eitprirvwr qtlercrsaa qvclclgqle rsiaweksvn kvtclvcrkg dndeflllcd gcdrgchiyc hrpkmeavpe gdwfctvcla qqvegeftqk pgfpkrgqkr ksgyslnfse gdgrrrrvll rgrespaagp ryseeglsps krrrlsmrnh hsdltfceii lmemeshdaa wpflepvnpr lvsgyrriik npmdfstmre rllrggytss eefaadallv fdncqtfned dsevgkaghi mrrffesrwe efyqgkqanl 4) An even greater challenge is to look for interesting genes in the new genomic sequence data from the various labs working on the genome projects. In this case you have to first identify coding sequences, then compare these to databases. Motif searching tools can be helpful here in several ways: you can verify potential coding regions by searching for flanking promoters and regulatory binding domains, you can search for functional domains in proteins that are not highly similar to anything well annotated in GenBank, or you can use motif tools to help build a group of related sequences with unknown function. Another approach is to first search for similarity to known proteins by doing a six frame translation and BLAST/FASTA search against SwissProt or PIR. [You will have to break up the genome sequence into smaller chunks.] Here is one fresh off the sequencer: atcagcaagaacactgatctacgcagacgaacagatgcaactgtggcgtggtggacacga agacgcgtcacacgacctcgaaacgatagagcaggaaacagcaacggatatcgtccaaga accgcagagagaaacagtgtagtaacagtcgatccaatcaccgaagatgaaagagtgcac gccaccaagacgtcgtaactagagagtgaagcggtgcgagcaagacccacatggcgtact acacacggtaaacacttgactcggcggaaatcacaagaacatgcgccaccgtcatgtttc gcaacaaacgacagcttcttgcacccccaaggcaaagtggggtctccctaggcctcaaaa aacattgtggggaactttctcgcggaaactttcaggcacgcgtacatacaaattgtttca acgcggattcttgggcacttggaggaacatcgccacacctgggatctctggacagaagtt ctccccgcaaaaccatctgtgttagaaatagtttggcaaacaactttgtggggccaagca taccaacgctcccttgtccttccgaaaagagtcaaacccttcaccaaagtaaaacgcccc caaatgaatagtgccacaaaaagatcgcaacaaaaagtgctaaatccaagataatcacac aacataatcaaccgagttaacgtatcccaaatcccaaacaaccaacatcactcggtattg gctattccgttcaacaaaacaaaccaagccacaatcaaacaataatcaactcagcaagca gcgttcccccacaacacacacttccacagaaaacccccttacacaatctggcccacaacc ctgcctaatcgtcaagatgtcccattcggctgcgcccttaggattactcctcgcatcccc acccctactctaattttcctcctcctccccgtttcacctccccttattcctctctcctgt tcaatttattcttcttcttactattttccatctctaccgccaaatccccatctctcgttc ggctttccccgcgcgcgcgtcgtccgtgcggtaatcttgccccatcagtgtgcctactcc ttcacccatctatatacagagctggcccctcgctcgtattatatagtattactcgggcgc ggtatgacgagcgaaggggacactggtggatgctgcaacgggggggcaaccgtgcaggcc ggccggtcgacgctagagtgggcccgcagggagagggacaggcacaggggaaagaaggag gagggcacagcggcgaacgaaccccgaagagccaggcgacccggcgggacgaagaaaacg cgggaggggcagcgggaacacaacggcccagaaaaagaagacaggggcgagccgcccgag ggggcagggggacagcacgagggggagcccggccggcggcggcacgacaagcgggacaag cggagcccaggaaggagaacccgaggcggaaaaggcgagaagaggaagcgaggagagagc gaggagagaggacgaggagaggacgagaagacgagcccggaggacgggaagacgacggga gcagcacgggaccggcaagagggggaaagacagaaagaaagggagcggggaagcgaaccg aacagggaaggcgggcgaggccccacagaaaagccgaaggccaacgaaaaggccaggaaa cagagacgcaggggaggaaaaagaggggggaaaggaacaaggggaagaacgccaaaagag agagaaaagaagcaagcagggaaggccgcggaacctgagcgaaaccaaacacaagaaaca aaacaaccaaggaccagtagaagggagaaaggcaacaacaaaaaccccacaccaaacgga aagccgaaggaaacaacgaaaacgcaaaacgaacgacggacccaaaaaagaacaggccac caggaaggcaaagccccggaccacgggaaccaaacaaggaggcaaagcccaaagggacac aaaggagggaaaaaaggcaccaaacggtaaacacgaaaaaaggcacacaaaaaggggcaa agaacgggggggcgcgcgaacacccccgaaccaaaaaaaaaaccggggagtgtcaaggtg gccaagccgtgggtgggggccagaaaaagggttggcaccgctgcaaagggcagagaaggg Answers: 1) Prosite profile PS50181 Pfam family PF00646 (F-box domain) Medline citation: F-box proteins are receptors that recruit phosphorylated substrates to the SCF ubiquitin-ligase complex. Skowyra D, Craig KL, Tyers M, Elledge SJ, Harper JW; Cell 1997;91:209-219 2) Pfam family: PF00035 (Double-stranded RNA binding motif) Prosite profile: PS50141 (Adenosine deaminase} BLOCKS: BP03961 - p99.1.3961 (RNA deaminase editing adenos) Domo: DM04852 (double-stranded RNA binding domain) 3) Easy one! gi|2183083 (AF000422) TTF-I interacting peptide 5 [Homo sapiens] Length = 407 Score = 131 bits (327), Expect = 3e-29 Identities = 65/75 (86%), Positives = 65/75 (86%) Query: 1 VPKVKRGRGRPPKVKITELLNKTDNRPLKKLEAQETLNEEDXXXXXXXXXXMRQKVQRGE 60 VPKVKRGRGRPPKVKITELLNKTDNRPLKKLEAQETLNEED MRQKVQRGE Sbjct: 308 VPKVKRGRGRPPKVKITELLNKTDNRPLKKLEAQETLNEEDKAKIAKSKKKMRQKVQRGE 367 Query: 61 CQTTIQGQARNKRKQ 75 CQTTIQGQARNKRKQ Sbjct: 368 CQTTIQGQARNKRKQ 382 It also matches the PHD domain in PFam fairly well (e-value=7.1e-10): The PHD finger: implications for chromatin-mediated transcriptional regulation. Aasland R, Gibson TJ, Stewart AF; Trends Biochem Sci 1995;20:56-59."