![]()
![]()
![]()
![]()
Searching
So, given an unknown protein sequence, what are your computational options to guess its function? Beyond prediction of hydrophobic regions and some indication that certain regions might form a helix or a sheet, straight out analysis of the sequence does not reveal much.
However, you are not working in a vacuum. A great deal of structural and functional information has been determined for other proteins. First, check for overall sequence homology using standard methods (BLAST or FASTA), you might have something that is very similar to a known gene.
If no obvious similarities are found, do not despair. Protein are built in sections, generally known as domains. For example, a great many enzymes with different functions bind ATP; the ATP binding domain is a conserved sequence across many different genes. Many of these conserved domains have been collected in databases.
The best starting point for motif search is in the PROSITE database:
The "Dictionary of Protein Sites and Patterns" maintained by Amos Bairoch at the University of Geneva, Switzerland.
PROSITE contains a comprehensive list of documented protein domains constructed by expert molecular biologists.
Several tools have been developed to compare an amino acid sequence to the patterns in PROSITE.
GCG provides the programs MOTIFS and FINDPATTERNS. More about these tools below.
Stand-alone programs are available for Macintosh, IBM compatibles and most other forms of computers. A number of WWW servers are also available for on-line similarity searching of PROSITE with your query sequence:
- The ExPASy server at the University of Geneva, Switzerland
- The EMBL server at Hinxton Hall in the United Kingdom
In addition, each of the motifs in PROSITE have been exhaustively searched against the SwissProt database (which contains all well-annotated protein sequences) and the results correlated into several derived databases:
- BLOCKS from the Henikoff group at the Fred Hutchinson Cancer Research Center in Seattle, WA (USA)
- ProDom : Protein Domain Database from Daniel Kahn at the INRA in Toulouse (France)
Some other databases of protein domains are constructed strictly by the application of computational algorithms to protein databases. These computed databases may contain more domains than PROSITE, but lack the extensive annotation about the function of the domains.
- Prints : Protein Motif Fingerprint Database from Terri Attwood at the UCL in London (UK) contains 614 entries, encoding 3280 individual motifs. Of these entries, 302 have some sort of equivalent pattern in PROSITE.
- Pfam : Protein Families Database is a collection of protein family alignments, some of which are constructed semi-automatically using hidden Markov models (HMMs) and some fully automatically (contains 175 HMM-based families and 11929 other families).
- Swiss-Model: Automated Protein Modeling Server at the GLAXO Institute, Geneva, Switzerland
- SBASE: a collection of annotated protein domain sequences. Entries are clustered using the BLAST score as similarity measure (contains 7231 entries)
- SCOP: Structural Classification of Proteins (based on the Brookhaven Protein Data Bank. As of May 1996 (Release 1.32) scop contains 4432 PDB entries classified into 8330 domains.
Any study of a new protein, or a family of know proteins would do well to begin by thoroughly investigate the contents of these protein structure/domain databases. Much work (database searching, multiple alignment, and the building of phylogenetic trees) has already been done - be sure that your project builds on this knowledge rather than repeating these same analyses.
Patterns vs. Profiles
In some cases, you may wish to reverse the pattern searching process. Rather than search your sequence for the known patterns found in a database, you might wish to search a database with a new pattern of your own creation.
GCG has two different types of pattern searching tools, FINDPATTERNS and MOTIFS work with simple text patterns, while PROFILESEARCH and PROFILESCAN use mathematical profiles that are created from multiple alignments of a number of sequences that share a conserved domain.
Patterns
FINDPATTERNS uses a text string of DNA or amino acids such as
TAATAATGas a pattern.
FINDPATTERNS allows for ambiguity in pattern matching by enclosing different choices in parenthesis and separating the choices with commas. For instance, RGF(Q,A)S means RGF followed by either Q or A followed by S. Variable numbers of N characters (which match any base) or X (which match any amino acid) may also be included.
FINDPATTERNS works best with short patterns (less than about 50 characters).
A number of patterns can be searched simultaneously by creating a pattern list file (see the GCG documentation for the format of this file)
The GCG program MOTIFS is essentially an implementation of FINDPATTERNS that uses the entire PROSITE database as a large pattern file:
PROSITETOGCG of: Prosite.Doc and Prosite.Dat December 18, 1995 11:18 Release 13.0 (11/95) Name Offset Pattern .. PDoc_Name 11s_Seed_Storage 1 NGx(D,E)2x(L,I,V,M,F)C(S,T)x{11,12}(P,A,G)D 0284.pdoc 1433_1 1 RNL(L,I)SV(G,A)YKN(I,V) 0633.pdoc 1433_2 1 YK(D,E)STLIMQLL(R,H)DNLTLW(T,A)(S,A) 0633.pdoc 25a_Synth_1 1 GGSx(A,G)(K,R)xTxL(K,R)(G,S,T)xSD(A,G) 0653.pdoc 25a_Synth_2 1 RPVILDPx(D,E)PT 0653.pdoc //////////////////////////////////////////////////////////////////////////// Zinc_Finger_C2h2 1 Cx{2,4}Cx3(L,I,V,M,F,Y,W,C)x8Hx{3,5}H 0028.pdoc Zinc_Finger_C3hc4 1 CxHx(L,I,V,M,F,Y)Cx2C(L,I,V,M,Y,A) 0449.pdoc Zinc_Protease 1 (G,S,T,A,L,I,V,N)x2HE(L,I,V,M,F,Y,W)~(D,E,H,R,K,P ... Zn2_Cy6_Fungal 1 (G,A,S,T,P,V)Cx2C(R,K,H,S,T,A,C,W)x2(R,K,H)x2Cx{5 ... Zp_Domain 1 (L,I,V,M,F,Y,W)x7(S,T,A,P,D,N)x3(L,I,V,M,F,Y,W)x( ...
Motifs displays an abstract from the PROSITE Dictionary below each pattern that is found.
Profiles
Read the essay about Profile Analysis entitled Associating Distantly Related Proteins and Finding Structural Motifs by John Devereux (principal author of the GCG program).
Profiles are mathematical representations of a conserved sequence region that are built from a multiple alignment.
There is no specific limitation on the size of the multiple alignment region or on the number of sequences in the alignment. However, an alignment confined to a highly conserved region with only similar sequences will make a better profile. Searches with motifs that contain too much ambiguity will match too many sequences in a database.
Profile searching has four steps:
- assembly of a family of related sequences into a multiple sequence alignment with PILEUP
- construction of a profile from the alignment with the program PROFILEMAKE
- comparison of the profile to a database of sequences with PROFILESEARCH
- display of the best similarities found with PROFILEGAP or PROFILESEGMENTS
A single sequence can be searched with a library of different profiles using the PROFILESCAN program.
![]()
![]()
![]()
![]()
Using Computers for Molecular Biology
Stuart M. Brown, Ph.D., RCR, NYU Medical Center Comments to: browns02@mcrcr.med.nyu.edu