next next next index

Databases

The databases available for BLAST searching (at NCBI) are:

Peptide (protein) Sequence Databases

  • nr = All non-redundant GenBank CDS translations+PDB+SwissProt+PIR
  • month = All new or revised GenBank CDS translation+PDB+SwissProt+PIR released in the last 30 days.
  • swissprot = The SWISS-PROT protein sequence database
  • yeast = Yeast (Saccharomyces cerevisiae) protein sequences.
  • pdb = Sequences derived from the 3-dimensional structure Brookhaven Protein Data Bank
  • kabat = Kabat's database of sequences of immunological interest
  • alu = Translations of select Alu repeats

Nucleotide Sequence Databases

  • nr = All Non-redundant GenBank+EMBL+DDBJ+PDB sequences (but no EST's or STS's)
  • month = All new or revised GenBank+EMBL+DDBJ+PDB sequences released in the last 30 days.
  • dbest = Expressed Sequence Tags dbsts = Sequence Tagged Sites
  • yeast = Yeast (Saccharomyces cerevisiae) genomic nucleotide sequences
  • pdb = Nucleotide sequences derived from 3-dimensional protein structures in the Brookhaven Protein Data Bank
  • kabat = Kabat's database of sequences of immunological interest
  • vector = Vector subset of GenBank
  • mito = Database of mitochondrial sequences, Rel. 1.0, July 1995"
  • alu = Select Alu repeats
  • epd = Eukaryotic Promoter Database
  • gss =Genome Survey Sequence, includes single-pass genomic data, exon-trapped sequences, and Alu PCR sequences.

The databases available for FastA searching (at the RCR) are:

Protein

  • sp:* = SwissProt - Amos Bairoch's protein sequence database (extremely well organized and annotated)
  • gp:* = GenPept - Translations of all GenBank DNA seqs (according to exons in features tables)
  • pir:* - Protein Information Resource
    • pir1:* - Annotated PIR entries
    • pir2:* - New PIR entries
    • pir3:* - Unverified PIR entries (your guess is as good as mine what they mean by "unverified")
    • pir4:* - Unencoded or untranslated
    • nrl_3d:* - sequences from 3-dimensional structure Brookhaven Protein Data Bank
  • Prosite - consensus seqs of conserved protein domains
  • TFD - Transcription Factor database

DNA

  • VECTOR - vector sequences
  • MALARIA - malaria genomic sequences
  • gb:* - all of GenBank (includes EMBL, DDBJ, PDB) updated daily

GenBank Subdivisions

  • gb_ba:* Bacterial
  • gb_in:* - Invertebrate
  • gb_om:* - Other Mammalian (non-rodent, non-primate)
  • gb_ov:* - Other Vertebrate (non-mammalian vertebrates)
  • gb_or:* - Organelle
  • gb_pat:* - Patents
  • gb_ph:* - Phage
  • gb_pl:* - Plant
  • gb_pr:* - Primate
  • gb_ro:* - Rodent
  • gb_st:* - Structural RNA
  • gb_sy:* - Synthetic sequences (recombinant constructs, etc.)
  • gb_un:* - Unannotated
  • gb_vi:* - Viral
  • gb_est*:* - Expressed Sequence Tags (short cDNAs) - now has sections est1 to est 9 with more added each quarter.
  • gb_sts:* - Sequence Tagged Sites
  • gb_gss:* - Genomic Survey Sequences (large genomic contigs)
  • gb_htg:* - High Throughput Genomic sequences (single pass sequences churned out by the genome projects, unannotated and filled with errors)
  • gb_tag:* - ESTs + STS + GSS + HTG

next next next index

Using Computers for Molecular Biology
Stuart M. Brown, Ph.D, RCR, NYU Medical Center
Comments to: browns02@mcrcr.med.nyu.edu