_/_/_/_/ _/_/_/_/_/ _/_/_/ _/ _/ _/ _/ _/ _/ _/ _/_/ _/_/ _/ _/ _/ _/ _/ _/ _/ _/ _/_/_/_/ _/_/_/_/ _/_/_/_/_/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ This document currently includes a more detailed description of the fields used in the Pfam database. The format of Pfam entries has become stricter and we now enforce some ordering of the fields. Pfam entries are composed of four sections shown in the figure below. __________________________________ | | | Header Section | | | |________________________________| | | | Reference Section | | | |________________________________| | | | Comment Section | | | |________________________________| | | | Alignment Section | | | |________________________________| Header Section: --------------- The header section mainly contains compulsory fields. These include Pfam specific information such as accession numbers and identifiers, as well as a short description of the family. The only non-compusory field in the header section is the PI field. All the fields in this section are described below. AC Accession number: One word in the form PFxxxxx or PBxxxxxx The Pfam-A accession numbers PFxxxxx are the stable identifier for each Pfam families. The Pfam-B accession PBxxxxxx numbers are not stable between releases of Pfam. PFxxxxx for pfam entries PBxxxxxx for Pfam-B entries ID Identification: One word less than 16 characters This field is designed to be a meaningful identifier for the family. Capitalisation of the first letter will be preferred. Underscores are used in place of space, and hyphens are only used to mean hyphens. DE Definition: 80 characters or less. This must be a one line description of the Pfam family. AU Author: Author of the entry. The format for this record is shown below, this is a comma seperated list on a single line. The most authoratative author is at the left of the author list. AU Bloggs JJ, Bloggs JE AL Alignment method of seed: The method used to align the seed members. This field has a restricted vocabulary. Currently the approved AL lines are shown below. It is important to note that this field only gives a guide to the method used for alignment construction. You may find for example that ClustalW does not give an identical alignment to that found in Pfam even if the AL line shows Clustalw as the method. AL Clustalv AL Clustalw AL Clustalw_mask_xxxx AL Domainer AL HMM_built_from_alignment AL HMM_simulated_annealing AL Manual AL Prosite_pattern AL Prodom AL Structure_superposition AL Domainer AL pftools AL Unknown Any method can have _manual appended to it, to indicate minor changes. e.g AL Clustalw_manual Manual alignments are those from any method which have been altered by hand. BM HMM building command lines. See the HMMER 2 user's manual for full instructions on building HMMs. Also see URL: http://hmmer.wustl.edu/ An example of the BM lines from a single entry BM hmmbuild HMM SEED BM hmmcalibrate --seed 0 HMM All models are calibrated using a seed of zero to allow exact replication of HMM construction. SE Source of seed: The source suggesting seed members belong to a family. GA Gathering method: Search threshold to build the full alignment. GA lines are the thresholds used in the hmmsearch command line. An example GA line is shown below with the corresponding hmmsearch command line. GA 25 15 hmmsearch -T 25 --domT 15 HMM DB The -T option specifies the whole sequence score in bits, and the --domT option specifies the per-domain threshold in bits. NC Noise cutoff: Two numbers The old optional field NC is now a compulsory field. This now refers to the bit scores of the highest scoring match not in the full alignment. An example NC line is shown below NC 19.50 18.10 The first number refers to the highest whole sequence score in bits of a match not in the full alignment, and the second number specifies the highest per-domain score in bits of a match not in the full alignment. These two scores may not refer to the same sequence. TC Trusted cutoff: Two numbers The old optional field TC is now a compulsory field. This now refers to the bit scores of the lowest scoring match in the full alignment. An example TC line is shown below TC 23.00 6.10 The first number refers to the lowest whole sequence score in bits of a match in the full alignment, and the second number specifies the lowest per-domain score in bits of a match in the full alignment. These two scores may not refer to the same sequence. PI Previous IDs: A single line, with semi-colon seperated old identifiers The most recent names are stored on the left. This field is non-compulsory. Reference Section: ------------------ The reference section mainly contains cross-links to other databases, and literature references. All the fields in this section are described below. DC Database Comment: Comment for database reference. DR Database Reference: Reference to external database. All DR lines end in a semicolon. Pfam carries links to a variety of databases, this information is found in DR lines. The format is DR Database; Primary-id; For SCOP links a third field is added indicating the level of placement in the SCOP heirarchy. Examples of each database link are shown below. It is expected that the format of the SCOP links will be changed to include chain and region information. DR EXPERT; jeisen@leland.stanford.edu; DR MIM; 236200; DR PFAMB; PB000001; DR PRINTS; PR00012; DR PROSITE; PDOC00017; DR PROSITE_PROFILE; PS50225; DR SCOP; 7rxn; sf; DR SCOP; 1pii; fa; DR SMART; CBS; DR URL; http://www.gcrdb.uthscsa.edu/; Links to PDBSUM at are also derived from the SCOP DR lines. RC Reference Comment: Comment for literature reference. RN Reference Number: Digit in square brackets Reference numbers are used to precede literature references, which have multiple line entries RN [1] RM Reference Medline: Eight digit number An example RM line is shown below RM 91006031 The number can be found as the UI number in pubmed http://www.ncbi.nlm.nih.gov/PubMed/ RT Reference Title: Title of paper. RA Reference Author: All RA lines use the following format RA Bateman A, Eddy SR, Mesyanzhinov VV; RL Reference Location: The reference line is in the format below. RL Journal abbreviation year;volume:page-page. RL Virus Genes 1997;14:163-165. RL J Mol Biol 1994;242:309-320. Journal abbreviations can be checked at http://expasy.hcuge.ch/cgi-bin/jourlist?jourlist.txt Journal abbreviation have no full stops, and page numbers are not abbreviated. Comment Section: ---------------- The comment section contains functioanl information about the Pfam family. The only field in the comment section is the CC field. CC Comment: Comment lines provide annotation and other information. Annotation in CC lines does not have a strict format. Links to Pfam families can be provided with the following syntax Pfam:PFxxxxx. Links to SWISS-PROT and SP-TrEMBL sequences can be provided with the following syntax Swiss:Accession. Alignment Section: ------------------ SQ Sequence: Nr of sequences, start of alignment. // End of alignment The alignment is in Stockholm format. This includes mark-ups of four types: #=GF #=GC #=GS #=GR Recommended placements: #=GF Above the alignment #=GC Below the alignment #=GS Above the alignment or just below the corresponding sequence #=GR Just below the corresponding sequence The alignment formats have the following size limits: : max 4096 characters. : max 50 characters. max 50 characters. These details can also be found on the web, in the belvu alignment viewer documentation. See URL: http://www.cgr.ki.se/cgr/groups/sonnhammer/Belvu.html