PFAM : Multiple alignments and profile HMMs of protein domains RELEASE 4.0 -------------------------------------- 1. INTRODUCTION Pfam is a collection of protein family alignments which were constructed semi-automatically using hidden Markov models (HMMs). Sequences that were not covered by Pfam were clustered and aligned automatically, and are released as Pfam-B. Pfam families have permanent accession numbers and contain functional annotation and cross-references to other databases, while Pfam-B families are re-generated at each release and are unannotated. See http://www.sanger.ac.uk/Software/Pfam/ http://pfam.wustl.edu/ http://www.cgr.ki.se/Pfam/ 2. STATISTICS Pfam Pfam-B ----------------------- ----------------------- Release Date families sequences residues families sequences residues Source ------- ----- -------- --------- -------- -------- --------- -------- --------- 0.2 01/96 100 10431 2246421 11763 32081 9200334 Swiss 32 1.0 04/96 175 15610 3560959 11929 31931 8957230 Swiss 33 2.0 03/97 527 28170 6770529 13289 31349 8224614 Swiss 34 2.1 10/97 527 28205 6790960 13289 31349 8224614 Swiss 34 3.0 06/98 806 99043 22766133 33550 79544 20648530 Swiss 35 + SP-TrEMBL 5 3.1 09/98 1313 114750 27573470 33550 79544 20648530 Swiss 35 + SP-TrEMBL 5 3.2 10/98 1344 115155 27689081 33550 79544 20648530 Swiss 35 + SP-TrEMBL 5 3.3 12/98 1390 119420 28085438 33550 79544 20648530 Swiss 35 + SP-TrEMBL 5 3.4 01/99 1407 119963 28343136 33550 79544 20648530 Swiss 35 + SP-TrEMBL 5 4.0 05/99 1465 147347 34476183 128689 123610 33470292 Swiss 37 + SP-TrEMBL 9 3. DESCRIPTION OF CHANGES MADE SINCE RELEASE 3.4 Release 4.0 contains 58 new families since the last release. Pfam 4.0 is now based on Swiss-Prot 37 and SP-TREMBL 9 sequences. These databases can be accessed from ftp://ftp.ebi.ac.uk/pub/databases/swissprot/release/ ftp://ftp.ebi.ac.uk/pub/databases/trembl/ The flatfile formats have changed since release 3.4. To allow users time to rewrite any parsers we have provided a script to convert the new format into the old format. This will not be kept up to date. The flatfiles are now in Stockholm format. This is a marked up alignment format. Several changes have been made to the database fields since release 3.4. Two new fields have been made. Comments attached to database links are now called database comments and given in the field DC, similarly comments attached to literature references are found in reference comment fields RC. Please read the file userman.txt for a full description of the fields in Pfam. Pfam-B the automatic clustering of sequences not in Pfam-A the curated portion of Pfam are now made differently. We now use Prodom as the basis of the clustering. Prodom clusters are filtered to remove subsequences that contain Pfam-A domains. The prodom alignments are then split into domain alignments based on the matches to Pfam-A domains. For example a Pfam-A domain that splits a Prodom alignment into two halves will create two Pfam-B families. In the case where a Prodom family contains extra members relative to the Pfam-A family, we have removed the Pfam-A subsequences, The Pfam-B entry only contains the members of the family not found in Pfam-A. There are links between Pfam-A and Pfam-B to indicate where this has happened. These may represent cases where Pfam-A does not find all true members of the family. We are grateful to the many people who contributed data: Rob Finn, Chris Ponting, Peer Bork, Joerg Schultz, Richard Copley and Tim Dudgeon. 6. DESCRIPTION OF RELEASE FILES relnotes.txt - This file. userman.txt - A fuller description of Pfam fields. Pfam-A.full - Annotation and full alignments in Pfam format of all Pfam-A families. Pfam-A.seed - Annotation and seed alignments in Pfam format of all Pfam-A families. Pfam-B - All Pfam-B families (generated by Domainer). swissPfam - Pfam domain organisation of all Swissprot proteins. Pfam - All Pfam-A HMMs in a HMM library searchable with the hmmpfam program. PfamFrag - All Pfam-A HMMs in fs (fragment search) mode in a HMM library searchable with the hmmpfam program. diff - A list of files for each family that have changed since the last release. All files are compressed with standard unix compression. 7. DESCRIPTION OF FIELDS Compulsory fields: ------------------ AC Accession number: Accession number in form PFxxxxx or PBxxxxxx. ID Identification: One word name for family. DE Definition: Short description of family. AU Author: Authors of the entry. AL Alignment method of seed: The method used to align the seed members. SE Source of seed: The source suggesting the seed members belong to one family. GA Gathering method: Search threshold to build the full alignment. TC Trusted Cutoff: Lowest sequence score and domain score of match in the full alignment. NC Noise Cutoff: Highest sequence score and domain score of match not in full alignment. SQ Sequence: Number of sequences in alignment. // End of alignment. Optional fields: ---------------- DC Database Comment: Comment about database reference. DR Database Reference: Reference to external database. RC Reference Comment: Comment about literature reference. RN Reference Number: Reference Number. RM Reference Medline: Eight digit medline UI number. RT Reference Title: Reference Title. RA Reference Author: Reference Author RL Reference Location: Journal location. PI Previous identifier: Record of all previous ID lines. KW Keywords: Keywords. CC Comment: Comments. 8. REFERENCES Papers on Pfam are listed below: i) Sonnhammer ELL, Eddy SR, Durbin R. Proteins: Structure, Function and Genetics 28:405-420 (1997). ii) Sonnhammer ELL, Eddy SR, Birney E, Bateman A, Durbin R. Nucleic Acids Research 26:320-322 (1998). iii) Bateman A, Birney E, Durbin R, Eddy SR, Finn RD, Sonnhammer ELL. Nucleic Acids Research 27:260-262 (1999). We suggest that you reference the most recent paper. 9. COPYRIGHT NOTICE Pfam - A database of protein domain family alignments and HMMs Copyright (C) 1996-1999 The Pfam consortium. This database is free; you can redistribute it and/or modify it under the terms of the GNU Library General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. In summary, you are free to redistribute *verbatim* copies of Pfam or any Pfam files in any way you like, including packaging Pfam in proprietary software, so long as your copy of Pfam retains our copyright notice and the GNU license. You may also make *modified* copies of Pfam and distribute them, but your derivative database must be freely distributed under the GNU LGPL. Many academic freeware licenses prohibit any form of commercial use. In contrast, the intent of our license is that Pfam should be freely available to both industrial and academic researchers, including the use of the Pfam database in commercial software; however, proprietary modifications of the Pfam database itself are prohibited. Proprietary modification of the Pfam database is possible only by a separate formal licensing agreement from the Pfam consortium and our host institutions. See the file GNULICENSE for the full text of the GNU Library General Public License. This database is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Library General Public License for more details. You may also obtain a copy of the GNU LGPL by writing to the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. Pfam is maintained by a consortium of researchers. You can contact the Pfam consortium at: pfam-admin@sanger.ac.uk The current members of the Pfam consortium are: Alex Bateman, Ewan Birney, Kevin Howe, Richard Durbin: The Sanger Centre, UK Erik Sonnhammer, Christian Storm: Karolinska Institute, Sweden Matt Barnhart, Linda Lutfiyya, Sean Eddy: Washington University, St Louis, USA ___________________ The Pfam Consortium May 1999