PFAM : Multiple alignments and profile HMMs of protein domains RELEASE 15.0 -------------------------------------- 1. INTRODUCTION Pfam is a collection of protein family alignments which were constructed semi-automatically using hidden Markov models (HMMs). Sequences that are not covered by Pfam are clustered and aligned automatically, and released as Pfam-B. Pfam families have permanent accession numbers and contain functional annotation and cross-references to other databases, while Pfam-B families are re-generated at each release and are unannotated. 2. LOCATIONS Pfam is available on the web at: http://www.sanger.ac.uk/Software/Pfam/ http://pfam.cgb.ki.se http://pfam.wustl.edu/ http://pfam.jouy.inra.fr/ 2. STATISTICS Pfam Pfam-B ----------------------- ----------------------- Release Date families sequences residues families sequences residues Source ------- ----- -------- --------- ---------- -------- --------- -------- --------- 0.2 01/96 100 10431 2246421 11763 32081 9200334 Swiss 32 1.0 04/96 175 15610 3560959 11929 31931 8957230 Swiss 33 2.0 03/97 527 28170 6770529 13289 31349 8224614 Swiss 34 2.1 10/97 527 28205 6790960 13289 31349 8224614 Swiss 34 3.0 06/98 806 99043 22766133 33550 79544 20648530 Swiss 35 + SP-TrEMBL 5 3.1 09/98 1313 114750 27573470 33550 79544 20648530 Swiss 35 + SP-TrEMBL 5 3.2 10/98 1344 115155 27689081 33550 79544 20648530 Swiss 35 + SP-TrEMBL 5 3.3 12/98 1390 119420 28085438 33550 79544 20648530 Swiss 35 + SP-TrEMBL 5 3.4 01/99 1407 119963 28343136 33550 79544 20648530 Swiss 35 + SP-TrEMBL 5 4.0 05/99 1465 147347 34476183 128689 123610 33470292 Swiss 37 + SP-TrEMBL 9 4.1 07/99 1488 148195 34692597 36739 89640 22510097 Swiss 37 + SP-TrEMBL 9 4.2 08/99 1664 155979 36683193 40017 99587 24062200 Swiss 37 + SP-TrEMBL 9 4.3 09/99 1815 161833 37803491 39506 97492 23115975 Swiss 37 + SP-TrEMBL 9 4.4 11/99 2000 164412 38411490 39200 96055 22552453 Swiss 37 + SP-TrEMBL 9 5.0 01/00 2008 178110 41516321 39228 96077 22506088 Swiss 38 + SP-TrEMBL 11 5.1 02/00 2015 179782 41704446 42357 103709 24762358 Swiss 38 + SP-TrEMBL 11 5.2 03/00 2128 181068 42018555 42163 102843 24471000 Swiss 38 + SP-TrEMBL 11 5.3 05/00 2216 183695 42512479 41974 102024 23952537 Swiss 38 + SP-TrEMBL 11 5.4 06/00 2290 185251 42659663 41885 101728 23774015 Swiss 38 + SP-TrEMBL 11 5.5 09/00 2478 190302 43837632 41232 99302 22716640 Swiss 38 + SP-TrEMBL 11 6.0 01/01 2697 258321 59332756 40681 96571 21789591 Swiss 39 + SP-TrEMBL 14 6.1 03/01 2727 260202 59586847 40230 96128 21545422 Swiss 39 + SP-TrEMBL 14 6.2 04/01 2773 260570 59749821 58924 131100 30096444 Swiss 39 + SP-TrEMBL 14 6.3 05/01 2847 261546 60210817 58539 130194 29602298 Swiss 39 + SP-TrEMBL 14 6.4 05/01 2866 262071 60508404 58297 129257 29370968 Swiss 39 + SP-TrEMBL 14 6.5 06/01 2929 264276 61192015 57891 127861 28892585 Swiss 39 + SP-TrEMBL 14 6.6 08/01 3071 267598 61976627 57477 126378 28143196 Swiss 39 + SP-TrEMBL 14 7.0 01/02 3360 409136 93681071 78233 179966 39684197 Swiss 40 + SP-TrEMBL 18 7.1 03/02 3621 413112 94740523 77733 178097 38784346 Swiss 40 + SP-TrEMBL 18 7.2 04/02 3735 417711 95253221 77408 177141 38445765 Swiss 40 + SP-TrEMBL 18 7.3 05/02 3849 419102 95460308 83108 193509 41854966 Swiss 40 + SP-TrEMBL 18 7.4 07/02 3882 419360 95369593 83166 193462 41907032 Swiss 40 + SP-TrEMBL 18 7.5 08/02 4176 424307 96575650 82343 190360 40849796 Swiss 40 + SP-TrEMBL 18 7.6 09/02 4463 428237 97335169 81669 187725 40073352 Swiss 40 + SP-TrEMBL 18 7.7 10/02 4832 435353 99654093 80001 182610 38092326 Swiss 40 + SP-TrEMBL 18 7.8 11/02 5049 437970 100096164 79427 181013 37469930 Swiss 40 + SP-TrEMBL 18 8.0 02/03 5193 626452 142233861 76370 174607 35686841 Swiss 40.31 + SP-TrEMBL 22.0 9.0 05/03 5722 705698 160653012 98158 232641 46847522 Swiss 41.0 + SP-TrEMBL 23.0 10.0 07/03 6190 733829 167492698 96550 227325 45381141 Swiss 41.10 + SP-TrEMBL 23.15 11.0 10/03 7255 805978 184093744 94757 220445 43457201 Swiss 41.25 + SP-TrEMBL 24.14 12.0 01/04 7316 898590 205356517 108951 262041 53933483 Swiss 42.5 + SP-TrEMBL 25.6 13.0 04/04 7426 899152 205367676 108119 260253 53251778 Swiss 42.12 + SP-TrEMBL 25.12 14.0 05/04 7459 903115 206720690 107460 259106 52378429 Swiss 43.2 + SP-TrEMBL 26.2 15.0 08/04 7503 1105589 251318620 140216 357340 70848731 Swiss 44.0 + SP-TrEMBL 27.0 3. CONSTRUCTION OF PFAM Pfam is based on a sequence database called Pfamseq - Pfamseq 15 is based on Swiss-Prot 44.0 and SP-TrEMBL 27.0. These databases can be accessed at: ftp://ftp.ebi.ac.uk/pub/databases/swissprot/release/ ftp://ftp.ebi.ac.uk/pub/databases/trembl/ Pfam-B has been constructed using PRODOM 2004.1. 4. DESCRIPTION OF CHANGES FROM RELEASE 14.0 to 15.0 Release 15.0 contains a total of 7503 families, with 70 new families and 26 families deleted since the last release. 74.38% of all proteins in Pfamseq contain a match to at least one Pfam domain. 53.50% all residues in the sequence database fall within Pfam domains. 6. FUTURE FORMAT CHANGES No major changes planned for next release. 7. DESCRIPTION OF RELEASE FILES relnotes.txt - This file. userman.txt - A fuller description of Pfam fields. Pfam_ls - All global (ls mode) Pfam-A HMMs in an HMM library searchable with the hmmpfam program. Pfam_fs - All local (fs mode) Pfam-A HMMs in an HMM library searchable with the hmmpfam program. Pfam-A.seed - Annotation and seed alignments of all Pfam-A families in Pfam format. Pfam-A.full - Annotation and full alignments of all Pfam-A families in Pfam format. Pfam-A.fasta - A list of sequences in each Pfam-A family in fasta format. Pfam-A.dead - All Pfam-A families that have been removed from the database. Pfam-B - All Pfam-B families. Pfam-C - A list of all the clans, containing annotation and lists of Pfam-A entries in the clan diff - A list of files for each family that have changed since the last release. pfamseq - The underlying sequence database in fasta format. swisspfam - Pfam domain organisation of all proteins in Pfamseq. prior.tar - A collection of PRIOR files used to build the HMMs of specific families. 8. DESCRIPTION OF FIELDS Compulsory fields: ------------------ AC Accession number: Accession number in form PFxxxxx.version or PBxxxxxx. ID Identification: One word name for family. DE Definition: Short description of family. AU Author: Authors of the entry. SE Source of seed: The source suggesting the seed members belong to one family. GA Gathering method: Search threshold to build the full alignment. TC Trusted Cutoff: Lowest sequence score and domain score of match in the full alignment. NC Noise Cutoff: Highest sequence score and domain score of match not in full alignment. TP Type: Type of family -- presently Family, Domain, Motif or Repeat. SQ Sequence: Number of sequences in alignment. AM Alignment Method The order ls and fs hits are aligned to the model to build the full align. // End of alignment. Optional fields: ---------------- DC Database Comment: Comment about database reference. DR Database Reference: Reference to external database. RC Reference Comment: Comment about literature reference. RN Reference Number: Reference Number. RM Reference Medline: Eight digit medline UI number. RT Reference Title: Reference Title. RA Reference Author: Reference Author RL Reference Location: Journal location. PI Previous identifier: Record of all previous ID lines. KW Keywords: Keywords. CC Comment: Comments. NE Pfam accession: Indicates a nested domain. NL Location: Location of nested domains - sequence ID, start and end of insert. Obsolete fields: ----------- AL Alignment method of seed: The method used to align the seed members. 9. ACKNOWLEDGEMENTS We are grateful to the many people who contributed data: L. Aravind, Laurence Etwiller, Matthew Bashton, Peer Bork, Richard Copley, Tim Dudgeon, Anton Enright, Nicola Kerrison, Nina Mian, William Mifsud, Chris Ponting, Joerg Schultz, Val Wood, David Waterfield, Simon Moxon, Dan Haft, Owen White and Matthew Fenech as well as many others. 10. REFERENCES Papers on Pfam are listed below: i) Sonnhammer ELL, Eddy SR, Durbin R. Proteins: Structure, Function and Genetics 28:405-420 (1997). ii) Sonnhammer ELL, Eddy SR, Birney E, Bateman A, Durbin R. Nucleic Acids Research 26:320-322 (1998). iii) Bateman A, Birney E, Durbin R, Eddy SR, Finn RD, Sonnhammer ELL. Nucleic Acids Research 27:260-262 (1999). iv) Bateman A, Birney E, Durbin R, Eddy SR, Howe KL, Sonnhammer ELL. Nucleic Acids Research 28:263-266 (2000). v) Bateman A, Birney E, Cerruti L, Durbin R, Etwiller L, Eddy SR, Griffiths-Jones S, Howe KL, Marshall M, Sonnhammer ELL. Nucleic Acids Res. 30:276-280 (2002). vi) Bateman A, Coin L, Durbin R, Finn RD, Hollich V, Griffiths-Jones S, Khanna A, Marshall M, Moxon S, Sonnhammer EL, Studholme DJ, Yeats C, Eddy SR. Nucleic Acids Res. 32(1):D138-41 (2004). Please reference the most recent paper. 11. THE PFAM CONSORTIUM Pfam is maintained by a consortium of researchers. You can contact the Pfam consortium at: pfam-admin@sanger.ac.uk The current members of the Pfam consortium are: Alex Bateman, Lachlan Coin, Richard Durbin, Robert Finn, Sam Griffiths-Jones, Kevin Howe, Mhairi Marshall, Corin Yeats, Simon Moxon: The Welcome Trust Sanger Institute, UK. Ewan Birney, Laurence Etwiller: The European Bioinformatics Institute, UK. Lorenzo Cerrutti: ISREC, Switzerland. Erik Sonnhammer, Volker Hollich: Karolinska Institute, Sweden Sean Eddy, Ajay Khanna, Christian Zmasek: Washington University, St Louis, USA 12. COPYRIGHT NOTICE Pfam - A database of protein domain family alignments and HMMs Copyright (C) 1996-2004 The Pfam consortium. This database is free; you can redistribute it and/or modify it under the terms of the GNU Library General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. In summary, you are free to redistribute *verbatim* copies of Pfam or any Pfam files in any way you like, including packaging Pfam in proprietary software, so long as your copy of Pfam retains our copyright notice and the GNU license. You may also make *modified* copies of Pfam and distribute them, but your derivative database must be freely distributed under the GNU LGPL. Many academic freeware licenses prohibit any form of commercial use. In contrast, the intent of our license is that Pfam should be freely available to both industrial and academic researchers, including the use of the Pfam database in commercial software; however, proprietary modifications of the Pfam database itself are prohibited. Proprietary modification of the Pfam database is possible only by a separate formal licensing agreement from the Pfam consortium and our host institutions. See the file GNULICENSE for the full text of the GNU Library General Public License. This database is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Library General Public License for more details. You may also obtain a copy of the GNU LGPL by writing to the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. ___________________ The Pfam Consortium August 2004