Protein structure predictions for Pfam families =============================================== This directory provides structural models and contact maps for 6,370 Pfam families (Pfam 33.1). Models were generated by the Baker group (https://www.bakerlab.org/) using trRosetta (1) from multiple sequence alignments. Overview -------- Download the archive and verify its integrity: $ wget http://ftp.ebi.ac.uk/pub/databases/Pfam/baker/pfam33.1_baker.tar.gz $ wget http://ftp.ebi.ac.uk/pub/databases/Pfam/baker/pfam33.1_baker.tar.gz.md5 $ md5sum -c pfam33.1_baker.tar.gz.md5 Untar the archive: $ tar -xzf pfam33.1_baker.tar.gz $ ls -l drwxr-xr-x 5 interpro interpro 4096 Feb 25 22:49 pfam33.1_baker -rw-r--r-- 1 interpro interpro 86271431608 Feb 25 22:49 pfam33.1_baker.tar.gz -rw-r--r-- 1 interpro interpro 57 Feb 25 22:49 pfam33.1_baker.tar.gz.md5 The pfam33.1_baker directory has the following structure: . |-- a3m | |-- PF00242.a3m | |-- PF00257.a3m | |-- ... | `-- PF19222.a3m |-- dist | |-- PF00242.npz | |-- PF00257.npz | |-- ... | `-- PF19222.npz |-- lddt | |-- PF00242.npz | |-- PF00257.npz | |-- ... | `-- PF19222.npz `-- pdb |-- PF00242.pdb |-- PF00257.pdb |-- ... `-- PF19222.pdb * a3m: contains the UniProt alignments for Pfam families, in the A3M file format. * dist: contains predicted protein inter-residue distances and orientation, in the NumPy NPZ file format. * lddt: contains predicted residue-wise local Distance Difference Test (lDDT) scores (2), in the NumPy NPZ file format. * pdb: directory of predicted structures, in the PDB file format. Contact maps ------------ Contact maps can be retrieved from the files in the /dist/ directory, using the Python code below (requires the NumPy package). $ python Python 3.6.9 (default, Oct 8 2020, 12:12:24) [GCC 8.4.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import numpy as np >>> data = np.load("pfam33.1_baker/dist/PF00242.npz") >>> dist = data["dist"] >>> matrix = np.sum(dist[:,:,1:13], axis=-1) The dist array represents the residue-reidue distance and is of shape [L, L, 37] where L represents the size of the first sequence of the alignment. The third dimension represents probabilities for the Cb-Cb distance to be within the ranges: 0-2.5A, 2.5-3A, ..., 19.5-20A, >20A. matrix is the contact map with probabilities of Cb-Cb distances <8A. lDDT scores ----------- The local Distance Difference Test (lDDT) scores can be retrieved from the files in the /lddt/ directory. $ python Python 3.6.9 (default, Oct 8 2020, 12:12:24) [GCC 8.4.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import numpy as np >>> data = np.load("pfam33.1_baker/lddt/PF00242.npz") >>> lddt = data["lddt"] lddt is an array of the size of the sequence. References ---------- (1) Yang J, et al. Improved protein structure prediction using predicted interresidue orientations. Proc Natl Acad Sci U S A. 2020 (2) Mariani V, et al. lDDT: a local superposition-free score for comparing protein structures and models using distance difference tests. Bioinformatics. 2013