Protein structure predictions for Pfam families =============================================== This directory provides structural models and contact maps for 6,370 Pfam families (Pfam 33.1). Models were generated by the Baker group (https://www.bakerlab.org/) using trRosetta (1) from multiple sequence alignments. Overview -------- There are two archives available for download: * trRosetta.full.tar.gz - the complete data, which includes contact maps, pdb files, lDDT scores and alignments * trRosetta.pdb.tar.gz - a smaller archive containing only the pdb files Download the archive and verify its integrity : Full archive: $ wget http://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam33.1/structure_models/trRosetta.full.tar.gz $ wget http://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam33.1/structure_models/trRosetta.full.tar.gz.md5 $ md5sum -c trRosetta.full.tar.gz.md5 Smaller pdb archive: $ wget http://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam33.1/structure_models/trRosetta.pdb.tar.gz $ wget http://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam33.1/structure_models/trRosetta.pdb.tar.gz.md5 $ md5sum -c trRosetta.pdb.tar.gz.md5 Untar the archive: Full archive: $ tar -xzf trRosetta.full.tar.gz $ ls -l drwxr-xr-x 5 interpro interpro 4096 Feb 25 22:49 trRosetta_full -rw-r--r-- 1 interpro interpro 87255064553 Feb 25 22:49 trRosetta.full.tar.gz -rw-r--r-- 1 interpro interpro 56 Feb 25 22:49 trRosetta.full.tar.gz.md5 Smaller pdb archive: $ tar -xzf trRosetta.pdb.tar.gz $ ls -l drwxr-xr-x 5 interpro interpro 512000 Feb 25 22:49 pdb -rw-r--r-- 1 interpro interpro 321154409 Feb 25 22:49 trRosetta.pdb.tar.gz -rw-r--r-- 1 interpro interpro 55 Feb 25 22:49 trRosetta.pdb.tar.gz.md5 If you downloaded the full archive, the trRosetta_full directory has the following structure: . |-- a3m | |-- PF00242.a3m | |-- PF00257.a3m | |-- ... | `-- PF19222.a3m |-- dist | |-- PF00242.npz | |-- PF00257.npz | |-- ... | `-- PF19222.npz |-- lddt | |-- PF00242.npz | |-- PF00257.npz | |-- ... | `-- PF19222.npz `-- pdb |-- PF00242.pdb |-- PF00257.pdb |-- ... `-- PF19222.pdb * a3m: contains the UniProt alignments for Pfam families, in the A3M file format. * dist: contains predicted protein inter-residue distances and orientation, in the NumPy NPZ file format. * lddt: contains predicted residue-wise local Distance Difference Test (lDDT) scores (2), in the NumPy NPZ file format. * pdb: directory of predicted structures, in the PDB file format. If you downloaded the smaller trRosetta.pdb.tar.gz archive, you will only have the pdb directory containing the PDB files. Contact maps ------------ Contact maps can be retrieved from the files in the /dist/ directory, using the Python code below (requires the NumPy package). $ python Python 3.6.9 (default, Oct 8 2020, 12:12:24) [GCC 8.4.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import numpy as np >>> data = np.load("trRosetta_full/dist/PF00242.npz") >>> dist = data["dist"] >>> matrix = np.sum(dist[:,:,1:13], axis=-1) The dist array represents the residue-reidue distance and is of shape [L, L, 37] where L represents the size of the first sequence of the alignment. The third dimension represents probabilities for the Cb-Cb distance to be within the ranges: 0-2.5A, 2.5-3A, ..., 19.5-20A, >20A. matrix is the contact map with probabilities of Cb-Cb distances <8A. lDDT scores ----------- The local Distance Difference Test (lDDT) scores can be retrieved from the files in the /lddt/ directory. $ python Python 3.6.9 (default, Oct 8 2020, 12:12:24) [GCC 8.4.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import numpy as np >>> data = np.load("trRosetta_full/lddt/PF00242.npz") >>> lddt = data["lddt"] lddt is an array of the size of the sequence. References ---------- (1) Yang J, et al. Improved protein structure prediction using predicted interresidue orientations. Proc Natl Acad Sci U S A. 2020 (2) Mariani V, et al. lDDT: a local superposition-free score for comparing protein structures and models using distance difference tests. Bioinformatics. 2013