Protein structure predictions for Pfam families =============================================== This directory provides structural models and contact maps for 9648 Pfam families (Pfam 35.0). Models were generated by the Baker group (https://www.bakerlab.org/) using RoseTTAfold (1) from multiple sequence alignments. Overview -------- There are two archives available for download: * RoseTTAfold.full.tar.gz - the complete data, which includes contact maps, pdb files, lDDT scores and alignments * RoseTTAfold.pdb.tar.gz - a smaller archive containing only the pdb files Note that these RoseTTAfold archives contain some additional low quality models that are not shown on the InterPro and Pfam websites. Download the archive and verify its integrity: Full archive: $ wget http://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam35.0/structure_models/RoseTTAfold.full.tar.gz $ wget http://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam35.0/structure_models/RoseTTAfold.full.tar.gz.md5 $ md5sum -c RoseTTAfold.full.tar.gz.md5 Smaller pdb archive: $ wget http://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam35.0/structure_models/RoseTTAfold.pdb.tar.gz $ wget http://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam35.0/structure_models/RoseTTAfold.pdb.tar.gz.md5 $ md5sum -c RoseTTAfold.pdb.tar.gz.md5 Untar the archive: Full archive: $ tar -xzf RoseTTAfold.full.tar.gz $ ls -l drwxr-xr-x 5 interpro interpro 4096 Feb 25 22:49 RoseTTAfold_full -rw-r--r-- 1 interpro interpro 87255064553 Feb 25 22:49 RoseTTAfold.full.tar.gz -rw-r--r-- 1 interpro interpro 56 Feb 25 22:49 RoseTTAfold.full.tar.gz.md5 Smaller pdb archive: $ tar -xzf RoseTTAfold.pdb.tar.gz $ ls -l drwxr-xr-x 5 interpro interpro 512000 Feb 25 22:49 pdb -rw-r--r-- 1 interpro interpro 321154409 Feb 25 22:49 RoseTTAfold.pdb.tar.gz -rw-r--r-- 1 interpro interpro 55 Feb 25 22:49 RoseTTAfold.pdb.tar.gz.md5 If you downloaded the full archive, the RoseTTAfold_full directory has the following structure: . |-- a3m | |-- PF00242.a3m | |-- PF00257.a3m | |-- ... | `-- PF20139.a3m |-- npz | |-- PF00242.npz | |-- PF00257.npz | |-- ... | `-- PF20139.npz `-- pdb |-- PF00242.pdb |-- PF00257.pdb |-- ... `-- PF20139.pdb * a3m: contains the UniProt alignments for Pfam families, in the A3M file format. * npz: contains the contact maps with a pre-set distance threshold of 15Å, the predicted residue-wise local Distance Difference Test (lDDT) scores (2) and the distance error distributions, all in the NumPy NPZ file format. * pdb: directory of predicted structures, in the PDB file format. If you downloaded the smaller RoseTTAfold.pdb.tar.gz archive, you will only have the pdb directory containing the PDB files. Contact maps ------------ Contact maps can be retrieved from the files in the /npz/ directory, using the Python code below (requires the NumPy package). $ python Python 3.6.9 (default, Oct 8 2020, 12:12:24) [GCC 8.4.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import numpy as np >>> data = np.load("RoseTTAfold_full/npz/PF00242.npz") >>> dist = data["mask"] dist is a matrix of shape [L, L], where L is the size of the sequence. lDDT scores ----------- The local Distance Difference Test (lDDT) scores can be retrieved from the files in the /npz/ directory. $ python Python 3.6.9 (default, Oct 8 2020, 12:12:24) [GCC 8.4.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import numpy as np >>> data = np.load("RoseTTAfold_full/npz/PF00242.npz") >>> lddt = data["lddt"] lddt is an array of the size of the sequence. Distance error distribution --------------------------- The distance error distribution can be retrieved from the files in the /npz/ directory. $ python Python 3.6.9 (default, Oct 8 2020, 12:12:24) [GCC 8.4.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import numpy as np >>> data = np.load("RoseTTAfold_full/npz/PF00242.npz") >>> distance_error = data["estogram"] distance_error is a matrix of shape [L, L], where L is the size of the sequence. References ---------- (1) Baek M, et al. Accurate prediction of protein structures and interactions using a three-track neural network. Science. 2021 (2) Mariani V, et al. lDDT: a local superposition-free score for comparing protein structures and models using distance difference tests. Bioinformatics. 2013