******************************************************************************** RefSeq-release71.txt ftp://ftp.ncbi.nlm.nih.gov/refseq/release/release-notes/ NCBI Reference Sequence (RefSeq) Database Release 71 July 6, 2015 Distribution Release Notes Release Size: 55267 organisms, 669786114584 nucleotide bases, 19394398061 amino acids, 77730891 records ****************************************************************************** This document describes the format and content of the flat files that comprise releases of the NCBI Reference Sequence (RefSeq) database. Additional information about RefSeq is available at: 1. NCBI Bookshelf: a) NCBI Handbook: http://www.ncbi.nlm.nih.gov/books/NBK21091/ b) RefSeq Help (FAQ) http://www.ncbi.nlm.nih.gov/books/NBK50680/ 2. RefSeq Web Sites: RefSeq Home: http://www.ncbi.nlm.nih.gov/RefSeq/ RefSeqGene Home: http://www.ncbi.nlm.nih.gov/refseq/rsg/ If you have any questions or comments about RefSeq, the RefSeq release files or this document, please contact NCBI by email at: info@ncbi.nlm.nih.gov. To receive announcements of future RefSeq releases and large updates please subscribe to NCBI's refseq-announce mail list: send email to refseq-announce-subscribe@ncbi.nlm.nih.gov with "subscribe" in the subject line (without quotes) and nothing in the email body OR subscribe using the web interface at: http://www.ncbi.nlm.nih.gov/mailman/listinfo/refseq-announce ============================================================================= TABLE OF CONTENTS ============================================================================= 1. INTRODUCTION 1.1 Release 71 1.2 Cutoff date 1.3 RefSeq Project Background 1.3.1 Sequence accessions, validation, and annotations 1.3.2 Data assembly, curation, and collaboration 1.3.3 Biologically non-redundant data set 1.3.4 RefSeq and DDBJ/EMBL/GenBank comparison 1.4 Uses and applications of the RefSeq database 2. CONTENT 2.1 Organisms included 2.2 Molecule Types included 2.3 Known Problems, Redundancies, and Inconsistencies 2.4 Release Catalog 2.5 Changes since the previous release 3. ORGANIZATION OF DATA FILES 3.1 FTP Site Organization 3.2 Release Contents 3.3 File Names and Formats 3.4 File Sizes 3.5 Statistics 3.6 Release Catalog 3.7 Removed Records 3.8 Accession Format 3.9 Growth of RefSeq 4. FLAT FILE ANNOTATION 4.1 Main features of RefSeq Flat File 4.1.1 LOCUS, DEFLINE, ACCESSION, KEYWORDS, SOURCE, ORGANISM 4.1.2 REFERENCE, DIRECT SUBMISSION, COMMENT 4.1.3 FEATURE ANNOTATION (Gene, mRNA, CDS, Variation, Protein) 4.2 Tracking Identifiers 4.2.1 GeneID 4.2.2 Transcript ID 4.2.3 Protein ID 4.2.4 Conserved Domain Database (CDD) ID 5. REFSEQ ADMINISTRATION 5.1 Citing RefSeq 5.2 RefSeq Distribution Formats 5.3 Other Methods of Accessing RefSeq Data 5.4 Request for Corrections and Comments 5.5 Credits and Acknowledgements 5.6 Disclaimer ============================================================================= 1. INTRODUCTION ============================================================================= The NCBI Reference Sequence Project (RefSeq) is an effort to provide the best single collection of naturally occurring biomolecules, representative of the central dogma, for each major organism. Ideally this would include one sequence record for each chromosome, organelle, or plasmid linked on a residue by residue basis to the expressed transcripts, to the translated proteins, and to each mature peptide product. Depending on the organism, we may have some, but not all, of this information at any given time. We pragmatically include the best view we can from available data. Additional information about the RefSeq project is available from: a) RefSeq Web site http://www.ncbi.nlm.nih.gov/refseq/ b) Entrez Books, NCBI Handbook, RefSeq chapter http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=handbook.chapter.ch18 1.1 Release 71 -------------- The National Center for Biotechnology Information (NCBI) at the National Library of Medicine (NLM), National Institutes of Health (NIH) is responsible for producing and distributing the RefSeq Sequence Database. Records are provided through a combination of collaboration and in-house processing including some curation by NCBI staff comprised of expert biologists. This is a full release of all NCBI RefSeq records. The RefSeq project is an ongoing effort to provide a curated, non-redundant collection of sequences. This release includes all of the sequence data that we have collected at this time. Although the RefSeq collection is not yet complete, its value as a non-redundant dataset has reached a level that justifies providing full releases. 1.2 Cutoff date --------------- This full release, release 71, incorporates data available as of July 6, 2015. For more recent data, users are advised to: 1. Download the RefSeq daily update files from the RefSeq FTP site ftp://ftp.ncbi.nlm.nih.gov/refseq/daily/ 2. Use NCBI's Entrez Programming Utilities to download records based on queries or lists of accessions http://www.ncbi.nlm.nih.gov/books/NBK25500/ 3. Use the interactive web query system to query based on date. http://www.ncbi.nlm.nih.gov/nucleotide/ http://www.ncbi.nlm.nih.gov/protein/ 1.3 RefSeq Project Background ----------------------------- 1.3.1 Sequence accessions, validation, and annotation ----------------------------------------------------- Every sequence is assigned a stable accession, version, and gi and all older versions remain available over time. RefSeq accessions have a distinct format (see section 3.6); the underscore ("_") is the primary distinguishing feature of a RefSeq accession. DDBJ/EMBL/GenBank accessions never include an underscore. Sequences are validated in several ways. For example, to confirm that genomic sequence from the region of the mRNA feature really does match the mRNA sequence itself, and that the annotated coding region features really can be translated into the protein sequences they refer to. Validation also checks for valid ASN.1 format. Validation also ensures that consistency is maintained in descriptive information (symbols, gene and protein names) between RefSeq and Gene records. Each molecule is annotated as accurately as possible with the correct organism name, the correct gene symbol for that organism, and reasonable names for proteins where possible. When available, nomenclature provided by official nomenclature groups is used. Note that gene symbols are not required or expected to be unique either across species or within a species. 1.3.2 Data assembly, curation, and collaboration ------------------------------------------------ We welcome collaborations with authoritative groups outside NCBI who are willing to provide the sequences, annotations, or links to phenotypic or organism specific resources. Where such collaborations have not yet developed, NCBI staff have assembled the best view of the organism that we can put together ourselves. In some cases, as with the human genome, NCBI is an active participant in generating the genome assembly and in providing reference sequences to represent the annotated genome. For other genomes, we may compile the data ourselves from DDBJ/EMBL/GenBank or other public sources. For instance, we may simply select the "best" DDBJ/EMBL/GenBank record by automatic means, validate the data format (and correct if needed), and add an essentially unchanged copy to the RefSeq collection, attributed to the original DDBJ/EMBL/GenBank record. In other cases we may provide a record that is very similar to the DDBJ/EMBL/GenBank record, but to which experts at NCBI have added corrected or additional annotation. This latter process can range from minor technical repairs to a manually curated re-annotation of the sequence, often in collaboration with experts outside NCBI. Each record that has been curated, or that is in the pool for future curation, is labeled with the level of curation it has received. Curation status information is provided primarily for transcript and protein records. Curation is carried out on the whole genome level for some smaller genomes such as viral, organelle, and some microbial genomes. Curation status codes are defined in the section 3.2 below. 1.3.3 Biologically non-redundant data set ----------------------------------------- RefSeq provides a biologically non-redundant set of sequences for database searching and gene characterization. It has the advantage of providing an objective and experimentally verifiable definition of "non-redundant" in supplying one example of each natural biomolecule per organism or sample. The small amount of sequence redundancy introduced from close paralogs, alternate splicing products, and genome assembly intermediates is compensated for by the clarity of the model. RefSeq provides the substrate for a variety of conclusions about non-redundancy based on clustering identical sequences, or families of related sequences, without confounding the database itself with these more subjective assessments. 1.3.4 RefSeq and DDBJ/EMBL/GenBank comparison --------------------------------------------- RefSeq is unique in providing a large curated database across many organisms, which precisely and explicitly links genetic (chromosome), expression (mRNA), and functional (protein) sequence data into an integrated whole. DDBJ/EMBL/GenBank also integrates DNA and protein information, and RefSeq is substantially based on sequence records contributed to DDBJ/EMBL/GenBank. However, RefSeq is similar to a review article in that it represents a synthesis and summary of information by a particular group (NCBI or other RefSeq contributors) that is based on the primary data gathered by many others and made part of the scientific record. Also, like a review article, it has the advantage of organizing a large body of diverse data into a single consistent framework with a uniform set of conventions and standards. Note that while based on DDBJ/EMBL/GenBank, RefSeq is distinct from DDBJ/EMBL/GenBank. DDBJ/EMBL/GenBank represents the sequence and annotations supplied by the original authors and is never changed by NCBI or RefSeq staff. DDBJ/EMBL/GenBank remains the primary sequence archive while RefSeq is a summary and synthesis based on that essential primary data. 1.4 Uses and applications of the RefSeq database ------------------------------------------------ A stable, consistent, comprehensive, non-redundant database of genomes and their products provides a valuable sequence resource for similarity searching, gene identification, protein classification, comparative genomics, and selection of probes for gene expression. It also acts as molecular "white pages" by providing a single, uniform point of access for searching at the sequence level, and by connecting the results with a diversity of organism-specific databases or resources unique to that organism or field. ============================================================================= 2. CONTENT ============================================================================= 2.1 Organisms included ---------------------- This number of organisms reported for the release (section 3.5 below) is determined by counting the number of distinct tax_ids included in the release. Tax_ids are provided by the NCBI Taxonomy group. Tax_ids were historically provided for all species and strains having any amount of sequence data. In 2014 NCBI stopped assigning strain-level tax_ids. Strains are now being tracked by the BioSample database. The release includes species ranging from viral to microbial to eukaryotic and includes organisms for which complete and incomplete genomic sequence data is available. The release does not include all species for which some sequence data is available in DDBJ/EMBL/GenBank. The decision to generate RefSeq data for a species or strain depends in part on the amount of sequence data available. Additional species will be represented in the RefSeq collection as more sequence data becomes available. 2.2 Molecule Types Included --------------------------- The RefSeq release includes genomic, transcript, and protein sequence data; however, these molecule types are not provided for all organisms and the sequences provided may not be complete or comprehensive for some species. Transcript RefSeq records may represent protein-coding transcripts or non-coding RNA products; these records are currently only provided for eukaryotic species. Genomic RefSeq records are provided when a sufficient quantity of genomic sequence data is available in DDBJ/EMBL/GenBank. Transcript and protein records may be provided for a species before genomic sequence data is available. 2.3 Known Problems, Redundancies, and Inconsistencies ------------------------------------------------------ Known Problems with RefSeq release 71: ====================================== [1] We failed to process some bacterial genome accessions for this release as errors were identified during the release QA step: WGS master for metagenome assembly (out of scope for RefSeq): NZ_ADHP00000000 NZ_AFSJ00000000 NZ_AFSK00000000 NZ_AGGA00000000 NZ_AHBG00000000 NZ_AHBH00000000 NZ_AIJM00000000 These assemblies have now been suppressed. Missing organism source descriptor. NZ_AFSD01000007 NZ_AFSD01000008 NZ_AHJK01000001 NZ_AHJO01000001 NZ_AHJQ01000001 NZ_AHJR01000001 NZ_AHJT01000001 These scaffolds have now been resubmitted to correct the data error. [2] Metagenome assemblies and genomes from environmental samples: We are continuing to suppress environmental sample assemblies, due to concerns about possible sequence contamination and/or inaccuracies in organism labels. Known Redunancies and Inconsistencies: ====================================== The RefSeq collection is an ongoing project that is expected to grow in scope and content over time. Thus it is important to recognize that it is not complete in that some genomes are not yet completely sequenced, some incompletely sequenced genomes may not be included, or some gene products may not yet be represented. RefSeq records may be added, removed, or updated in future releases as new information becomes available and as a result of curation. Known Data inconsistencies: [1] RefSeq status codes are not consistently provided for some species. The goal is to consistently provide a status code for all RefSeq records. The release catalog indicates "UNKNOWN" if a status code was expected but not detected and "na" if a status code is not expected based on the original project plan for provision of this type of information. Status codes will be more consistently applied to all records in the future. [2] The genomic, transcript, and protein collection is known to be incomplete for many species. This is particularly true for those genomes for which a complete genome assembly is not yet available, such as Sus scrofa (pig). As additional sequence data becomes available, the RefSeq representation for this, and other, organisms will increase. [3] Whole genome shotgun (WGS) assemblies of organelle, plastid, or viral genomes are included in the complete node and in the taxonomic group that the whole genome WGS project is reported in (e.g., fungi etc.). Our process flow for WGS data provides a data extraction per WGS project with no distinction by molecule (such as mitochondrial). Therefore, some nodes do not include WGS data or may include WGS data for different taxa. For instance, NZ_ACSJ01000000 includes contigs representing two tax_ids - a bacterium and a phage. The entire WGS project has been processed for the complete node and the microbial node in this release. Therefore, the microbial node includes a small amount of viral sequence and the viral node omits this data. NZ_ACSJ01000001 to NZ_ACSJ01000011 microbial contigs NZ_ACSJ01000012 to NZ_ACSJ01000019 viral contigs [4] Although the goal is to provide a non-redundant collection, some redundancy is included in this release as follows: Redundant Protein records: Alternate Splicing When additional transcripts are provided to represent alternate splicing products, and the alternate splice site occurs in the UTR, then the protein is redundantly provided. Paralogs The goal is to provide a RefSeq record for each naturally occurring molecule. Therefore, records are provided for all genes identified including those produced by more recent gene duplication events in which the genes are nearly identical. Redundant Genomic records: Intermediate records For some species, intermediate genomic records are provided to support the assembly and/or annotation of the genome. For example, for human, a chromosome may be represented by a chromosome RefSeq record with a NC_ accession prefix. The chromosome record may consist of many contigs, each represented as a separate record with a NT_ accession prefix. In addition, some curated gene region records, with NG_ accession prefix, may also be provided to support annotation of complex regions. Alternate assemblies Genomic records are provided to represent alternate assemblies of genomic sequence derived from different populations. These records will have varying levels of redundancy and represent polymorphic and haplotype differences in terms of the sequence and annotation. For example, alternate assemblies are provided for different mouse strains and for regions of the human major histocompatibility complex (MHC). The MHC is a highly variable region of chromosome 6 which exhibits variation at the level of both sequence polymorphism and gene content. The alternate assemblies make it possible to represent this alternate gene content. Microbial strains Microbial genome sequence data derived from different strains may be represented as additional RefSeq records. This introduces redundancy but may also add representation for some proteins that are unique to a strain. RefSeq records for a specific strain can be identified by the unique taxonomic ID for that strain. [5] Note that for some organisms, most notably vertebrates, processing to update individual transcript and protein records may occur on a daily basis. Transcript and protein updates may include changes to descriptive information such as publications, names, or feature annotations. Updates can also include changes to the sequence or the addition of new sequence records. Thus information available on transcript and protein records may be more current than the annotated genome. 2.4 Release Catalog ------------------- The Release Catalog documents the full contents of the RefSeq Release. The catalog can be used to identify data of interest. See the format description in section 3.5 for additional information. The release catalog is available at: ftp://ftp.ncbi.nlm.nih.gov/refseq/release/release-catalog/RefSeq-release#.catalog The catalog for previous releases is available in the archive directory: ftp://ftp.ncbi.nlm.nih.gov/refseq/release/release-catalog/archive/ 2.5 Changes since the previous release -------------------------------------- [1] A list of updated organisms and dbSNP annotation summary is available here: ftp://ftp.ncbi.nih.gov/snp/release-notes/RefSeq/refseq71.snp.rpt The report summarizes SNP Build 144 updates for human, soybean, chicken, and horse. [2] The Caenorhabditis elegans annotation was updated to correct an identified problem with missing gene symbols, and incorrectly labeled non-coding RNAs. Previous announcement: ---------------------- [1] A list of updated organisms and dbSNP annotation summary is available here: ftp://ftp.ncbi.nih.gov/snp/release-notes/RefSeq/refseq70.snp.rpt [2] Eukaryotic genome updates This release includes updated annotation for the human reference genome (GRCh38.p2), the mouse reference genome (GRCm38) and the Caenorhabditis elegans reference genome corresponding to WormBase release WS245. [3] Prokaryotic RefSeq data This release reflects a large update of complete bacterial RefSeq genomes, proteins, and Genes. NCBI decided to re-annotate all RefSeq prokaryotic genomes using NCBI’s genome annotation pipeline in order to make genome annotation comparable across genomes and species, instead of representing submitted annotation that was provided using different methods reflecting different states of technology development over time. Previously, it was possible that the same gene, in the same species, with an identical sequence for the genes genomic region might be annotated with a different protein simply because it was annotated using different methods. Because of the re-annotation, the same gene in the same species with the same sequence will now be annotated with exactly the same protein in RefSeq. If you’d like to learn more about the re-annotation project and what NCBI is doing to help you transition to using this new data, please see the RefSeq Re-annotation Project page at: http://www.ncbi.nlm.nih.gov/refseq/about/prokaryotes/reannotation/. Previously, each annotated CDS was tracked with a distinct RefSeq protein accession number; however, given the facts that the identical protein sequence has been found on multiple re-annotated RefSeq genomes, coupled with the extensive sequencing of bacterial genomes (often of the same strain but different isolates) the RefSeq prokaryotic protein dataset was rapidly becoming very redundant. Therefore, rather than flood the protein database with thousands of completely identical proteins, NCBI has adopted the use of non-redundant (WP_) proteins for RefSeq prokaryotic genomes that are annotated using the NCBI pipeline. If the identical protein sequence (exactly the same protein sequence and length) appears on more than one RefSeq genome, NCBI simply re-uses the existing WP accession number instead of creating a new accession for each new occurrence and genome. For conserved proteins the same WP accession may appear on thousands of genomes. This is a first step toward dealing with a world when genomes are sequenced just for assays, rather than to discover novel proteins. We appreciate that this is new and a major change for RefSeq prokaryotic genomes, and that there are some issues still to be worked out to use these data smoothly, but we felt we needed to start making this change as the number of disease-outbreak and other isolate sequencing continues to increase rapidly. Advantages of comprehensive re-annotation and non-redundant proteins: - More consistent annotation across RefSeq bacterial genomes. - Significant reduction in protein redundancy. This is most notable for heavily sequenced species. For more information please see: http://www.ncbi.nlm.nih.gov/refseq/about/prokaryotes/reannotation/#reducedredundancy - Significant improvement in protein name management. This release: The long term plan to re-annotate all RefSeq bacterial genomes using NCBI’s prokaryotic genome annotation pipeline has now nearly completed and is included in this release. We anticipate that the remaining very small number of re-annotated bacterial genomes will be released by the end of the summer 2015. We also plan to re-annotate the archaeal genomes. As RefSeq bacterial genomes were re-annotated, the proteins were replaced with non-redundant RefSeq proteins (having the WP_ accession prefix). This data type was first announced in June 2013: http://www.ncbi.nlm.nih.gov/news/06-11-2013-wp-refseqs/. Thus >7 million YP/NP protein accessions were removed since January, resulting in a decrease in the total number of protein accessions and a significant reduction in protein redundancy for the prokaryotic dataset. Removed accessions are reported here: ftp://ftp.ncbi.nlm.nih.gov/refseq/release/release-catalog/release70.removed-records.gz and a data mapping report is available in the release-catalog directory (release70.bacterial-reannotation-report.txt.gz). Protein records: In all bacterial genomes, except reference genomes and a small number which have yet to be re-annotated, protein accessions NP/YP have been replaced with non-redundant protein accession numbers (WP_). - > 7 million bacterial YP_ and NP_ RefSeq proteins were suppressed as complete bacterial genomes were re-annotated to conform to the new data model - Nearly 1 million non-redundant protein records were updated in March and April 2015 to improve the protein name. These updates affected CDS “/product=” annotation details for all (>31,000) of the RefSeq bacterial genomes and included typographical corrections, name format standardization, and improved functional information. - We have initiated a long-term project to validate and improve protein names for non-redundant protein records. In March and April we validated names for approximately 2 million records using multiple support lines from Swiss-Prot, HMM analysis, domain architecture analysis, and NCBI scientific staff curation. Nucleotide records: - >6,400 new or re-annotated RefSeq bacterial genomes were released since January 2, 2015. - All new complete or draft RefSeq prokaryote genomes now use the accession format rule NZ_. Complete genomes that were already accessioned using the ‘NC_’ prefix will continue to use that accession number. Thus, the accession prefix is no longer an indicator of a complete bacterial genome. Information about genome completeness is provided in the record DEFINITION line, the Assembly resource, and FTP reports provided by Assembly and Genome resources. Quality control: - Over 450 RefSeq bacterial genomes that do not meet updated quality criteria were suppressed; some of these may be reinstated in the future after further improvements are made to NCBI’s prokaryotic genome annotation pipeline. - A supplemental file in the refseq-catalog directory (release70.addedQA-suppressedAssemblies.txt) reports details for a subset of bacterial genomes that were suppressed in March 2015 following an expansion of QA metrics and subsequent to curatorial review. This report illustrates some of the reasons for suppression. Locus_tag format: Re-annotated RefSeq genome records have new locus_tags in the format of _RS. The original locus tag is provided in the “old_locus_tag” qualifier. A bacterial genomes mapping report available in the release-catalog directory (release70.bacterial-reannotation-report.txt.gz) includes information about old and new locus_tags. Available Reports and Documentation: a) Supplemental data mapping file: A ftp file in the release-catalog directory (release70.bacterial-reannotation-report.txt.gz) has been prepared for re-annotated genomes that have recently transitioned to using the new non-redundant proteins. This file reports the old protein accession and gi, the annotated CDS coordinates, the old locus_tag and NCBI GeneID values and maps that to the current non-redundant protein accession and gi, the new locus_tag and NCBI GeneID (if available), the current CDS annotation coordinates, and indicates then the original protein identically matches verses is similar to the replacement non-redundant protein or was dropped from the annotation. b) Supplemental report of suppressed assemblies: A ftp file in the release-catalog directory (release70.addedQA-SuppressedAssemblies.txt) reports details for a subset of bacterial genomes that were suppressed in March 2015 following an expansion of QA metrics and subsequent to curatorial review. This report illustrates some of the reasons for suppression. c) NCBI has created online documentation to explain these changes in detail: - Re-annotation project: http://www.ncbi.nlm.nih.gov/refseq/about/prokaryotes/reannotation/ - RefSeq Prokaryotic Genome Policy: http://www.ncbi.nlm.nih.gov/refseq/about/prokaryotes/ - RefSeq non-redundant proteins: http://www.ncbi.nlm.nih.gov/refseq/about/nonredundantproteins/ - Prokaryotic annotation pipeline: http://www.ncbi.nlm.nih.gov/genome/annotation_prok/process/ - Prokaryotic RefSeq FAQ: http://www.ncbi.nlm.nih.gov/refseq/about/prokaryotes/faq/ Impact to NCBI Gene: Together with this re-annotation effort, the scope of bacterial genomes included in Gene has been changed to include only genomes designated as a 'reference genome,' or 'representative genome' where there is a cluster of related assemblies to indicate that the chosen representative assembly will be stable. Individual gene features on each assembly are identified with a locus_tag that can be used as a unique identifier for the gene in publications, even if the assembly is out of scope for Gene. Ongoing work: - Organism classification and QA: work continues to identify miss-classified genomes and those with contamination. Depending on the specific details of identified issues, additional RefSeq bacterial genomes may be suppressed or updated. - Re-annotation of complete genomes: A small number of bacterial genomes have not yet been re-annotated at this time and will be in the near future. We also plan to re-annotate the archaeal RefSeq genomes in 2015. - Protein names: we are continuing to work on providing improved names for the non-redundant (WP_ accessioned) bacterial protein dataset. We are leveraging multiple sources of information including curated UniProtKB/Swiss-Prot records, HMMs, Domain and domain architecture, publications and manual curation. - Partial proteins: we are re-examining the prokaryotic genome annotation pipeline logic with regards to providing a non-redundant protein record for partial coding sequences. Using this data: Please refer to the RefSeq bacterial genomes FAQ for information that will facilitate access to these data. http://www.ncbi.nlm.nih.gov/refseq/about/prokaryotes/faq/ a)Strain-specific protein datasets for individual RefSeq genomes can be obtained online, by FTP, and through NCBI’s programming utilities. To access data online, navigate to the annotated genome record(s) in NCBI’s Nucleotide database, use the right-column option to ‘Find related data’ in the Protein database, then download the protein records using the upper-right ‘Send to’ wizard. To access proteins for specific species or strains by FTP, navigate to NCBI’s Assembly record then follow the right-column link to the RefSeq FTP site. RefSeq genomes include a link to the Assembly resource in the DBLINK section of the record or in the right-column Related information section of the Nucleotide record. To access data using NCBI programming utilities one must provide the genomic accessions and use the eLink function to access the linked protein data (see documentation http://www.ncbi.nlm.nih.gov/books/NBK25501/). b) A graphical display of an annotated gene or protein can be accessed from the Nucleotide resource. From a RefSeq genome record of interest, such as NC_002695.1, follow the link to ‘Graphics’, and search for the locus_tag or protein name of interest. c) Conversely, is starting from an individual non-redundant protein record, information about the annotated genomic location and genome taxonomy is available by following the (page top) link to the Identical Protein report. When a non-redundant protein record has been annotated on multiple RefSeq genomes, this report page lists the set of genomes that contain that identical protein, the genomic coordinates of the annotated CDS, and the specific organism information of the annotated genomic record. Thus this report page can be used to identify the taxonomic range that that identical protein has been found in. The protein report can be downloaded in tabular format using the ‘Send to’ link, and can be accessed using NCBI’s programming utilities. Measurable reduction in protein redundancy: Here are some measures for four species that illustrates the significant reduction in protein record redundancy resulting from the use of non-redundant RefSeq proteins (WP_ accessions). Counts: Species Genomes Total Proteins Total Unique WPs Total Singleton WPs ------------------------- ------- -------------- ---------------- ------------------- Staphylococcus aureus 4194 11,764,898 222,588 138,284 Escherichia coli 2685 13,637,370 1,033,617 649,100 Mycobacterium tuberculosis 1790 7,245,836 139,800 101,255 Salmonella enterica 918 4,099,013 294,106 194,982 Percents: Species Genomes Percent Reduction (WPs) Percent Singleton WPs ------------------------- ------- ----------------------- --------------------- Staphylococcus aureus 4194 98% 62% Escherichia coli 2685 94% 63% Mycobacterium tuberculosis 1790 98% 72% Salmonella enterica 918 93% 66% Singletons Per Genome: Species Average Protein Count Singleton WPs per Genome Percent Singleton Per Genome ------------------------- --------------------- ------------------------ ---------------------------- Staphylococcus aureus 2814 33 1.17% Escherichia coli 5088 241 4.74% Mycobacterium tuberculosis 4046 56 1.38% Salmonella enterica 4485 212 4.72% Definitions: - "Total Proteins" counts the number of times non-redundant proteins accessions are annotated on the set of genomes for the species. - "Total Unique WPs" counts the distinct number of non-redundant proteins used across all genomes. This is the truly non-redundant set of proteins for the species. - "Total Singleton WPs" counts the number of non-redundant proteins used only once in the set of genomes for the species. - "Percent Reduction" measures the compression in protein identifier space gained by using non-redundant protein accessions (WP_ prefix) - "Percent Singleton WPs" measures the percent of all non-redundant proteins for that species that are used only once in that species. Announcing Future Changes: -------------------------- Prokaryotic genomes: We plan to comprehensively re-annotate bacterial and archaeal genomes for RefSeq release 72 (September 2015). This re-annotation is being carried out to reflect improvements in a) management of partial, very short, and fragmented genes and proteins; and b) protein name management. This re-annotation will also increase consistency of some textual information that is applied to RefSeq records. Note that re-annotation will not be done for the small set of bacterial reference genomes for which annotation changes are manually maintained. ============================================================================= 3. ORGANIZATION OF DATA FILES ============================================================================= 3.1 FTP Site Organization ------------------------- RefSeq releases are available on the NCBI FTP site at: ftp://ftp.ncbi.nlm.nih.gov/refseq/release/ Documentation Directories and Files: ------------------------------------ release-catalog/ archive/ --subdirectory, archive of previous catalogs RefSeq-release#.catalog --file, comprehensive list of sequence records included in the current release release#.files.installed --file, list of sequence data files installed release#.removed-records --file, list of removed records that were included the previous release release#.taxon.new --file, list of organisms that have been added to the release since the previous release release#.taxon.update --file, list of organisms for which there has been a change in either the NCBI Tax ID or the organism name. release-notes/ archive/ --subdirectory, archive of previous documentation RefSeq-release#.txt --file, this Release notes document release-statistics/ archive/ --subdirectory, archive of previous documentation RefSeq-release#.MMDDYYYY.stats.txt --file, release statistics Sequence Data Directories and Files: ------------------------------------ The RefSeq collection is provided in a redundant fashion to best meet the needs of those who want the full collection as well as those who want a specific sub-set of the collection. Therefore the collection is provided as: 1) the complete collection, and 2) sections as defined by major taxonomic or other logical groupings. A subdirectory exists for each sub-section as follows: archaea bacteria fungi invertebrate mitochondrion plant plasmid plastid protozoa vertebrate_mammalian vertebrate_other viral In addition, the complete collection is available without these sub-groupings in the subdirectory: complete Note that this directory structure intentionally provides the release data in a redundant fashion. We gave considerable thought to how to package the release to meet the needs of different user groups. For instance, some groups may be interested in retrieving the complete protein set, while other groups may be interested in retrieving data for a more limited number of organisms. We decided to provide logical groupings based on general taxonomic node (viral, mammalian etc.) as well as logical molecule type compartmentalization (e.g., plastid). Thus, all records are provided at least twice, once in the "complete" directory, and a second time in one of the other directories. Some sequences may be provided three times when it is logical to include the record in more than one additional directory. For example, a sequence may be provided in the "complete", "mitochondrion", and "vertebrate_mammalian" directories. We are interested in hearing if you find this structure useful or if you would like information grouped in a different manner. Send suggestions or comments to the NCBI Help Desk at: info@ncbi.nlm.nih.gov 3.2 Release Contents -------------------- A comprehensive list of sequence files provided for the current release is available in: ftp://ftp.ncbi.nlm.nih.gov/refseq/release/release-catalog/release#.files.installed A comprehensive list of sequence records included in the current release is available in: ftp://ftp.ncbi.nlm.nih.gov/refseq/release/release-catalog/release#.catalog File name format indicates the directory node, molecule type, and format type. Name format: complete.10.1.bna.gz |--------|--|-|---|--| 1 2 3 4 5 1. directory location 2. numerical increment -to provide a set of unique file names 3. optional: sub-part number -to provide a unique file name for genomic FASTA files which may be split based on size 3. format type 4. compression Multiple files may be provided for any given molecule and format type and file names include a numerical increment. Files with the same numerical increment are related by content, they are all derived from the same initial ASN.1 file. For example: complete.1.bna.gz ---this file represents all of the content found in the following files. complete.1.1.genomic.fna.gz complete.1.protein.faa.gz complete.1.rna.fna.gz complete.1.genomic.gbff.gz complete.1.protein.gbff.gz complete.1.rna.gbff.gz Note that for some molecule and format types, a number increment is skipped. This is not an error. It is also not an error if a filename provided with one release is not provided with a different release. For example: complete.281.genomic.gbff.gz complete.282.genomic.gbff.gz complete.284.genomic.gbff.gz complete.285.genomic.gbff.gz complete.287.genomic.gbff.gz --release 70 did not include files named as complete.283.genomic or complete.286.genomic because complete.283.bna & complete.286.bna did not include genomic data. The RefSeq release processing first produces a comprehensive set of ASN.1 files, ordered by tax_id, and limited by a size constraint. These initial files are further processed to export the records by molecule and format type. If the initial ASN.1 file does not include any records for a given molecule type, such as genomic sequence data, then the corresponding 'genomic' fasta and flatfile records will not be found. The installed release includes a comprehensive report of all files installed for a given release. Please refer to /release-catalog/release#.files.installed (where # is the release number). 3.3 File Names and Formats -------------------------- File names are informative, and indicate the content, molecule type, and file format of each RefSeq release data file. Most filenames utilize this structure: directory.filenumber.subpart.molecule.format.gz 1 2 3 4 5 File Name Key: 1. directory directory level the file is provided in (e.g.,complete, viral etc) 2. file number: large data sets are provided as incrementally numbered files 3. sub-part number: large genomic fasta files may be split to facilitate transfer 4. molecule type of molecule (genomic, rna, or protein); not relevant for ASN.1 format files provided in the "complete" sub-directory 5. format the data format provided in the file; see below For example: complete1.genomic.bna.gz vertebrate_mammalian2.protein.gpff.gz RefSeq Whole Genome Shotgun (WGS) data are provided in files provided per WGS project. Their filenames use a slightly different structure: directoryWGSproject.molecule.format.gz For example: completeNZ_AAAU.bna.gz microbialNZ_AAAV.genomic.fna.gz All RefSeq release files have been compressed with the gzip utility; therefore, an invariant ".gz" suffix is present for all release files. The data that comprises a RefSeq release are available in several file formats, as indicated by the format component in the file name: bna binary ASN.1 format; includes nucleotide and protein gbff GenBank flat file format; nucleotide records gpff GenPept flat file format; protein records fna FASTA format; nucleotide records faa FASTA format; protein records The comprehensive full release is deposited in the "complete" directory and is available in all file types. Binary ASN.1 format is only provided in the complete directory. The remaining directories include all of the remaining file types. The DDBJ/EMBL/GenBank and GenPept flat file format provided in this release matches that seen when accessing the records using the NCBI web site. Notably, some RefSeq record are in the CON division and do not instantiate the sequence on the flat file display, instead a 'join' statement is provided to indicate the assembly instructions. The FASTA files do include the assembled sequences for these CON division RefSeq records. For example, see NC_000022. Suggestions regarding the structure of the RefSeq release product and the available formats may be sent to the NCBI Help Desk: info@ncbi.nlm.nih.gov 3.4 File Sizes -------------- RefSeq release files are provided in a range of sizes. Most are limited to several hundred megabytes (MB) and uncompressed ASN.1 file size will not exceed 500 MB. Nucleotide FASTA files are split when whey reach 1 gigabyte (GB). Files are compressed to reduce file size and facilitate FTP retrieval. iebdev21:/panfs/pan1.be-md.ncbi.nlm.nih.gov/rsrelease/release/workdir>more release64.filesize The total size of release 71 is as follows: Extension Size (GB) Type ----------------------------------------------------------- bna 594.91 ASN.1 gbff 836.48 GenBank flat file gpff 275.78 GenPept flat file fna 1359.56 FASTA, nucleotide faa 48.82 FASTA, protein Notes: [A] The compete directory provides all file types. The ASN.1 format is only available in the complete directory; the file sizes reported for the remaining file formats represents the redundant total found in the complete plus other directories. 3.5 Statistics --------------- RefSeq release 71 includes sequences from 55267 different organisms. The number of species represented in each Release sub-directory, determined by counting distinct tax IDs, is as follows: archaea 952 bacteria 39660 complete 55267 fungi 3367 invertebrate 1786 mitochondrion 5732 plant 847 plasmid 2139 plastid 843 protozoa 273 vertebrate_mammalian 776 vertebrate_other 2755 viral 4850 Counts of accessions and basepairs/residues per molecule type: Accessions Basepairs/Residues Genomic: 13403331 644351166512 RNA: 11803354 25434948072 Protein: 52494032 19394398061 Wgs master: 30174 0 Complete RefSeq release statistics for each directory are provided in a separate document. Please see: ftp://ftp.ncbi.nlm.nih.gov/refseq/release/release-statistics/ file: RefSeq-release#.MMDDYYYY.stats.txt #: indicates release number MMDDYY: indicates release date as month,day,year Statistics for previous releases are available in the archive subdirectory: ftp://ftp.ncbi.nlm.nih.gov/refseq/release/release-statistics/archive/ 3.6 Release Catalog Format -------------------------- The full non-redundant contents of the release are documented in the release catalog. Available at: ftp://ftp.ncbi.nlm.nih.gov/refseq/release/release-catalog/ The catalog includes the following columns: 1. tax_id 2. Taxon name 3. RefSeq accession.version 4. gi 5. FTP directories data is provided in, '|' separated 6. RefSeq status code 7. sequence length Note: the molecule type for each catalog entry can be inferred from the accession prefix (see below). RefSeq Status Codes are documented on the RefSeq web site. The catalog includes the following terms: na Not Applicable; status codes are not provided for some records UNKNOWN The status code has not yet been applied or status is not applicable to the type of record. REVIEWED The RefSeq record has been the reviewed by NCBI staff or by a collaborator. Some RefSeq records may incorporate expanded sequence and annotation information including additional publications and features. This indicates a curated record. VALIDATED The RefSeq record has undergone an initial review to provide the preferred sequence standard. The record has not yet been subject to final review at which time additional functional information may be provided. This indicates a curated record. PROVISIONAL The RefSeq record has not yet been subject to individual review and is thought to be well supported and to represent a valid transcript and protein. This record is not curated. PREDICTED The RefSeq transcript may represent an ab initio prediction or may be weakly supported by transcripts or protein homology. This record is not curated. INFERRED The RefSeq record is inferred by genome sequence analysis. This record is not curated. MODEL RefSeq records provided via automated processing and are not subject to individual review or revision between builds. This record is not curated. 3.7 Removed Records ------------------- This is a report of accessions that were included in the previous release but are no longer included in the current release. Available at: ftp://ftp.ncbi.nlm.nih.gov/refseq/release/release-catalog/ release#.removed-records file format The file includes the following columns: 1. tax_id 2. species name 3. RefSeq accession.version 4. gi 5. FTP directories data was provided in, in last release 6. RefSeq status code 7. sequence length 8. type of removal type options include: dead protein replaced by accession [original accession is not secondary] permanently suppressed temporarily suppressed [record may become available again in the future] 3.8 RefSeq Accession Format --------------------------- RefSeq accessions are formatted as a two letter prefix, followed by an underscore, followed by six or nine digits or 4 letters plus eight digits. For example, NM_020236, NP_001107345, and NZ_AABC02000001. The underscore ("_") is the primary distinguishing feature of a RefSeq accession; DDBJ/EMBL/GenBank accessions never include an underscore. The RefSeq accession prefix indicates the molecule type. Molecule Type Accession Prefix ---------------------------------------------- protein *P_ including: NP_; XP_; AP_; YP_; WP_ rna *R_ and *M_ including: NM_; NR_; XM_; XR_ genomic NC_; NG_; NT_; NW_; NZ_; NS_; AC_ Additional information is available on the RefSeq Web site: http://www.ncbi.nlm.nih.gov/RefSeq/key.html#accessions Transcript and protein accessions are followed by either 6-digits or 9-digits. For example: NP_123456 -or- NP_123456789 As other accession series need to be expanded, they will also be expanded by adding 3 digits with existing accessions remaining stable. 3.9 Growth of RefSeq -------------------- Release Date Taxons Nucleotides Amino Acids Records 1 Jun 30, 2003 2005 4672871949 263588685 1061675 2 Oct 21, 2003 2124 7745398573 286957682 1097404 3 Jan 13, 2004 2218 7992741222 294647847 1101244 4 Mar 24, 2004 2358 8175128887 318253841 1193457 5 May 3, 2004 2395 8325515623 337229387 1255613 6 July 5, 2004 2467 8696371716 365446682 1367206 7 Sep 10, 2004 2558 21072808460 405233619 1579579 8 Oct 31, 2004 2645 26814386658 430300369 1709723 9 Jan 9, 2005 2780 36786975473 470534907 1843944 10 Mar 6, 2005 2827 36893741150 482862858 1893478 11 May 8, 2005 2928 39731702362 507980644 2477893 12 Jul 10,2005 2969 43043256058 608493108 2869675 13 Sep 11, 2005 3060 44727484853 686768902 3400773 14 Nov 20, 2005 3198 47364955367 763761075 3272776 15 Jan 1, 2006 3244 52645441913 810009733 3436263 16 Mar 11, 2006 3397 56175443059 887509001 3715260 17 May 1, 2006 3497 62130037371 927587669 3999859 18 July 11, 2006 3695 70474041999 974374765 4186692 19 Sep 10, 2006 3774 70694879544 1012985077 4311543 20 Nov 5, 2006 3919 72679681505 1061797276 4567569 21 Jan 6, 2007 4079 73864990566 1144795927 4742335 22 Mar 5, 2007 4187 82441128546 1215085694 5207865 23 May 8, 2007 4300 83148327110 1291050995 5503385 24 July 10, 2007 4511 89856995521 1365916222 6073814 25 Sep 11, 2007 4646 91265840843 1470475398 6515132 26 Nov 4, 2007 4737 99105705485 1495032507 6698250 27 Jan 6, 2008 4926 101059552113 1556356987 7025715 28 Mar 9, 2008 5059 102051350525 1770627427 7914560 29 May 4, 2008 5168 104671101150 1870214220 8376141 30 July 7,2008 5395 105074486709 1913447691 8572852 31 Aug 30, 2008 5513 109214348591 2026768719 9145702 32 Nov 10, 2008 5726 111122203221 2089596746 9501764 33 Jan 16, 2009 7773 116001583818 2204073443 10325282 34 Mar 6, 2009 8054 111792574830 2299682138 10021870 35 May 4, 2009 8393 113210655336 2565199170 10993891 36 July 2, 2009 8665 117013741530 2756884219 12141825 37 Sept 3, 2009 9005 119151229820 2965450333 12941750 38 Nov 7, 2009 9166 119196622435 3115246540 13436447 39 Jan 23, 2010 10171 118502856500 3221054793 13656433 40 Mar 7, 2010 10291 118645985035 3280528951 13853798 41 May 9, 2010 10567 125500880884 3427514220 14472060 42 July 13, 2010 10728 143311839055 3553178673 15038858 43 Sep 5, 2010 10854 148706971456 3761205880 15934055 44 Nov 7, 2010 11354 152241490865 3899827321 16421261 45 Jan 7, 2011 11536 152787094873 3989526325 16748646 46 Mar 8, 2011 11734 153220856222 4064052954 16998463 47 May 7, 2011 12000 162001966044 4226432170 17631876 48 July 10, 2011 12235 163771272903 4381572480 18162534 49 Sep 7, 2011 16248 162286146420 4401462131 18236994 50 Nov 8, 2011 16392 168702162406 4529303978 18815153 51 Jan 9, 2012 16609 172751347778 4727472575 19580946 52 Mar 5, 2012 16923 173705194347 4929467422 20235247 53 May 7, 2012 17339 175345433862 5247723883 21286080 54 July 9, 2012 17605 176492228688 5456992181 21889466 55 Sep 17, 2012 17994 194971374545 5803694332 23207572 56 Nov 8, 2012 18512 207200464965 6003283860 23892460 57 Jan 8, 2013 21415 227639108990 8895153979 34158511 58 Mar 11,2013 22460 233247214400 9699076220 36938203 59 Apr 29, 2013 24656 256547643663 10081118607 39040745 60 Jul 19, 2013 28560 304686151670 10968281809 40913699 61 Sep 9, 2013 29414 319551394177 11248966865 41958567 62 Nov 10, 2013 31646 361097812819 12364402476 45971929 63 Jan 12, 2014 33485 380736496721 12898823816 48358066 64 Mar 10, 2014 33693 407131829420 13126329523 49538213 65 May 12, 2014 36335 430613954268 13544443640 51770174 66 July 7, 2014 41263 464958653006 15380643722 58334707 67 Sep 8, 2014 41913 490800792583 15984799771 61277203 68 Nov 3, 2014 49312 551290496427 16790850066 66078114 69 Jan 2, 2015 51661 594452675642 18690872100 74127019 70 Apr 30, 2015 54118 643051675415 18556381492 74720563 71 Jul 6, 2015 55267 669786114584 19394398061 77730891 Note: Date refers to the data cut-off date. ============================================================================= 4. FLAT FILE ANNOTATION ============================================================================= 4.1 Main features of RefSeq Flat File ------------------------------------- Also see the RefSeq web site and the NCBI Handbook, RefSeq chapter. http://www.ncbi.nlm.nih.gov/refseq/ http://www.ncbi.nlm.nih.gov/books/NBK21091/ 4.1.1 LOCUS, DEFLINE, ACCESSION, KEYWORDS, SOURCE, ORGANISM -------------------------------------------------------------------- The beginning of each RefSeq records provides information about the accession, length, molecule type, division, and last update date. This is followed by the descriptive DEFINITION line, then by the Accession, version,and GI data, followed by detailed information about the organism and taxomonic lineage. // LOCUS NC_004916 384502 bp DNA linear INV 05-JUN-2012 DEFINITION Leishmania major strain Friedlin complete genome, chromosome 3. ACCESSION NC_004916 VERSION NC_004916.2 GI:389592668 DBLINK Project: 15564 BioProject: PRJNA15564 KEYWORDS RefSeq; complete genome. SOURCE Leishmania major strain Friedlin ORGANISM Leishmania major strain Friedlin Eukaryota; Euglenozoa; Kinetoplastida; Trypanosomatidae; Leishmaniinae; Leishmania. // Note: Both the GI and VERSION number increment when a sequence is updated, while the ACCESSION remains the same. The GI and "ACCESSION.VERSION" identifiers provide the finest resolution reference to a sequence. 4.1.2 REFERENCE, DIRECT SUBMISSION, COMMENT ------------------------------------------- REFERENCE: While the majority of RefSeq records do include REFERENCE data, this data is not required and some records do not include any citations. Publications are propagated from the GenBank record(s) from which the RefSeq is derived, provided by collaborating groups and NCBI staff during the curation process, and provided by the National Library of Medicine (NLM) PubMed MeSH indexing staff as they add new articles to PubMed. Functionally relevant citations are added by individual scientists using the Entrez Gene GeneRIF submission form, and a significant volume of citation connections are supplied by the NLM MeSH indexing staff for human, mouse, rat, zebrafish,and cow. This functionality is expected to increase in the future to treat all organisms represented in the RefSeq collection. Citations supplied by the MeSH indexers and individual scientists can be identified by the presence of a REMARK beginning with the text string "GeneRIF". This represents a significant method to keep sequence connections to the literature up-to-date; GeneRIFs add considerable value to the RefSeq collection. For more information on GeneRIFs please see: http://www.ncbi.nlm.nih.gov/gene/about-generif For example, several GeneRIFs have been added to NM_000173.1 including: // REFERENCE 13 (bases 1 to 2480) AUTHORS Poujol,C., Ware,J., Nieswandt,B., Nurden,A.T. and Nurden,P. TITLE Absence of GPIbalpha is responsible for aberrant membrane development during megakaryocyte maturation: ultrastructural study using a transgenic model JOURNAL Exp. Hematol. 30 (4), 352-360 (2002) MEDLINE 21935100 PUBMED 11937271 REMARK GeneRIF: Absence of GPIbalpha is responsible for aberrant membrane development during megakaryocyte maturation; leads to abnormal partitioning of the membrane systems and abnormal proplatelet production. // DIRECT SUBMISSION: A Direct Submission field is provided on some RefSeq records but not all. It is propagated from the underlying GenBank record from which the RefSeq is derived or provided on submissions from collaborating groups. Transcript and protein RefSeqs for human, mouse, rat, zebrafish, and cow do not provide this field as records often include additional data and are not necessarily direct copies of the GenBank submission. COMMENT: A COMMENT identifying the RefSeq Status is provided for the majority of the RefSeq records. This comment may include information about the RefSeq status, collaborating groups, and the GenBank records(s) from which the RefSeq is derived. The RefSeq COMMENT is not provided comprehensively in this release. We are working to supply this COMMENT more comprehensively in the future. Additional COMMENTS are provided for some records to provide information about the sequence function, notes about the aspects of curation, or comments describing transcript variants. A COMMENT is always provided if the GI has changed. For example (from NM_133490): // COMMENT REVIEWED REFSEQ: This record has been curated by NCBI staff. The reference sequence was derived from BC008969.1. On Dec 31, 2002 this sequence version replaced gi:19424123. Summary: Voltage-gated potassium (Kv) channels represent the most complex class of voltage-gated ion channels from both functional and structural standpoints. Their diverse functions include regulating neurotransmitter release, heart rate, insulin secretion, neuronal excitability, epithelial electrolyte transport, smooth muscle contraction, and cell volume. This gene encodes a member of the potassium channel, voltage-gated, subfamily G. This member functions as a modulatory subunit. The gene has strong expression in brain. Alternative splicing results in two transcript variants encoding distinct isoforms. Transcript Variant: This variant (2) has an alternate 3' sequence, as compared to variant 1. It encodes isoform 2 that is shorter and has a distinct C-terminus as compared to isoform 1. // 4.1.3 NUCLEOTIDE FEATURE ANNOTATION ----------------------------------- Gene, mRNA, CDS: Every effort is made to consistently provide the Gene and coding sequence (CDS) feature (when relevant). If a RefSeq is based on a GenBank record that is only annotated with the CDS, then a Gene feature is created. mRNA features are provided for most eukaryotic records; this is not yet comprehensively provided and will improve in future releases. Gene Names: Gene symbols and names are provided by external official nomenclature groups for some organisms. If official nomenclature is not available we may use a systemic name provided by the data submittor or apply a more functional name during curation. When official nomenclature is available we may provide additional alternate names for some organisms. Variation: Variation is computed by the dbSNP database staff and added via post-processing to RefSeq records. Miscellaneous: For some records, additional annotation may be provided when identified by the curation staff or provided by a collaborating group. For example, the location of polyA signal and sites may be included. 4.1.4 PROTEIN FEATURE ANNOTATION -------------------------------- Protein Names: Protein names may be provided by a collaborating group, may be based on the Gene Name, or for some records, the curation process may identify the preferred protein name based on that associated with a specific EC number or based on the literature. Protein Products: Signal peptide and mature peptide annotation is provided by propagation from the GenBank submission that the RefSeq is based on, when provided by a collaborating group, or when determined by the curation process. Domains: Domains are computed by alignment to the NCBI Conserved Domain Database database for human, mouse, rat, zebrafish, nematode, and cow. The best hits are annotated on the RefSeq. For some records, additional functionally significant regions of the protein may be annotated by the curation staff. Domain annotation is not provided comprehensively at this time. 4.2 Tracking Identifiers ------------------------ Several identifiers are provided on RefSeq records that can be used to track relationships between annotated features, relationships between RefSeq records, and changes to RefSeq records over time. The GeneID identifies the related Gene, mRNA, and CDS features. Transcript IDs (RefSeq accessions) provide an explicit connection between a transcript feature annotated on a genomic RefSeq record, and the RefSeq transcript record itself. Likewise, the Protein ID (RefSeq accessions) provides the association between the annotated CDS feature on a genomic or transcript RefSeq record, and the protein record itself. Changes to a RefSeq sequence over time can be identified by changes to the GI and version number. 4.2.1 GeneID ------------ A gene feature database cross-reference qualifier (dbxref), the GeneID, is provided on many RefSeq records to support access to the Entrez Gene database. Entrez Gene provides gene-oriented information for a sub-set of the RefSeq collection. Gene includes data for all Eukaryotic genomes, viral genomes, and a representative Prokaryotic genomes. The GeneID provides a distinct tracking identifier for a gene or locus and is provided on the gene, mRNA, and CDS features. The GeneID can be used to identify a set of related features; this is especially useful when multiple products are provided to represent alternate splicing events. For example: // gene 19683..104490 /gene="DLEC1" /db_xref="GeneID:9940" <<<--- GeneID /db_xref="MIM:604050" // When viewing RefSeq records via the internet, the GeneID is hot-linked to Entrez Gene. 4.2.2 Transcript ID ------------------- The transcript_id qualifier found on a mRNA or other RNA feature annotation provides an explicit correspondence between a feature annotation on a genomic record and the RefSeq transcript record. For example: NT_011523.9 Homo sapiens chromosome 22 genomic contig. // mRNA complement(231444..239103) /gene="PKDREJ" /product="polycystic kidney disease (polycystin) and REJ (sperm receptor for egg jelly homolog, sea urchin)-like" /note="Derived by automated computational analysis using gene prediction method: BestRefseq,BLAST. Supporting evidence includes similarity to: 3 mRNAs" /transcript_id="NM_006071.1 <<<--- linked RefSeq transcript /db_xref="GI:5174632" /db_xref="GeneID:10343" /db_xref="MIM:604670" // 4.2.3 Protein ID ---------------- The protein_id qualifier found on a coding region (CDS) feature provides an explicit correspondance between feature annotation on a genomic or transcript RefSeq record and the RefSeq transcript record. For example: NC_001144.4 Saccharomyces cerevisiae chromosome XII, complete sequence. // CDS complement(16639..17613) /gene="MHT1" /locus_tag="YLL062C" /experiment="experimental evidence, no additional details recorded" /note="S-methylmethionine-homocysteine methyltransferase, functions along with Sam4p in the conversion of S-adenosylmethionine (AdoMet) to methionine to control the methionine/AdoMet ratio" /codon_start=1 /product="Mht1p" /protein_id="NP_013038.1" <<<--- linked RefSeq protein /db_xref="GI:6322966" /db_xref="SGD:S000003985" /db_xref="GeneID:850664" // 4.2.4 Conserved Domain Database (CDD) ID ---------------------------------------- Protein domain annotation is calculated by the Conserved Domain Database and is included in RefSeq protein records processed for the FTP site. Domain annotation appears as a Region feature on protein records and is propagated to associated transcript features (if available) as a misc_feat. The feature annotation includes a dbxref cross-reference to the CDD database that is the equivalent of a gi identifier in that it may change over time. The dbxref retrieves a domain model as calculated at a point in time; recalculation of domains by the CDD group may result in a new CDD identifier value. The CDD dbxref values that are available in the RefSeq release, although not stable, will continue to retrieve data from the CDD database where a newer identifier value may be found. For example: VERSION NP_000550.2 GI:28302131 DEFINITION A-gamma globin [Homo sapiens]. // Region 5..142 /region_name="globin" /note="Globins are heme proteins, which bind and transport oxygen; cd01040" /db_xref="CDD:29979" <<--- CDD identifier // ============================================================================= 5. REFSEQ ADMINISTRATION ============================================================================= The National Center for Biotechnology Information (NCBI), National Library of Medicine, National Institutes of Health, is responsible for the production and distribution of the NIH RefSeq Sequence Database. NCBI distributes RefSeq sequence data by anonymous FTP. For more information, you may contact NCBI by email at info@ncbi.nlm.nih.gov or by phone at 301-496-2475. 5.1 Citing RefSeq ----------------- When citing data in RefSeq, it is appropriate to to give the sequence name, and primary accession and version number (or GI). Note, the most accurate citation of the sequence is provided by including the combined accession plus version number or the GI number. It is also appropriate to list a reference for the RefSeq project. Please refer to the RefSeq web site for the most recent publication. http://www.ncbi.nlm.nih.gov/refseq/publications/ 5.2 RefSeq Distribution Formats ------------------------------- Complete flat file releases of the RefSeq database are available via NCBI's anonymous ftp server: ftp://ftp.ncbi.nlm.nih.gov/refseq/release/ Each release is cumulative, incorporating previous data plus new data. Records that have been suppressed are not included in the release. Incremental updates that become available between RefSeq releases are available at: ftp://ftp.ncbi.nlm.nih.gov/refseq/daily ftp://ftp.ncbi.nlm.nih.gov/refseq/cumulative Please refer to the README for additional information: ftp://ftp.ncbi.nlm.nih.gov/refseq/README 5.3 Other Methods of Accessing RefSeq Data ------------------------------------------ Entrez is a molecular biology database system that presents an integrated view of DNA and protein sequence data, structure data, genome data, publications, and other data fields. The Entrez query and retrieval system is produced by the National Center for Biotechnology Information (NCBI) and is available only via the internet. Entrez is accessed at: http://www.ncbi.nlm.nih.gov/Entrez/ RefSeq entries are indexed for retrieval in the Entrez system. The web-based filter restrictions can be used to restrict your query to RefSeq data or to specific subsets of the RefSeq database. Additional specific property restrictions are provided to support querying for RefSeq records with specific STATUS codes. Queries are defined on the RefSeq web site at: http://www.ncbi.nlm.nih.gov/RefSeq/ 5.4 Request for Corrections and Comments ---------------------------------------- We welcome your suggestions to improve the RefSeq collection; we invite groups interested in contributing toward the collection and curation of the RefSeq database to improve the representation of single genes, gene families, or complete genomes to contact us. Please refer to RefSeq accession and version numbers (or GI) and the RefSeq Release number to which your comments apply; it is useful if you indicate the source of data that you found to be problematic (e.g., data on the FTP site, data retrieved on the web site), the entry DEFLINE, and the specific annotation field for which you are suggesting a change. Suggestions and corrections can be sent to: info@ncbi.nlm.nih.gov 5.5 Credits and Acknowledgements -------------------------------- This RefSeq release would not be possible without the support of numerous collaborators and the primary sequence data that is submitted by thousands of laboratories and available in GenBank. The RefSeq project is ambitious in scope and we actively welcome opportunities to work with other groups to provide this collection. We value all of our collaborators; they contribute information with a large range in scope and volume such as completely annotated genomes, advice to improve the sequence or annotation of individual RefSeq records, information about official nomenclature, and information about function. In addition to the significant information collected by collaboration, numerous NCBI staff are involved in infrastructure support, programmatic support, and curation. RefSeq is supported by 3 primary work groups that are associated with Entrez Gene, Entrez Genomes, and the Genome Annotation Pipeline. 5.6 Disclaimer -------------- The United States Government makes no representations or warranties regarding the content or accuracy of the information. The United States Government also makes no representations or warranties of merchantability or fitness for a particular purpose or that the use of the sequences will not infringe any patent, copyright, trademark, or other rights. The United States Government accepts no responsibility for any consequence of the receipt or use of the information. For additional information about RefSeq releases, please contact NCBI by e-mail at info@ncbi.nlm.nih.gov or by phone at (301) 496-2475.