****************************************************************************** RefSeq-release3.txt ftp://ftp.ncbi.nih.gov/refseq/release/release-notes/ NCBI Reference Sequence (RefSeq) Database Release 3 January 13, 2004 Distribution Release Notes Release Size: 2218 organisms, 7992741222 nucleotide bases, 294647847 amino acids, 1101244 records ****************************************************************************** This document describes the format and content of the flat files that comprise releases of the NCBI Reference Sequence (RefSeq) database. Additional information about RefSeq is available at: 1. NCBI Handbook: http://www.ncbi.nlm.nih.gov/books/bv.fcgi?call=bv.View..ShowTOC&rid=handbook.TOC&depth=2 2. RefSeq Web Site: http://www.ncbi.nih.gov/RefSeq/ If you have any questions or comments about RefSeq, the RefSeq release files or this document, please contact NCBI by email at: info@ncbi.nlm.nih.gov. To receive announcements of future RefSeq releases and large updates please subscribe to NCBI's refseq-announce mail list: send email to refseq-announce-subscribe@ncbi.nlm.nih.gov with "subscribe" in the subject line (without quotes) OR subscribe using the web interface at: http://www.ncbi.nlm.nih.gov/mailman/listinfo/refseq-announce ============================================================================= TABLE OF CONTENTS ============================================================================= 1. INTRODUCTION 1.1 Release 3 1.2 Cutoff date 1.3 RefSeq Project Background 1.3.1 Sequence accessions, validation, and annotations 1.3.2 Data assembly, curation, and collaboration 1.3.3 Biologically non-redundant data set 1.3.4 RefSeq and DDBJ/EMBL/GenBank comparison 1.4 Uses and applications of the RefSeq database 2. CONTENT 2.1 Organisms included 2.2 Molecule Types included 2.3 Known Problems, Redundancies, and Inconsistencies 2.4 Last genome update for select major organisms 2.5 Release Catalog 2.6 Changes since the previous release 3. ORGANIZATION OF DATA FILES 3.1 FTP Site Organization 3.2 File Names and Formats 3.3 File Sizes 3.4 Statistics 3.5 Release Catalog 3.6 Accession Format 3.7 Growth of RefSeq 4. FLAT FILE ANNOTATION 4.1 Main features of RefSeq Flat File 4.1.1 LOCUS, DEFLINE, ACCESSION, KEYWORDS, SOURCE, ORGANISM 4.1.2 REFERENCE, DIRECT SUBMISSION, COMMENT 4.1.3 FEATURE ANNOTATION (Gene, mRNA, CDS, Variation, Protein) 4.2 Tracking Identifiers 4.2.1 GeneID and LocusID 4.2.2 Transcript ID 4.2.3 Protein ID 4.2.4 Conserved Domain Database (CDD) ID 5. REFSEQ ADMINISTRATION 5.1 Citing RefSeq 5.2 RefSeq Distribution Formats 5.3 Other Methods of Accessing RefSeq Data 5.4 Request for Corrections and Comments 5.5 Credits and Acknowledgements 5.6 Disclaimer ============================================================================= 1. INTRODUCTION ============================================================================= The NCBI Reference Sequence Project (RefSeq) is an effort to provide the best single collection of naturally occurring biomolecules, representative of the central dogma, for each major organism. Ideally this would include one sequence record for each chromosome, organelle, or plasmid linked on a residue by residue basis to the expressed transcripts, to the translated proteins, and to each mature peptide product. Depending on the organism, we may have some, but not all, of this information at any given time. We pragmatically include the best view we can from available data. 1.1 Release 3 ------------- The National Center for Biotechnology Information (NCBI) at the National Library of Medicine (NLM), National Institutes of Health (NIH) is responsible for producing and distributing the RefSeq Sequence Database. Records are provided through a combination of collaboration and in-house processing including some curation by in-house staff comprised of expert biologists. RefSeq Release 3 is a full release of all NCBI RefSeq records. The RefSeq project is an ongoing effort to provide a curated, non-redundant collection of sequences. This release includes all of the sequence data that we have collected at this time. Although the RefSeq collection is not yet complete, its value as a non-redundant dataset has reached a level that justifies providing full releases. 1.2 Cutoff date --------------- This full release, Release 3, incorporates data available as of January 13, 2003. For more recent data, users are advised to: . Download the RefSeq daily update files from the RefSeq FTP site ftp://ftp.ncbi.nih.gov/refseq/daily/new/ . Use the interactive web Entrez Query systems to query based on date http://www.ncbi.nih.gov/Entrez/ Notice of Change: a new directory has been created (daily/new/) to provide daily updates in the same file formats as are made available with the release. The original file formats provided in the 'daily' directory will be retained until January 2004. At that time, the daily updates will only be provided in the file formats consistent with the release and the 'new' directory will be removed. 1.3 RefSeq Project Background ----------------------------- 1.3.1 Sequence accessions, validation, and annotation ----------------------------------------------------- Every sequence is assigned a stable accession, version, and gi and all older versions remain available over time. RefSeq accessions have a distinct format (see section 3.6); the underscore ("_") is the primary distinguishing feature of a RefSeq accession. DDBJ/EMBL/GenBank accessions never include an underscore. Sequences are validated in several ways. For example, to confirm that genomic sequence from the region of the mRNA feature really does match the mRNA sequence itself, and that the annotated coding region features really can be translated into the protein sequences they refer to. Validation also checks for valid ASN.1 format. For genomes included in the LocusLink database, validation ensures consistency is maintained for descriptive information (symbols, gene and protein names) between RefSeq and LocusLink records. Each molecule is annotated as accurately as possible with the correct organism name, the correct gene symbol for that organism, and reasonable names for proteins where possible. When available, nomenclature provided by official nomenclature groups is used. Note that gene symbols are not required or expected to be unique either across species or within a species. 1.3.2 Data assembly, curation, and collaboration ------------------------------------------------ We welcome collaborations with authoritative groups outside NCBI who are willing to provide the sequences, annotations, or links to phenotypic or organism specific resources. Where such collaborations have not yet developed, NCBI staff have assembled the best view of the organism that we can put together ourselves. In some cases, as with the human genome, NCBI is an active participant in generating the genome assembly and in providing reference sequences to represent the annotated genome. For other genomes, we may compile the data ourselves from DDBJ/EMBL/GenBank or other public sources. For instance, we may simply select the "best" DDBJ/EMBL/GenBank record by automatic means, validate the data format (and correct if needed), and add an essentially unchanged copy to the RefSeq collection, attributed to the original DDBJ/EMBL/GenBank record. In other cases we may provide a record that is very similar to the DDBJ/EMBL/GenBank record, but to which experts at NCBI have added corrected or additional annotation. This latter process can range from minor technical repairs to a manually curated re-annotation of the sequence, often in collaboration with experts outside NCBI. Each record that has been curated, or that is in the pool for future curation, is labeled with the level of curation it has received. Curation status information is provided primarily for transcript and protein records. Curation is carried out on the whole genome level for some smaller genomes such as viral, organelle, and some microbial genomes. Curation status codes are defined in the section 3.2 below. 1.3.3 Biologically non-redundant data set ----------------------------------------- RefSeq provides a biologically non-redundant set of sequences for database searching and gene characterization. It has the advantage of providing an objective and experimentally verifiable definition of "non-redundant" in supplying one example of each natural biomolecule per organism. The small amount of sequence redundancy introduced from close paralogs, alternate splicing products, and genome assembly intermediates is compensated for by the clarity of the model. RefSeq provides the substrate for a variety of conclusions about non-redundancy based on clustering identical sequences, or families of related sequences, without confounding the database itself with these more subjective assessments. 1.3.4 RefSeq and DDBJ/EMBL/GenBank comparison --------------------------------------------- RefSeq is unique in providing a large curated database across many organisms, which precisely and explicitly links genetic (chromosome), expression (mRNA), and functional (protein) sequence data into an integrated whole. DDBJ/EMBL/GenBank also integrates DNA and protein information, and RefSeq is substantially based on sequence records contributed to DDBJ/EMBL/GenBank. However, RefSeq is similar to a review article in that it represents a synthesis and summary of information by a particular group (NCBI or other RefSeq contributors) that is based on the primary data gathered by many others and made part of the scientific record. Also, like a review article, it has the advantage of organizing a large body of diverse data into a single consistent framework with a uniform set of conventions and standards. Note that while based on DDBJ/EMBL/GenBank, RefSeq is distinct from DDBJ/EMBL/GenBank. DDBJ/EMBL/GenBank represents the sequence and annotations supplied by the original authors and is never changed by NCBI or RefSeq staff. DDBJ/EMBL/GenBank remains the primary sequence archive while RefSeq is a summary and synthesis based on that essential primary data. 1.4 Uses and applications of the RefSeq database ------------------------------------------------ A stable, consistent, comprehensive, non-redundant database of genomes and their products provides a valuable sequence resource for similarity searching, gene identification, protein classification, comparative genomics, and selection of probes for gene expression. It also acts as molecular "white pages" by providing a single, uniform point of access for searching at the sequence level, and by connecting the results with a diversity of organism-specific databases or resources unique to that organism or field. ============================================================================= 2. CONTENT ============================================================================= 2.1 Organisms included ---------------------- This release includes records representing 2124 distinct taxonomic categories, as measured by counting the number of distinct tax_ids included in the release. Tax_ids are provided, for all species having any amount of sequence data, by the NCBI Taxonomy group. The release includes species ranging from viral to microbial to eukaryotic and includes organisms for which complete and incomplete genomic sequence data is available. The release does not include all species for which some sequence data is available in DDBJ/EMBL/GenBank. The decision to generate RefSeq data for a species depends in part on the amount of sequence data available. Additional species will be represented in the RefSeq collection as more sequence data becomes available. 2.2 Molecule Types Included --------------------------- The RefSeq release includes genomic, transcript, and protein sequence data; however, these molecule types are not provided for all organisms and the sequences provided may not be complete or comprehensive for some species. Transcript RefSeq records may represent protein-coding transcripts or non-coding RNA products; these records are currently only provided for eukaryotic species. Genomic RefSeq records are provided when a sufficient quantity of genomic sequence data is available in DDBJ/EMBL/GenBank. Transcript and protein records may be provided for a species before genomic sequence data is available, as is the case with Danio rerio (zebrafish). 2.3 Known Problems, Redundancies, and Inconsistencies ------------------------------------------------------ The RefSeq collection is an ongoing project that is expected to grow in scope and content over time. Thus it is important to recognize that it is not complete in that some genomes are not yet completely sequenced, some incompletely sequenced genomes may not be included, or some gene products may not yet be represented. RefSeq records may be added, removed, or updated in future releases as new information becomes available and as a result of curation. Genomes with pending updates: Homo sapiens: An annotation update for the assembled human genome was running concurrently with processing for this release. Release 3 includes model RefSeqs (with accession prefix XM_ and XP_) from human genome build 34.1; the updated annotation represented in human genome build 34.2 will be available in the Map Viewer FTP site: ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/RNA/ ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/protein/ An update for the following large genomes is antipated to occur before the next RefSeq release: Drosophila malanogaster Arabidopsis thaliana Known Data inconsistencies: [1] RefSeq status codes are not consistently provided for some species. The goal is to consistently provide a status code for all RefSeq records. The release catalog indicates "UNKNOWN" if a status code was expected but not detected and "na" if a status code is not expected based on the original project plan for provision of this type of information. Status codes will be more consistently applied to all records in the future. [2] The genomic, transcript, and protein collection is known to be incomplete for many species. This is particularly true for those genomes for which a complete genome assembly is not yet available, such as Danio rerio (zebrafish), Bos taurus (cow), and Leishmania major. As additional sequence data becomes available, the RefSeq representation for these, and other, organisms will increase. [3] Although the goal is to provide a non-redundant collection, some redundancy is included in this release as follows: Redundant Protein records: Alternate Splicing When additional transcripts are provided to represent alternate splicing products, and the alternate splice site occurs in the UTR, then the protein is redundantly provided. Paralogs The goal is to provide a RefSeq record for each naturally occuring molecule. Therefore, records are provided for all genes identified including those produced by more recent gene duplication events in which the genes are nearly identical. Redundant Genomic records: Intermediate records For some species, intermediate genomic records are provided to support the assembly and/or annotation of the genome. For example, for human, a chromosome may be represented by a chromosome RefSeq record with a NC_ accession prefix. The chromosome record may consist of many contigs, each represented as a separate record with a NT_ accession prefix. In addition, some curated gene region records, with NG_ accession prefix, may also be provided to support annotation of complex regions. Alternate assemblies Genomic records are provided to represent alternate assemblies of genomic sequence derived from different populations. These records will have varying levels of redundancy and represent polymorphic and haplotype differences in terms of the sequence and annotation. For example, alternate assemblies are provided for different mouse strains and for regions of the human major histocompatibility complex (MHC). The MHC is a highly variable region of chromosome 6 which exhibits variation at the level of both sequence polymorphism and gene content. The alternate assemblies make it possible to represent this alternate gene content. [4] Wrong molecule type: NC_004958 is erroneously annotated as an RNA molecule and thus provided in the file plasmid1.rna.gbff, whereas it should have been provided in the file plasmid.genomic1.gbff. This record will be updated for the next release. [5] Plasmid data: Due to a processing error, release 2 did not include the complete RefSeq collection for plasmids; this has been corrected for release 3. 2.4 Notes on select major organisms ----------------------------------- Anopheles gambiae Genomic sequence data is available as whole genome shotgun (WGS). Arabidopsis thaliana Release 4.0 of the annotated genome was provided by TIGR in July, 2003. The RefSeq release includes chromosomes, transcripts, and proteins. An update to the Arabidopsis RefSeq collection is in progress. Caenorhabditis elegans The RefSeq release includes an annotation update that was released on October 16, 2003 the genome version available on March 7, 2003 (genome release 2.0). The release includes chromosome, transcript, and protein records. Drosophila melanogaster Release 3.1 of the assembled, annotated genome was provided by FlyBase in March 2003. An update to the Drosophila RefSeq collection is in progress. Homo sapiens NCBI provides the human genome assembly in close collaboration with the sequencing centers. RefSeq release 3 includes human genome build 34.1, which is based on data available on July 30 2003. An annotation update, 34.2, was released during the RefSeq release processing. The release includes RefSeq chromosomes, contigs, known transcripts and proteins (as defined by having a Locus ID), and derived model transcripts and proteins predicted by the Genome Annotation pipeline. See: http://www.ncbi.nlm.nih.gov/genome/guide/build.html Magnaporthe grisea The assembled annotated WGS genome was provided by the Whitehead Institute and RefSeqs were available on December 29, 2003. The release includes genomic, transcript, and protein records. Mus musculus NCBI provides the mouse genome assembly in close collaboration with the sequencing centers. This RefSeq release includes mouse genome build 32 which is based on data available in September, 2003. Release 3 includes RefSeq contigs, known transcripts and proteins, and derived model transcripts and proteins predicted by the Genome Annotation pipeline. A mouse genome update is imminent at the time of this release processing and will be included in the next release. Neurospora crassa The annotated genome data was supplied by the Whitehead Institute. RefSeqs were released on July 2, 2003 and include WGS genomic contigs, predicted transcripts, predicted proteins. The RefSeq data does not represent the subset of small WGS contigs that were not mapped to a chromosome position or do not include annotation. Oryza sativa The genome is being sequenced by the International Rice Genome Sequencing Project. RefSeqs are provided by NCBI processing to generate the annotated genomic contigs; annotation is propagated from the submitted BAC clones. The rice RefSeq set does not represent annotation from BAC clones that didn't fall into supercontigs, even though they are mapped to a chromosome position. RefSeqs were first released on October 6, 2003 and include genomic contigs, transcripts, and proteins. Rattus norvegicus NCBI uses the rat whole genome shotgut (WGS) genome assembly provided by Baylor sequencing center. RefSeq release 3 includes rat genome build 2 which is based on the RGSC v3.1 assembly, provided by the Rat Genome Sequencing Consortium (RGSC). RefSeqs include contigs, known transcripts and proteins, and derived model transcripts and proteins predicted by the Genome Annotation pipeline. Saccharomyces cerevisiae Provided by Sacchraomyces Genome Database (SGD); this release includes the chromosome and protein records updated on December 24, 2003. Schizosaccharomyces pombe Provided by Sacchraomyces Genome Database (SGD); this release includes the chromosome and protein records updated on January 12, 2004. Microbial The RefSeq collection includes incomplete WGS microbial genomes for which an accession is provided for each contig; thus, the number of accessions for this category is significantly greater than the number of organisms represented. This RefSeq release includes 148 complete microbial genomes. Microbial genomes are annotated by a collaborative automatic computation method, followed by curation by NCBI staff. Three microbial genomes have become available in RefSeq since the last release: Geobacter sulfurreducens PCA Rhodopseudomonas palustris strain CGA009 Onion Yellows phytoplasma Nineteen microbial genomes have been curated: Aeropyrum pernix Archaeoglobus fulgidus Buchnera sp. APS Buchnera aphidicola Sg Corynebacterium glutamicum Escherichia coli K-12 Escherichia coli O157:H7 Haemophilus influenzae Lactococcus lactis subsp. lactis Mycoplasma genitalium Mycoplasma pneumoniae Oceanobacillus iheyensis Salmonella typhimurium LT2 Shewanella oneidensis Pyrococcus abyssi Pyrococcus furiosus Pyrococcus horikoshii Thermoplasma volcanium Vibrio vulnificus CMCP6 Viruses This RefSeq release includes over 1269 distinct viral records which have been curated via an extensive collaboration between the international virologist community and NCBI staff virologists. A panel of viral genomes advisors has been established. For more information please see: RefSeq Collaborations: http://www.ncbi.nih.gov/RefSeq/collaborators.html Viral Genome Advisors: http://www.ncbi.nih.gov/PMGifs/Genomes/viradvisors.html Microbial Contributors: http://www.ncbi.nih.gov/RefSeq/microbialcontrib.html 2.5 Release Catalog ------------------- The Release Catalog documents the full contents of the RefSeq Release. The catalog can be used to identify data of interest. See the format description in section 3.5 for additional information. The release catalog is available at: ftp://ftp.ncbi.nih.gov/refseq/release/release-catalog/RefSeq-release3.catalog The catalog for previous releases is available in the archive directory: ftp://ftp.ncbi.nih.gov/refseq/release/release-catalog/archive/ 2.6 Changes since the last release ---------------------------------- New Files: Accessions that were included in the previous release but not in the current release are now being reported in: ftp://ftp.ncbi.nih.gov/refseq/release/release-catalog/release3.removed-records See section 3.6 for a description of the file format. A comprehensive list of sequence files provided for the current release is available in: ftp://ftp.ncbi.nih.gov/refseq/release/release-catalog/release3.files.installed New large genomes: Magnaporthe grisea GeneID: The GeneID dbxref is being added to the RefSeq collection incrementally. This release includes a greater number of GeneID dbxrefs but it is not yet comprehensively supplied. See section 4.2.1 for more information. Plasmid data: Due to a processing error, release 2 did not include the complete RefSeq collection for plasmids; this has been corrected for release 3. ============================================================================= 3. ORGANIZATION OF DATA FILES ============================================================================= 3.1 FTP Site Organization ------------------------- RefSeq releases are available on the NCBI FTP site at: ftp://ftp.ncbi.nih.gov/refseq/release/ The RefSeq collection is provided in a redundant fashion to best meet the needs of those who want the full collection as well as those who want a specific sub-set of the collection. Therefore the collection is provided as: 1) the complete collection, and 2) sections as defined by major taxonomic or other logical groupings. A subdirectory exists for each sub-section as follows: fungi invertebrate microbial mitochondrion plant plasmid plastid protozoa vertebrate_mammalian vertebrate_other viral In addition, the complete collection is available without these sub-groupings in the subdirectory: complete Note that this directory structure intentionally provides the release data in a redundant fashion. We gave considerable thought to how to package the release to meet the needs of different user groups. For instance, some groups may be interested in retrieving the complete protein set, while other groups may be interested in retrieving data for a more limited number of organisms. We decided to provide logical groupings based on general taxonomic node (viral, mammalian etc) as well as logical molecule type compartmentalization (e.g., plastid). Thus, all records are provided at least twice, once in the "complete" directory, and a second time in one of the other directories. Some sequences may be provided three times when it is logical to include the record in more than one additional directory. For example, a sequence may be provided in the "complete", "mitochondrion", and "vertebrate_mammalian" directories. We are interested in hearing if you find this structure useful or if you would like information grouped in a different manner. Send suggestions or comments to the NCBI Help Desk at: info@ncbi.nlm.nih.gov 3.2 File Names and Formats -------------------------- File names are informative, and indicate the content, molecule type, and file format of each RefSeq release data file. Most filenames utilize this structure: directoryfilenumber.molecule.format.gz 1 2 3 4 File Name Key: 1. directory directory level the file is provided in (e.g.,complete, viral etc) 2. file number: large data sets are provided as incrementally numbered files 3. molecule type of molecule (genomic, rna, or protein); not relevant for ASN.1 format files provided in the "complete" sub-directory 4. format the data format provided in the file; see below For example: complete1.genomic.bna.gz vertebrate_mammalian2.protein.gpff.gz RefSeq Whole Genome Shotgun (WGS) data are provided in files provided per WGS project. Their filenames use a slightly different structure: directoryWGSproject.molecule.format.gz For example: completeNZ_AAAU.bna.gz microbialNZ_AAAV.genomic.fna.gz All RefSeq release files have been compressed with the gzip utility; therefore, an invariant ".gz" suffix is present for all release files. The data that comprises a RefSeq release are available in several file formats, as indicated by the format component in the file name: bna binary ASN.1 format; includes nucleotide and protein gbff GenBank flat file format; nucleotide records gpff GenPept flat file format; protein records fna FASTA format; nucleotide records faa FASTA format; protein records The comprehensive full release is deposited in the "complete" directory and is available in all file types. Binary ASN.1 format is only provided in the complete directory. The remaining directories include all of the remaining file types. The DDBJ/EMBL/GenBank and GenPept flat file format provided in this release matches that seen when accessing the records using the NCBI web site. Notably, some RefSeq record are in the CON division and do not instantiate the sequence on the flat file display, instead a 'join' statement is provided to indicate the assembly instructions. The FASTA files do include the assembled sequences for these CON division RefSeq records. For example, see NC_000022. Suggestions regarding the structure of the RefSeq release product and the available formats may be sent to the NCBI Help Desk: info@ncbi.nlm.nih.gov 3.3 File Sizes -------------- RefSeq release files are provided in a range of sizes. Most are limited to several hundred megabytes. However, some of the genomic FASTA files can exceed 2Gb. Files are compressed to reduce file size and facilitate FTP retrieval. The total size of release 3 (includes all directories) is as follows: Extension Size (GB) Type ----------------------------------------------------------- bna 5.04 ASN.1 gbff 8.58 GenBank flat file gpff 5.28 GenPept flat file fna 33.50 FASTA, nucleotide faa 0.71 FASTA, protein Note: for release 3, the compete directory provides all file types. The ASN.1 format is only available in the complete directory; the file sizes reported for the remaining file formats represents the redundant total found in the complete plus other directories. 3.4 Statistics --------------- RefSeq release 3 includes sequences from 2218 different organisms. The number of species represented in each Release sub-directory, determined by counting distinct tax IDs, is as follows: complete 2218 fungi 37 invertebrate 87 microbial 395 mitochondrion 475 plant 34 plasmid 317 plastid 34 protozoa 40 vertebrate_mammalian 98 vertebrate_other 221 viral 1269 Total Number of Accessions and Length (number of nucleotides or amino acids), per type of molecule: Accessions Basepairs/Residues Genomic: 58793 7648922017 RNA: 198043 343819205 Protein: 844408 294647847 Complete RefSeq release statistics for each directory are provided in a separate document. Please see: ftp://ftp.ncbi.nih.gov/refseq/release/release-statistics/ file: RefSeq-release3.1132004.stats.txt Statistics for previous releases are available in the archive subdirectory: ftp://ftp.ncbi.nih.gov/refseq/release/release-statistics/archive/ 3.5 Release Catalog Format -------------------------- The full non-redundant contents of the release are documented in the release catalog. The catalog includes the following columns: 1. tax_id 2. species name 3. RefSeq accession.version 4. gi 5. FTP directories data is provided in 6. RefSeq status code 7. sequence length Note: the molecule type for each catalog entry can be inferred from the accession prefix (see below). RefSeq Status Codes are documented on the RefSeq web site. The catalog includes the following terms: na Not Applicable; status codes are not provided for some genomic records UNKNOWN The status code has not yet been applied REVIEWED The RefSeq record has been the reviewed by NCBI staff or by a collaborator. Some RefSeq records may incorporate expanded sequence and annotation information including additional publications and features. VALIDATED The RefSeq record has undergone an initial review to provide the preferred sequence standard. The record has not yet been subject to final review at which time additional functional information may be provided. PROVISIONAL The RefSeq record has not yet been subject to individual review and is thought to be well supported and to represent a valid transcript and protein. PREDICTED The RefSeq transcript may represent an ab initio prediction or may be partially supported by other transcript data; the protein is predicted. INFERRED The RefSeq record is inferred by genome sequence analysis. MODEL RefSeq records provided via automated processing and are not subject to individual review or revision between builds. 3.6 release#.removed-records file format ---------------------------------------- This is a report of accessions that were included in the previous release but are no longer included in the current release. The file includes the following columns: 1. tax_id 2. species name 3. RefSeq accession.version 4. gi 5. FTP directories data was provided in, in last release 6. RefSeq status code 7. sequence length 8. type of removal type options include: dead protein replaced by accession [original accession is not secondary] permanently suppressed temporarily suppressed [record may become available again in the future] 3.6 RefSeq Accession Format --------------------------- RefSeq accessions are formatted as a two letter prefix, followed by an underscore, followed by six digits or 4 letters plus eight digits. For example, NM_020236 and NZ_AABC02000001. The underscore ("_") is the primary distinguishing feature of a RefSeq accession; DDBJ/EMBL/GenBank accessions never include an underscore. The RefSeq accession prefix indicates the molecule type. Molecule Type Accession Prefix ---------------------------------------------- protein NP_; XP_; ZP_ rna NM_; NR_; XM_; XR_ genomic NC_; NG_; NT_; NW_; NZ_ Additional information is available on the RefSeq Web site: http://www.ncbi.nih.gov/RefSeq/key.html#accessions NOTICE OF CHANGE: NP_ accession space will need to be expanded in the near future, the new format will be NP_12345678. Existing accessions will remain unchanged. That is, existing accessions, such as NP_013474, will not be modified to 8 digits (NP_013474 and NP_00013474 will be distinct accessions identifying different protein records). As other accession series need to be expanded, they will also be expanded by adding 2 digits with existing accessions remaining stable. 3.7 Growth of RefSeq -------------------- Release Date Species Nucleotides Amino Acids Records 1 Jun 30, 2003 2005 4672871949 263588685 1061675 2 Oct 21, 2003 2124 7745398573 286957682 1097404 3 Jan 13, 2004 2218 7992741222 294647847 1101244 ============================================================================= 4. FLAT FILE ANNOTATION ============================================================================= 4.1 Main features of RefSeq Flat File ------------------------------------- Also see the RefSeq web site and the NCBI Handbook, RefSeq chapter. http://www.ncbi.nih.gov/RefSeq/ http://www.ncbi.nlm.nih.gov/books/bv.fcgi?call=bv.View..ShowTOC &rid=handbook.TOC&depth=2 4.1.1 LOCUS, DEFLINE, ACCESSION, KEYWORDS, SOURCE, ORGANISM -------------------------------------------------------------------- The beginning of each RefSeq records provides information about the accession, length, molecule type, division, and last update date. This is followed by the descriptive DEFINITION line, then by the Accession, version,and GI data, followed by detailed information about the organism and taxomonic lineage. // LOCUS NC_004916 384518 bp DNA linear INV 26-JUN-2003 DEFINITION Leishmania major chromosome 3, complete sequence. ACCESSION NC_004916 VERSION NC_004916.1 GI:32189699 KEYWORDS . SOURCE Leishmania major ORGANISM Leishmania major Eukaryota; Euglenozoa; Kinetoplastida; Trypanosomatidae; Leishmania. // Note: Both the GI and VERSION number increment when a sequence is updated, while the ACCESSION remains the same. The GI and "ACCESSION.VERSION" identifiers provide the finest resolution reference to a sequence. 4.1.2 REFERENCE, DIRECT SUBMISSION, COMMENT ------------------------------------------- REFERENCE: While the majority of RefSeq records do include REFERENCE data, this data is not required and some records do not include any citations. Publications are propagated from the GenBank record(s) from which the RefSeq is derived, provided by collaborating groups and NCBI staff during the curation process, and provided by the National Library of Medicine (NLM) PubMed MeSH indexing staff as they add new articles to PubMed. Functionally relevant citations are added by individual scientists using the LocusLink GeneRIF submission form, and a significant volume of citation connections are supplied by the NLM MeSH indexing staff for human, mouse, rat, zebrafish,and cow. This functionality is expected to increase in the future to treat all organisms represented in the RefSeq collection. Citations supplied by the MeSH indexers and individual scientists can be identified by the presence of a REMARK beginning with the text string "GeneRIF". This represents a significant method to keep sequence connections to the literature up-to-date; GeneRIFs add considerable value to the RefSeq collection. For more information on GeneRIFs please see: http://www.ncbi.nlm.nih.gov/LocusLink/GeneRIFhelp.html For example, several GeneRIFs have been added to NM_000173.1 including: // REFERENCE 13 (bases 1 to 2480) AUTHORS Poujol,C., Ware,J., Nieswandt,B., Nurden,A.T. and Nurden,P. TITLE Absence of GPIbalpha is responsible for aberrant membrane development during megakaryocyte maturation: ultrastructural study using a transgenic model JOURNAL Exp. Hematol. 30 (4), 352-360 (2002) MEDLINE 21935100 PUBMED 11937271 REMARK GeneRIF: Absence of GPIbalpha is responsible for aberrant membrane development during megakaryocyte maturation; leads to abnormal partitioning of the membrane systems and abnormal proplatelet production. // DIRECT SUBMISSION: A Direct Submission field is provided on some RefSeq records but not all. It is propagated from the underlying GenBank record from which the RefSeq is derived or provided on submissions from collaborating groups. Transcript and protein RefSeqs for human, mouse, rat, zebrafish, and cow do not provide this field as records often include additional data and are not necessarily direct copies of the GenBank submission. COMMENT: A COMMENT identifying the RefSeq Status is provided for the majority of the RefSeq records. This comment may include information about the RefSeq status, collaborating groups, and the GenBank records(s) from which the RefSeq is derived. The RefSeq COMMENT is not provided comprehensively in this release. We are working to supply this COMMENT more comprehensively in the future. Additional COMMENTS are provided for some records to provide information about the sequence function, notes about the aspects of curation, or comments describing transcript variants. A COMMENT is always provided if the GI has changed. For example (from NM_133490): // COMMENT REVIEWED REFSEQ: This record has been curated by NCBI staff. The reference sequence was derived from BC008969.1. On Dec 31, 2002 this sequence version replaced gi:19424123. Summary: Voltage-gated potassium (Kv) channels represent the most complex class of voltage-gated ion channels from both functional and structural standpoints. Their diverse functions include regulating neurotransmitter release, heart rate, insulin secretion, neuronal excitability, epithelial electrolyte transport, smooth muscle contraction, and cell volume. This gene encodes a member of the potassium channel, voltage-gated, subfamily G. This member functions as a modulatory subunit. The gene has strong expression in brain. Alternative splicing results in two transcript variants encoding distinct isoforms. Transcript Variant: This variant (2) has an alternate 3' sequence, as compared to variant 1. It encodes isoform 2 that is shorter and has a distinct C-terminus as compared to isoform 1. // 4.1.3 NUCLEOTIDE FEATURE ANNOTATION ----------------------------------- Gene, mRNA, CDS: Every effort is made to consistently provide the Gene and coding sequence (CDS) feature (when relevant). If a RefSeq is based on a GenBank record that is only annotated with the CDS, then a Gene feature is created. mRNA features are provided for most eukaryotic records; this is not yet comprehensively provided and will improve in future releases. Gene Names: Gene symbols and names are provided by external official nomenclature groups for some organisms. If official nomenclature is not available we may use a systemic name provided by the data submittor or apply a more functional name during curation. When official nomenclature is available we may provide additional alternate names for some organisms. Variation: Variation is computed by the dbSNP database staff and added via post-processing to RefSeq records. Miscellaneous: For some records, additional annotation may be provided when identified by the curation staff or provided by a collaborating group. For example, the location of polyA signal and sites may be included. 4.1.4 PROTEIN FEATURE ANNOTATION -------------------------------- Protein Names: Protein names may be provided by a collaborating group, may be based on the Gene Name, or for some records, the curation process may identify the preferred protein name based on that associated with a specific EC number or based on the literature. Protein Products: Signal peptide and mature peptide annotation is provided by propagation from the GenBank submission that the RefSeq is based on, when provided by a collaborating group, or when determined by the curation process. Domains: Domains are computed by alignment to the NCBI Conserved Domain Database database for human, mouse, rat, zebrafish, nematode, and cow. The best hits are annotated on the RefSeq. For some records, additional functionally significant regions of the protein may be annotated by the curation staff. Domain annotation is not provided comprehensively at this time. 4.2 Tracking Identifiers ------------------------ Several identifiers are provided on RefSeq records that can be used to track relationships between annotated features, relationships between RefSeq records, and changes to RefSeq records over time. The GeneID (and LocusID) identifies the related Gene, mRNA, and CDS features. Transcript IDs (RefSeq accessions) provide an explicit connection between a transcript feature annotated on a genomic RefSeq record, and the RefSeq transcript record itself. Likewise, the Protein ID (RefSeq accessions) provides the association between the annotated CDS feature on a genomic or transcript RefSeq record, and the protein record itself. Changes to a RefSeq sequence over time can be identified by changes to the GI and version number. 4.2.1 GeneID and LocusID ------------------------ A gene feature database cross-reference qualifier (dbxref), the GeneID, is being added to RefSeq records to support access to the new Entrez database, Gene. The inclusion of this new dbxref is not yet comprehensively provided for all RefSeq records. RefSeq updates will continue incrementally to apply this dbxref to all records. Entrez Gene provides gene-oriented information for the entire RefSeq collection. It represents a significant expansion of the LocusLink database concept. The GeneID is initially set to be equivalent to the LocusID; at some point in the future, these IDs may diverge. GeneIDs will be available for all RefSeq records; whereas the LocusID is available only for those genomes included in the LocusLink resource. The GeneID (or LocusID) provides a distinct tracking identifier for a gene or locus and is provided on the gene, mRNA, and CDS features. The GeneID can be used to identify a set of related features; this is especially useful when multiple products are provided to represent alternate splicing events. For example: // gene 19683..104490 /gene="DLEC1" /db_xref="GeneID:9940" <<<--- GeneID /db_xref="LocusID:9940" <<<--- LocusID /db_xref="MIM:604050" // When viewing RefSeq records via the internet, the GeneID is hot-linked to Entrez Gene and the LocusID is hot-linked to the LocusLink Gene Report page. Both resources provide additional descriptive information for genes, as available. 4.2.2 Transcript ID ------------------- The transcript_id qualifier found on a mRNA or other RNA feature annotation provides an explicit correspondance between a feature annotation on a genomic record and the RefSeq transcript record. For example: NT_011523.9 Homo sapiens chromosome 22 genomic contig. // mRNA complement(231444..239103) /gene="PKDREJ" /product="polycystic kidney disease (polycystin) and REJ (sperm receptor for egg jelly homolog, sea urchin)-like" /note="Derived by automated computational analysis using gene prediction method: BestRefseq,BLAST. Supporting evidence includes similarity to: 3 mRNAs" /transcript_id="NM_006071.1 <<<--- linked RefSeq transcript /db_xref="GI:5174632" /db_xref="GeneID:10343" /db_xref="LocusID:10343" /db_xref="MIM:604670" // 4.2.3 Protein ID ---------------- The protein_id qualifier found on a coding region (CDS) feature provides an explicit correspondance between feature annotation on a genomic or transcript RefSeq record and the RefSeq transcript record. For example: NC_001144.2 Saccharomyces cerevisiae chromosome XII, complete chromosome sequence. // CDS complement(16639..17613) /gene="MHT1" /locus_tag="YLL062C" /note="Mht1p; go_component: cellular_component unknown [goid 8372] [evidence ND]; go_function: homocysteine S-methyltransferase activity [goid 8898] [evidence IDA] [pmid 11013242]; go_process: sulfur amino acid metabolism [goid 96] [evidence IMP] [pmid 11013242]" /codon_start=1 /evidence=experimental /product="S-Methylmethionine Homocysteine methylTransferase" /protein_id="NP_013038.1" <<<--- linked RefSeq protein /db_xref="GI:6322966" /db_xref="SGD:S0003985" /db_xref="GeneID:850664" /translation="MKRIPIKELIVEHPGKVLILDGGQGTELENRGININSPVWSAAP FTSESFWEPSSQERKVVEEMYRDFMIAGANILMTITYQANFQSISENTSIKTLAAYKR FLDKIVSFTREFIGEERYLIGSIGPWAAHVSCEYTGDYGPHPENIDYYGFFKPQLENF NQNRDIDLIGFETIPNFHELKAILSWDEDIISKPFYIGLSVDDNSLLRDGTTLEEISV HIKGLGNKINKNLLLMGVNCVSFNQSALILKMLHEHLPGMPLLVYPNSGEIYNPKEKT WHRPTNKLDDWETTVKKFVDNGARIIGGCCRTSPKDIAEIASAVDKYS" // 4.2.4 Conserved Domain Database (CDD) ID ---------------------------------------- The CDD identifier found on protein records, and mapped to associated nucleotide records as a misc_feat,identifies protein domains that are found on the record. CDD annotation is applied computationally. Initially this annotation was provided for a subset of RefSeq; it will be applied to the entire collection in the near future. For example: NP_000550.2 A-gamma globin // Region 5..147 /region_name="Globin" /note="globin" /db_xref="CDD:pfam00042" <<<--- conserved domain database // ============================================================================= 5. REFSEQ ADMINISTRATION ============================================================================= The National Center for Biotechnology Information (NCBI), National Library of Medicine, National Institutes of Health, is responsible for the production and distribution of the NIH RefSeq Sequence Database. NCBI distributes RefSeq sequence data by anonymous FTP. For more information, you may contact NCBI by email at info@ncbi.nlm.nih.gov or by phone at 301-496-2475. 5.1 Citing RefSeq ----------------- When citing data in RefSeq, it is appropriate to to give the sequence name, and primary accession and version number (or GI). Note, the most accurate citation of the sequence is provided by including the combined accession plus version number or the GI number. It is also appropriate to list a reference for the RefSeq project. The following on-line publication provides the most complete description and should be cited when possible: The NCBI handbook [Internet]. Bethesda (MD): National Library of Medicine (US), National Center for Biotechnology Information; 2002 Oct. Chapter 17, The Reference Sequence (RefSeq) Project. Available from http://www.ncbi.nih.gov/entrez/query.fcgi?db=Books If on-line citations are not accepted by a journal, please use the following citation: NCBI Reference Sequence Project: update and current status Pruitt KD, Tatusova T, Maglott DR Nucleic Acids Res 2003 Jan 1;31(1):34-37 5.2 RefSeq Distribution Formats ------------------------------- Complete flat file releases of the RefSeq database are available via NCBI's anonymous ftp server: ftp://ftp.ncbi.nih.gov/refseq/release/ Each release is cumulative, incorporating previous data plus new data. Records that have been suppressed are not included in the release. Incremental updates that become available between RefSeq releases are available at: ftp://ftp.ncbi.nih.gov/refseq/daily/new ftp://ftp.ncbi.nih.gov/refseq/cumulative Please refer to the README for additional information: ftp://ftp.ncbi.nih.gov/refseq/README ftp://ftp.ncbi.nih.gov/refseq/CHANGE_NOTICE 5.3 Other Methods of Accessing RefSeq Data ------------------------------------------ Entrez is a molecular biology database system that presents an integrated view of DNA and protein sequence data, structure data, genome data, publications, and other data fields. The Entrez query and retrieval system is produced by the National Center for Biotechnology Information (NCBI) and is available only via the internet. Entrez is accessed at: http://www.ncbi.nih.gov/Entrez/ RefSeq entries are indexed for retrieval in the Entrez system. The web-based filter restrictions can be used to restrict your query to RefSeq data or to specific subsets of the RefSeq database. Additional specific property restrictions are provided to support querying for RefSeq records with specific STATUS codes. Queries are defined on the RefSeq web site at: http://www.ncbi.nih.gov/RefSeq/ 5.4 Request for Corrections and Comments ---------------------------------------- We welcome your suggestions to improve the RefSeq collection; we invite groups interested in contributing toward the collection and curation of the RefSeq database to improve the representation of single genes, gene families, or complete genomes to contact us. Please refer to RefSeq accession and version numbers (or GI) and the RefSeq Release number to which your comments apply; it is useful if you indicate the source of data that you found to be problematic (e.g., data on the FTP site, data retrieved on the web site), the entry DEFLINE, and the specific annotation field for which you are suggesting a change. Suggestions and corrections can be sent to: info@ncbi.nlm.nih.gov 5.5 Credits and Acknowledgements -------------------------------- This RefSeq release would not be possible without the support of numerous collaborators and the primary sequence data that is submitted by thousands of laboratories and available in GenBank. The RefSeq project is ambitious in scope and we actively welcome opportunities to work with other groups to provide this collection. We value all of our collaborators; they contribute information with a large range in scope and volume such as completely annotated genomes, advice to improve the sequence or annotation of individual RefSeq records, information about official nomenclature, and information about function. In addition to the significant information collected by collaboration, numerous NCBI staff are involved in infrastructure support, programmatic support, and curation. RefSeq is supported by 3 primary work groups that are associated with LocusLink, Entrez Genomes, and the Genome Annotation Pipeline. See the RefSeq web site for a list of collaborating groups and in-house staff. 5.6 Disclaimer -------------- The United States Government makes no representations or warranties regarding the content or accuracy of the information. The United States Government also makes no representations or warranties of merchantability or fitness for a particular purpose or that the use of the sequences will not infringe any patent, copyright, trademark, or other rights. The United States Government accepts no responsibility for any consequence of the receipt or use of the information. For additional information about RefSeq releases, please contact NCBI by e-mail at info@ncbi.nlm.nih.gov or by phone at (301) 496-2475.