Extended bacteria RefSeq annotation comparison report README Filename: bacterial-reannotation-report.06-12-2015.tsv.gz Preparation date: 6-12-2015 Description: This report includes results of annotation comparisons for RefSeq genomes originally annotated with NP or YP protein accessions and re-annotated with WP protein accessions with annotation generated by NCBI's prokaryotic genome annotation pipeline. The new annotations correspond to the annotations included as part of RefSeq Release 70. Matches are reported for all genes included in the original annotation. The report does not include new gene annotations that do not have a match in the original annotation. 7 assemblies were excluded because of some pending work on GeneID assignments. Those are: GCF_000010125.1 GCF_000013125.1 GCF_000019725.1 GCF_000026605.1 GCF_000023205.1 GCF_000267545.1 GCF_000400635.2 Columns: [1-3] nuc-taxid, nuc-acc, nuc-gi Nucleotide tax-id, accession, and GI [4-6] old-gene-start, old-gene-stop, old-gene-strand Location of original gene feature. Start and Stop are left/right positions. Features spanning the origin have start > stop [7-9] old-cds-start, old-cds-stop, old-cds_strand Location of original CDS feature (for protein-coding genes) [10] old-locus-tag locus_tag for the original gene feature [11] old-gene-id NCBI GeneID for the original gene feature [12] old-pseudo. Value 'pseudo' or '-' presence/absence of /pseudo attribute on original gene feature [13-14] old-prot-acc, old-prot-gi Protein accession and GI of the original RefSeq protein product [15-17] new-gene-start, new-gene-stop, new-gene-strand Location of new gene feature [18-20] new-cds-start, new-cds-stop, new-cds-strand Location of new CDS feature (for protein-coding genes) [21] new-locus-tag locus_tag for the new gene feature [22] new-gene-id NCBI GeneID for the new gene feature ('-' if not retained in Gene) [23] new-pseudo presence/absence of /pseudo attribute on new gene feature [24-25] new-prot-acc, new-prot-gi Protein accession and GI of the new RefSeq protein product [26-29] old-gene-coverage,new-gene-coverage,old-cds-coverage,new-cds-coverage percent overlap of the old/new gene/cds feature by its corresponding match. [30] gene-comparison description of comparison of the old and new gene features. Categories are: 'identical': same range 'similar': at least 65% coverage of old and new gene features, and CDSes are in the same reading frame 'dissimilar': lower coverage but still in the same reading frame 'no match': no overlapping gene feature the same frame in new annotation [31] pseudo-change 'gene type change': gene-comparison is classified as identical, similar or dissimilar, and change in /pseudo attribute (i.e. coding to pseudo or pseudo to coding changes) [32] cds-comparison description of comparison of the old and new CDS features Categories are the same as for gene-comparison [33] cds_change_category sub-categorization of the CDS comparison. Value is the first of: 'span_origin': at least one of the CDSes spans the origin. Note coverage percentages are correct for these features 'identical': same range 'match_stop': share the same 3' CDS coordinate 'match_start': share the same 5' CDS coordinate (but different 3' CDS) 'in-frame start': differ in 5' and 3' CDS coordinates, but 5' difference is divisible by three 'BLAST': none of the above, but proteins align well by BLAST. These are a mix of annotated frameshifts/slippage and partial CDSes annotated with a different initial phase 'out-of-frame': features overlap, but are in different reading frames [34] cds_length_change CDS length difference, in bp, calculated as 'new - old'. A few are not divisible by three because of features annotated with micro-introns or ribosomal slippage, or usage of alternate phase. CDSes spanning the origin were excluded from the calculation