Why do VectorBase gene IDs change? (finding missing gene IDs)


Change is inevitable. As gene predictions are improved based on new evidence some will have a modified exon structure, new predictions will be created, old ones will be deleted, some will be split into multiple separate predictions and some will be merged together into a single new prediction. During this process the annotation identifiers can change. So how can you find out about these changes and how do you go about finding the new IDs for your genes?

Firstly lets deal with what kind of identifiers may be used for a given locus:

  • A gene symbol, e.g. CPLCW5. These are assigned by the community and used as the common name for a locus. This name should be stable and maintained between releases. There are times when the name can change but that will not be initiated by us. We will usually keep the old name as a synonym in cases where genes are renumbered of renamed.
  • An annotation ID e.g. AGAP028126. These are defined during the annotation process and form the basic identifier for the locus onto which all other information is associated. So external database references, or citations will be associated with the AGAP identifier and not with the symbol.
  • Updates to the annotation from the community via Community Annotation submissions (CA) are given a separate identifier using the same species/strain prefix but a different set of numeric ordinal (AGAP8nnnnn in the case of Anopheles gambiae). These identifiers will be subsumed into the canonical set of AGAPxxxxxx identifiers when a gene set update happens at release. Note: A locus can have multiple CA submissions from the community which will show up with different IDs for the community (CA gene) annotations.

So an example.

"Why when I search for CPLCW5, I get two identifiers: AGAP028126 and AGAP820007. This gene used to be called AGAP008464."

This result shows the annotation ID AGAP028126 and the curation CA ID AGAP820007. When the updated gene prediction was integrated into the latest geneset the ID was changed from AGAP008464 to AGAP028126 as a consequence of the modification to the predicted exon structure. The gene symbol CPLCW5 has been maintained.

Notification of updated geneset for your species can be found via news items and in the release notes.

For a visualization of both the current canonical and pending new gene models you can go to the genome browser and activate the track called "Community Models" under the category "mRNA and protein alignments". For information on how to do it follow this link to the tutorial called "Browsing genomes 2: Visualizing and adding tracks", and follow the slides using gene CPLCW5. Zoom out to 1.25 kb for a better perspective of the differences.

How to find a gene that has changed?

You have four options, we recommend to try the one on top first and follow the other ones in order:

  • If you type your "old" gene ID on VectorBase Search, the results page should give you the "current/new" gene ID.
  • Type the genomic coordinates, i.e., scaffold (or supercontig) or chromosome and base pair range on the genome browser
  • Use the nucleotide or protein sequence for a BLAST search.
  • Note: This may not be available for all organisms gene sets. Use the lists of identifier changes presented in some gene set pages, follow this link for the latest one of Culex quinquefasciatus.
  • How can I cite a specific gene set on my thesis/paper? Citation information is available here in the FAQ called called "How to cite VectorBase".