Why do VectorBase gene IDs change? (finding missing gene IDs)


Change is inevitable. As gene predictions are improved based on new evidence some will have a modified exon structure, new predictions will be created, old ones will be deleted, some will be split into multiple separate predictions and some will be merged together into a single new prediction. During this process the annotation identifiers can change. So how can you find out about these changes and how do you go about finding the new IDs for your genes?

Firstly lets deal with what kind of identifiers may be used for a given locus:

  • A gene symbol, e.g. CPLCW5. These are assigned by the community and used as the common name for a locus. This name should be stable and maintained between releases. There are times when the name can change but that will not be initiated by us. We will usually keep the old name as a synonym in cases where genes are renumbered of renamed.
  • An annotation ID e.g. AGAP028126. These are defined during the annotation process and form the basic identifier for the locus onto which all other information is associated. So external database references, or citations will be associated with the AGAP identifier and not with the symbol.
  • Manual annotations from the community via Apollo are given an automatic gene ID based on the one gene used as template to modify the annotation. These identifiers will be updated when a gene set update happens at release. Note: A locus can have multiple Apollo submissions from the community which will show as multiple gene model annotations, that is why we advice to name models with your lastname in the Apollo Information Editor as shown below.

So an example.

"Why when I search for CPLCW5, I get AGAP028126. This gene used to be called AGAP008464."

This result shows the annotation ID AGAP028126. When the updated gene prediction was integrated into the latest geneset the ID was changed from AGAP008464 to AGAP028126 as a consequence of the modification to the predicted exon structure. The gene symbol CPLCW5 has been maintained.

Notification of updated geneset for your species can be found via news items and in the release notes.

For a visualization of both the current canonical and pending new gene models you can go to the Genome Browser. Go to the location tab, select the page 'Region in detail' and click on 'Configure this page'. Under the category "Genes and transcripts" activate the track called "Current Apollo annotation"

How to find a gene that has changed?

You have four options, we recommend to try the one on top first and follow the other ones in order:

  • If you type your "old" gene ID on VectorBase Search, the results page should give you the "current/new" gene ID.
  • Type the genomic coordinates, i.e., scaffold (or supercontig) or chromosome and base pair range on the genome browser
  • Use the nucleotide or protein sequence for a BLAST search.
  • Note: This may not be available for all organisms gene sets. Use the lists of identifier changes presented in some gene set pages, follow this link for the latest one of Culex quinquefasciatus.
  • How can I cite a specific gene set on my thesis/paper? Follow the link to this FAQ.