Gene identifiers or gene IDs (e.g., AGAP + 6 digit number)

Answer: 

Gene nomenclature for VectorBase organisms

Gene identifiers or gene IDs in VectorBase follow a similar format to the FlyBase Drosophila gene nomenclature. We use a 4-character species designation and a 6-digit ordinal number to uniquely identify a gene locus and then '-RA' or '-PA' suffices for the associated transcript or peptide respectively.

Species identification

The species prefix consists of four characters consisting of three-characters for the species and one available for strain information where appropriate. Having set the field length to four we must use four and so any projects which do not have an appropriate strain/isolate we will assign the fourth letter arbitrarily.

Project examples:

  • AAEL Aedes aegypti Liverpool
  • AGAP Anopheles gambiae PEST
  • CPIJ Culex pipiens quinquefasciatus JHB
  • GMOY Glossina morsitans Yale
  • ISCW Ixodes scapularis Wikel
  • MDOA Musca domestica Aabys
  • PHUM Pediculus humanus USDA
  • RPRC Rhodnius prolixus CDC

Gene number

A six digit ordinal number is assigned to each locus. This assignment is done in an arbitary manner and no location information is inherent in the name (i.e. AGAP004053 is not necessarily 5' of AGAP004054). Subsequent re-annotations of a genome will utilise the next available ordinal number for that species. Once a number has been used it will not be re-used.

Transcript and Peptide suffices

We have used the FlyBase style nomenclature ('-RA' and '-PA') for describing the transcripts and protein-products of the gene. One important consideration here was to make explicit links between the gene and it's transcripts/peptides. In this nomenclature these all share the same root and this makes it more intuitive for the user, compare the old-style Ensembl identifiers used in Anopheles.

Alternative transcripts from the same gene are descibed within the suffix using the letters of the alphabet i.e. the first transcript is labelled as the 'A' form, the next as the 'B' form and so forth. This system has the caveat of only dealng with 26 isoforms in a neat manner.

Example

The following is a worked example for Aedes aegypti Liverpool strain

Gene/Locus name: AAEL100007

Isoform #1 Transcript: AAEL100007-RA Translation: AAEL100007-PA
Isoform #2 Transcript: AAEL100007-RB Translation: AAEL100007-PB

These identifiers will be used as the systematic name for a gene/locus, submitted to GenBank/EMBL as the /locus_tag qualifier and the name in VectorBase search database.