Data (faqs)

FAQ category.

Displaying 1 - 10 of 38

Are the BLAST databases masked?


The BLAST databases are NOT masked - but there is an option to turn this feature on or off as shown in this screenshot of the tool's page.

Are the (genomes) gene sets experimentally validated?


No genes have been experimentally validated by VectorBase, but automatic predictions of the initial gene sets are done using two approaches:

  • Computational: ab initio and similarity methods
  • Experimental: RNA and protein sequencing methods

Available data has been different in the case of each one of the current 40 genomes we host, including EST, RNAseq, protein alignments, as well as data from more distant species. For details about each genome, we strongly encourage you to look for the genome paper for your species of interest.

For most of our Anopheles gambiae genes and a large number of Aedes aegypti genes, a VectorBase curator was in charge of manually check the gene models. Starting in 2013, we mainly relay on community manual annotation. Any such community annotation not yet incorporated into an updated gene set can be viewed in the Genome Browser (Location tab/Region in detail/Configure this page/Genes and transcripts/Current Apollo annotation) and in Apollo (track/user-created annotations).

In VectorBase Genome Browser, you can also add tracks to display protein alignments from other species, and experimental evidence with microarrays, RNAseq and mass spectrometry for the species of interest. You can use these tracks as experimental evidence to (indirectly) validate gene models.

Contigs and singletons in the Ixodes assembly



The contigs file at VectorBase contains just the 570,637 contigs that make up the official annotated genome assembly (IscaW1). All of these are represented in the 369,492 supercontigs (note that many of these 'supercontigs' are just a single contig).

The file that describes how the contigs make up the supercontigs is also downloadable here

There are 580,638 additional contigs deposited at GenBank, unfortunately not tagged differently, to give a total of 1,141,595. The ones not used in IscaW1 are all short and are considered to "represent degenerate contigs". We don't have these at VectorBase.


The singletons file available from VectorBase consists of trace reads that could not be combined into contigs. We make it available because it represents a large proportion (38%) of the total trace reads (unusually large for a WGS sequencing project). It isn't entirely clear why such a large proportion of the reads were left as singletons. A high level of polymorphism within the sequenced population of ticks is suspected to be a large part of the problem.

The advice given by the JCVI folk who did the assembly is that there may be interesting sequence, different to the sequence in IscaW1, in the singletons; but you are unlikely to find anything interesting in the short degenerate contigs omitted from IscaW1.

So if you want can't find your gene of interest in the IscaW1 assembly, you may want to download and search the singletons to see if there is anything extra there. To do this, you will need plenty of storage space, and a local installation of BLAST or another search program.

Link to downloads page

Alternatively you can search the entire raw output of the WGS project at VectorBase, by searching the traces with BLAST:

WIKEL Strain, June 2007 Trace Reads

Link to BLAST page

These are the 19.3 million unassembled sequencing reads. For sequences that are present once in the IscaW1 assembly, you will expect to find several hits in the traces (sequences in the assembly are covered by an average of about 4 trace reads).

Does genome sequences on VectorBase contain mitochondrial DNA?


The genome sequences on VectorBase represent the nuclear genome. However, there are some scaffolds that have mitochondrial genome sequence on it and these represent either nuclear integrations of the mitochondrial genome or regions that may contain some misassembly.

Does VectorBase host mitochondrial genomes?


VectorBase host mitochondrial genomes for Anopheles, Aedes, Culex and ticks. To access this information please go to Data menu and click on Mitochondrial Genomes or follow this link.

Do you have any interaction data?


We are often asked if we stored interaction data, either interaction between protein, or, more often, interaction between a vector and a pathogen (E.g. Anopheles gambiae and Plasmodium falciparum). The answer is NO in both cases, mainly because there aren't that many data in these domains and we decided to focus on areas with more data available (gene annotation, gene expression).

Protein interaction and pathways

Protein interaction data can be obtained via an interaction archive database, such as IntAct or DIP, or even better, the IMEx Consortium.

Pathway data can be obtained at Reactome

If you have any biological data and you would like to submit them to one of the interaction or pathway databases, please do so.
They would be happy to curate them and integrate them to their resources.

Vector/pathogen interaction

Orthologs between Anopheles gambiae and Plasmodium falciparum can be obtained at EnsemblGenomes, in the Pan Compara.

You need to start with an Anopheles or a Plasmodium gene and search for its orthologs in the other genome.

Do you store mosquito transposon data?


No, we don't store mosquito transposon data.

We mask the genome sequence before the annotation step, using the TEfam and GenBank transposon-related sequences.
In the genome browser you can visualize the masked regions but there is little details about the origin of the repeat/TE sequences.
Aedes only has an additional track showing the TEs from TEfam.

  • Go to this location: supercont1.139:706001-906000
  • Click on 'configure this page' on the left-hand side menu. This will open a new pop-up window
  • Click on 'repeats' and select the 'TEfam' track
  • Close the pop-up window on the top right-side
  • The TEFam track should appear in the location view

How are gene descriptions propagated between species?


Some of the species in VectorBase have been annotated more extensively than others, and it is useful to propagate gene descriptions to closely related species. Gene descriptions (but not gene names) are propagated based on orthology.

If there is one-to-one or one-to-many orthology between a gene in a source (i.e. well annotated) species and a target species, then the description is propagated from the source to the target if the following conditions are met:

  • a description in the source gene exists, and does not contain the words 'hypothetical' or 'putative'
  • no existing name or description in the target gene
  • >30% amino acid sequence identity
  • an alignment that covers >66% of both genes' lengths

When the description is propagated to the target gene it retains the source description's provenance, and information is added about the source species and gene stable ID. If the description ends in a digit, this usually indicates a species-specific element of the annotation, and is removed during propagation.

Descriptions are propagated between the following species:

  • Aedes aegypti to Aedes albopictus
  • Anopheles gambiae to the other Anophelines
  • Glossina morsitans to the other Glossinidae, Musca domestica, Stomoxys calcitrans
  • Drosophila melanogaster to Glossinidae, Musca domestica, Stomoxys calcitrans

How are genes projected between assembly versions?


The projection is based on an alignment of the assembly versions, using ATAC, which provides a mapping between regions of the old and new assemblies. This mapping is used to project each exon separately; these are then combined to produce projected transcripts, which in turn generate projected genes.

Sometimes, UTRs are truncated when projected to the new assembly; in these cases, only the CDS regions are projected, and the projected transcript has no UTRs (since it is not necessarily the case that a truncated UTR will be valid).

Projection can fail by:

  • Generating a translation with internal stop codons. This is most likely due to a nucleotide change(s) in the underlying assembly.
  • Mapping from one scaffold in the old assembly to multiple scaffolds (or strands) in the new assembly.
  • Mapping partially (with either truncated CDS regions or missing exons).
  • Not mapping at all.

In many cases there are good reasons for a transcript failing to project, but some transcripts with good evidence can be lost; our automated procedures try to minimise this latter set, but it is inevitable that some will need a small amount of manual correction. To facilitate this, we calculate statistics to show the quality and quantity of evidence for unprojected transcripts. Further, transcripts which we consider to have good evidence in the old assembly are documented.

For unprojected transcripts that map at least partially, the mappings are available in GFF3 format, and are also presented as a track in WebApollo. If transcripts map to multiple scaffolds (or strands), the GFF file has separate genes for each scaffold, using the original ID with a numeric suffix. For all unprojected transcripts there are a range of FASTA files that have the original transcript's sequence, for BLASTing or otherwise searching on the new assembly. For completeness, we provide the GFF3 and FASTA files for projected transcripts as well, but these are probably not as useful as those for unprojected transcripts.

All of the files associated with projection are available from the Downloads section.

How are "high confidence" orthologs defined?


Ortholog metrics are calculated for some groups of species, and these are used to classify a "high confidence" set of orthologs. The methodology uses two orthogonal sources of information, gene order conservation (GOC) and whole genome alignments (WGA).

The "GOC score" metric for a pair of orthologs measures whether the two genes up- and downstream of each gene in the ortholog pair are also orthologous, and allows for inversions and gene insertions. The "WGA coverage" metric determines the extent to which the orthologous regions have been aligned by pairwise LASTz alignments, primarily based on exonic coverage, with a small contribution from intronic coverage. Both metrics have a value between 0 and 100.

There is only an expectation for gene order conservation between species that are evolutionarily close; thus the GOC score is only calculated within Diptera, Chelicerata, and Hemiptera. Similarly, pairwise WGAs, and thus the related metric, are only available for a subset of fairly closely-related species.

To classify orthologs as "high confidence", thresholds are applied to the ortholog metrics, according to the evolutionary distance between the species. Within Anophelinae and Glossinidae, the GOC threshold is 50 and the WGA threshold is 50; within Brachycera, Culicinae, Hemiptera, and Phlebotominae the GOC threshold is 25 and the WGA threshold is 25; no thresholds are applied beyond these clades. In cases where GOC and WGA metrics are not available, a "tree-compliance" metric is used to identify (and therefore exclude from the "high confidence" set) orthologs inferred from dubious tree topologies. Finally, to be included in the "high confidence" set, both orthologous proteins must have percentage identity above a certain threshold, currently set at 25% for all species.

The metrics are displayed in the genome browser in the ortholog table, and are available in BioMart.

The following plots show, respectively, the percentage of orthologs with some degree of gene order conservation; the mean WGA coverage; and the mean percentage identity between orthologous protein sequences:

Gene Order Conservation: Score Metric
Whole Genome Alignment: Coverage Metric
Percentage Identity


Subscribe to Data (faqs)