Data (faqs)

FAQ category.

Are the BLAST databases masked?

Answer: 

The

Are the (genomes) gene sets experimentally validated?

Answer: 

No genes have been experimentally validated by VectorBase, but automatic predictions of the initial gene sets are done using two approaches:

Contigs and singletons in the Ixodes assembly

Contigs

Answer: 

Contigs

Does genome sequences on VectorBase contain mitochondrial DNA?

The genome sequences on VectorBase represent the nuclear genome. However, there are some scaffolds that have mitochondrial genome sequence on it and these represent either nuclear integrations of the mitochondrial genome or regions that may contain some misassembly.

Answer: 

The genome sequences on VectorBase represent the nuclear genome. However, there are some scaffolds that have mitochondrial genome sequence on it and these represent either nuclear integrations of the mitochondrial genome or regions that may contain some misassembly.

Does VectorBase host mitochondrial genomes?

VectorBase host mitochondrial genomes for Anopheles, Aedes, Culex and ticks. To access this information please go to Data menu and click on Mitochondrial Genomes or follow this link.

Answer: 

VectorBase host mitochondrial genomes for Anopheles, Aedes, Culex and ticks. To access this information please go to Data menu and click on Mitochondrial Genomes or follow this link.

Do you have any interaction data?

We are often asked if we stored interaction data, either interaction between protein, or, more often, interaction between a vector and a pathogen (E.g. Anopheles gambiae and Plasmodium falciparum). The answer is NO in both cases, mainly because there aren't that many data in these domains and we decided to focus on areas with more data available (gene annotation, gene expression).

Answer: 

We are often asked if we stored interaction data, either interaction between protein, or, more often, interaction between a vector and a pathogen (E.g. Anopheles gambiae and Plasmodium falciparum). The answer is NO in both cases, mainly because there aren't that many data in these domains and we decided to focus on areas with more data available (gene annotation, gene expression).

Do you store mosquito transposon data?

No, we don't store mosquito transposon data.

We mask the genome sequence before the annotation step, using the TEfam and GenBank transposon-related sequences.
In the genome browser you can visualize the masked regions but there is little details about the origin of the repeat/TE sequences.
Aedes only has an additional track showing the TEs from TEfam.

Answer: 

No, we don't store mosquito transposon data.

We mask the genome sequence before the annotation step, using the TEfam and GenBank transposon-related sequences.
In the genome browser you can visualize the masked regions but there is little details about the origin of the repeat/TE sequences.
Aedes only has an additional track showing the TEs from TEfam.

How are gene descriptions propagated between species?

Some of the species in VectorBase have been annotated more extensively than others, and it is useful to propagate gene descriptions to closely related species. Gene descriptions (but not gene names) are propagated based on orthology.

If there is one-to-one or one-to-many orthology between a gene in a source (i.e. well annotated) species and a target species, then the description is propagated from the source to the target if the following conditions are met:

Answer: 

Some of the species in VectorBase have been annotated more extensively than others, and it is useful to propagate gene descriptions to closely related species. Gene descriptions (but not gene names) are propagated based on orthology.

If there is one-to-one or one-to-many orthology between a gene in a source (i.e. well annotated) species and a target species, then the description is propagated from the source to the target if the following conditions are met:

How are genes projected between assembly versions?

Answer: 

The projection is based on an alignment of the assembly versions, using ATAC, which provides a mapping between regions of the old and new assemblies. This mapping is used to project each exon separately; these are then combined to produce projected transcripts, which in turn generate projected genes.

How are "high confidence" orthologs defined?

Ortholog metrics are calculated for some groups of species, and these are used to classify a "high confidence" set of orthologs.

Answer: 

Ortholog metrics are calculated for some groups of species, and these are used to classify a "high confidence" set of orthologs.

How are RNA genes annotated?

Answer: 

A large proportion of the RNA genes in VectorBase are annotated in a completely automated way. There are some exceptions to this rule, where we import genes from other sources, or have been supplied with manual annotations for particular classes of RNA.

VectorBase automated annotation

VectorBase uses three sources for RNA gene annotation:

How can I find RNAi gene pathway members?

Two solutions to get the RNAi:

1. VectorBase

If you know the name of the RNAi gene member you're looking for, you can try typing it in the SEARCH box above (e.g. Argonaute*).
You may get results from one or more genomes. Select a gene (or more) and grab its orthologs (on the Genome Browser: Gene Tab page -> orthologs section) in your species of interest.

2. BioMart

https://biomart.vectorbase.org/biomart/martview/

Answer: 

Two solutions to get the RNAi:

1. VectorBase

If you know the name of the RNAi gene member you're looking for, you can try typing it in the SEARCH box above (e.g. Argonaute*).
You may get results from one or more genomes. Select a gene (or more) and grab its orthologs (on the Genome Browser: Gene Tab page -> orthologs section) in your species of interest.

2. BioMart

https://biomart.vectorbase.org/biomart/martview/

How can I retrieve Interpro domains for a given species?

You can retrieve the

Answer: 

You can retrieve the

How does VectorBase annotate repeats and mask sequences?

It is standard practice to annotate repetitive regions of a genome as repeat features; this may be interesting in its own right, but is also a pre-requisite for subsequent analyses, such as gene prediction or whole genome alignment

Answer: 

It is standard practice to annotate repetitive regions of a genome as repeat features; this may be interesting in its own right, but is also a pre-requisite for subsequent analyses, such as gene prediction or whole genome alignment

How do I connect with the Ensembl Perl API?

VectorBase gene builds are constructed by VectorBase staff at Ensembl Genomes, and for convenience we also use their public MySQL server to store a recent release of our data (see data access to access to the MySQL server):

mysql -hmysql.ebi.ac.uk -P4157 -uanonymous

Owing to production cycle constraints, the data hosted by Ensembl Genomes correspond 100% with the previous release of VectoBase gene builds but the database is refreshed every two months.

Answer: 

VectorBase gene builds are constructed by VectorBase staff at Ensembl Genomes, and for convenience we also use their public MySQL server to store a recent release of our data (see data access to access to the MySQL server):

mysql -hmysql.ebi.ac.uk -P4157 -uanonymous

Owing to production cycle constraints, the data hosted by Ensembl Genomes correspond 100% with the previous release of VectoBase gene builds but the database is refreshed every two months.

How do I retrieve GO annotations?

The easiest way to get all the GO mappings from any species is using BioMart. For example, follow these steps for Aedes aegypti genes.

Answer: 

The easiest way to get all the GO mappings from any species is using BioMart. For example, follow these steps for Aedes aegypti genes.

How do you name alternative transcripts?

All transcripts are named after the gene, with addition of the suffix -R (transcripts) or -P (translations), plus a letter indicating the isoform. So a locus with 3 alternative isoforms will have three identifiers -RA, -RB and -RC. This is based on the FlyBase notation.




E.g., given the AGP000123 gene in Anopheles,

Answer: 

All transcripts are named after the gene, with addition of the suffix -R (transcripts) or -P (translations), plus a letter indicating the isoform. So a locus with 3 alternative isoforms will have three identifiers -RA, -RB and -RC. This is based on the FlyBase notation.




E.g., given the AGP000123 gene in Anopheles,

How to find a specific gene?

Four solutions:

Answer: 

Four solutions:

I find unexpected duplications in the Anopheles assembly: are they real?

We are aware that the Anopheles gambiae genome

Answer: 

We are aware that the Anopheles gambiae genome

Is it possible to query for genes' pathway information? e.g., KEGG

Answer: 

KEEG

KEGG is a database to understand high-level functions and utilities of biological systems, from molecular-level information. One of its data-oriented entry points is KEGG pathway, a collection of manually drawn pathway maps representing knowledge on the molecular interaction and reaction networks among proteins and other molecules. These curated pathways include six of our hosted genomes, these species are Aedes aegypti, Anopheles gambiae, Culex quinquefasciatus, Ixodes scapularis, Musca domestica and Pediculus humans.

Is there any plan to map the species supercontigs to chromosomes?

Answer: 
Currently all VectorBase genomes, with the exception of Anopheles gambiae PEST, are available only as supercontigs. For some species, the genome papers and other publications have achieved to locate a subset of the supercontigs.

Retrieving genes with the same domain

> I am trying to retrieve the carboxylesterases (COEs) for Anopheles gambiae dataset but am having some problems. Previous research has identified roughly 50 COEs however I have only been able to retrieve 12. For example, there is an accession for COEAE1A but not for COEAE2A, which was used in the A. gambiae detox chip (David et al. 2005).

Answer: 

> I am trying to retrieve the carboxylesterases (COEs) for Anopheles gambiae dataset but am having some problems. Previous research has identified roughly 50 COEs however I have only been able to retrieve 12. For example, there is an accession for COEAE1A but not for COEAE2A, which was used in the A. gambiae detox chip (David et al. 2005).

What are the -PA suffix after gene names?

This notation represents translations. All translations are named after the gene, with addition of the suffix -Px, 'x' being a letter. Transcripts will have a "-Rx" suffix. This is based on the FlyBase notation.


E.g., given the AGP000123 gene in A. gambiae,

Answer: 

This notation represents translations. All translations are named after the gene, with addition of the suffix -Px, 'x' being a letter. Transcripts will have a "-Rx" suffix. This is based on the FlyBase notation.


E.g., given the AGP000123 gene in A. gambiae,

What data is available for download?

Answer: 

The data in VectorBase is available for download in a variety of formats.

What do I do with downloaded files ending with “.gz”?

Most of the “Data files” in the “Downloads” navigation tab (https://www.vectorbase.org/downloads), and the Generic Feature Format Version 3 (GFF3) files that you download from the Genome Browser end with “.gz”.

Answer: 

Most of the “Data files” in the “Downloads” navigation tab (https://www.vectorbase.org/downloads), and the Generic Feature Format Version 3 (GFF3) files that you download from the Genome Browser end with “.gz”.

What is the origin of the Aedes DNA?

The genome sequence was obtained from an inbred sub-strain (LVPib12) of the Liverpool strain of A. aegypti

For more information please refer to Nene et al 2007: PMID: 17510324 or the species page.

Answer: 

The genome sequence was obtained from an inbred sub-strain (LVPib12) of the Liverpool strain of A. aegypti

For more information please refer to Nene et al 2007: PMID: 17510324 or the species page.

What is the origin of the Anopheles M and S form DNA

M (Mali-NIH)
Collected in a village near Niono, Mali in June 2005 by Tovi Lehmann; single ovipositions were set up in the insectary at NIH. Established by combining ca. 80 isofemale families that were molecularly identified as A. gambiae M molecular form. Karyotyping by Olga Grushko at the University of Notre Dame revealed this colony to be homokaryotypic 2Rbc/bc and 2La/a. This colony is the source of DNA for the A. gambiae M form genome sequencing project supported by NHGRI.


Answer: 

M (Mali-NIH)
Collected in a village near Niono, Mali in June 2005 by Tovi Lehmann; single ovipositions were set up in the insectary at NIH. Established by combining ca. 80 isofemale families that were molecularly identified as A. gambiae M molecular form. Karyotyping by Olga Grushko at the University of Notre Dame revealed this colony to be homokaryotypic 2Rbc/bc and 2La/a. This colony is the source of DNA for the A. gambiae M form genome sequencing project supported by NHGRI.


What is this -RA suffix after gene names?

This notation represents transcripts. All transcripts are named after the gene, with addition of the suffix -Rx, 'x' being a letter. Translation will have a "-Px" suffix. This is based on the FlyBase notation.




E.g., given the AGP000123 gene in Anopheles,

Answer: 

This notation represents transcripts. All transcripts are named after the gene, with addition of the suffix -Rx, 'x' being a letter. Translation will have a "-Px" suffix. This is based on the FlyBase notation.




E.g., given the AGP000123 gene in Anopheles,

What 'Residue overlap splice site' means?

Sometimes, in the peptide page of a gene you can see some residue in red, they correspond to "residue overlapping splice site".

This means that the three bases composing this amino acid are on both sides of a splice site: 1 or 2 bases are in an exon and the remaining base(s) are in the following exon.

Answer: 

Sometimes, in the peptide page of a gene you can see some residue in red, they correspond to "residue overlapping splice site".

This means that the three bases composing this amino acid are on both sides of a splice site: 1 or 2 bases are in an exon and the remaining base(s) are in the following exon.

What the gene set counts mean?

Answer: 

This a gene set table (click on the image to enlarge):

Gene set table

Where are the Culex EST libraries from

The Culex ESt libraries have been generated by several laboratories:

Answer: 

The Culex ESt libraries have been generated by several laboratories:

Where has Culex pipiens gone

Culex pipiens quinquefasciatus, previously a sub-species of Culex pipiens, has been granted the status of species, changing its name to Culex quinquefasciatus. This change was supported by the NCBI.

To reflect the taxonomic change, VectorBase has updated all the Culex pipiens to Culex quinquefasciatus - NOT changing any of the data.

Answer: 

Culex pipiens quinquefasciatus, previously a sub-species of Culex pipiens, has been granted the status of species, changing its name to Culex quinquefasciatus. This change was supported by the NCBI.

To reflect the taxonomic change, VectorBase has updated all the Culex pipiens to Culex quinquefasciatus - NOT changing any of the data.

Where to submit proteomic data?

We recommend you submit your proteomic data to PRIDE. This resource is centralized, standards compliant, public data repository for proteomics data. It has been developed to provide the proteomics community with a public repository for protein and peptide identifications together with the evidence supporting these identifications.

In a near future we will be linking our gene to proteomic data, and we are very likely to do so via PRIDE.

Answer: 

We recommend you submit your proteomic data to PRIDE. This resource is centralized, standards compliant, public data repository for proteomics data. It has been developed to provide the proteomics community with a public repository for protein and peptide identifications together with the evidence supporting these identifications.

In a near future we will be linking our gene to proteomic data, and we are very likely to do so via PRIDE.

Which are the gene biotype definitions?

Answer: 

This a Search query for all genes in VectorBase (click on the image to enlarge):

Which external ressources is VectorBase linking to

VectorBase links the genes, transcripts or translations to various external references, some are common to all organisms, others are organisms specific. The list get updated regularly, with new references being added.

Here is a list of the resources we link to, as of April 2011 (release VB-2011-04).

Answer: 

VectorBase links the genes, transcripts or translations to various external references, some are common to all organisms, others are organisms specific. The list get updated regularly, with new references being added.

Here is a list of the resources we link to, as of April 2011 (release VB-2011-04).

Why are there miRNA genes on both strands at the same location?

Most of the VectorBase miRNA genes are predicted computationally, by aligning Rfam covariance models against the genome. Some miRNA genes, as a consequence of the miRNA's secondary structure, have symmetrical properties that generate credible alignments on both strands.

Answer: 

Most of the VectorBase miRNA genes are predicted computationally, by aligning Rfam covariance models against the genome. Some miRNA genes, as a consequence of the miRNA's secondary structure, have symmetrical properties that generate credible alignments on both strands.

Why are there protein sequences with Xs and genes sequences with Ns

A protein sequence or a gene sequence may have Xs and Ns, respectively, because of spanning a gap between two contigs.

Answer: 

A protein sequence or a gene sequence may have Xs and Ns, respectively, because of spanning a gap between two contigs.

Why are there three Anopheles gambiae (PEST,M and S)

As of October 2015 (release VB-2015-10), VectorBase is showing two genome browsers for Anopheles gambiae: PEST and Pimperena (S). The former M

Answer: 

As of October 2015 (release VB-2015-10), VectorBase is showing two genome browsers for Anopheles gambiae: PEST and Pimperena (S). The former M

Subscribe to Data (faqs)