Downloads of comparative data


The comparative data in VectorBase is available for download in bulk; because the data is often spread across many files, the downloads are provided as compressed 'tar' archives. There are two sources of comparative data, gene trees and whole genome alignments.

Gene trees and homologs

Gene trees are provided in two formats, Newick (a.k.a. New Hampshire) and PhyloXML. The Newick-format trees have branch lengths and labelled leaf nodes. The PhyloXML files include additional metadata, such as bootstrap values and internal node labels.

The amino acid and cDNA alignments on which the gene trees were inferred are available in Fasta format. The ID field in these Fasta files is the VectorBase protein or transcript ID, and the header contains 3 further fields of meta data, separated by '|': species, genomic location, gene ID. (These alignments are also embedded within the PhyloXML files.)

The homologs that are derived from the gene trees are provided in OrthoXML format.

For all of this gene tree-related data there is one file per gene tree, named for the VectorBase gene tree ID (e.g. VBGT00190000009607). To avoid having many thousands of files within one directory, gene trees are ordered sequentially (by ID) then grouped into sets of 500 and placed in sub-directories whose name indicates the range of IDs within.

Gene tree IDs in VectorBase are "stable" in that they track an evolutionary hypothesis, but won't necessarily represent exactly the same tree from release to release. For example, in VB release 1512, let's say we have a tree with 32 nodes. In release 1602, we add Aedes albopictus to our set of species, and the tree is updated to include one of its genes. That gene tree would have the same ID across the two releases; at a fundamental level the tree is the same, even though its Newick representation will be different.

Pairwise whole genome alignments

Pairwise alignments, calculated with either LASTZ or tBLAT, are available in MAF format. If the reference species has chromosomes (e.g. Anopheles gambiae) there is one MAF file per chromosome, otherwise all scaffolds are in a single file.