How does VectorBase annotate repeats and mask sequences?


It is standard practice to annotate repetitive regions of a genome as repeat features; this may be interesting in its own right, but is also a pre-requisite for subsequent analyses, such as gene prediction or whole genome alignment. VectorBase uses the same methodology for almost all species; exceptions are noted below.

Repeat annotation: simple repeats

Dust and TRF are used to identify low complexity regions and tandem repeats, for all species.

Repeat annotation: complex repeats

VectorBase has three primary sources for repeat libraries, which are used to annotate repeat features with RepeatMasker:

  • de novo libraries generated with RepeatModeler (which includes RECON and RepeatScout)
  • transposable element libraries from TEfam
  • generic Repbase libraries

TEfam libraries are only available for some of our mosquito species, and the Repbase library has relatively few arthropod repeats for non-reference species (i.e. anything other than Aedes aegypti, Anopheles gambiae, and Culex quinquefasciatus); so most of our annotation of complex repeats is derived from the de novo, species-specific, libraries.

Repeat annotation: exceptions

Occasionally, species are provided to VectorBase with repeat features already annotated; in these cases we only annotate with Dust, TRF, and the Repbase library, and do not generate de novo libraries. This currently applies to Aedes albopictus and Musca domestica.

The de novo, species-specific, libraries are sometimes augmented with additional repeat libraries from other species, from sources such as GenBank or FlyBase. This currently applies to Aedes aegypti, Anopheles gambiae, Glossina morsitans, and Ixodes scapularis.

Repeat masking

Genome sequences available for download are in "softmasked" format. This means that regions annotated with repeat features have lower case characters. Contrast this with the strict definition of "masking" (often called "hardmasking"), in which repetitive regions are converted to 'N' characters.

The very simple example below demonstrates the difference between the two types of masking. Note that unmasked or hardmasked sequence can be derived from softmasked sequence.

    >seq1 (softmasked)
    >seq1 (hardmasked)

Note that the BLAST provided at VectorBase is against unmasked sequences.