Glossary

Glossary of terms for the Glossary module
Click one of the letters above to advance the page to terms beginning with that letter.

A

Accession number search for term

A unique identifier given to a sequence, sample or other item when it is submitted to a repository (for example GenBank).

AGP file (A Golden Path) search for term

A file provided to VectorBase that describes how the longer sequences in the genome assembly were assembled from shorter sequences. For example, an AGP file can describe how a chromosome is assembled from a collection of scaffolds or a collection of contigs. For an AGP file that describes how a scaffold is assembled from a collection of contigs, each contig will be listed on a separate line in the AGP file and the line will include information about where the contig lies within the scaffold and the orientation of the contig.

Algorithm search for term

A sequence of computational tasks or actions that carry out a specific function.

Alignment search for term

A comparison between two or more sequences by matching identical and/or similar residues and assigning a score to the match.

Allele search for term

An allele is an alternative form of a nucleotide sequence, a gene or a locus in the genome. The term was originally used to describe variation among protein coding genes, but it also refers to variation among non-coding genes or DNA sequences.

Allele frequency search for term

A measure of how prevalent an allele or genotype is in a population. In Ensembl, it is displayed ranging from 0 (zero) to 1 (one).

Alternate sequence search for term

Genomic sequence that differs from the genomic DNA on the primary assembly. The alternate sequences come in two types: allelic sequence (haplotypes and novel patches) and fix patches. Novel patches represent new allelic loci but they are not necessarily haplotypes. Fix patches are where the primary assembly was found to be incorrect, and the patch reflects the corrected sequence. Both haplotypes, novel patches and fix patches are determined by the GRC, not by Ensembl. When using the API, the primary assembly is referred to as reference sequence and alternate sequence is referred to as non-reference sequence.

ambiguity code search for term

The standard ambiguity codes for nucleotides are provided by IUPAC (INTERNATIONAL UNION OF PURE AND APPLIED CHEMISTRY) and indicate the possible nucleotides that can occur at a given position. The symbols are valid for both DNA and RNA and are shown below: A = adenine C = cytosine G = guanine T = thymine R = G A (purine) Y = T C (pyrimidine) K = G T (keto) M = A C (amino) S = G C (strong bonds) W = A T (weak bonds) B = G T C (all but A) D = G A T (all but C) H = A C T (all but G) V = G C A (all but T) N = A G C T (any)

Ambiguous ORF search for term

Ambiguous Open Reading Frame. A non-coding transcript believed to be protein coding, with more than one possible ORF.

Anthropophilic search for term

Anthropophilic mosquitoes preferentially feed on humans.

Antisense search for term

Non-coding transcript believed to be an antisense product used in the regulation of the gene to which it belongs.

API (Application Programming Interface) search for term

A series of routines that applications can use to make the operating system request and carry out lower-level services.

Artifact ((in the context of a transcript)) search for term

Error in the sequence in a public database (for example UniProtKB, NCBI RefSeq). Annotation is by the VEGA/Havana project.

Assembly search for term

When the genome or transcriptome of a species is sequenced, typically many short random fragments are sequenced and reassembled by a computer algorithm into longer contiguous sequences called contigs. Genomic contigs may be assembled into longer sequences called scaffolds and sometimes, if the depth of sequencing is high enough, there may be enough information to assemble most of the scaffolds into chromosomes.

ATV (A Tree Viewer) search for term

An application (Java tool) for the visualisation of phylogenetic trees. Allows the possibility to edit and export data. See Zmasek et al.

B

BAC (Bacterial Artificial Chromosome) search for term

A vector used to clone DNA fragments (100 to 300-kb insert size; average, 150 kb) from another species so that it can be replicated in bacteria.

Base pairs (number of base pairs in the genome) search for term

The base pairs length on pages such as the whole genome display (next to the golden path length) is based on the assembled end position of the last seq_region in each chromosome (from the AGP file), or if there is a terminal gap it is set to the assembled end location of that terminal gap.

Biotype search for term

A gene or transcript classification. Transcript types include protein coding, pseudogene, and non-coding RNAs.

BLAST search for term

BLAST (Basic Local Alignment Search Tool) is a sequence comparison algorithm optimised for speed which is used to search sequence databases for optimal local alignments to a query. (Altschul et al., J Mol Biol 215:403-410; 1990)

BLAT (BLAST-Like Alignment Tool) search for term

An mRNA/DNA and cross-species protein sequence analysis tool to quickly find sequences of 95% and greater similarity of length 40 bases or more. (Kent, W.J. 2002. BLAT -- The BLAST-Like Alignment Tool. Genome Research 4: 656-664)

BLOSUM 62 (Blocks Substitution Matrix) search for term

A matrix that defines scores for amino acid substitutions, reflecting the similarity of physicochemical properties, and observed substitution frequencies. The BLOSUM 62 matrix is tailored using sequences sharing no more than 62% identity (sequences closer evolutionary, were represented by a single sequence in the alignment to avoid bias from using related family members). (Henikoff and Henikoff, Proc Natl Acad Sci U S A 89:10915-10919; 1992).

C

Canonical transcript search for term

For human, the canonical transcript for a gene is set according to the following hierarchy: 1. Longest CCDS translation with no stop codons. 2. If no (1), choose the longest Ensembl/Havana merged translation with no stop codons. 3. If no (2), choose the longest translation with no stop codons. 4. If no translation, choose the longest non-protein-coding transcript.

cDNA (Complementary DNA) search for term

DNA obtained by reverse transcription of a mRNA template. In bioinformatics jargon, cDNA is thought of as a DNA version of the mRNA sequence. Generally, cDNAs are denoted in coding or 'sense' orientation.

CDS (Coding sequence) search for term

The portion of a gene or an mRNA that codes for a protein. Introns are not coding sequences, nor are the 5' or 3' UTR. The coding sequence in a cDNA or mature mRNA includes everything from the start codon through to the stop codon, inclusive.

Centimorgan (cM) search for term

A unit of genetic distance, determined by how frequently two genes on the same chromosome are inherited together. One centimorgan equals 1% recombinant offspring. In humans, 1 cM is about 1 x 10^6 bp

Chr:bp search for term

The chromosome location and coordinates in base pairs.

CIGAR (Compact Idiosyncratic Gapped Alignment Report) search for term

Defines the sequence of matches/mismatches and deletions (or gaps). The cigar line defines the sequence of matches/mismatches and deletions (or gaps). For example, this cigar line 2MD3M2D2M will mean that the alignment contains 2 matches/mismatches, 1 deletion (number 1 is omitted in order to save some space), 3 matches/mismatches, 2 deletions and 2 matches/mismatches. If the original sequence is: Original sequence: AACGCTT The aligned sequence will be: cigar line: 2MD3M2D2M M M D M M M D D M M A A - C G C - - T T

Clone search for term

A segment of DNA that has been inserted into a vector molecule, such as a plasmid, and then replicated to form many identical copies.

CNV search for term

Copy number variation. It is defined by SO (sequence ontology) as a variation that increases or decreases the copy number of a given region. See more details here.

codon search for term

Three base pairs in either DNA or RNA that code for an amino acid (or stop translation).

Contig search for term

A contig is a contiguous stretch of DNA sequence without gaps that has been assembled solely based on direct sequencing information. Short sequences (reads) from a fragmented genome are compared against one another, and overlapping reads are merged to produce one long sequence. This merging process is iterative: overlapping reads are added to the merged sequence whenever possible and so the merged sequence becomes even longer. When no further reads overlap the long merged sequence, then this sequence - called a contig - has reached its maximum length. Contig can be used in other contexts: A contig can be the sequence corresponding to only one clone. A contig map shows the regions of a chromosome where contiguous DNA segments overlap.

Coordinate system search for term

In VectorBase, the term "coordinate system" or "coord_system" identifies which level of the assembly we are working on. A genome assembly imported into VectorBase has up to three coordinate systems defined in the coord_system table: contigs, scaffolds or chromosomes.

We define one additional coordinate system: toplevel. Toplevel sequences are tagged in the seq_region_attrib table. Most gene annotation is done on toplevel sequence.

Cosmid search for term

DNA from a bacterial virus spliced with a small fragment of a genome (up to 50 kb) to be amplified and sequenced.

cytogenetic map search for term

A banding pattern on a chromosome resulting from staining and examination by microscopy. Cytogenetic abnormalities such as deletions or inverted nucleotide sequences may be detected by examining and comparing banding patterns.

D

DAS (Distributed Annotation System) search for term

A protocol for requesting and returning annotation data for genomic regions. See the BioDAS site for more information.

dbSNP search for term

The Single Nucleotide Polymorphism database (dbSNP) is a public-domain archive for a broad collection of simple (short) genetic polymorphisms. This collection of polymorphisms is maintained by NCBI and includes single-base nucleotide substitutions (also known as single nucleotide polymorphisms or SNPs), small-scale multi-base deletions or insertions (also called deletion insertion polymorphisms, indels or DIPs), and retroposable element insertions and microsatellite repeat variations (also called short tandem repeats or STRs). See more details here.

dbVAR search for term

Database of genomic structural variation (SV), such as copy number variation. See the glossary term for SV. More details here.

DDBJ (DNA Data Bank of Japan) search for term

DDBJ is the sole DNA data bank in Japan, which is officially certified to collect DNA sequences from researchers and to issue the internationally recognized accession number to data submitters. Data is exchanged with EMBL/EBI and GenBank/NCBI on a daily basis, and the three data banks share virtually the same data at any given time.

Disrupted domain ((in the context of a transcript)) search for term

Coding region omiitted due to a splice variation. Annotation is by the VEGA/Havana project.

Domain search for term

A region of special biological interest within a single protein sequence. However, a domain may also be defined as a region within the three-dimensional structure of a protein that may encompass regions of several distinct protein sequences that accomplishes a specific function. A domain class is a group of domains that share a common set of well-defined properties or characteristics.

Dotter search for term

Ensembl DotterView is based on the program Dotter, a dot-matrix program with dynamic threshold control suited for genomic DNA and protein sequence analysis. The Dotter tool provides a visual display of the sequence alignment it represents. The dotplot displays detailed comparison of two sequences. Every residue in one sequence is compared to every residue in the other sequence. The first sequence runs along the x-axis and the second sequence along the y-axis. In regions where the two sequences are similar to each other, a row of high scores will run diagonally across the dot matrix. If you're comparing a sequence against itself to find internal repeats, you'll notice that the main diagonal scores maximally, since it's the 100% perfect self-match. To make the score matrix more intelligible, the pairwise scores are averaged over a sliding window that runs diagonally. The averaged score matrix forms a three-dimensional landscape, with the two sequences in two dimensions and the height of the peaks in the third. This landscape is projected onto two dimensions by aid of grayscales - higher peaks are indicated by darker grays. Dotter was written by Erik L.L. Sonnhammer and Richard Durbin Gene 167: GC1-10 (1995)

DUST search for term

A standalone application that looks for low complexity sequences.

DWGA (Derived from Whole Genome Alignments) search for term

Human versus Chimpanzee exception: The human versus chimpanzee orthologue predictions were obtained in a completely different manner. Since the current chimpanzee genome sequence assembly is the result of low-coverage sequencing, the assembled sequence is of too poor quality to generate a gene set on the classical Ensembl gene build pipeline. The chimpanzee gene set produced by Ensembl has rather been generated by "projecting" human genes to the chimpanzee genome through whole genome BLASTz alignments between both species and filtering for orthologue sequence alignments. The result of this procedure is de facto the human - chimpanzee orthologue set that has been Derived from Whole Genome Alignments (DWGA). See the Prediction Method section on a relevant Ensembl Gene Report page.

E

EMBL (European Molecular Biology Laboratory) search for term

Europe's primary nucleotide sequence resource. The main sources of the DNA and RNA sequences in the database are submissions from individual researchers, genome sequencing projects and patent applications.

End phase search for term

In protein-coding exons, it is usually the case that end phase = (phase + exon_length)%3 but end_phase could be -1 if the exon is half-coding and its 3 prime end is UTR.

Endophagic search for term

Behavioural trait for feeding indoors.

Endophilic search for term

Tendency to inhabit/rest indoors.

Ensembl genes search for term

Set of Ensembl gene predictions based on experimental evidence from protein sequences and/or near-full-length cDNA available from public sequence databases. "Ensembl known genes" are predicted on the basis of species-specific database entries from manually curated UniProt/Swiss-Prot, partially manually curated RefSeq and UniProt/TrEMBL databases. Predictions of "Ensembl novel genes" are based on other experimental evidence such as protein and cDNA sequence information from related species. Golden genes are the result of a merge between a Havana transcript (manually curated) and an Ensembl gene prediction from the annotation pipeline. See "havana transcript".

Eponine search for term

Eponine is a probabilistic method for detecting transcription start sites (TSS) in mammalian genomic sequence, with good specificity and excellent positional accuracy. Eponine models consist of a set of DNA weight matrices recognizing specific sequence motifs. Each of these is associated with a position distribution relative to the TSS.

EST (Expressed Sequence Tags) search for term

Coarse sequence reads from flanking vector regions into the inserts of cDNA libraries. ESTs act as physical markers for cloning and full length sequencing of the cDNAs of expressed genes. Typically identified by purifying mRNAs, converting to cDNAs, and then sequencing a portion of the cDNAs. Usually short, single reads from a tissue or stage in development.

EST genes search for term

Set of Ensembl gene predictions solely based on EST evidence. The process of EST gene prediction uses a combination of Exonerate, BLAST and Est2Genome to map ESTs onto the genomic sequence. Redundant ESTs are merged, before GenomeWise is used to assign 5' and 3' UTRs to the longest found ORF. See Eyras et al. for a more complete explanation of the EST gene prediction process.

Exon search for term

The part of the genomic sequence that remains in the transcript (mRNA) after introns have been spliced out.

Exonerate search for term

A fast gapped DNA-DNA alignment algorithm. It can be used for aligning various types of sequences such as genomic DNA, cDNAs/ESTs, and proteins.

Exophagic search for term

Behavioural trait for feeding oudoors.

Exophilic search for term

Tendency to inhabit/rest in outdoor areas.

F

Feature search for term

Any annotation on a specific location in the genomic sequence.

Fgenes search for term

FGENES, also known as Find Genes, is a Human gene predictor that is based on pattern recognition of different types of exons, promoters and poly A signals. It is built based on linear discriminant functions of internal, 5'-coding, and 3'-coding exon recognition. It is designed to find the optimal combination of these components and to construct a set of gene models along a given sequence.

Flanking sequence search for term

Sequence 5' or 3' to a DNA or RNA sequence of interest (for example gene, transcript, SNP or repeat).

Frameshift intron search for term

Frameshift introns are the length of 1, 2, 4, or 5 basepairs. They are introduced by the Ensembl genebuild in order to fit the cDNA sequence to the genome.

G

GenBank WGS search for term

GenBank accession for a Whole Genome Shotgun (WGS) project. This is the master accession which should link to both contigs and scaffold data for the species. If an assembly is changed the WGS accession will be modified. It is possible that the assembly improvement involves re-scaffolding and the underlying contig sequences can be unchanged.

Gene set search for term

The set of predicted genes which represent the reference for an assembly. Gene sets contain both protein-coding and ncRNA loci.

Genebuild search for term

The Ensembl gene annotation system is used to annotate genome assemblies. This process is called the genebuild.

GeneWise search for term

GeneWise is sequence analysis tool for comparing proteins or profile HMMs to DNA sequences allowing for introns and frameshifts. The Wise2 package was written by Ewan Birney. More information about the package can be obtained at: http://www.ebi.ac.uk/Tools/psa/genewise/help/

Genomic marker search for term

A short sequence whose placement on the genome is known.

Genotype search for term

Specific alleles present in an individual's genome, or the genetic makeup of one organism.

GENSCAN search for term

An application for identification of complete gene structures in genomic DNA (Burge, C. and Karlin, S. (1997) Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268, 78-94). The splice site models used are described in more detail in: Burge, C. B. (1998) Modeling dependencies in pre-mRNA splicing signals. In Salzberg, S., Searls, D. and Kasif, S., eds. Computational Methods in Molecular Biology, Elsevier Science, Amsterdam, pp. 127-163.

GO (Gene Ontology) search for term

An organized hierarchy of terms produced by the Gene Ontology Consortium, used to describe biological processes, cellular component, and molecular function. Specific GO terms are as follows: Molecular Function Ontology. Tasks performed by individual gene products; examples are carbohydrate binding and ATPase activity. Biological Process Ontology. Broad biological goals, such as mitosis or purine metabolism, are accomplished by ordered assemblies of molecular functions. Cellular Component Ontology. Subcellular structures, locations, and macromolecular complexes; examples include nucleus, telomere, and origin recognition complex. A gene may be indexed under many GO terms depending on GO classification system. A gene product has one or more molecular functions and is used in one or more biological processes; it might be associated with one or more cellular components. For instance, cytochrome c can be described by the molecular function term electron transporter activity, the biological process terms oxidative phosphorylation and induction of cell death, and the cellular component terms mitochondrial matrix and mitochondrial inner membrane.

Golden path length search for term

The golden path is the length of the reference assembly. It consists of the sum of all top-level sequences in the seq_region table, omitting any redundant regions such as haplotypes and PARs (pseudoautosomal regions).

H

Haplotype search for term

Known variations to the primary assembly, due to variability in the human genome sequence (eg. the highly variable MHC locus containing halpotypes HSCHR6_MHC_COX, HSCHR6_MHC_SSTO, HSCHR6_MHC_APD, HSCHR6_MHC_DBB, HSCHR6_MHC_MANN, HSCHR6_MHC_MCF, and HSCHR6_MHC_QBL). In Region in Detail, the haplotype regions are coloured with a red background.

Haplotypes search for term

A set of genes or markers on one chromosome that are inherited together. Often refers to SNPs that are closely linked (i.e. have a high linkage disequilibrium (LD) value, and are inherited together.) In Region in Detail, the haplotype regions are coloured with a red background.

High-coverage genome search for term

Refers to the number of overlapping sequences used to build the genomic assembly. High coverage, such as human and mouse genomes, indicates a good amount of sequence information. This is also referred to as deep-coverage. Low coverage reflects a low amount of sequence information.

homologues search for term

Specific sequences that are descended from the same common sequence in an ancestor. See orthologues or paralogues.

I

In-del (Insertion-deletion) search for term

A mutation or polymorphism in which one or more base pairs have been inserted into or removed from a genomic sequence.

InterPro search for term

InterPro is an integrated resource for protein families, domains and sites, combining information from several different protein signature databases. InterPro IDs are linked to the summary of information about that domain or family. InterPro is managed by EBI. A number of databases (SwissProt, TrEMBL, PROSITE, PRINTS, Pfam, and ProDom, SMART, TIGRFAMs, PIR SuperFamilies and SUPERFAMILY) with different approaches to biological information are used to derive protein signatures. ProteinView, GeneView and DomainView provide links to the relevant InterPro entries.

Intron search for term

The part of the genomic sequence that is transcribed and then spliced out of the transcript (mRNA). Noncoding.

J

Jalview search for term

Jalview is a multiple alignment editor, used by the EBI clustalw server and the PFAM protein domain database and is available as a general purpose alignment editor.

K

Karyotype search for term

The Karyotype View in VectorBase displays the set of chromosomes for a species, including the centromere and banding pattern as they would appear under a light microscope. Dark bands indicate heterochromatin and light bands indicate homochromatin.
The Karyotype View is only available for species where the genome assembly provided to us has been assembled into chromosomes, currently only Anopheles gambiae. For many species in VectorBase, the genome assembly is comprised of only unplaced scaffolds.

Known gene search for term

A known gene is an Ensembl gene for which at least one known transcript has been annotated.

Known transcript search for term

A known Ensembl transcript matches to a sequence for the same species in a public, scientific database such as UniProtKB or NCBI RefSeq.

L

LD (Linkage Disequilibrium) search for term

A measure of how often two SNPs or specific sequences are inherited together.

Length (aa) search for term

The number of amino acids in, for example, a protein.

Length (bp) search for term

The number of base pairs in, for example, a transcript.

LincRNA search for term

Large intergenic non-coding RNAs, usually associated with open chromatin signatures such as histone modification sites.

Linkage search for term

A measure of how often features (genes, specific sequences) on a chromosome are inherited together.

lncRNA search for term
A gene that encodes a long non-coding RNA. Source: http://www.sequenceontology.org/browser/current_svn/term/SO:0002127
Low-complexity region search for term

A region in the sequence with a biased composition (i.e. repeated sequences or residues.)

Low-coverage genome search for term

Refers to the number of overlapping sequences used to build the genomic assembly. High coverage, such as human and mouse genomes, indicates a good amount of sequence information. This is also referred to as deep-coverage. Low coverage, such as the lesser-known mammals, reflects a low amount of sequence information. 2X genomes are low coverage.

M

MBRH (Multiple Best Reciprocal Hit) search for term

When due to gene duplications there are multiple 'best' hits with identical score, E-value, % identity, %positivity, one is unable to pick a unique orthologue for a gene. This results in more complex graphs of 'best' relationships. This often occurs when different genes have identical translations, which could be due to a duplication event, an assembly error, or chance. On average 3% of the genes have an identical translation to some other gene either within it's genome or in another genome. * MBRH / DUP 1.# - MBRH set where in one genome there is only one gene, but the other genome has multiple genes, all on the same chromosome and within 1.5 megbases of each other. This could be due to recent gene duplication events where sequences have not diverged or a mis-assembly of the genome sequence leading to artificial, apparent gene duplications. (e.g. MBRH / DUP 1.2 or MBRH/ DUP 1.4) * MBRH / SYN - This is a more complex MBRH set where there are multiple genes in each genome split across multiple chromosomes. The one(s) labeled MBRH/SYN satisfies both the MBRH criteria and the RHS search criteria. * MBRH / COMPLEX - This is a more complex MBRH set where there are multiple genes in each genome split across multiple chromosomes. This MBRH pair does not satisfy the RHS criteria.

Microsatellite search for term

A region in the genomic sequence containing short tandem repeats.

miRNA search for term

MicroRNA is single-stranded RNA, typically 21-23 base pairs long, that is thought to be involved in gene regulation (especially inhibition of protein expression).

miRNA pseudogene search for term

MicroRNA pseudogene.

Misc RNA search for term

Miscellaneous RNA.

Misc RNA pseudogene search for term

Miscellaneous RNA pseudogene.

Motif search for term

A conserved region of sequence with a specific function/ structure.

mt rRNA search for term

Mitochondrial ribosomal RNA.

mt tRNA search for term

Mitochondrial transfer RNA.

Mt tRNA pseudogene search for term

Mitochondrial transfer RNA pseudogene.

Mutation search for term

A modification (insertion, deletion, or alteration) in the genomic or amino acid sequence.

N

ncRNA (non-coding RNA) search for term

Short non-coding RNAs such as rRNA, scRNA, snTNA, snoRNA and miRNA are annotated by the Ensembl ncRNA pipeline (see article). To view these short ncRNAs, go to Region In Detail and open the Configure This Page window. Select ncRNA from the Genes menu. Transfer RNAs (tRNAs) are identified by tRNAscan. To view tRNAs, go to Region In Detail and open the Configure This Page window. Select tRNA from the Simple Feature menu. Long intergenic ncRNAs have only been annotated for human and mouse. To view long ncRNAs, go to Region In Detail and open the Configure This Page window. Select lincRNA from the Genes menu. An RNA transcript that does not encode for a protein rather the RNA molecule is the gene product. Source: http://www.sequenceontology.org/browser/current_svn/term/SO:0000655

Nonsense mediated decay search for term

Transcript is thought to undergo nonsense mediated decay, a process which detects nonsense mutations and prevents the expression of truncated or erroneous proteins.

Novel gene search for term

A novel gene is an Ensembl gene for which only one or more novel transcripts have been annotated.

Novel transcript search for term

A novel Ensembl transcript does not match to a sequence for the same species in a public, scientific database such as UniProtKB or NCBI RefSeq.

O

ORF (Open Reading Frame) search for term

A DNA sequence that possesses a start codon and a large window of sequence with no stop codon that could potentially code for a protein.

Ortholog search for term

Orthologs are genes derived from a common ancestor through vertical descent (or speciation) and can be thought of as the direct evolutionary counterpart. In contrast, paralogues are genes within the same genome that have evolved by duplication.

P

PAR (Pseudoautosomal region) search for term

Small regions of sequence identity located at the tips of the short and the long arms of the X and Y chromosomes where recombination and genetic exchange take place. Genes within the pseudoautosomal region are not sex linked. The Genome Reference Consortium defines two PARs for the human genome assembly. The first pseudoautosomal region, PAR1, is located at the tip of the short arm and consists entirely of N's. The second pseudoautosomal region, PAR2, is located at the tip of the long arm. In the Ensembl human database, DNA for the complete X chromosome is stored and annotated. Only the two unique regions of the Y chromosome are stored and annotated. We are able to represent the complete Y chromosome by filling the 'gaps' with the two PAR regions from the X chromosome. This is done on-the-fly using our assembly_exceptions table. Please note that when using the API, SliceAdaptor by default will fetch only the unique regions of the genome. This means that the PARs on chromosome X will be fetched but only the unique regions on Y will be fetched. To fetch the full length of the Y chromosome using the SliceAdaptor, set the 4th argument to '1' as shown: my $slices = $slice_adaptor->fetch_all( 'toplevel', 'GRCh37', 0, 1 );

Paralog search for term

A sequence that has evolved by duplication: "Anopheles albimanus has four paralogs of gene X while Anopheles gambiae has only one copy"

PDB (Protein Data Bank) search for term

A repository for 3-D biological macromolecular structure data. PDB archives protein structures deduced from crystallography and nuclear magnetic reasonance (NMR) experiments on protein structures. The Protein Data Bank (PDB) is operated by Rutgers, The State University of New Jersey; the San Diego Supercomputer Center at the University of California, San Diego; and the Center for Advanced Research in Biotechnology of the National Institute of Standards and Technology -- three members of the Research Collaboratory for Structural Bioinformatics (RCSB). The RCSB PDB is supported by funds from the National Science Foundation, the Department of Energy, and the National Institutes of Health.

Pfam search for term

Pfam is a large collection of multiple sequence alignments and hidden Markov models covering many common protein domains and families. Pfam can be used to view the domain organization of proteins, to view multiple alignments, protein domain architectures, protein structures, and species distributions.

Pmatch search for term

Pmatch is a fast, exact matching program for aligning protein sequences with either protein or DNA sequence.

Polymorphic pseudogene search for term

Pseudogene owing to a polymorphism in the reference genome, translated in other individuals/haplotypes/strains.

Pre-release site search for term

Initial annotations of upcoming Ensembl genomes, usually without gene predictions or validation, are regularly made available on the pre-release site, pre.ensembl.org

pre_miRNA search for term
The 60-70 nucleotide region remain after Drosha processing of the primary transcript, that folds back upon itself to form a hairpin structure. Source: http://www.sequenceontology.org/browser/current_svn/term/SO:0001244
Prints search for term

The PRINTS protein fingerprint database is a compendium of protein fingerprints. A fingerprint is a group of conserved motifs used to characterise a protein family; its diagnostic power is refined by iterative scanning of a SwissProt/TrEMBL composite. Usually the motifs do not overlap, but are separated along a sequence, though they may be contiguous in 3D-space. Fingerprints can encode protein folds and functionalities more flexibly and powerfully than can single motifs, full diagnostic potency deriving from the mutual context provided by motif neighbors.

Processed pseudogene search for term

Noncoding pseudogene produced by integration of a reverse transcribed mRNA into the genome.

Processed transcript search for term

Noncoding transcript that does not contain an open reading frame (ORF).

Projected gene (or known by_projection) search for term

A projected Ensembl gene has only one or more novel transcripts annotated, and has a known gene from human or mouse as an orthologue. The gene symbol and description are projected from the human or mouse orthologue.

Prosite search for term

PROSITE is a database of protein families and domains run by the (Expert Protein Analysis System (ExPASy) proteomics server of the Swiss Institute of Bioinformatics (SIB). It consists of biologically significant sites, patterns and profiles that help to reliably identify to which known protein family (if any) a new sequence belongs.

Protein coding search for term

A protein coding transcript is a spliced mRNA that leads to a protein product.

Protein ID search for term

Ensembl protein IDs are unique for differing translations.

Pseudogene search for term

A noncoding sequence similar to an active protein. A sequence that closely resembles a known functional gene, at another locus within a genome, that is non-functional as a consequence of (usually several) mutations that prevent either its transcription or translation (or both). In general, pseudogenes result from either reverse transcription of a transcript of their "normal" paralog (SO:0000043) (in which case the pseudogene typically lacks introns and includes a poly(A) tail) or from recombination (SO:0000044) (in which case the pseudogene is typically a tandem duplication of its "normal" paralog). Source: http://www.sequenceontology.org/browser/current_svn/term/SO:0000336

Q

QTL (Quantitative Trait Locus) search for term

Genetic loci where allelic variation is associated with variation in a quantitative trait (e.g. blood pressure). The presence of QTL is inferred from genetic mapping. Total variation is partitioned into components linked to a number of discrete, mapped chromosome markers described by statistical association to quantitative variation in a particular phenotypic trait that is thought to be controlled by the cumulative action of alleles at multiple loci.

Query %id search for term

Query %id indicates the percentage of the query sequence matching the target sequence.

R

Reference SNP (Reference Single Nucleotide Polymorphism) search for term

A SNP assigned to eliminate redundancy in the NCBI dbSNP database. All SNPs submitted at the position of a reference SNP are given the reference SNP identifier (a number preceded by 'rs').

RefSeq search for term

NCBI's Reference Sequences (RefSeq) database is a curated database of Genbank's genomes, mRNAs and proteins. RefSeq attempts to provide a comprehensive, integrated, non-redundant set of sequences, including genomic DNA, tRNA, and protein products, providing a stable reference for gene identification and characterization, mutation analysis, expression studies, polymorphism discovery, and comparative analyses.

repeat search for term

Repetitive DNA in which the same sequence occurs multiple times.

Repeat Masking search for term

The method by which repeated sequences and low-complexity regions are hidden, usually used in searches by alignment and homology-searching programs.

RepeatMasker search for term

RepeatMasker (AFA Smit & P Green) is a standard software tool used in computational genomics to identify repetitive elements and low-complexity sequences.

Retained intron search for term

Noncoding transcript containing intronic sequence.

Retrotransposed search for term

A noncoding pseudogene produced by integration of a reverse transcribed mRNA into the genome.

RH map (Radiation Hybrid map) search for term

Technique for identifying landmarks (STS) every 100 kb in the human genome, the ordering is relative to the frequency with which they are separated by radiation-induced breaks. The frequency is assayed by analysing a panel of human-hamster hybrid cell lines.

ribozyme search for term
An RNA with catalytic activity. Source: http://www.sequenceontology.org/browser/current_svn/term/SO:0000374
RNase_MRP_RNA search for term
The RNA molecule essential for the catalytic activity of RNase MRP, an enzymatically active ribonucleoprotein with two distinct roles in eukaryotes. In mitochondria it plays a direct role in the initiation of mitochondrial DNA replication. In the nucleus it is involved in precursor rRNA processing, where it cleaves the internal transcribed spacer 1 between 18S and 5.8S rRNAs. Source: http://www.sequenceontology.org/browser/current_svn/term/SO:0000385
RNase_P_RNA search for term
The RNA component of Ribonuclease P (RNase P), a ubiquitous endoribonuclease, found in archaea, bacteria and eukarya as well as chloroplasts and mitochondria. Its best characterized activity is the generation of mature 5 prime ends of tRNAs by cleaving the 5 prime leader elements of precursor-tRNAs. Cellular RNase Ps are ribonucleoproteins. RNA from bacterial RNase Ps retains its catalytic activity in the absence of the protein subunit, i.e. it is a ribozyme. Isolated eukaryotic and archaeal RNase P RNA has not been shown to retain its catalytic function, but is still essential for the catalytic activity of the holoenzyme. Although the archaeal and eukaryotic holoenzymes have a much greater protein content than the bacterial ones, the RNA cores from all the three lineages are homologous. Helices corresponding to P1, P2, P3, P4, and P10/11 are common to all cellular RNase P RNAs. Yet, there is considerable sequence variation, particularly among the eukaryotic RNAs. Source: http://www.sequenceontology.org/browser/current_svn/term/SO:0000386
rRNA search for term

Ribosomal RNA. RNA that comprises part of a ribosome, and that can provide both structural scaffolding and catalytic activity. Source: http://www.sequenceontology.org/browser/current_svn/term/SO:0000252

rRNA pseudogene search for term

Ribosomal RNA pseudogene.

S

SARA (Same As Reference Assembly) search for term

An acronym used to indicate a SNP (single nucleotide polymorphism) that has the same sequence as the strain used in the assembly.

Scaffold search for term

Supercontigs or scaffolds are sets of ordered, oriented contigs. They are longer sequences than contigs, but shorter than full chromosomes.

scaRNA search for term
A ncRNA, specific to the Cajal body, that has been demonstrated to function as a guide RNA in the site-specific synthesis of 2'-O-ribose-methylated nucleotides and pseudouridines in the RNA polymerase II-transcribed U1, U2, U4 and U5 spliceosomal small nuclear RNAs (snRNAs). Source: http://www.sequenceontology.org/browser/current_svn/term/SO:0002095
ScRNA search for term

Small cytoplasmic RNA.

ScRNA pseudogene search for term

Small cytoplasmic RNA pseudogene.

SEG search for term

Seg divides sequences into contrasting segments of low-complexity and high-complexity. Low-complexity segments defined by the algorithm represent "simple sequences" or "compositionally-biased regions". Segment lengths and the number of segments per sequence are determined automatically by the algorithm.

sense_intronic search for term
A non-coding transcript found within an intron of a coding or non-coding gene, with no overlap of exonic sequence. Source: http://www.sequenceontology.org/browser/current_svn/term/SO:0002131
Sequence identity search for term

A measure of how similar two sequences are, specifically, what percent of amino acids are the same in type and position between the two sequences.

Shotgun method search for term

(also whole genome shotgun) Semi-automated sequencing method that involves randomly sequenced cloned pieces of the genome (size selected, sually 2, 10, 50 and 150 kb), with no prior knowledge their location. The clones are then sequenced from both ends. The two ends of the same clone are referred to as mate pairs. The distance between two "mate pairs" can be inferred if the library size is known and has a narrow window of deviation. This approach can be contrasted with "directed" strategies, in which pieces of DNA from known chromosomal locations are sequenced.

Shotgun sequencing search for term

A method in which small, random DNA sequences are generated that overlap. The fragments are sequenced and the full, connected sequence determined through the overlaps.

SIFT search for term

A tool which predicts the variation effect on protein function based on sequence homology and the physico-chemical similarity between the alternate amino acids. See the SIFT website for more information.

SignalP search for term

The SignalP application predicts the presence and location of signal peptide cleavage sites in amino acid sequences from different organisms: Gram-positive prokaryotes, Gram-negative prokaryotes, and eukaryotes. The method incorporates a prediction of cleavage sites and a signal peptide/non-signal peptide prediction based on a combination of several artificial neural networks. Signal peptides indicate a protein that will be secreted. Prediction of signal peptides is quite accurate however care must be exercised and these regions should be verified by other means. (Henrik Nielsen, Jacob Engelbrecht, Søren Brunak and Gunnar von Heijne. Identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites. Protein Engineering 10, 1-6 (1997)

Similarity search for term

How well one sequence matches another determined by calculation by an alignment program of identical and conserved residues.

Slice search for term

The term "slice" in Ensembl refers to a length of DNA sequence. A slice can be any length, from one base long to the entire length of a chromosome. A slice is defined as follows: 'coord_system_name:coord_system_version:seq_region_name:start:end:strand' eg. 'chromosome:GRCm38:X:1000:2000:1'

SNAP search for term

1. (Synonymous/Non-synonymous Analysis Program) A program which calculates synonymous and non-synonymous substitution rates based on a set of codon-aligned nucleotide sequences, based on the method of Nei and Gojobori, incorporating a statistic developed in Ota and Nei. 2. An ab initio gene prediction program developed by Ian Korf that models protein coding sequences in genomic DNA by means of hidden Markov models.

snoRNA search for term

Small nucleolar RNA, involved in modifications of other RNAs. A snoRNA (small nucleolar RNA) is any one of a class of small RNAs that are associated with the eukaryotic nucleus as components of small nucleolar ribonucleoproteins. They participate in the processing or modifications of many RNAs, mostly ribosomal RNAs (rRNAs) though snoRNAs are also known to target other classes of RNA, including spliceosomal RNAs, tRNAs, and mRNAs via a stretch of sequence that is complementary to a sequence in the targeted RNA. Source: http://www.sequenceontology.org/browser/current_svn/term/SO:0000275

SnoRNA pseudogene search for term

Small nucleolar RNA pseudogene, involved in modifications of other RNAs.

SNP (Single Nucleotide Polymorphism) search for term

SNPs are common variations that occur in DNA with a 0.1% frequency. Ensembl displays SNPs obtained from dbSNP, (the SNP repository maintained by NCBI; The Human Genic Bi-Allelic Sequences Database (HGVBase) and The SNP Consortium Ltd.(TSC).

snRNA search for term

A small nuclear RNA molecule involved in pre-mRNA splicing and processing. Source: http://www.sequenceontology.org/browser/current_svn/term/SO:0000274

SnRNA pseudogene search for term

Small nuclear RNA pseudogene.

SNV search for term

A Single Nucleotide Variant (SNV) is a nucleotide position in genomic DNA at which different sequence alternatives (alleles) exist. SNVs include SNPs and single nucleotide insertions or deletions. See more details here.

Solexa sequencing search for term
Type of Next-generation sequencing (NGS).
SRP_RNA search for term
The signal recognition particle (SRP) is a universally conserved ribonucleoprotein. It is involved in the co-translational targeting of proteins to membranes. The eukaryotic SRP consists of a 300-nucleotide 7S RNA and six proteins: SRPs 72, 68, 54, 19, 14, and 9. Archaeal SRP consists of a 7S RNA and homologues of the eukaryotic SRP19 and SRP54 proteins. In most eubacteria, the SRP consists of a 4.5S RNA and the Ffh protein (a homologue of the eukaryotic SRP54 protein). Eukaryotic and archaeal 7S RNAs have very similar secondary structures, with eight helical elements. These fold into the Alu and S domains, separated by a long linker region. Eubacterial SRP is generally a simpler structure, with the M domain of Ffh bound to a region of the 4.5S RNA that corresponds to helix 8 of the eukaryotic and archaeal SRP S domain. Some Gram-positive bacteria (e.g. Bacillus subtilis), however, have a larger SRP RNA that also has an Alu domain. The Alu domain is thought to mediate the peptide chain elongation retardation function of the SRP. The universally conserved helix which interacts with the SRP54/Ffh M domain mediates signal sequence recognition. In eukaryotes and archaea, the SRP19-helix 6 complex is thought to be involved in SRP assembly and stabilizes helix 8 for SRP54 binding. Source: http://www.sequenceontology.org/browser/current_svn/term/SO:0000590
SSAHA (Sequence Search and Alignment by Hashing Algorithm) search for term

A search designed to detect exact matches, or nearly exact matches, in DNA or protein databases. The SSAHA search has been optimized for alignments of high percentage identity and display as results the most significant matches for ungapped alignments between sequences. Each exact match in an SSAHA alignment is analogous to finding a high-scoring segment pair in BLAST. A number of consecutive matches on a contig may represent features of a gene such as exons or 5' and 3' untranslated regions, depending on the nature of the query sequence.

Stable ID (Stable identifier) search for term

Stable identifiers are defined for a number of features including genes, transcripts, translations, exons. Stable IDs for all species follow the same format:

species stable ID prefix + 6 numbers + feature type suffix

Stable IDs are versioned.

Start phase search for term

In protein-coding exons, the start phase is the place where the intron lands inside the codon : 0 between codons, 1 between the 1st and second base, 2 between the second and 3rd base. Exons therefore have a start phase and an end phase, but introns have just one phase.

STS markers search for term

STS markers are short sequences of genomic DNA that can be uniquely amplified by the polymerase chain reaction (PCR) using a pair of primers. Because each is unique, STSs are often used in linkage and radiation hybrid mapping techniques. STSs serve as landmarks on the physical map of the human genome.

Supercontig search for term

Supercontigs or scaffolds are sets of ordered, oriented contigs. They are longer sequences than contigs, but shorter than full chromosomes.

supercontigs search for term

Assemblies consist of sequence contigs combined into scaffolds, also known as supercontigs. Supercontigs are combined and ordered according to their orientation and linking information provided by mated sequences from the ends of genomic sub-clones. For some species, supercontigs are combined into ultracontigs, in which neighboring supercontigs are organized into their proper order and orientation using linking information provided by the physical map of BAC clones independently assembled using restriction fragment patterns and the FPC program.

SV search for term

Structural variation. It is generally defined as a region of DNA approximately 1 kb and larger in size and can include inversions and balanced translocations or genomic imbalances (insertions and deletions), commonly referred to as copy number variants (CNVs). More details here.

Synteny search for term

The term synteny was originally defined to mean that two gene loci share the same chromosome. In a genomic context we refer to syntenic regions if both sequence and gene order is conserved between two (closely related) species.

T

tandem repeats search for term

Multiple copies of the same base sequence on a chromosome; used as markers in physical mapping.

Target % id search for term

Target %id indicates the percentage of the target sequence matching the query sequence.

Toplevel search for term

The largest continuous sequence for an organism. The official technical definition for toplevel sequences are 'sequence regions in the genome assembly that are not a component of another sequence region'. For example, when a genome is assembled into chromosomes, toplevel sequences will be chromosomes and unplaced scaffolds. If a genome has only been assembled into scaffolds, then toplevel sequences are scaffolds and unplaced contigs.

Transcribed processed pseudogene search for term

Processed pseudogene with evidence of expression.

Transcribed unprocessed pseudogene search for term

Unprocessed pseudogene with evidence of expression.

Transcript search for term

Nucleotide sequence resulting from the transcription of the genomic DNA to mRNA. One gene can have different transcripts or splice variants resulting from the alternative splicing of different exons in genes.

Transcript ID search for term

Ensembl transcript identifiers are unique for each splice variant.

translation start site search for term

The position within an mRNA at which synthesis of a protein begins. The translation start site is usually an AUG codon, but occasionally, GUG or CUG codons are used to initiate protein synthesis.

tRNA search for term

Transfer RNA. These are identified using tRNAscan. Transfer RNA (tRNA) molecules are approximately 80 nucleotides in length. Their secondary structure includes four short double-helical elements and three loops (D, anti-codon, and T loops). Further hydrogen bonds mediate the characteristic L-shaped molecular structure. Transfer RNAs have two regions of fundamental functional importance: the anti-codon, which is responsible for specific mRNA codon recognition, and the 3' end, to which the tRNA's corresponding amino acid is attached (by aminoacyl-tRNA synthetases). Transfer RNAs cope with the degeneracy of the genetic code in two manners: having more than one tRNA (with a specific anti-codon) for a particular amino acid; and 'wobble' base-pairing, i.e. permitting non-standard base-pairing at the 3rd anti-codon position. Source: http://www.sequenceontology.org/browser/current_svn/term/SO:0000253

tRNA pseudogene search for term

Transfer RNA pseudogene.

U

Unigene search for term

UniGene is an experimental system for automatically partitioning GenBank sequences into a non-redundant set of gene-oriented clusters. Each Unigene cluster contains sequences that represent a unique gene, as well as related information such as the tissue types in which the gene has been expressed and map location.

UniProt/Swiss-Prot search for term

(Universal Protein Resource) is the world's most comprehensive catalogue of information on proteins. UniProt/Swiss-Prot is a curated protein sequence database that provides a high level of annotation, a minimal level of redundancy and high level of integration with other databases. SwissProt is maintained collaboratively by the Swiss Institute for Bioinformatics (SIB) and the European Bioinformatics Institute (EBI).

UniProt/TrEMBL search for term

SPTrEMBL is a subset of TrEMBL (Translated EMBL database) containing the computer-annotated protein translations of all coding sequences (CDS) present in the EMBL EMBL nucleotides that are not yet incorporated into the UniProt/SwissProt database.

UniSTS search for term

UniSTS is a NCBI resource for non-redundant Sequence Tagged Sites (STS) markers. For each marker, UniSTS displays the primer sequences, product size, and mapping information, as well as cross references to dbSNP, RHdb, GDB, MGD, etc. The marker report also lists GenBank and RefSeq records that contain the primer sequences determined by ePCR.

Unitary pseudogene search for term

An unprocessed pseudogene with an active orthologue in another species.

Unprocessed pseudogene search for term

A noncoding pseudogene arising from gene duplication.

UTR (Untranslated Region) search for term

The 5' UTR is the portion of an mRNA from the 5' end to the position of the first codon used in translation. The 3' UTR is the portion of an mRNA from the position of the last codon that is used in translation to the 3' end.

V

Validation status search for term

A measure of confidence that a variant is a true polymorphism. It includes 1000 Genomes, HapMap and other validation statuses from dbSNP such as frequency and cluster. See a detailed description on the dbSNP website.

Variation source search for term

The origin of the variation data (e.g. dbSNP, COSMIC, DGVa).

Vector competence search for term

The ability to transmit a pathogen, e.g. Plasmodium, dengue.

VEP (Variant Effect Predictor) search for term

Ensembl Tool that allows users to provide a list of variants and export a results file containing consequence types.

Y

YAC (Yeast Artificial Chromosome) search for term

Originated from a bacterial plasmid, a YAC contains a yeast centromeric region (CEN), a yeast origin of DNA replication, a cluster of unique rectriction sites and a selectable marker and a telomere region at the en of each arm. YACs are capable of cloning extremely large segments of DNA (over 1 megabase long) into a host cell, where the DNA is propagated along with the other chromosomes of the yeast cell.

Z

Zoophilic search for term

Zoophilic mosquitoes preferentially feed on animals.