How are "high confidence" orthologs defined?


Ortholog metrics are calculated for some groups of species, and these are used to classify a "high confidence" set of orthologs. The methodology uses two orthogonal sources of information, gene order conservation (GOC) and whole genome alignments (WGA).

The "GOC score" metric for a pair of orthologs measures whether the two genes up- and downstream of each gene in the ortholog pair are also orthologous, and allows for inversions and gene insertions. The "WGA coverage" metric determines the extent to which the orthologous regions have been aligned by pairwise LASTz alignments, primarily based on exonic coverage, with a small contribution from intronic coverage. Both metrics have a value between 0 and 100.

There is only an expectation for gene order conservation between species that are evolutionarily close; thus the GOC score is only calculated within Diptera, Chelicerata, and Hemiptera. Similarly, pairwise WGAs, and thus the related metric, are only available for a subset of fairly closely-related species.

To classify orthologs as "high confidence", thresholds are applied to the ortholog metrics, according to the evolutionary distance between the species. Within Anophelinae and Glossinidae, the GOC threshold is 50 and the WGA threshold is 50; within Brachycera, Culicinae, Hemiptera, and Phlebotominae the GOC threshold is 25 and the WGA threshold is 25; no thresholds are applied beyond these clades. In cases where GOC and WGA metrics are not available, a "tree-compliance" metric is used to identify (and therefore exclude from the "high confidence" set) orthologs inferred from dubious tree topologies. Finally, to be included in the "high confidence" set, both orthologous proteins must have percentage identity above a certain threshold, currently set at 25% for all species.

The metrics are displayed in the genome browser in the ortholog table, and are available in BioMart.

The following plots show, respectively, the percentage of orthologs with some degree of gene order conservation; the mean WGA coverage; and the mean percentage identity between orthologous protein sequences:

Gene Order Conservation: Score Metric
Whole Genome Alignment: Coverage Metric
Percentage Identity