SOP004
From VectorBase Help System
Contents |
[edit] SOP definition
Gene accuracy, gene structure inferred from transcript data only
[edit] Introduction
This document provides information relating to the VectorBase gene structure predictions with particular reference to the lines of evidence used in prediction. It does not cover the downstream analyses which provide functional annotation, the assignment of gene symbols and descriptions.
[edit] Summary of product assigned to this SOP
A gene structure prediction which is associated with VB:SOP004 has been predicted on the basis of known expressed transcript sequences (predominantly EST). The prediction consists of the exon/intron structure for the CDS (CoDing Sequence) and may have untranslated regions. The prediction of correct initiation methionine and stop codons may not be present and many structures will be partial predictions.
Such gene structure predictions are viewed as high quality since they are evidenced by transcript data where the exon/intron structures are correct and the presence of a gene locus is highly likely.
Note that this SOP is assigned to a transcript. The assignment of this SOP to a transcript does not preclude the existence of alternative splice forms (isoforms) for the parental gene. Annotated alternative isoforms have independent SOP assignments.
[edit] Prediction process
[edit] Preparation
* RepeatMask[1] the sequence using a custom produced library (lib) file
* Alignment of EST sequences to the genome using Exonerate[2] with a stringency which allows mismatches internally of (presumed) exonic alignments.
[edit] Predicting gene structures
- The output from the exonerate program was parsed to yield a set of partial gene structure predictions which correspond to the transcript alignment (i.e. each mapped EST produces a gene structure).
[edit] EST GeneBuilder
The Ensembl !ClusterMerge algorithm was employed to collapse down the EST/cDNA based alignments into a non-redundant set of transcripts with complete open reading frames (n.b. complete in this sense means that the predictions are modulo 3 in length and not any restriction based on the prediction of initiation or stop codons).
[edit] Mark up in VectorBase gene build
Transcripts associated with this SOP are marked up as EST Genes. They will have DNA_align_features as supporting evidence but no protein_align_features.
[edit] Methodology of tracking in the build
MySQL queries for transcripts with dna_align_feature supporting_feature only.
