SOP005

From VectorBase Help System

Jump to: navigation, search


Contents

[edit] SOP definition

Gene accuracy, gene structure predicted by an ab initio predictor with protein domain supporting evidence.


[edit] Introduction

This document provides information relating to the VectorBase gene structure predictions with particular reference to the lines of evidence used in prediction. It does not cover the downstream analysis which provide functional annotation, the assignment of gene symbols or descriptions.


[edit] Summary of product assigned to this SOP

A gene structure prediction which is associated with VB:SOP005 has been predicted using an ab initio predictor on the basis of statistical approaches; codon bias measures of coding potential, matches to the splice site consensus, and termination signals. There are a number of prediction programs available for this task including SNAP(ref), Genefinder(ref), Augustus (ref) and fgenesh (ref). Further the resulting peptide prediction has a significant match for a protein domain (Pfam analysis) which provides supporting evidence for the validity of the gene call but little evidence for the accuracy of the exon structure.

Such gene structure predictions are viewed with a low level of confidence with the expectation that many structures will have problems (exon extensions and truncations or missed exons). The level of false positives (predictions which are wholly incorrect) for this SOP is expected to be negligible as conceptual translation of the exon structure yields a peptide with similarity to Pfam domains.

Note that this SOP is assigned to a transcript. The assignment of this SOP to a transcript does not preclude the existence of alternative splice forms (isoforms) for the parental gene. Annotated alternative isoforms have independent SOP assignments.


[edit] Prediction process

[edit] Preparation

  • RepeatMask[1] the sequence using a custom produced library (lib) file

[edit] ab initio prediction

  • Gene structure predictions generated using SNAP[2] with default parameterization.

[edit] Screen predictions for presence of protein domains

  • Translate CDS into peptide set.
  • Screen this set against the Pfam[3] database of protein domains.

[edit] Filtering of results

  • Discard those which have matches to Pfam domains from TE origin[4].


[edit] Mark up in VectorBase gene build

  • Standard build transcripts with no supporting features


[edit] Methodology of tracking in the build

  • Transcripts have no peptide_supporting_feature or dna_supporting_feature.
  • Transcripts are not part of the VB-CHADO curation warehouse.


[edit] References

  1. Smit, AFA, Hubley, R & Green, P. RepeatMasker Open-3.0. (1996-2004) RepeatMasker
  2. I. Korf (2004) BMC Bioinformatics 2004, 5:59
  3. Pfam
  4. List of TE based Pfam domains
Personal tools