SOP002

From VectorBase Help System

Jump to: navigation, search


Contents

[edit] SOP definition

Gene accuracy, gene structure inferred from protein data with support from transcript data


[edit] Introduction

This document provides information relating to the VectorBase gene structure predictions with particular reference to the lines of evidence used in prediction. It does not cover the donstream analysis which provide functional annotation, the assignment of gene symbols or descriptions.


[edit] Summary of product assigned to this SOP

A gene structure prediction which is associated with VB:SOP002 has been predicted on the basis of an homologous protein sequence. The prediction consists of the exon/intron structure corresponding to the CDS (CoDing Sequence) only. The prediction of correct initiation methionine and stop codons may not be present and many structure will be partial predictions. Further credence to these structures comes from supporting transcript coverage which may include 5'-UTR and/or 3'-UTR.

Such gene structure predictions are viewed as moderately reliable with strong support for a gene locus.

Note that this SOP is assigned to a transcript. The assignment of this SOP to a transcript does not preclude the existence of alternative splice forms (isoforms) for the parental gene. Annotated alternative isoforms have independent SOP assignments.


[edit] Prediction process

[edit] Preparation

  • RepeatMask[1] the sequence using a custom produced library (lib) file
  • Map non-redundant protein sequence database[2] to the genomic scaffolds using WU_Blast[3]
wublastx <database> <query> -hitdist 40 T=14 V=500000 B=500000 wordmask=seg lcmask

[edit] Selection of significant hits for prediction

  • The genome is split into 1 Mb slices
  • For each slice the peptide similarities with a score greater than 200 are extracted for re-alignment to the genome.
default blastx score threshold = 200

[edit] Gene prediction

  • The selected peptides are realigned to the genome using genewise[4].
  • Genewise produces a predicted gene structure which attempts to represent the maximal alignment of the peptide to the genome, including knowloedge of splice site donor/receptor consensi and the ability to accomodate frame-shifts in the alignment (through small introns < 10 bp).


[edit] Merge of possible gene predictions into a single canonical one

  • Multiple gene structures for a locus are merged/collapsed into single predictions using the Ensembl ClusterMerge algorithm. Note that the resulting structures can be a composite of the primary genewise structures and hence be supported by mulitple peptides.


[edit] Common problems with genewise predictions

  • Partial predictions (Loss of alignment quality near edges of the alignment)
  • Low alignment quality (e.g. low complexity regions) are poorly represented
  • Tandemly repeated gene families can confuse genewise and hybrid predictions from the repeated genes are produced
  • Repeated multi-domained proteins (e.g. multiple Ig/EGF domains) can confuse genewise and lead to many alternate isoform predictions


[edit] Mark up in VectorBase gene build

  • Standard similarity build transcript predictions which coincide with EST_genes and may have added UTR spans.


[edit] Methodology of tracking in the build

  • Transcripts have peptide_supporting_feature and dna_supporting_feature.
  • Transcripts are not part of the VB-CHADO curation warehouse.


[edit] References

  1. Smit, AFA, Hubley, R & Green, P. RepeatMasker Open-3.0. (1996-2004) RepeatMasker
  2. UniProt
  3. Gish, W. (1996-2004) WU-BLAST
  4. E. Birney, M. Clamp and R. Durbin Genome Research, (2004), 14:988-995
Personal tools