Expression Browser User Guide

VBGE: VectorBase Gene Expression

VectorBase aims to make all published gene expression data available in an easy to use format. The primary goal is to provide gene-based summaries of expression across a range of experiments, and to integrate these seamlessly into the VectorBase genome browser and other resources. Data from different laboratories is processed through the same pipeline so that results can be compared side-by-side.

Some screenshots of VBGE's capabilities are shown below. Click through to see the full-size versions.

VBGE Gene Summary VBGE Plots
Gene expression summary for AGAP011294 Developmental series profile for DEF1
Above: gene-averaged profile for two reporters
Below: the two unaveraged reporter profiles

Data processing

Before any statistical analysis is done, the VectorBase Gene Expression curator(s) load the array designs and microarray data into our databases.

Microarrays

In collaboration with the group who developed the microarray, we obtain the sequence(s) for each probe on the array. These sequences are aligned to the relevant genome assembly (see How do you map the microarrays to the genome for more details) and the alignments are added to the VectorBase genome database so that they may be displayed as tracks on the genome browser. Associations with genes are also made in the VectorBase genome database (for display on gene pages).

The alignments of reporters to the genome and gene associations are redone when necessary (i.e. new assembly or gene build).

Microarray designs are also loaded into our BASE database.

RNA-Seq

RNA-Seq reads are obtained, post trimming, from authors to ensure data quality. Reads are then pushed through a pipeline consisting of read alignment to a reference genome (using TopHat) and transcript assembly (using Cufflinks) to obtain FPKM's (Fragments Per Kilobase Of Exon Per Million Fragments Mapped). Reads are pushed through this pipeline to match the latest reference genome or transcriptome update. FPKM values are then transformed into a scale comparable to the microarray expression data (i.e. log2(FPKM +1) ).

Reference transcriptomes are loaded into our BASE database, in the same fashion as a microarray design.

Experimental data

All expression data in the resource is re-analysed from raw unprocessed data. For microarrays, the quantified scans enter the pipelines below. The raw reads from RNAseq are also processed as described below.

Standard processing pipeline: two channel data

  1. calculate spot intensities for each channel from the raw data ("raw bioassay" in BASE):
    1. ch1 = channel 1 foreground median - channel 1 background median
      (foreground median = median of pixel intensities inside spot, background median = median of pixel intensities surrounding spot)
    2. ch2 = channel 2 foreground median - channel 2 background median
  2. discard spots flagged as problematic (by the investigator or image analysis software)
  3. perform lowess normalisation (good explanation here) using default parameters in BASE
  4. calculate log2(ch1/ch2) as the final "expression value" used in subsequent statistical analysis

In our database, channel 1 is always the "experimental sample" and channel 2 is always the "reference" or "control sample". That is, we swap the channels accordingly while loading the data to take into account dye swaps or different laboratory procedures.

Standard processing pipeline: Affymetrix data

Note that each "spot" (=probe) on an Affymetrix array belongs to a probeset. On the original Anopheles/Plasmodium array there are 11 probes per probeset. A probeset is designed to report on a specific gene or a part of the gene (coding sequence or UTR for example).

  1. probeset intensities are calculated in BASE directly from raw data (.CEL) files using the Robust Multichip Average algorithm implemented in the RMAExpress BASE plug-in.
  2. the resulting logged intensities are the final "expression values" used in subsequent statistical analysis

Statistical analysis

Averaging for multiple spots and reporters

All of the statistical methods described in the following sections ask the same question in one way or another: "Do the observed expression values differ from those expected by chance?" An assumption is made that the observed expression values are independent from each other and are normally distributed. We define independent expression values as those which come from different hybridisations (we currently treat biological and technical hybridisation replicates as the same, but acknowledge this is not ideal, see VBGE feature requests).

However, a reporter may be spotted more than once onto an array, or if we are calculating a gene expression summary, there may be more than one reporter corresponding to that gene. We therefore perform an arithmetic average of expression values over all spots for a particular reporter, or all reporters for a particular gene (or a combination of both) for each hybridisation. The result is one value per hybridisation (e.g. one value per replicate) that is then fed into the statistical analysis. All the variance contributing to that mean is ignored in the subsequent analysis (again, we concede that more sophisticated methods might be appropriate).

Only uniquely mapped reporters are included in gene-averages

When a reporter maps to more than one gene, its data is not included in expression summary for either gene. These ambiguous reporters are listed in a second "Reporter details" table on the gene expression summary page (example valid at time of writing: AGAP001762). You can always click on any reporter link to see the expression data for that reporter. You can also create custom URLs to average any set of reporters you like (see unaveraged reporter plots for details).

Statistical test: ANOVA

question_32.png Statistical test: ANOVA
ANOVA tests for differential expression between two or more groups.


A standard one way ANOVA, as implemented in the commons math Java library, is used. When only two groups are compared it is equivalent to an unpaired t-test.


Statistical test: t-test

question_32.png Statistical test: t-test
A t-test is used to identify a significant fold-change. Applicable to true ratio data only.


The null hypothesis is that the arithmetic mean of the log2 ratios is zero (equivalent to a fold change of 1, meaning no change). You can visualise this test on the plot by looking for overlap between the 95% confidence interval error bars and the y=0 line.

Only true ratios from two colour experiments can be analysed with this test. Two colour standard reference experiments, such as the Koutsos et al. developmental series, do not give true ratios, and so this test is not applicable.

Note that no fold-change cut-off (e.g. 2-fold or more) is applied. Therefore, for example, we may identify statistically significant 1.1-fold changes that have little or no biological significance.


Statistical test: Neighbour t-test

question_32.png Statistical test: Neighbour t-test
An unpaired t-test is applied to expression values from neighbouring points in a series.


This test applies to time series and developmental profiles.


Web interface

Data model

To navigate this web application effectively, some understanding of the underlying data model is useful. We have tried to make it as simple as possible. Every arrow on the diagram below represents a navigable route between the web pages (e.g. from sample to hybridisation, reporter to experiment).

Image:VBGE_data_model.png

For example, if you searched for "pupa" and got back a list of samples, you can navigate in a few steps through to experiment and reporters to see some expression data. Note that the data model in our underlying database is more complex and comprehensive.

Expanding and contracting

question_32.png Expanding and contracting
Click on the arrows or yellow area to toggle between displaying just the best p-value (default) and all calculated p-values for each experiment.


Some experiments have more than one p-value calculated for them. For example, the blood meal time series has an ANOVA p-value and several t-test p-values calculated between each neighbouring time step. In the summary for AGAP007032 you can toggle between showing only the best p-value and all p-values by clicking on the double down arrows below the blood meal time series row (or on any nearby area that lights up yellow on mouse-over). To collapse the table, click on the double up arrows at the bottom of the expanded region.

Note that when a page is reloaded the table reverts to the default unexpanded state.


Unexpanded:

Image:vbge_unexpanded.jpg

Expanded:

Image:vbge_expanded.jpg

Expression summary columns

The columns of the expression summary table provide the following information:

Expression summary column: Experiment

question_32.png Expression summary column: Experiment
This column contains the experiment name and links to the experiment page, detailed plots and tables, and a summary of the reporters that were used.


Most of this information is self explanatory, however be sure to click at least once on the plots and data link, as this is the main way to see summary plots and data tables for an experiment.

The last piece of information in this row summarises which reporters provided data for the expression summary statistics. Some experiments use microarrays where more than one reporter associates to some or all genes. One such gene is AGAP007032, where some experiments (e.g. those using the Affymetrix array) have two reporters associated with this gene. Click the "show details" link to see the reporters used (see screenshot below). The links take you to the reporter-experiment page.

VBGE Experiment Column


Expression summary column: P-value

question_32.png Expression summary column: P-value
This column contains the p-values calculated for each experiment or sub-experiment. They are sorted best-first.


P-values report the probability that the null hypothesis (no differential expression) could be rejected by chance. VectorBase applies no correction for multiple hypothesis testing because we do not know how many genes you are looking at. The default confidence level (0.05) has been chosen arbitrarily. If you are looking at 100 genes, then we recommend that you should adopt a p-value threshold of 0.05/100 = 0.0005 (this is the widely used Bonferroni correction).


Expression summary column: Statistical test

question_32.png Expression summary column: Statistical test
This column indicates which type of statistical test was used to generate the p-value.


See Statistical analysis.


Expression summary column: Experimental factor

question_32.png Expression summary column: Experimental factor
This column shows you which experimental factors were investigated and used for statistical analysis for each experiment. Mouse-over the icons for further information.


See Experimental factor icons.


Expression summary column: Summary

question_32.png Expression summary column: Summary
This column provides a text summary of the gene expression based on the statistics performed at the 0.05 confidence level.


For the ANOVA tests, the up and down arrows indicate the conditions where the highest and lowest expression values were found.

The left and right arrows specify which conditions were compared in the neighbour t-test.

A text summary indicating the fold change and direction is given for the standard t-test.


P-value colour code

question_32.png P-value colour code
The P-values calculated from each experiment or sub-experiment are coloured for your convenience. The strongest red indicates the most significant result gradating through pink to white which indicates no significance at the 0.05 level.


Here is the colour key for the P-values presented in the expression summaries.


Experimental factor icons

The expression summary for a gene or reporter contains quite a lot of information. Experimental factor icons are provided to allow the frequent user to locate the experiments of interest without having to read down the entire list of experiment names. For example, if you are interested in male/female expression differences you can scan down the list looking for the gender symbol.

Many thanks to VectorBase's resident artist Neil Lobo for the excellent icons!

Icon: Age

Icon: Age
Experimental factor Age. This experiment aims to measure gene expression with respect to age (usually hours or days).

Age.png


Icon: Time

question_32.png Icon: Time
Experimental factor Time. This experiment aims to measure gene expression with respect to time (usually minutes, hours or days).

Time.png


Icon: Compound

question_32.png Icon: Compound
Experimental factor Compound. This experiment aims to measure gene expression with respect to treatment with a drug or other foreign compound.

Compound.png



Icon: DevelopmentalStage

question_32.png Icon: DevelopmentalStage
Experimental factor Developmental stage. This experiment aims to measure gene expression with respect to the developmental stage of an organism, e.g. embryo, larva, pupa, adult.

DevelopmentalStage.png


Icon: DiseaseStaging

question_32.png Icon: DiseaseStaging
Experimental factor Disease staging. This experiment aims to measure gene expression with respect to the progression of a disease or infection.

DiseaseStaging.png


Icon: DiseaseState

question_32.png Icon: DiseaseState
Experimental factor DiseaseState. This experiment aims to measure gene expression with respect to one or more forms of a disease or infection.

DiseaseState.png


Icon: Dose

question_32.png Icon: Dose
Experimental factor Dose. This experiment aims to measure gene expression with respect to the amout of Compound given.

Dose.png

See also: Compound.


Icon: GeneticModification

question_32.png Icon: GeneticModification
Experimental factor GeneticModification. This experiment aims to measure gene expression with respect to an inheritable genetic modification (e.g. mutant).

GeneticModification.png


Icon: Genotype

question_32.png Icon: Genotype
Experimental factor Genotype. This experiment aims to measure gene expression with respect to different genotypes.

Genotype.png


Icon: GrowthCondition

question_32.png Icon: GrowthCondition
Experimental factor GrowthCondition. This experiment aims to measure gene expression with respect to growth conditions, typically relating to feeding.

GrowthCondition.png


Icon: MaterialType

question_32.png Icon: MaterialType
Experimental factor Material Type. This experiment aims to measure gene expression with respect to different fractions of RNA.

MaterialType.png


Icon: OrganismPart

question_32.png Icon: OrganismPart
Experimental factor Organism part. This experiment aims to measure gene expression with respect to different tissues, organs or body parts. In the icon, the body parts are coloured differently to signify this.

OrganismPart.png


Icon: RNAi

question_32.png Icon: RNAi
Experimental factor RNAi. This experiment aims to measure gene expression with respect to a knock-down of one or more genes using double stranded RNA. This is a temporary effect (c.f. GeneticModification)

RNAi.png

See also: Genetic Modification


Icon: Sex

question_32.png Icon: Sex
Experimental factor Sex. This experiment aims to measure gene expression with respect to the gender of the organism(s).

Sex.png


Icon: StrainOrLine

question_32.png Icon: StrainOrLine
Experimental factor StrainOrLine. This experiment aims to measure gene expression with respect to the strain(s) or cell line(s) used.

StrainOrLine.png


Expression plots and tables

Main expression plot

question_32.png Main expression plot
log2 transformed mean expression values (ratios or absolute values, depending on the microarray technology) are plotted with error bars indicating the 95% confidence interval of the mean.


Expression values are log2 transformed ratios from two channel experiments or log2 transformed expression values from single channel platforms, such as Affymetrix.

If the confidence interval is very narrow or not visible, check the number of replicates in the table below the plot. Confidence intervals cannot be calculated when n=1.

Lines are drawn between data points for all experiments identified as being time or developmental series.

The "y" axis is forced to pass through zero for true ratio experiments.

Note that a PDF download link is available underneath the plot. If you require a bitmap image, tell your browser to view the image on its own (usually a right-click menu option). Then you may edit the URL (width=nnn and height=nnn parameters) to produce an image of the size you need (up to a maximum of 1600x1200 pixels), and then save the image using your web browser.

Help with log2

  • A log2 ratio close to zero means no change
  • A log2 ratio of 1 means a two-fold upregulation.
  • A log2 ratio of 2 means a four-fold upregulation.
  • A log2 ratio of -1 means a two-fold downregulation.
  • A log2 ratio of -3 means an eight-fold downregulation.
  • If two non-ratio (e.g. Affymetrix or two-colour standard reference) log2 expression values differ by 2 then this represents a four-fold change.


Unaveraged reporter plots

question_32.png Unaveraged reporter plots
Clicking on these plots or the reporter ID takes you to the full expression report for the individual reporter and experiment.


Check these plots agree with the overall averaged plot above it. If one or more reporters do not agree (an example is AGAP000029 with respect to blood feeding), you may wish to calculate a custom average. To do this, type a URL of the format https://www.vectorbase.org/expression-browser/reporter/id1/id2/id3 into your browser (replacing id1, id2 etc with the reporter IDs which do agree with each other).

Note that differences in expression could result from

  • real biological differences (e.g. alternative transcripts)
  • hybridisation problems for particular reporters
  • technical problems with the reporter↔gene mapping in VectorBase (which you can evaluate in the genome browser)


Statistical summary table

The table of statistical summary information (mean, standard deviation, etc) corresponds to the data plotted above it and is based on hybridisation-averaged data. The table can be resorted by clicking on the appropriate column heading. You may export the entire table to XML or CSV (comma separated value) format text files.

Spot data table

The spot data table lists all normalised, but unaveraged spot data. The table can be resorted by clicking on the appropriate column heading. You may export the entire table to XML or CSV (comma separated value) format text files.

Click on the hybridisation links to learn more about the samples (age, sex, strain etc).

Other documentation

Frequently asked questions

Please see Frequently asked questions

Feature requests

Please see VBGE feature requests