home » products » data to diamonds

Data to Diamonds: Multivariate Datamining Leads to Concise Gene Signatures for Disease Classification

In recent years there has been significant growth in the data volumes generated expression experiments. The advent of high throughput gene chips (Affymetrix, cDNA, ...), has led to the capture of massive amounts of information in a single experiment. Currently, experimenters can measure the expression levels for 5,000 to 30,000 genes for each sample. Protein expression technology is not far behind.

CSIRO Bioinformatics has developed a data analysis technique which rapidly sifts through very large numbers of expression measurements to identify the genes that form the best predictive set.

Data volumes, data problems

Of course, access to such large quantities of data raises interesting new questions on how it should be analysed.

  1. In many cases, scientists are interested simply in finding relationships among the samples or genes measured, with no particular target property in mind. This unsupervised approach has led to many interesting studies based on various forms of cluster analysis (hierarchical, k-means, self organising maps).
  2. In other cases, there may be a design amongst the samples, and the experimenter is interested in which and how genes are affected by the differences in samples; for example, cell cycle experiments where the samples are taken through time.
  3. A third form of analysis has a target property of the samples in mind and is interested in which genes might be used to predict or explain that property. Examples are; finding the genes whose expression levels diagnose the presence of a disease, predicting the outcome of a particular treatment given gene expression, and predicting the survival time of a particular patient with known gene expression.

In most cases of this latter type of analysis, the majority of the genes are not relevant. It is for this supervised data analysis problem that CSIRO Bioinformatics has designed the Gene-Rave methodology.

Gene-Rave - an integrated analysis

Most supervised methods require some form of preselection of genes, and this is usually achieved by considering the association with the target of each gene in turn. This form of preselection can have serious drawbacks in that ad hoc methods often need to be used, depending on the nature of the target. Further, if genes are relevant only in combination, gene-bygene selection can miss them. Gene-Rave overcomes this problem by integrating the modelling of the target and gene selection into a single process.

Separate classes

Two genes separate classes, but neither does on its own.

Gene-Rave models target sample properties using a Generalised Linear Models framework. This family of models encompasses:

Gene selection is achieved by use of Bayesian prior built into the model. The prior formalises the idea that most genes have zero weight in the model, and as the model fitting process proceeds genes, are very rapidly eliminated from the model.

Iteration 1

Iteration Number 1

Iteration 1

Iteration Number 2

Iteration 1

Iteration Number 3

Iteration 1

Iteration Number 20

The plots above show model weight for more than 4,000 genes, at four different iterations. Gene-Rave rapidly eliminates genes from the model.

Advantages

The parsimonious models constructed in this way:

Verification

Cross validation is used to get estimates of prediction error. It involves dividing the data samples into v groups. Each group is removed from the data and the model fit to the remaining v - 1 groups. The model is then used to predict the left-out group. This process in repeated for each of the groups and the overall prediction error assessed.

Permutation tests are used to assess the significance of a model. By randomly permuting the target labels or values of the samples and rebuilding the model the likelihood of achieving a result by chance can be evaluated.

For further information contact us