Gene-Rave predicts leukemia subtype using only ten genes
Gene-Rave is CSIRO Bioinformatics technology for the analysis of gene expression microarray data. The technology is able to find small sets of genes with the same or better predictive accuracy than the usually much larger sets found by existing technology.
Gene-Rave is extremely fast and therefore scalable to large datasets, and amenable to computationally intensive resampling techniques such as cross validation and permutation testing.
Subtype classification in pediatric ALL
The success of treatment of pediatric acute lymphoblastic leukemia (ALL) is critically dependent on the accurate assignment of individual patients to various risk groups. The determination of this assignment is currently involved and expensive.
In "Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling" (Cancer Cell, Vol. 1, March 2002), the authors report a study of 360 pediatric ALL patients using Affymetrix oligonucleotide microarrays containing 12,600 probe sets.
Of the 360 ALL patients, 248 are clearly allocated to one of six disease subtypes, and the authors use a number of gene selection methods followed by a number of different supervised classification techniques to produce very good prediction results. Typically, gene preselection methods are providing 200 to 300 genes.
Gene-Rave parsimonious classifiers
In this classification setting, Gene-Rave builds models using multinomial regression, with a Bayesian prior that puts a large probability that the coefficient of any gene in the model is zero. Gene-Rave initially uses all available genes, not just those preselected, and rapidly eliminates genes that do not contribute to the model.
Two dimensional representation of the ten genes found by Gene-Rave. (view 1)
The resulting Gene-Rave model produces a probability prediction for a sample being from one of the six classes, using only ten genes.
To visualise the results of such a model, we have further projected the ten genes expression values onto two dimensions. We have done this several times to produce the figures displayed in this document. The six classes are represented by six different colours.
Prediction accuracy
To assess the prediction accuracy of the Gene-Rave model we used 15-fold cross-validation. This involves leaving out 1/15th of the samples and rebuilding the model using an identical procedure. This model is then used to predict the class of the fraction left out, and the number of errors counted. This process is repeated 15 times until each sample has been left out once. The following classification table shows the results.
This represents an overall estimated future error rate of 12 out of 248, or less than 5%.
Second two dimensional representation of the ten genes found by Gene-Rave. (view 2)
Third two dimensional representation of the ten genes found by Gene-Rave. (view 3)
Are the results better than random chance?
One problem with methods that select from many possible genes to build a predictor, is the possibility that the relationship found arose purely by random chance. That is, if there was no real relationship, what are the chances of finding one when given enough random genes? This can commonly occur where there are many more observations made about each sample, than there are samples.
To guard against this kind of error, we can estimate the probability of finding as good as a result by chance using a permutation test. Essentially this involves randomly relabelling the samples into the six classes (keeping the number in each class the same) and then rebuilding a model using an identical process. The error rates thus achieved are indicative of error rates achievable by chance in such a data set.
For this data set, using 100 random permutations, no other model achieved such good results. Thus we suggest that the ten gene model found is highly significant.
| Predicted Subtype | ||||||
|---|---|---|---|---|---|---|
| True Subtype | 1 | 2 | 3 | 4 | 5 | 6 |
| 1 | 11 | 0 | 3 | 1 | 0 | 0 |
| 2 | 0 | 27 | 0 | 0 | 0 | 0 |
| 3 | 3 | 0 | 60 | 0 | 1 | 0 |
| 4 | 2 | 0 | 2 | 16 | 0 | 0 |
| 5 | 0 | 0 | 0 | 0 | 43 | 0 |
| 6 | 0 | 0 | 0 | 0 | 0 | 79 |
Expanding the set of genes
Of course, the ten genes found are not the only genes associated with the leukemia subtypes. Many other genes may have an association, but the ten we have found are the minimum necessary to produce a good predictor using our models.
Additional genes can be found in two ways;
- By association with the ten already found, using correlation for example.
- By rebuilding a model with the initial ten removed. In this example, the method finds a second set of eleven genes.
For further information contact us
