Gene-Rave builds a survival index based on just six genes
Gene-Rave is CSIRO Bioinformatics technology for the analysis of gene expression microarray data. The technology is able to find small sets of genes with the same or better predictive accuracy than the usually much larger sets found by existing technology.
Gene-Rave is extremely fast and therefore scalable to large datasets, and amenable to computationally intensive resampling techniques such as cross validation and permutation testing.
Survival analysis in diffuse large B-cell lymphoma
From "Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling", Alizadeh et al., Nature, February 2000.
Diffuse large B-cell lymphoma (DLBCL), the most common subtype of non-Hodgkin's lymphoma, is clinically heterogeneous: 40% of patients respond well to current therapy and have prolonged survival, ...
The authors discuss an experiment to look at the gene expression profiles of tissue from 96 patients using Lymphochip cDNA microarrays. 42 of these patients were diagnosed with DLBCL. After some preprocessing the authors focused on 4,026 genes. Using cluster analysis and substantial prior knowledge, the paper divides the DLBCL patients in to two new subgroups, referred to as germinal centre B-like DLBCL and activated B-like DLBCL.
... patients assigned [...] to either DLBCL subgroup shared a large gene expression program that distinguished them from the other subgroup.
... no single gene [...] was absolutely correlated in expression with the DLBCL subgroup taxonomy.
Gene-Rave finds two genes that separate DLBCL subgroups
The Gene-Rave technology allows us to fit logistic regression models to this data using (initially) all 4,026 genes. Gene-Rave uses a Bayesian prior that puts a Gene-Rave use two genes to separate DLBCL subgroups. large probability that the coefficient of any gene in the model is zero. When applied to this data set Gene- Rave selects just two genes to separate the DLBCL subgroups.
Gene-Rave use two genes to separate DLBCL subgroups.
Prediction accuracy
The predictor produced makes 1 error out of 42 when used to predict the original training data. Of course, this result is likely to be optimistic, so in the absence of an independent test set, we have used crossvalidation to assess the accuracy of our classifier. This involves removing one array and rebuilding the classifier using the remaining 41 arrays. The classifier thus derived is used to predict the left out array, and the whole process repeated until each array has been left out once.
Using this process the assessed error rate of the Gene- Rave classifier is 2 out of 42 or less than 5%.
To assess whether such a result could have occurred by random chance we use permutation testing. The original 42 arrays are randomly relabelled as being members of one of the two subgroups and a model fit to this data. If the original result is likely by chance, then it is also likely that a randomly relabelling can achieve a similar error rate. In this example, we calculate the error rate for 99 random permutations, and no random labelling achieved an error rate as low as the original true labelling; a highly significant result.
Histogram of error rates for 99 random permutations and the original model. The vertical red line is the error rate from the true labelling.
Plot of actual survival time against Gene-Rave 6 gene survival index. Red dots are deaths.
Clinical outcomes patient survival time
The original authors discovered that the survival rates of the two subgroups of DLBCL patients are very different. We take the analysis one stage further and consider building a predictor of survival time using the gene expression data.
Using Gene-Rave tailored to censored survival data, we built a Cox's proportional hazards model of the survival times, and Gene-Rave selected just 6 genes. The model built has no information about treatment regimes or other patient covariates, and it is likely that, given sufficient data, the predictions could be improved.
Are the other genes interesting?
Gene-Rave selects a minimal set of genes that it needs to build a model. Of course, this set is not exhaustive, in the sense that it may not contain all the genes with information relevant to the target.
Additional genes can be found in two ways;
- By association with the ten already found, using correlation for example.
- By rebuilding a model with the initial ten removed. In this example, the method finds a second set of eleven genes.
For further information contact us
