See also Software
In recent years, gene expression and proteomics platforms of various types have produced massively multivariate signatures of gene activity. It is now commonplace to assay the expression of tens of thousands of genes from a single sample.
One of the most interesting uses of this technology has been to develop diagnostic signatures which can be used to discriminate between diseased and healthy patients, or between different types of disease.
The usefulness of these platforms for diagnostic development has been handicapped by deficiencies in the analysis of data. Naïve approaches based on simple clustering, and even more sophisticated approaches such as support vector machines, lead to the generation of diagnostic signatures with tens, or perhaps hundreds of genes.
These high dimensional signatures can only be made operational with expensive or volatile platforms such as Affymetrix chips and microarrays. They do not lead to the development of robust and cost effective diagnostics.
We have developed GeneRave, a procedure based on rapid variable elimination from a Bayesian model. GeneRave leads to the generation of simple diagnostic fingerprints which are easy to use. GeneRave is CPU efficient and low memory footprint.
For more information about GeneRave see Data to Diamonds: Multivariate Datamining Leads to Concise Gene Signatures for Disease Classification
Our Technology and Its Value
Our technology is directed to the production of robust, low dimensional diagnostic signatures, which can be implemented in low cost and stable diagnostic platforms. These diagnostic signatures typically:
- outperform high dimensional signatures,
- are easy both to interpret and to implement.
Many experimental models may be used to generate data for diagnostic development. The common paradigm is measurement of some response variable on samples, and then prediction of that response data using gene expression.
Experimental procedures differ in the nature of the response variable. Our algorithms have been designed to work when the response variable is:
- Membership of two or more groups;
- Censored survival data of the type usually analysed by Cox's proportional hazards regression;
- Continuously variable;
- Ordered categories (e.g. cancer staging data or prognostic class);
- Any member of the regular exponential family of distributions.
Our technology is embodied in a number of patent protected algorithms. Our proprietary algorithms are realised in a rigorously engineered library, proven in multi platform application.
These algorithms have been designed to be scalable to problems with tens of thousands of variables, measured on hundreds of thousands of observations. In a reference application, observations represent different microarrays or gene chips, and variables represent expression levels (or ratios) on different genes.
Our target architectures include medium level PC workstations, and high performance SMP hardware, such as the IBM p-series server which is scalable up to supercomputer performance. We are currently also developing cluster (Beowulf) versions of the algorithms.
