Rank-Based Classification of Gene Expression Profiles
Donald Geman, (Johns Hopkins University), geman@jhu.edu,
Daniel Naiman, (Johns Hopkins University), daniel.naiman@jhu.edu, and
Christian d'Avignon, (Johns Hopkins University), davic@ bme.jhu.edu
Abstract
Statistical inference from gene expression microarray data is difficult due to the small number of observations, typically tens, relative to the large number of genes, typically thousands. Consequently, standard methods in machine learning may lead to over-fitting and inflated estimates of performance in detecting disease, identifying tumors and predicting treatment responses, especially when all aspects of learning a classifier are not properly cross-validated. Moreover, and equally important, the results may be very difficult to interpret in biological terms. We address these problems by a purely rank-based analysis, for instance comparing the mRNA counts in selected pairs. As an example, we attempt to distinguish among cancer types with a maximum likelihood classifier based on a single pair of genes. The results so far are very promising; we obtain accurate and transparent decisions from small samples in standard classification tasks. However, there are many unanswered questions, both statistical and biological.