Interface 2004
Abstract

Some statistical issues related to feature detection using random forests
Grant Izmirlian, (National Cancer Institute), izmirlian@nih.gov

Abstract

The random forest (RF) algorithm of Breiman and Cutler is arguably one of the best ``off the shelf'' classification algorithms available to date, in that with practically no tuning extremely underspecified statistical problems can be classified at expected losses near the bayes error. In the analysis of proteomic profiling data as in most all classification problems, the target is not so much the classifier per se, but the identification of important features. Towards this end, the RF algorithm supplies a peak importance measure based upon the average decrease in correct votes that occurs when a given feature is ``noised''. Via Monte Carlo study, it is shown that (1) under the null hypothesis, the normalized importance measures display non-normal fat tailed asymptotics, so that a step down procedure such as the Benjamini Hochberg False discovery rate results in observed false discovery rates that are highly inflated and (2) some implications about power and sample size are hinted at using a second Monte Carlo study generated under an alternative hypothesis containing an important peak having simple odds ratio 10 for the affected class in a balanced design.


Take me back to the main conference page.