Learning From Extremely Imbalanced Data with Random Forests
Andy Liaw, (Merck Research Labs), andy_liaw@merck.com,
Chao Chen, (UC Berkeley), chenchao@stat.berkeley.edu, and
Leo Breiman, (UC Berkeley), leo@stat.berkeley.edu
Abstract
Classification of extremely imbalanced data arises in many practical applications; e.g., fraud detection, drug discovery, direct marketing, rare disease diagnosis, etc., where the class of interest comprise of very small fraction of the data. Usually the primary interest in learning from such data is to have high accuracy in predicting the rare ("positive") class, whereas the prediction accuracy of the large ("negative") class is relatively de-emphasized. Such data can pose a serious challenge to most of the classification algorithms in common use, because these algorithms, without modification, aim at maximizing the overall accuracy. Even algorithms that can accommodate class weights can break down in the extreme cases. Recent research in dealing with imbalanced data suggests that sampling techniques can be quite effective. We propose two modifications to the random forest algorithm (Breiman, 2001) for such data; one is based on an ensemble of trees where the majority ! class is down-sampled, and the other takes class weights into account in both tree induction and counting votes for the ensemble. We compare the performance of these two approaches with some algorithms that were proposed recently by various authors (one-sided sampling, SHRINK, SMOTE, and SMOTEboost). An empirical comparison on a few datasets shows that both approaches tend to perform better than the existing methods, but neither of these two shows a clear advantage over the other.