Computationally Efficient Identification of Outliers in Large Data Sets
Mark Werner, (Oakland University), werner@oakland.edu
Abstract
We present a computationally fast procedure for identifying outliers, suitable for use in large data sets. This procedure uses a modification of Tukey's biweight function to obtain robust location and scale estimates and accordingly, robust Mahalanobis distances (RMD) for each observation. We estimate the density of these RMD's and determine a final rejection point from the empirical density function; points are then classified as outliers if their RMD is sufficiently large. Since no assumptions are made regarding the data (such as normality), this procedure demonstrates a high degree of accuracy on a wide variety of data sets, including skewed and correlated data. It is computationally efficient and is capable of rapidly identifying outliers in large, high-dimensional data sets. We also examine the influence function of the robust estimator defined by the first half of this algorithm and compute its asymptotic robustness properties. These properties are compared to other we ll-known estimators for a deeper understanding of robust estimation, which does not necessarily have to be performed in conjunction with outlier identification.