Massive Data Sets and Streams (David Scott, chair)


Tamraparni Dasu (AT&T Labs - Research)
Shankar Krishnan (AT&T Labs - Research)
Suresh Venkatasubramanian (AT&T Labs - Research)
Ke Yi (Duke University)
An Information-Theoretic Approach to Detecting Changes in Multi-Dimensional Data Streams

Saturday 10:30-11:00, Fountain II

Abstract:

An important problem in processing large data streams is detecting changes in the underlying distribution that generates the data. In this paper, we take a general, information-theoretic approach to the change detection problem, which works for multidimensional as well as categorical data. We use relative entropy, also called the Kullback-Leibler distance, to measure the difference between two given distributions.

Our scheme is general; it is nonparametric and requires no assumptions on the underlying distributions. It employs a statistical inference procedure based on the theory of bootstrapping, which allows us to determine whether our measurements are statistically significant. The scheme is also quite flexible from a practical perspective; it can be implemented using any spatial partitioning scheme that scales well with dimensionality. In addition to providing change detections, our method generalizes Kulldorff's spatial scan statistic, allowing us to quantitatively identify specific regions in space where large changes have occurred.

We provide an experimental study that demonstrates the generality and efficiency of our approach with both synthetic and real multidimensional datasets.


Deepak Agarwal (AT&T Labs - Research)
Junlan Feng (AT&T Labs - Research)
Valerie Torres (AT&T Labs - Research)
Monitoring Massive Streams Simultaneously: A Holistic Approach

Saturday 11:00-11:30, Fountain II

Abstract:

A holistic approach to prospective anomaly detection for massive number of streams is proposed. The method works by building a baseline model to capture normal behavior. Any baseline model that provides a p-value for the observed, relative to the predicted can be used. Anomalies are detected by tracking the EWMA of normal scores derived from p-values. A flexible and fast five-parameter Bayesian model adjusts for multiple testing at each time point. Methods to delete uninformative streams from the monitoring process are also discussed. The procedure is illustrated with a biosurveillance application which involves monitoring Chinese websites for a set of patterns intended to provide leading indicators of social disruption events in China.



David Poole (AT&T Labs - Research)
TurboPRIM: Hotspot Setection in Massive Datasets

Saturday 11:30-12:00, Fountain II

Abstract:

The patient rule induction method (PRIM), introduced by Friedman and Fisher (1999), is a powerful information mining algorithm. The objective of PRIM is to find response ``hotspots'' (or bumps) in a potentially high-dimensional space of predictor variables. PRIM seeks box-shaped subregions in the predictor space where the average value of the response is significantly larger than its average over the entire space. This paper describes TurboPRIM, a modification of PRIM designed specifically for massive datasets. There are various practical challenges when the number of cases and/or predictors is very large. Our approach is to create an out-of-memory, disk-based implementation of PRIM where the dataset is never stored in the memory of a computer, and all calculations are performed by making a minimal series of passes over the data on disk. We illustrate TurboPRIM on both simulated and real data, and show that TurboPRIM outperforms sampling-based standard PRIM for some large datasets. We also describe pertinent features of TurboPRIM such as rapid single-pass quantile calculation and a method for approximate significance testing. We discuss future research possibilities, particularly ways in which TurboPRIM could be adapted for massive data streams.