Streaming Data II (Bill Szewczyk, organizer)


Pedro Domingos (University of Washington)
A General Framework for Mining Massive Data Streams

Friday 2:00-2:30, San Fountain I

Abstract:

In many domains, data now arrives faster than we are able to mine it. To avoid wasting this data, we must switch from the traditional "one-shot" data mining approach to systems that are able to mine continuous, high-volume, open-ended data streams as they arrive. In this talk I will identify some desiderata for such systems, and outline our framework for realizing them. A key property of our approach is that it minimizes the time required to build a model on a stream, while guaranteeing (as long as the data is i.i.d.) that the model learned is effectively indistinguishable from the one that would be obtained using infinite data. Using this framework, we have successfully adapted several learning algorithms to massive data streams, including decision tree induction, Bayesian network learning, k-means clustering, and the EM algorithm for mixtures of Gaussians. These algorithms are able to process on the order of billions of examples per day using off-the-shelf hardware. Building on this, we have developed VFML, a library of software primitives for scaling arbitrary learning algorithms to massive data streams with minimal effort.



William F. Szewczyk (National Security Agency)
Time-evolving Adaptive Regression

Friday 2:30-3:00, Fountain I

Abstract:

The ability of sensors to track data in real time for extended periods gives one the opportunity to monitor evolving functional relationships. In this talk I will demonstrate how one can capture complex, nonstationary relations using time-evolving adaptive mixtures.



Olivier Verscheure (IBM T.J. Watson Research Center)
Quantization for Adapted GMM-Based Speaker Verification

Friday 3:00-3:30, Fountain I

Abstract:

State-of-the-art speaker verification systems are built around the likelihood ratio test, using Gaussian Mixture Models (GMM) for likelihood functions, a universal background model (UBM) for alternative speaker representation, and a form of Bayesian adaptation to derive speaker models from the UBM. This work tackles optimal quantizer design of the speech cepstral features (MFCCs) for such systems. The problem is posed as the minimization of loss of log-likelihood ratio between the quantized and unquantized speech features. First we show that the conventional mean squared error (MSE) quantizer for the top-scoring UBM Gaussian is optimal under practical assumptions. Then we derive the optimal bit allocation strategy across the dimensions of the feature vectors. Finally we demonstrate the validity of the approach against various quantization and bit allocation schemes by running experiments on the appropriately modified IBM Speaker Verification system. Experimental results on the HUB4 corpora show negligible impact on verification performance for bit rates as low as less than 1 bit per dimension on average in contrast to 32 bits per dimension in the original system.