Statistical Learning and Model Selection I (Susan Paddock, chair)


Tao Shi (The Ohio State University)
Bin Yu (UC Berkeley)
Binning in Gaussian Kernel Regularization

Thursday 4:00-4:20, San Marino

Abstract:

Gaussian kernel regularization is widely used in the machine learning literature and has proved successful in many empirical experiments. The periodic version of Gaussian kernel regularization has been shown to be minimax rate optimal in estimating functions in any finite order Sobolev space. However, for a data set with n points, the computation complexity of the Gaussian kernel regularization method is of order O(n^3). In this paper we propose to use binning to reduce the computation of Gaussian kernel regularization in both regression and classification. For periodic Gaussian kernel regression, we show that the binned estimator achieves the same minimax rates as the unbinned estimator, but the computation is reduced to O(m^3) with m as the number of bins. To achieve the minimax rate in the k-th order Sobolev space, m needs to be in the order of O(kn^{1/(2k+1)}), which makes the binned estimator computation of order O(n) for k=1, and even less for larger k. Our simulations show that the binned estimator (binning 120 data points into 20 bins in our simulation) provides almost the same accuracy with only 0.4% of computation time. For classification, binning with L2-loss Gaussian kernel regularization and Gaussian kernel Support Vector Machines is tested in a polar cloud detection problem.



Haixun Wang (IBM T.J. Watson Research Center)
Jian Yin (IBM T.J. Watson Research Center)
Phillip S. Yu (IBM T.J. Watson Research Center)
Classifying Streams of Changing Data Distribution

Thursday 4:20-4:40, San Marino

Abstract:

Mining data streams of changing distribution is important for real-time business decision support. The stream classifiers must evolve to reflect the current data distribution, which is embodied by the most recent training data. However, relying solely on the most recent training data can result in a biased classifier, since the most recent training data is usually a biased sample of the current data distribution. This reduces classification accuracy, especially when we are classifying rare events (e.g., positive instances may not occur in the most recent training data). Incorporating historical data can form a more representative sample, but it also increases the chances of learning obsolete models. To address this issue, we use a stochastic model to describe the concept shifting patterns and formulize this problem as an optimization problem: given historical data distributions and the most recent training data, find the most likely current data distribution, and learn a classifier based on the most-likely distribution. We derive an analytic solution and approximate this solution with an efficient algorithm. We evaluate our algorithm with both synthetic and real-world data. Our results show that our algorithm produces accurate and efficient classification.



Zhengjun Zhang (University of Wisconsin, Madison)
Quotient Metric for Microarray Data

Thursday 4:40-5:00, San Marino

Abstract:

In this paper, a new dependence measure is proposed for microarray data. This new measure uses the concept of the quotient correlation which is different from the classical correlations: Pearson's product correlations, and Spearman's rank correlations. It measures nonlinear dependencies, tail dependencies between variables. It can be used to perform gene clustering, gene classification, and to identify unusual genes.


Wolfgang Jank (Department of Decision and Information Technologies, The Robert H. Smith School of Business, University of Maryland)
Galit Shmeuli (Department of Decision and Information Technologies, The Robert H. Smith School of Business, University of Maryland)
Modeling Concurrent Events in Online Auctions Using Spatio-temporal Semiparametric Models

Thursday 5:00-5:20, San Marino

Abstract:

We introduce a semiparametric approach for modeling the effect of concurrent events on an outcome of interest. Concurrency manifests itself as temporal and spatial dependencies. By temporal dependency we mean the effect of an event in the past. Modeling this effect is challenging since events arrive at irregularly spaced time intervals and thus the standard definition of time-lags does not apply. Our concurrency model also takes spatial effects into account. We interpret the meaning of ``space" in a slightly non-traditional way in order to conceptualize the more abstract notion of space among a set of item-features. We motivate our model in the context of eBay's online auctions. In particular, we model the effect of concurrent auctions on an auction's price. Our concurrency model consists of three components: a transaction-related component that accounts for auction-design and bidding-competition, a spatial component that takes into account similarity among item-features, and a temporal component that accounts for events in the past. To construct each of these model components, we borrow ideas from spatial modeling and from the mixed model methodology. We also develop a new time-lag metric to handle unevenly-spaced time series by interpreting a time-lag as the information distributed over a certain time-window in the past and by incorporating this information via inclusion of appropriate summary measures. We illustrate the power of this model by applying it to a large and diverse set of laptop auctions crawled off eBay.com. We show that our model results in superior predictive performance compared to a set of competitor models. Our model also allows for new insight into the factors that drive price in eBay's online auctions and their relationship to bidding-competition, auction-design, product-variety and temporal learning effects.