Statistical Learning and Model Selection I (Susan Paddock, chair)
Tao Shi (The Ohio State University)
Bin Yu (UC Berkeley)
Binning in Gaussian Kernel Regularization
Thursday 4:00-4:20, San Marino
Abstract:
Gaussian kernel regularization is widely
used in the machine learning literature and has proved successful in
many empirical experiments. The periodic version of Gaussian
kernel regularization has been shown to be minimax rate optimal in
estimating functions in any finite order Sobolev space. However,
for a data set with n points, the computation complexity of the
Gaussian kernel regularization method is of order O(n^3).
In this paper we propose to use binning to reduce the computation
of Gaussian kernel regularization in both regression and
classification. For periodic Gaussian kernel regression, we
show that the binned estimator achieves the same minimax rates as
the unbinned estimator, but the computation is reduced to O(m^3)
with m as the number of bins. To achieve the minimax rate in the
k-th order Sobolev space, m needs to be in the order of
O(kn^{1/(2k+1)}), which makes the binned estimator computation
of order O(n) for k=1, and even less for larger k. Our
simulations show that the binned estimator (binning 120 data
points into 20 bins in our simulation) provides almost the same
accuracy with only 0.4% of computation time. For classification,
binning with L2-loss Gaussian kernel
regularization and Gaussian kernel Support Vector Machines is
tested in a polar cloud detection problem.
Haixun Wang (IBM T.J. Watson Research Center)
Jian Yin (IBM T.J. Watson Research Center)
Phillip S. Yu (IBM T.J. Watson Research Center)
Classifying Streams of Changing Data Distribution
Thursday 4:20-4:40, San Marino
Abstract:
Mining data streams of changing distribution is important for real-time
business decision support. The stream classifiers must evolve to
reflect the current data distribution, which is embodied by the most
recent training data. However, relying solely on the most recent
training data can result in a biased classifier, since the most recent
training data is usually a biased sample of the current data
distribution. This reduces classification accuracy, especially when
we are classifying rare events (e.g., positive instances may not occur
in the most recent training data). Incorporating historical data can
form a more representative sample, but it also increases the chances
of learning obsolete models. To address this issue, we use a
stochastic model to describe the concept shifting patterns and
formulize this problem as an optimization problem: given historical
data distributions and the most recent training data, find the most
likely current data distribution, and learn a classifier based on the
most-likely distribution. We derive an analytic solution and
approximate this solution with an efficient algorithm. We evaluate our
algorithm with both synthetic and real-world data. Our results show
that our algorithm produces accurate and efficient classification.
Zhengjun Zhang (University of Wisconsin, Madison)
Quotient Metric for Microarray Data
Thursday 4:40-5:00, San Marino
Abstract:
In this paper, a new dependence measure is proposed for microarray
data. This new measure uses the concept of the quotient correlation
which is different from the classical correlations: Pearson's
product correlations, and Spearman's rank correlations. It measures
nonlinear dependencies, tail dependencies between variables. It can
be used to perform gene clustering, gene classification, and to
identify unusual genes.
Wolfgang Jank (Department of Decision and Information Technologies, The Robert H. Smith School of Business, University of Maryland)
Galit Shmeuli (Department of Decision and Information Technologies, The Robert H. Smith School of Business, University of Maryland)
Modeling Concurrent Events in Online Auctions Using Spatio-temporal Semiparametric Models
Thursday 5:00-5:20, San Marino
Abstract:
We introduce a semiparametric approach for modeling the effect of concurrent events on an outcome of interest. Concurrency manifests itself as temporal and spatial dependencies. By temporal dependency we mean the effect of an event in the past. Modeling this effect is challenging since events arrive at irregularly spaced time intervals and thus the standard definition of time-lags does not apply. Our concurrency model also takes spatial effects into account. We interpret the meaning of ``space" in a slightly non-traditional way in order to conceptualize the more abstract notion of space among a set of item-features. We motivate our model in the context of eBay's online auctions. In particular, we model the effect of concurrent auctions on an auction's price. Our concurrency model consists of three components: a transaction-related component that accounts for auction-design and bidding-competition, a spatial component that takes into account similarity among item-features, and a temporal component that accounts for events in the past. To construct each of these model components, we borrow ideas from spatial modeling and from the mixed model methodology. We also develop a new time-lag metric to handle unevenly-spaced time series by interpreting a time-lag as the information distributed over a certain time-window in the past and by incorporating this information via inclusion of appropriate summary measures. We illustrate the power of this model by applying it to a large and diverse set of laptop auctions crawled off eBay.com. We show that our model results in superior predictive performance compared to a set of competitor models. Our model also allows for new insight into the factors that drive price in eBay's online auctions and their relationship to bidding-competition, auction-design, product-variety and temporal learning effects.