Statistical Learning and Model Selection II
(John Rigsby, chair)
Jose-Miguel Yamal (Rice University and the University of Texas M.D. Anderson Cancer Center)
Dennis Cox (Rice University)
Multilevel Classification for Heterogeneous Data
Friday 10:30-10:50, San Rafael
Abstract:
Multilevel classification is a problem in statistics which has gained
increasing importance in many real-world problems, but it has not yet
received the same statistical understanding as the general problem of
classification. Our goal is to detect cervical neoplasia (pre-cancer) using
quantitative cytology, measurements on the cells from a pap smear. In
particular, the case where we have a high-dimensional feature vector on each
cell has proved to be a challenging problem for researchers in the
statistical and machine learning areas. They have historically approached
this problem in two ways: a) ignoring this multilevel structure of the data
and performing classification at the microscopic level, using mainly ad-hoc
methods to classify at the macroscopic level, or b) summarizing the
micro-level data into a few summary statistics and then using these to
compare the subjects at the macro level, and hence not using the data in an
optimal way. We propose using a more rigorous statistical approach, the
Cumulative Log-Odds (CLO) Method, to classify patients with cervical
neoplasia. Combining the CLO method, which also can handle problems of
high-dimensionality, with clustering in a likelihood framework helps to
account for latent classes where independence assumptions may be better
satisfied. This method is well-suited for the challenging problem of
classification of heterogeneous data.
Steven Cen (Biostatistics Division, Department of Preventative Medicine Keck School of Medicine,
University of Southern California)
Catherine Sugar (Information and Operations Management Department, University of Southern California)
Doug Stahl (City of Hope National Medical Center; Beckman Research)
David Conti (Biostatistics Division, Department of Preventative Medicine Keck School of Medicine,
University of Southern California)
Bryan Langholz (Biostatistics Division, Department of Preventative Medicine Keck School of Medicine,
University of Southern California)
Stanley Azen (Biostatistics Division, Department of Preventative Medicine Keck School of Medicine,
University of Southern California)
"STEAM Engine" with a Double Supervised Machine Learning in the Approach of Individualizing the Medical Treatment
Friday 10:50-11:10, San Rafael
Abstract:
The ``STEAM (Searching Treatment Effect/Adverse-Effect Modifiers) Engine" is a clinical decision support application designed to identify optimal patient treatment options, taking into account multiple domains such as treatment side effects, disease prognostic factors, and genetic characteristics. The system employs a new double supervised machine learning technique called the Modified Homogeneity Score Searching Method (MHS-SM). MHS-SM combines a homogeneity score derived from Breslow-Day's homogeneity test with a searching strategy adopted from a supervised machine learning method, Classification and Regression Trees (CART). To study the benefit of extending supervised machine learning to double supervised machine learning, we compared MHS-SM and CART via simulation studies. The results showed that MHS-SM is more adept at detecting simulated treatment effect modifiers in the presence of marginal effects or independent confounding main effects. The comparison was also made using data from a large-scale clinical trial in acute lymphoblastic leukemia. The result showed that MHS-SM was able to detect treatment effect modifiers in this complex dataset, while CART was not.
Guilherme Rocha (UC Berkeley)
Peng Zhao (UC Berkeley)
Bin Yu (UC Berkeley)
Grouped and Hierarchical Model Selection through Composite Absolute Penalties (CAP)
Friday 11:10-11:30, San Rafael
Abstract:
Recently much attention has been devoted to model selection through
regularization methods in regression and classification where
features are selected by use of a penalty function (e.g. Lasso in
Tibshirani, 1996). While the resulting sparsity leads to more
interpretable models, one may want to further incorporate natural
groupings or hierarchical structures present within the
features.
Natural grouping arises in many situations. For gene expression data
analysis, genes belonging to the same pathway might be viewed as a
group. In ANOVA factor analysis, the dummy variables corresponding to
the same factor form a natural group. For both cases, we want the
features to be excluded and included in the estimated model
together as a group. Furthermore, if interaction terms are to be
considered in ANOVA, a natural hierarchy exists as the interaction term between two factors should only be included after the corresponding main effects. In other
cases, as in the fitting of multi-resolution models such as wavelet
regression, the hierarchy between bases on different
resolution levels should be enforced, that is, the lower resolution
base should be included before any higher resolution base in the same
region. Our goal is to obtain model estimates that approximate the true model while preserving such group or hierarchical structures.
Assuming data is given in the form
{(Y_i,X_i);i=1,...,n}, where X_i in X is a subset of R^{d} are
explanatory variables and Y_i in Y} a response
variable, also assuming the estimate for Y is of the form
f(X)beta, where beta in R^{p} are the model coefficients and
f:X -> X^{*} is a subset of R^{p} the features,
we obtain our model estimates by jointly minimizing a goodness of fitness criterion represented by a convex loss function L(beta, Y, X) and a suitably
crafted CAP (Composite Absolute Penalty) penalty function. Such a
framework fits within that of penalized regressions.
The CAP penalty function is constructed by first defining groups G_i,
i=1,...,k that reflect the natural structure among the features. A
new vector is then formed by collecting the L_{gamma_{i}} (i=1,...,k) norm of the coefficients beta_{G_i} associated with the features
within each of the groups. These are the group-norms and they are
allowed to differ from group to group. The CAP penalty is then defined to
be the L_{gamma_{0}} norm (the overall norm) of this new vector. By
properly selecting the group-norms and the overall norm, selection of
variables can be done in a grouped fashion (Grouped Lasso by Yuan and
Lin, 2004 and Blockwise Sparse Regression by Kim et al., 2005 are
special cases of this penalty class). In addition, when the
groups are defined to overlap, this construction of penalty provides a
mechanism for expressing hierarchical relationships between the features.
When constructed with gamma_{i} >= 1, for i=0,...,k,
the CAP penalty functions closely resemble proper norms
and are proven to be convex which renders CAP computationally feasible.
In this case, the BLASSO algorithm (Zhao & Yu, 2004) can be used to trace the regularization path.
Particularly, in Least Squares Regressions, when the norms are
restricted to combinations of L_{1} and L_{infty} norms,
the regularization paths are piecewise linear.
Therefore we provide LARS-fashioned (Efron et. al, 2004) algorithms,
which jump between the turning points of the piecewuse linear path, to
compute the entire regularization path efficiently.
Hanying Zhou (Jet Propulsion Laboratory, California Institute of Technology)
Abhijit Shevade (Jet Propulsion Laboratory, California Institute of Technology)
Christine Pelletier (Jet Propulsion Laboratory, California Institute of Technology)
Margie Homer (Jet Propulsion Laboratory, California Institute of Technology)
Margaret Ryan (Jet Propulsion Laboratory, California Institute of Technology)
Quasi Real Time Data Analysis for Air Quality Monitoring with an Electronic Nose
Friday 11:30-11:50, San Rafael
Abstract:
JPL is developing a third generation Electronic Nose (ENose) for a technology demonstration of air quality event monitoring aboard the international space station (ISS). Currently there is no device capable of continuously monitoring the air quality for human habitats in spacecrafts. The ENose is an array-based sensing system with 32 polymer/carbon composite conductometric sensors. The ability of the ENose to autonomously and continuously detect, identify and quantify, in a quasi real time manner, of specific hazardous compounds which might be released through a leak or a spill in a spacecraft crew cabin will greatly enhance the safety of astronauts. In this paper, we will discuss various issues and techniques in the mining and analyzing of the Enose sensor data, including baseline drift accommodation, event detection, event identification and quantification, humidity subtraction, functional group classification, and potential model building for unknowns or model updates. Most discussions will be based on our second generation sensor data.