Statistical Learning and Model Selection II (John Rigsby, chair)


Jose-Miguel Yamal (Rice University and the University of Texas M.D. Anderson Cancer Center)
Dennis Cox (Rice University)
Multilevel Classification for Heterogeneous Data

Friday 10:30-10:50, San Rafael

Abstract:

Multilevel classification is a problem in statistics which has gained increasing importance in many real-world problems, but it has not yet received the same statistical understanding as the general problem of classification. Our goal is to detect cervical neoplasia (pre-cancer) using quantitative cytology, measurements on the cells from a pap smear. In particular, the case where we have a high-dimensional feature vector on each cell has proved to be a challenging problem for researchers in the statistical and machine learning areas. They have historically approached this problem in two ways: a) ignoring this multilevel structure of the data and performing classification at the microscopic level, using mainly ad-hoc methods to classify at the macroscopic level, or b) summarizing the micro-level data into a few summary statistics and then using these to compare the subjects at the macro level, and hence not using the data in an optimal way. We propose using a more rigorous statistical approach, the Cumulative Log-Odds (CLO) Method, to classify patients with cervical neoplasia. Combining the CLO method, which also can handle problems of high-dimensionality, with clustering in a likelihood framework helps to account for latent classes where independence assumptions may be better satisfied. This method is well-suited for the challenging problem of classification of heterogeneous data.



Steven Cen (Biostatistics Division, Department of Preventative Medicine Keck School of Medicine,
University of Southern California)
Catherine Sugar (Information and Operations Management Department, University of Southern California)
Doug Stahl (City of Hope National Medical Center; Beckman Research)
David Conti (Biostatistics Division, Department of Preventative Medicine Keck School of Medicine,
University of Southern California)
Bryan Langholz (Biostatistics Division, Department of Preventative Medicine Keck School of Medicine,
University of Southern California)
Stanley Azen (Biostatistics Division, Department of Preventative Medicine Keck School of Medicine,
University of Southern California)
"STEAM Engine" with a Double Supervised Machine Learning in the Approach of Individualizing the Medical Treatment

Friday 10:50-11:10, San Rafael

Abstract:

The ``STEAM (Searching Treatment Effect/Adverse-Effect Modifiers) Engine" is a clinical decision support application designed to identify optimal patient treatment options, taking into account multiple domains such as treatment side effects, disease prognostic factors, and genetic characteristics. The system employs a new double supervised machine learning technique called the Modified Homogeneity Score Searching Method (MHS-SM). MHS-SM combines a homogeneity score derived from Breslow-Day's homogeneity test with a searching strategy adopted from a supervised machine learning method, Classification and Regression Trees (CART). To study the benefit of extending supervised machine learning to double supervised machine learning, we compared MHS-SM and CART via simulation studies. The results showed that MHS-SM is more adept at detecting simulated treatment effect modifiers in the presence of marginal effects or independent confounding main effects. The comparison was also made using data from a large-scale clinical trial in acute lymphoblastic leukemia. The result showed that MHS-SM was able to detect treatment effect modifiers in this complex dataset, while CART was not.



Guilherme Rocha (UC Berkeley)
Peng Zhao (UC Berkeley)
Bin Yu (UC Berkeley)
Grouped and Hierarchical Model Selection through Composite Absolute Penalties (CAP)

Friday 11:10-11:30, San Rafael

Abstract:

Recently much attention has been devoted to model selection through regularization methods in regression and classification where features are selected by use of a penalty function (e.g. Lasso in Tibshirani, 1996). While the resulting sparsity leads to more interpretable models, one may want to further incorporate natural groupings or hierarchical structures present within the features. Natural grouping arises in many situations. For gene expression data analysis, genes belonging to the same pathway might be viewed as a group. In ANOVA factor analysis, the dummy variables corresponding to the same factor form a natural group. For both cases, we want the features to be excluded and included in the estimated model together as a group. Furthermore, if interaction terms are to be considered in ANOVA, a natural hierarchy exists as the interaction term between two factors should only be included after the corresponding main effects. In other cases, as in the fitting of multi-resolution models such as wavelet regression, the hierarchy between bases on different resolution levels should be enforced, that is, the lower resolution base should be included before any higher resolution base in the same region. Our goal is to obtain model estimates that approximate the true model while preserving such group or hierarchical structures. Assuming data is given in the form {(Y_i,X_i);i=1,...,n}, where X_i in X is a subset of R^{d} are explanatory variables and Y_i in Y} a response variable, also assuming the estimate for Y is of the form f(X)beta, where beta in R^{p} are the model coefficients and f:X -> X^{*} is a subset of R^{p} the features, we obtain our model estimates by jointly minimizing a goodness of fitness criterion represented by a convex loss function L(beta, Y, X) and a suitably crafted CAP (Composite Absolute Penalty) penalty function. Such a framework fits within that of penalized regressions. The CAP penalty function is constructed by first defining groups G_i, i=1,...,k that reflect the natural structure among the features. A new vector is then formed by collecting the L_{gamma_{i}} (i=1,...,k) norm of the coefficients beta_{G_i} associated with the features within each of the groups. These are the group-norms and they are allowed to differ from group to group. The CAP penalty is then defined to be the L_{gamma_{0}} norm (the overall norm) of this new vector. By properly selecting the group-norms and the overall norm, selection of variables can be done in a grouped fashion (Grouped Lasso by Yuan and Lin, 2004 and Blockwise Sparse Regression by Kim et al., 2005 are special cases of this penalty class). In addition, when the groups are defined to overlap, this construction of penalty provides a mechanism for expressing hierarchical relationships between the features. When constructed with gamma_{i} >= 1, for i=0,...,k, the CAP penalty functions closely resemble proper norms and are proven to be convex which renders CAP computationally feasible. In this case, the BLASSO algorithm (Zhao & Yu, 2004) can be used to trace the regularization path. Particularly, in Least Squares Regressions, when the norms are restricted to combinations of L_{1} and L_{infty} norms, the regularization paths are piecewise linear. Therefore we provide LARS-fashioned (Efron et. al, 2004) algorithms, which jump between the turning points of the piecewuse linear path, to compute the entire regularization path efficiently.



Hanying Zhou (Jet Propulsion Laboratory, California Institute of Technology)
Abhijit Shevade (Jet Propulsion Laboratory, California Institute of Technology)
Christine Pelletier (Jet Propulsion Laboratory, California Institute of Technology)
Margie Homer (Jet Propulsion Laboratory, California Institute of Technology)
Margaret Ryan (Jet Propulsion Laboratory, California Institute of Technology)
Quasi Real Time Data Analysis for Air Quality Monitoring with an Electronic Nose

Friday 11:30-11:50, San Rafael

Abstract:

JPL is developing a third generation Electronic Nose (ENose) for a technology demonstration of air quality event monitoring aboard the international space station (ISS). Currently there is no device capable of continuously monitoring the air quality for human habitats in spacecrafts. The ENose is an array-based sensing system with 32 polymer/carbon composite conductometric sensors. The ability of the ENose to autonomously and continuously detect, identify and quantify, in a quasi real time manner, of specific hazardous compounds which might be released through a leak or a spill in a spacecraft crew cabin will greatly enhance the safety of astronauts. In this paper, we will discuss various issues and techniques in the mining and analyzing of the Enose sensor data, including baseline drift accommodation, event detection, event identification and quantification, humidity subtraction, functional group classification, and potential model building for unknowns or model updates. Most discussions will be based on our second generation sensor data.