Curve Fitting for Massive, Complex Data Sets (Michael G. Schimek, organizer)


Simon Wood (University of Bath)
Smooth Modelling of Large Datasets

Friday 4:00-4:30, Fountain II

Abstract:

Smooth modelling of very large datasets presents a computational challenge, particularly in regard to estimation of the degree of model smoothness. This talk examines cross validation based smoothing parameter estimation, particularly in the context of additive models for large datasets. After reviewing additive models represented using penalized regression splines, and GCV based smoothness selection, numerical methods are proposed for substantially increasing the size of dataset that can feasibly be modelled in this way.



Marlene Mueller (ITWM Fraunhofer)
Michael G. Schimek (Medical University of Graz)
Classification of High-dimensional Data by Semiparametric Generalized Regression Models

Friday 4:30-5:00, Fountain II

Abstract:

Apart from optimal classification results it is often necessary to visualize and interpret the fitted classification rules. A main issue is in what way features have impact on a classifier. Our approach is to consider semiparametric variants of the logistic regression model or more generally of the generalized linear regression model. A wide class of such models can be defined by using nonparametric function estimates within the argument of the link function. It includes generalized additive and generalized partial linear models as well as combinations of their components (Mueller, 2001). We introduce and compare different estimation approaches that have been proposed over the years for this class and might even be suitable for high-dimensional data. These cover in particular backfitting (Mammen, Linton and Nielsen, 1999; Nielsen and Sperlich, 2005) and marginal integration (Hengartner and Sperlich, 2005). Finally, computational efficient implementations in R are considered.

References:

Hengartner, N.W. and Sperlich, S.(2005): Rate-optimal estimation with the integration method in the presence of many covariates. Journal of Mult. Analysis, 95, 246-272.

Mammen, E., Linton, O. and Nielsen, J.P. (1999): The existence and asymptotic properties of a backfitting projection algorithm under weak conditions. Annal. Statist., 27, 1443-1490.

Muller, M. (2001): Estimation and testing in generalized partial linear models - A comparative study. Statistics and Computing, 11, 299-309.

Nielsen, J.P. and Sperlich, S. (2005): Smooth backfitting in practice. J. Royal Statist. Soc., B, 67, 43-61.



Lijian Yang (Michigan State University)
Li Wang (Michigan State University)
Efficient and Fast Spline-backfitted Kernel Smoothing of Additive Regression Models

Friday 5:00-5:30, Fountain II

Abstract:

We propose a one step backfitting estimator of the component function in an additive regression model, with spline in the first stage and kernel the second. Under weak conditions, the estimator is asymptotically equivalent to an univariate kernel estimator, effectively reducing the dimension to one. Monte Carlo evidence for a wide range of dimensions and sample sizes supports the asymptotic results.



Joan Staniswalis (University of Texas, El Paso)
A Novel Application of Functional Data Analysis to High-resolution Data from Environmental Epidemiology

Friday 5:30-6:00, Fountain II

Abstract:

In most of the literature, the daily average of hourly particulate matter (PM) measurements is one of many covariates in a log-linear model predicting daily mortality. Estimates for change in mortality after exposure to PM vary by geographical region. For example, in places such as El Paso County, where the level and type of PM exposure can vary greatly throughout a given day, this reduction of the PM hourly measurements by the daily mean results in a great loss of information to the extent that average PM is not significantly associated with mortality. Furthermore, the daily average PM is usually included in the predictive model with lags of 1 to 3 days. This has lead to much discussion as to whether the effects of PM are long-term leading to chronic health problems, or short-term in that mostly only highly sensitive individuals are affected. Here the 24 hourly observations of PM are viewed as curve data to which techniques from functional data analysis can be applied. Each day in the four year time period contributes one PM profile for prediction of daily mortality. Principal component analysis (Rice and Silverman 1991; Silverman 1996; Ramsay and Silverman 1997) using the Karhunen-Loeve expansion of the profiles and the historical functional linear model (Malfait and Ramsay 2003) implemented with P-splines (Eilers and Marx 1996) are considered here. Principal component analysis of the PM hourly profiles provides better summary statistics than the daily mean for estimating risk of mortality.

References:

Eilers, P.H.C. and Marx, B.D. (1996): Flexible smoothing with B-splines and penalties. Statist. Scien., 11, 89-121.

Malfait, N. and Ramsay, J.O. (2003): The historical functional linear model. Canad. J. Statist., 31, 1-15.

Ramsay, J.O. and Silverman, B.W. (1997): Functional Data Analysis. New York, Springer.

Rice, J.A. and Silverman, B.W. (1991): Estimating the mean and covariance structure nonparametrically when the data are curves. J. Royal Statist. Soc., B, 53, 233-243.

Silverman, B.W. (1996): Smoothed functional principal components analysis by choice of norm. Annal. Statist., 24, 1-24.