Modelling (Guilherme Rocha, chair)


Christine Smyth (Statistics and Intelligent Data Analysis Group, School of Mathematical and Physical Sciences, James Cook University)
Danny Coomans (Statistics and Intelligent Data Analysis Group, School of Mathematical and Physical Sciences, James Cook University)
Parsimonious Ensembles for Regression

Friday 4:00-4:30, Fountain I

Abstract:

An ensemble of regression models predicts by taking a weighted average of the predictions made by individual models. Predictions based on ensembles have been shown to be very effective on large datasets. Calculating the weights such that they reflect the accuracy of individual models (post processing the ensemble) has been shown to increase an ensemble's accuracy. The success of previous research motivates the study of other strategies as potential post processing techniques. This paper introduces post processing techniques and demonstrates the improvements attained by using more parsimonious ensembles of linear regression models and regression trees.



Hadley Wickham (Iowa State University)
Doina Caragea (Iowa State University)
Di Cook (Iowa State University)
Exploring High-dimensional Classification Boundaries

Friday 4:30-5:00, Fountain I

Abstract:

Given p-dimensional training data containing d groups (the design space), a classification algorithm (classifier) predicts which group new data belongs to. Generally the input to these algorithms is high dimensional, and the boundaries between groups will be high dimensional and perhaps curvilinear or multi-facted. This paper discusses methods for understanding the division of space between the groups, and provides an implementation in an R package, explore, which links R to GGobi.

If the classifier is mathematically tractable we can extract the boundaries directly; if the classifier provides posterior probabilities we can use these to find uncertain points which lie on boundaries; otherwise we can treat the classifier as a black box and use a k-nearest neighbours technique to remove non-boundary points. These techniques allow us to work with any classifier, and we demonstrate LDA, QDA, SVM, tree and neural net classifiers.


Rida E.A. Moustafa (Center for Computational Statistics, George Mason University; AALCPAs)
Ali S. Hadi (Department of Statistical Science, Cornell University; Department of Mathematics, The American University of Cairo)
Fast and Effective Graphs for Exploring Massive, Hyperdimensional Data

Friday 5:00-5:30, Fountain I

Abstract:

Massive (large number of observations) and Hyperdimensional (large number of variables) data are hard to visualize due to human visualization ability, which is limited to three dimensional space, and to the limitations inherited in the existing visualization tools. These large size data sets appears frequently in various real-life applications such as data mining and knowledge discovery. In this paper, we introduce efficient and effective graphs for the visualization of massive, hyper-dimensional data. The graphs are based on projections of the data onto two-dimensional space. This projection is a scatter plot of two statistical measures of the individual observations. Theoretical and empirical investigations of the properties of the proposed graphs show that they capture various patterns of complex geometry such as linear, nonlinear, and even mixing of both linear and nonlinear structures. They are easy to interpret and calculate, hence they are suitable for exploring very large size multivariate data visually. We illustrate the efficiency and effectiveness of these graphs using several real and constructed data sets.