High-dimensional Data Analysis (Kerby Shedden, chair)


Sayeed Qaiyumi (Goldsmiths, University of London)
Derrick Takishi Mirkitani (Goldsmiths, University of London)
An Extension of the K-Means Algorithm for the Dimensionality Curse of Multi-attributes

Thursday 4:00-4:20, San Rafael

Abstract:

The aim of this work is to generalize the K-Mean Algorithm technique in order to apply them to multi-attributes, to reduce dimensionality through the application of Entropy Analysis and to find the best initial condition or starting points for the K-Mean Algorithm. We start with deciding criteria attribute, and based on the comparison and selection, this technique considers the values surrounding it and would pick the best value after inter values comparison and assessing their gains. Dwarf data cubes are compared, which are thought to be reducing the curse specifically in a database environment where we have suffixes and prefixes. Finally the use of the entropy generated result for the initial conditioning of K-Mean is explained.



Roy Welsch (Massachusetts Institute of Technology)
Feature Selection, Prediction, and Robustness When There are More Variables than Observations

Thursday 4:20-4:40, San Rafael

Abstract:

Classical statistical procedures generally assume that the number of observations is more than the number of inputs or explanatory variables. In many applications areas such as finance and bioinformatics, this may not be the case. In this talk, we explore and compare a number of existing and new methods to address such problems including support vector machines, regularized logistic regression, random forests, elastic nets, and various combinations of these methods to see how they perform and how robust they are. We consider performance criteria related to both model accuracy (are the variables or features in the model the correct ones) and predictive performance on new data.



Jong Soo Lee (M.D. Anderson Cancer Center)
Pointwise Testing with Functional Data

Thursday 4:40-5:00, San Rafael

Abstract:

We consider the pointwise testing of functional data from two or more populations. There exist testing procedures to compare overall pointwise differences between populations. However, these tests are limited in that they cannot pinpoint which points or regions differ between the mean functions. Thus, we propose a follow-up testing procedure to locate significantly different regions. For this, we utilize a multiple comparison procedure proposed by Westfall and Young (1993). We show some theoretical properties of our method and demonstrate that it works well in a simulation study. Finally, we conclude by applying our method to cervical cancer screening data.



Heike Hofmann (Iowa State University)
Graph-theoretic Scagnostics for Projection Pursuit

Thursday 5:00-5:20, San Rafael

Abstract:

Finding "interesting" projections in high dimensional space has a long tradition, yielding in methodological solutions such as the grand tour (Asimov 1958) or projection pursuit methods. While the grand tour walks through the high-dimensional space on a path that covers all possible lower dimensional projections, this path is optimized in projection pursuit methods for one specific optimality index. John and Paul Tukey (1985) suggested "scagnostics" as a way to describe diagnostic properties of a scatterplot. Wilkinson et al (2005) recently extended this concept to a graph-theoretic approach. Using a set of nine indices, they used graph theoretic scagnostics for re-ordering scatterplots in a scatterplot matrix according to "skinniness", "clumpiness", number of outliers, etc. We propose an application of graph-theoretic indices as optimization criterion in projection pursuit: instead of a single criterion, a combination of these indices allows us to look for projections that are e.g. 80% "skinny", 15% "clumpy" with 0% outliers, etc. By allowing to change the index setting interactively, the analyst can guide the projection pursuit in more ways than the usual parameters allow.