I

N

V

I

T

E

D

 

S

E

S

S

I

O

N

S

Applications of Clustering and Classification to Large Datasets
Organizer: William Shannon
(shannon@osler.wustl.edu)
Washington University School of Medicine

Description:
Recent advances in data collection and storage have resulted in massive datasets that require new data analytical tools to extract meaningful information. These tools depend on developing interdisciplinary collaborations merging statistics, mathematics, computer science, and subject-area expertise into a cohesive whole. This fundamental change in data analysis has opened entirely new opportunities to address important and difficult scientific questions, as well as new challenges for developing more powerful data analytical methods.

An important framework for analyzing massive datasets comes from exploratory cluster analysis to group data points that exhibit some measure of similarity into homogeneous subgroups. Follow-up subgroup analyses search for subgroup-specific patterns that hopefully provide insight into the scientific questions being studied. In this session data analysis problems involving massive datasets from astrophysics and genetics will be described, and innovative computational approaches combining computer science, statistics and machine learning will be presented.

This session is sponsored by the Classification Society of North America (http://www.pitt.edu/~csna) which is actively involved in promoting the scientific study of classification and clustering (including systematic methods of creating classifications from data), and to disseminate scientific and educational information related to its fields of interests. This Society has been strongly interdisciplinary since its origin in 1968, attracting researchers from fields such as biology, computer science, marketing, mathematics, psychology, and statistics, and has been publishing the Journal of Classification since 1984.

Format:
The format will be two (30-minute) papers with 10 minutes of open floor discussion, followed by a discussant (15 minutes) and an additional 10 minutes for open floor discussion.

Participants:
Andrew Moore (presentation, Inner-Loop Statistics in Automated Scientific Discovery from Massive Datasets)
Andrew Moore, Ph.D. in Computer Science, is the A. Nico Haberman Associate Professor of Robotics and Computer Science at the School of Computer Science, Carnegie Mellon University. The focus of the talk will be on the use of multiresolution kd-trees. (http://www.cs.cmu.edu/~awm/hp.html)

Daniel Weaver (presentation, Current Approaches to Gene Chip Data Analysis)
Dan Weaver, Ph.D. in Molecular Biology, is a leading scientist in bioinformatics with the Genomica Corporation in Boulder, Colorado. This talk will have emphasis on flexible, robust, and automated microarray analysis. (http://www.genomica.com/home_index.html)

William Shannon (poster, Preliminary Studies on Combining Wavelet and Cluster Analysis for Gene Chip Data)
Washington University School of Medicine

Stephen D. Bay, Dennis Kibler, Michael J. Pazzani, and Padhraic Smyth (poster, The UC Irvine Knowledge Discovery in Databases Archive)
University of California, Irvine

William Shannon (Discussant)

 

Invited Sessions Home