George Mason University
AES/CCS/SCS/Statistics Colloquium Series
Seminar Announcement


A Probabilistic Approach to Mining Massive Earth Science Data Sets

Amy Braverman

Jet Propulsion Laboratory
California Institute of Technology

Location: Johnson Center, Assembly Room C
Time: 10:30 a.m. Refreshments, 10:45 a.m. Colloquium Talk
Date: January 25, 2006



ABSTRACT

Modern Earth science data sets are massive, complex and unwieldy. They are difficult to manipulate, let alone mine for unknown or unexpected structural features. An approach to the first problem is to reduce the data to manageable size in a way that preserves the statistical character of the original data, and work the smaller, less complex "compressed" data set. However, this does not address the issue of how to mine in a way that doesn't depend on knowing the data structure in the first place.

In this talk we describe (1) a method of data reduction that lends itself to moderately agnostic data mining, and (2) some methods for mining the resulting compressed data to characterize large scale data structure. We partition a massive data set into space-time regions (e.g. monthly, five degree grid cells), and replace the raw data in each grid cell with a discrete, multivariate probability distribution estimate. These distributions are the "signatures" of the physical processes generating the data, and data mining is the characterization and quantification of how those distributions evolve in time and space. Moreover, the aim is not only to characterize this evolution, but to explain it physically. In this talk we use data from JPL's Atmospheric Infrared Sounder Instrument to demonstrate the scientific utility of this approach.