Distributed Data, Parallel Computing, and Computational Strategies I
(David Marchette, chair)


Kirk Borne (George Mason University)
Astronomy Data Collections of the Future: Massive Data Mining Opportunities

Thursday 10:30-10:50, San Marino

Abstract:

Astronomers are producing ever-increasing data volumes from a variety of large projects, including NASA missions and ground-based telescope projects. One of the largest of these will be the Large Synoptic Survey Telescope (LSST) project, which is planned to produce nearly 30 Terabytes of data per night of observation, every night, for 10 years. The LSST will image the entire visible sky repeatedly every 4 nights. Astronomers refer to the resulting combined data set as "cosmic cinematography". Other similar projects are also envisioned. The temporal, spatial, and high-dimensional data mining opportunities posed by this massive data environment are exploding before us. Consequently, the scientific discovery potential of these data sets is enormous. All of these large astronomical data repositories will be geographically distributed. To facilitate access to these distributed data collections, astronomers are working with database experts and computer scientists to build a worldwide virtual data system: the National Virtual Observatory (NVO). I will describe the NVO and LSST projects, plus other large astronomy data-producing projects, including some of the corresponding discipline-driven research challenges, and finally some astronomical data mining activities now underway. It is anticipated that an Astro-informatics research paradigm will evolve from this -- in fact, this is already becoming important and soon will become imperative. Grid-based mining, Web Services-enabled mining, and ontology-enhanced semantic mining will all play a role in the astronomical research of the future.



Abbas Alhakim (Clarkson University)
On the Parallelization of a Shift Register Sequence

Thursday 10:50-11:10, San Marino

Abstract:

The problem of generating multiple streams of random numbers that withstand statistical tests of independence is increasingly important in Parallel Monte Carlo. The validity of simulation results depend strongly on the fact that individual streams that run on different processors need to be independent. Known methods of parallelization can be divided into paramterization and splitting. We propose a technique for `splitting' a shift register sequence (also known as a de Bruijn sequence) into multiple sequences via an inverse of a 2^k-to-one homomorphism from the binary de Bruijn graph B_{n+k} of order n+k to the lower order de Bruijn graph B_n. We show that there is a large family of such homomorphisms that generlize the famous Lempel D-morphism. Experimentally, choosing a nonlinear homomorphism yields streams that look uncorrelated.



Nathaniel Beagley (Pacific Northwest National Laboratory)
A Strategy for Cross Sample Comparisons of Massive Data Sets in the Search for Environmental Biomarkers

Thursday 11:10-11:30, San Marino

Abstract:

The search for environmental biomarkers using laboratory analytical techniques attempts to find chemical compounds that can reliably differentiate between two populations (for example: organisms that have been exposed to a toxin vs. unexposed organisms, sick animals vs. not sick) in a statistically robust way. Our approach uses comprehensive two-dimensional gas chromatography coupled with mass spectrometry (GCxGC-MS) which allows non-invasive screening and provides excellent separation for identification of chemical compounds. The barrier is that GCxGC-MS produces data on the order of 1.5 GB per sample which makes doing comparisons across multiple samples and multiple populations extremely time consuming. We present our strategies for doing statistical comparisons with this massive data set including optimal data storage design and multi-pass analysis algorithms.



Faleh Al-Shameri (George Mason University)
Edward J. Wegman (George Mason University)
Automated Generation of Metadata for Mining Image and Text Data

Thursday 11:30-11:50, San Marino

Abstract:

Recent years have witnessed an explosion in the amount of digitally stored data, the rate at which these data are being generated, and the diversity of disciplines relying on them. They are increasingly important in a wide range of applications including observational sciences, product marketing, and the monitoring and operations of large systems. Massive datasets are also collected routinely in many areas such as astrophysics, particle physics, genetic sequencing, geographical information systems, weather prediction, medical applications, telecommunications, sensors, government databases, and banking.

This research addresses the challenges of autonomous discovery and triage of contextually relevant information in massive and complex datasets. The aim is to extract feature vectors which function as digital object summaries of data from which they are derived, thereby effectively reducing the volume information that needs to be considered. We have developed an automated metadata system that scans for statistically appropriate feature vectors derived from summaries of the data's distributional characteristics. These features allow data miners to use boolean searches to quickly identify relevant portions of the dataset.

We consider two types of data here: text and imagery. The text data are documents from the Topic Detection and Tracking (TDT) Pilot Corpus collected by Linguistic Data Consortium of Philadelphia, PA. The TDT corpus comprises a set of 15863 news articles from CNN and Reuters over the period July 1, 1994 to June 30, 1995. Four features are extracted that capture topics, discriminating words, verbs, and word bigrams. These features were attached to each document thus allowing us to identify and retrieve related articles. Remote sensing data were 50 GB of imagery acquired by JPL's Multi-angle Imaging SpectroRadiometer (MISR) instrument, aboard NASA's Terra satellite. This large set of images provides an excellent prototype database for demonstrating feasibility of our system. We developed a set of features derived from gray level co-occurrence matrices (GLCM's) including homogeneity, contrast, dissimilarity, entropy, angular second moment (ASM), and energy.

In this talk we show how feature-based metadata can be used to understand the large-scale structure of massive sets of documents and images. This allows analysts to perform boolean searches based on specific criteria, and data miners to discover previously unknown relationships.