This research addresses the challenges of autonomous discovery and triage of contextually relevant information in massive and complex datasets. The aim is to extract feature vectors which function as digital object summaries of data from which they are derived, thereby effectively reducing the volume information that needs to be considered. We have developed an automated metadata system that scans for statistically appropriate feature vectors derived from summaries of the data's distributional characteristics. These features allow data miners to use boolean searches to quickly identify relevant portions of the dataset.
We consider two types of data here: text and imagery. The text data are documents from the Topic Detection and Tracking (TDT) Pilot Corpus collected by Linguistic Data Consortium of Philadelphia, PA. The TDT corpus comprises a set of 15863 news articles from CNN and Reuters over the period July 1, 1994 to June 30, 1995. Four features are extracted that capture topics, discriminating words, verbs, and word bigrams. These features were attached to each document thus allowing us to identify and retrieve related articles. Remote sensing data were 50 GB of imagery acquired by JPL's Multi-angle Imaging SpectroRadiometer (MISR) instrument, aboard NASA's Terra satellite. This large set of images provides an excellent prototype database for demonstrating feasibility of our system. We developed a set of features derived from gray level co-occurrence matrices (GLCM's) including homogeneity, contrast, dissimilarity, entropy, angular second moment (ASM), and energy.
In this talk we show how feature-based metadata can be used to understand the large-scale structure of massive sets of documents and images. This allows analysts to perform boolean searches based on specific criteria, and data miners to discover previously unknown relationships.