Streaming Data I (David Scott, organizer)


Suhrid Balakrishnan (Rutgers University)
A Streaming/Sequential Algorithm for Learning Sparse Classifiers

Thursday 2:00-2:30, Fountain III

Abstract:

Classifiers favoring sparse solutions, such as support vector machines, the relevance vector machine, LASSO-regression based classifiers etc., have proven to be very accurate (in terms of predictive accuracy) for classification problems in high dimensions. Sparsity is essential in this domain, as it is regularization of the classifier function class that results in trained classifiers with improved generalization performance. However, current algorithms for training sparse classifiers typically scale quite unfavorably with respect to the number of data points in training datasets. This talk outlines both a streaming/online and a multipass but sequential algorithm for training sparse classifiers for high dimensional data whose computational complexity and memory requirements make learning on massive datasets feasible. The central idea that makes this possible is analysis based on a simple quadratic approximation to the likelihood function.



Mark Hansen (UCLA)
Viewing Machines: Embedded Coupled Human-observational Systems

Thursday 2:30-3:00, Fountain III

Abstract:

The communications revolution of the 1990's has forever changed how individuals and organizations exchange information at a global scale. By combining these technologies with microsensing capabilities, we directly couple the virtual and physical worlds and seed the next revolution, the rise of Embedded Networked Sensing (ENS). This new local revolution seeks to make the world "transparent" by enabling observation of physical, biological, and chemical processes up close, and at spatial and temporal details that are simply impossible with larger-scale traditional remote or manual sensing. The ability to observe the physical world with high fidelity permits creation and refinement of models that can make predictions and eventually even manage the physical world in a range of application contexts. This has profound implications for scientific discovery and technological advances.

Through our design, development, and deployment activities at the Center for Embedded Networked Sensing (CENS), we have found that effective embedded sensing is not always about the largest number of the smallest sensors. Robust, scalable and flexible systems require a layered mix of observational resources, including sensor capabilities, networking, modeling, and user interaction. Our future observing systems will employ programmable, adaptive, and autonomous coordination among heterogeneous embedded devices, to export information, not just raw data; and fusion of this high-fidelity, in situ information with external data sources. In this talk, we will characterize the role of "data scientists" in architecting ENS technologies, in developing data formats, communication protocols and software systems which provide rich feedback loops between environmental phenomena, the operational characteristics of sensing systems, human-orchestrated design and data analysis and inference.

We will consider two use cases. First, with the growing recognition that significant advancements in environmental research and protection of environmental quality must be associated with a better understanding of the dynamics of complex ecological systems, there has been an expanded national and international focus on the development of environmental observatories to facilitate research at multiple spatial and temporal scales. The existing and proposed systems of environmental observatories share a common mission to support new research through innovations in sensing and informatics. Because environmental observatories will create massive amounts of new data, there are important concerns as to how researchers will access these data streams in real time, how data be will analyzed and models developed, and how instrumentation used by individual researchers will be linked with data streams from the observatory instruments.

In the second use case, we consider applications of ENS outside of scientific, engineering and industrial settings. These projects are not managed by a central agency or institution, but are deployed by private citizens and operate in personal, social or "urban" spaces. These applications are attractive because they offer a tremendous range of possibilities for data sharing experiences and for enabling social exchange. They require new algorithms and software mechanisms because unlike scientific applications of distributed sensing, a single system is widely distributed, intermittently connected, and privately administered; and unlike traditional Internet applications the physical inputs are critical to the behavior.



David Marchette (George Mason University and Naval Surface Warfare Center)
Analysis of Streaming Text

Thursday 3:00-3:30, Fountain III

Abstract:

We consider the problem of analysing streaming text, such as news feeds or email streams. Models based on simple word frequency statistics are used to characterize the different topics within the stream, providing a clustering of the documents in real time. These methods show that very simple statistics can be used to extract interesting information from the stream. Documents are linked together based on their word co-occurrence, producing a time series of graphs which can be used for further analysis of the stream. Methods for updating the models of the text are discussed and illustrated on real data, and limitations of the methodology will be discussed.