Streaming Data I (David Scott, organizer)
Suhrid Balakrishnan (Rutgers University)
A Streaming/Sequential Algorithm for Learning Sparse Classifiers
Thursday 2:00-2:30, Fountain III
Abstract:
Classifiers favoring sparse solutions, such as support vector
machines, the relevance vector machine, LASSO-regression based
classifiers etc., have proven to be very accurate (in terms of
predictive accuracy) for classification problems in high dimensions.
Sparsity is essential in this domain, as it is regularization of the
classifier function class that results in trained classifiers with
improved generalization performance. However, current algorithms for
training sparse classifiers typically scale quite unfavorably with
respect to the number of data points in training datasets. This talk
outlines both a streaming/online and a multipass but sequential
algorithm for training sparse classifiers for high dimensional data
whose computational complexity and memory requirements make learning on
massive datasets feasible. The central idea that makes this possible is
analysis based on a simple quadratic approximation to the likelihood
function.
Mark Hansen (UCLA)
Viewing Machines: Embedded Coupled Human-observational Systems
Thursday 2:30-3:00, Fountain III
Abstract:
The communications revolution of the 1990's has forever
changed how individuals and organizations exchange information at a
global scale. By combining these technologies with microsensing
capabilities, we directly couple the virtual and physical worlds and
seed the next revolution, the rise of Embedded Networked Sensing
(ENS). This new local revolution seeks to make the world "transparent"
by enabling observation of physical, biological, and chemical
processes up close, and at spatial and temporal details that are
simply impossible with larger-scale traditional remote or manual
sensing. The ability to observe the physical world with high fidelity
permits creation and refinement of models that can make predictions
and eventually even manage the physical world in a range of
application contexts. This has profound implications for scientific
discovery and technological advances.
Through our design, development, and deployment activities at the
Center for Embedded Networked Sensing (CENS), we have found that
effective embedded sensing is not always about the largest number of
the smallest sensors. Robust, scalable and flexible systems require a
layered mix of observational resources, including sensor capabilities,
networking, modeling, and user interaction. Our future observing
systems will employ programmable, adaptive, and autonomous
coordination among heterogeneous embedded devices, to export
information, not just raw data; and fusion of this high-fidelity, in
situ information with external data sources. In this talk, we will
characterize the role of "data scientists" in architecting ENS
technologies, in developing data formats, communication protocols and
software systems which provide rich feedback loops between
environmental phenomena, the operational characteristics of sensing
systems, human-orchestrated design and data analysis and inference.
We will consider two use cases. First, with the growing recognition
that significant advancements in environmental research and protection
of environmental quality must be associated with a better
understanding of the dynamics of complex ecological systems, there has
been an expanded national and international focus on the development
of environmental observatories to facilitate research at multiple
spatial and temporal scales. The existing and proposed systems of
environmental observatories share a common mission to support new
research through innovations in sensing and informatics. Because
environmental observatories will create massive amounts of new data,
there are important concerns as to how researchers will access these
data streams in real time, how data be will analyzed and models
developed, and how instrumentation used by individual researchers will
be linked with data streams from the observatory instruments.
In the second use case, we consider applications of ENS outside of
scientific, engineering and industrial settings. These projects are
not managed by a central agency or institution, but are deployed by
private citizens and operate in personal, social or "urban" spaces.
These applications are attractive because they offer a tremendous
range of possibilities for data sharing experiences and for enabling
social exchange. They require new algorithms and software mechanisms
because unlike scientific applications of distributed sensing, a
single system is widely distributed, intermittently connected, and
privately administered; and unlike traditional Internet applications
the physical inputs are critical to the behavior.
David Marchette (George Mason University and Naval Surface Warfare Center)
Analysis of Streaming Text
Thursday 3:00-3:30, Fountain III
Abstract:
We consider the problem of analysing streaming text, such as news feeds or email streams.
Models based on simple word frequency statistics are used to characterize the different
topics within the stream, providing a clustering of the documents in real time. These methods
show that very simple statistics can be used to extract interesting information from the stream.
Documents are linked together based on their word co-occurrence, producing a time series of
graphs which can be used for further analysis of the stream. Methods for updating the models
of the text are discussed and illustrated on real data, and limitations of the methodology will be
discussed.