Statistics and Information Technology (Bin Yu, organizer)


Akshay Adhikari (Avaya Labs Research)
Lorraine Denby (Avaya Labs Research)
Jim Landwehr (Avaya Labs Research)
Jean Meloche (Avaya Labs Research)
Monitoring the Converged Network

Thursday 10:30-11:00, Fountain I

Abstract:

Converged networks used for voice, video, and conferencing demand real-time monitoring and analysis of numerous quality of service metrics (QoS). The data is voluminous: simultaneous collection of packet stream measurements for several different tests, generating metrics of delay, jitter and loss, between pairs of subnets. Statistical analysis and visualization methodology of these streaming, real-time QoS network metrics is vital to aid the network administrator in quickly learning of QoS degradation that can affect the converged services, pinpointing the problem areas and determining the root cause. In this talk we will describe the role of this statistical analysis to meet these needs.



John Lafferty (Carnegie Mellon University)
The Evolution of Science: Time Series Models of Scientific Journals and Other Large Text Databases

Thursday 11:00-11:30, Fountain I

Abstract:

A surge of recent research in machine learning and statistics has developed new techniques for automatically finding patterns of words in document collections using hierarchical probabilistic models. These models are called "topic models" because the word patterns often reflect the underlying topics that are combined to form the documents; however topic models also naturally apply to such data as images and biological sequences.

While previous topic models have assumed that the corpus is static, many document collections actually change over time: scientific articles, emails, and search queries reflect evolving content, and it is important to model the corresponding evolution of the underlying topics. We describe new work on probabilistic models designed to capture of the dynamics of the topics as they evolve over time.

Traditional time series modeling has focused on continuous data; but topic models are designed for categorical data. Our approach is to use state space models on the natural parameter space of multinomial and logistic normal distributions that represent topic models as points on a high dimensional probability simplex over the word vocabulary. Due to the nonconjugacy of the Gaussian and multinomial models, posterior inference is intractable, and we develop variational approximations based on Kalman filters and nonparametric wavelet regression to carry out approximate posterior inference over the latent topics.

In addition to giving quantitative, predictive models of a corpus, topic models provide a qualitative window into the contents of a large document collection, allowing a user to explore the structure of the corpus in a topic-guided fashion. We demonstrate the capabilities of these new models on the archives of the journal Science, founded in 1880 by Thomas Edison. Our models are built on the noisy text resulting from an optical character recognition engine run over the original bound journals by JSTOR, the online scholarly journal archive.


Creon Levit (NASA Ames Research Center)
Using Graphics Processing Unit (GPU) Hardware for Interactive Exploration of Large Multivariate Data

Thursday 11:30-12:00, Fountain I

Abstract:

Exploratory Data Analysis, real-time visualization, interactive data mining, manually assisted projection pursuit, and related techniques are compelling and useful. Unfortunately, for large multivariate data - say, much more than 10^5 samples (records) with more than 5 variables per sample, the interactive graphics response of most statistical analysis programs stutters, stalls, or stops working altogether. However, the graphics processing units (GPUs) built in to all professional desktop and laptop computers currently on the market are capable of transforming, filtering, and rendering hundreds of millions of points per second. We present a prototype open-source cross-platform application which leverages some of the power latent in the GPU to enable smooth interactive exploration and analysis of large high-dimensional data using a variety of classical and recent techniques.