Detecting Outliers, Changes, and Extreme Events
(Diane Lambert, chair)


Robert Grossman (University of Illinois at Chicago; Open Data Partners)
Anushka Aanand (University of Illinois at Chicago)
John Chaves (Open Data Partners)
Michal Sabala (University of Illinois at Chicago)
Steve Vejik (Open Data Partners)
Lee Wilkinson (SPSS; University of Illinois at Chicago)
Detecting Changes Using Data Cubes of Detection Classifiers (DCDCDC)

Friday 4:00-4:20, San Marino

Abstract:

We are interested in the problem of establishing baselines and detecting changes in large volumes of multi-modal sensor data.

In this paper, we introduce a new technique called Detecting Changes Using Data Cubes of Detection Classifiers or the DCDCDC Algorithm. More specifically, for each cell in a multidimensional data cube, we develop a separate baseline and corresponding change detection algorithm. For example, a three dimensional data cube might include dimensions for time, spatial location and weather conditions.

Our novelity is that we have developed a methodology so that we can quickly compute and updates thousands or tens of thousands such baselines, providing an effective means of detecting changes, even in very large amounts of multi-modal sensor data.

We developed a testbed containing: real time data from over 830 highway traffic sensors in the Chicago region, data about weather, and text data about events that might affect traffic. The goal was to detect in real time interesting changes in traffic conditions.

For this study, We built a separate baseline for each hour in the day, for each day in the week, and for every 2 or 3 traffic sensors, resulting in over 42,000 separate baseline models. We also built a baseline engine to build the necessary baselines automatically. We modified an open source scoring engine to process in real time each new sensor reading, update the appropriate feature vectors, score the updated feature vectors using the baseline models, and send out real time alerts to hande held devices when deviations from the baselines were detected.


Jim Shine (US Army Topographic Engineering Center)
Paul Krause (US Army Topographic Engineering Center)
Predicting Times and Locations of Insurgent Attacks

Friday 4:20-4:40, San Marino

Abstract:

US armed forces are under attack from mortar and other devices in several foreign locations. Any models that can achieve some degree of success in predicting future attacks based on past ones obviously have high relevance and priority to the mission of the US military and the safety of its members. We present some analysis and visualization of past events and discuss conclusions and future work.



Juergen Symanzik (Department of Mathematics and Statistics, Utah State University)
Robert Gillies (Department of Aquatic, Watershed, and Earth Resources, Utah State University)
Hee Lee (Department of Aquatic, Watershed, and Earth Resources, Utah State University)
Peter Ma (Department of Aquatic, Watershed, and Earth Resources, Utah State University)
NDVI Data Reduction for Fuzzy Statistical Evaluation

Friday 4:40-5:00, San Marino

Abstract:

Previous time-series studies of the 1km, Advanced Very High Resolution Radiometer (AVHRR) and the 250m, Moderate Resolution Imaging Spectroradiometer (MODIS) NDVI datasets have yet to establish quantitative techniques that evaluate distributions of NDVI composites to identify seasonal deviations of distributions from a normal curve. This study was conducted to determine whether an effective technique could be developed to isolate anomalous NDVI distributions in both datasets. NDVI composites are compared against normal baseline distributions through quantitative assessments of probability density functions (pdfs) and cumulative distribution functions (cdfs). The effects of sample size, interval assignment techniques and baseline development on two statistical tests (Chi-square and Kolmogorov-Smirnov) were investigated for the Washington D.C. area. A sample size of 50 was established, which allowed significantly accurate results to be obtained from both statistical tests. Assignment of class intervals to NDVI distributions found that a data driven mean and standard deviation method was the most effective given pixel assignments across class intervals and agreement in pdf plots. Influenced by the measure of normality, both baseline techniques (11-year and comparative) functioned differently and produced different results.



Ramalingam Shanmugam (Texas State University)
Chance Models of Earthquake/Tsunami for Quick Warning

Friday 5:00-5:20, San Marino

Abstract: Forecast earthquake times often do not match actual incidence time. Earthquake and tsunami forecasting is in an embryonic stage and needs considerable improvement. Interdisciplinary team members in seismology are working hard to make breakthroughs to avoid destruction from these natural disasters. For an example, on December 28, 2004 the New York Times' Sandra Blakeslee reports that on Feb 4, 1975 the Chinese government evacuated the town of Hatching based on an earthquake forecast. A year later a 7.6 earthquake struck Tangshan, China, without any warning, and an estimated 250,000 people died. The importance of making accurate and timely forecasts is obvious. Disastrous events like the tsunami that occurred in the Indian Ocean on December 26, 2004 are frightening, but 21st century Earth science has the potential to avert such catastrophes. The challenge should be taken by the statistics community as well since occurrences of such natural disasters do exhibit some statistical regularity. In this presentation we discuss how probability models and data analysis techniques can help construction and implementation of early warning systems. Two types of data are pertinent for accurate early warning. They are collected/collectable type and non-collectable type. The collectable/collected type includes location of the epicenter, magnitude of quake, elevation/depth compared to sea level, and (south, north, east, west) distances to land. The non-collectable type includes amounts of mass emitted or absorbed. We review the scientific literature related to concepts and analytical tools, and evaluate their strength and weakness. We then discuss our attempt to develop better probability models and data mining methods for early warning of tsunami/earthquake occurrences. We use real data to test these models and methods, and present some comments on the scope and limitations of our efforts.



William Heavlin (Sun Microsystems)
Archetypal Analysis of Computer Performance

Friday 5:20-5:40, San Marino

Abstract:

In 1994, Cutler and Breiman introduced the multivariate method of archetypal analysis. Useful because its descriptions are composites of the original data, archetypes essentially approximate the data's convex hull to calculate representative extremes. Such extremes help considerably in understanding computer performance studies, where the performance scores form a complete matrix of hardware (rows) and software tasks (columns); our examples here span several generations of x86 microprocessors. In applying archetypes to such data, we confront and resolve these issues: (a) data transformations, (b) attenuation effects, (c) relative performance ratios, and (d) the row-column orientation. The latter motivates a generalization, termed here co-archetypes, which is invariant to transposing the rows and columns of the data matrix.