Spatio-temporal Data Mining (Zoran Obradovic, organizer)
Richard A. Berk (Department of Statistics, UCLA)
Using Ensemble Statistical Procedures for Imputing Homeless Counts for Los Angeles County
Friday 10:30-11:00, Fountain II
Abstract:
Early in 2005, the number of homeless individuals was counted
in a random sample of Los Angles County census tracts. From these
counts, a credible homeless estimate and standard error were computed
for the County as a whole. However, there was also a policy need for
counts at the level of individual tracts, even tracts in which counts
were not available. Using tract-level predictors from the US Census and
land use information, random forests and boosted trees were
used to construct imputed counts for the tracts not visited. The
nature of these counts and their properties will be examined.
Zoran Obradovic (Information Science and Technology Center, Temple University)
Integration of Deterministic and Statistical Algorithms for Retrieval and Analysis of Geophysical Parameters
Friday 11:00-11:30, Fountain II
Abstract:
Current methods for the retrieval of geophysical information from satellite data are based on deterministic forward simulation algorithms. In this approach, physical models predict what the instruments will observe under possible atmospheric and reflective surface conditions. These predictions are then compared to observations, and the condition corresponding to the best prediction is assumed to hold. The drawbacks include high computational cost and the manual enhancements to the postulated physical models. We proposed a novel data mining based method for addressing these drawbacks by complementing deterministic models with computationally cheaper statistical algorithms that can exploit data of varying quality obtained from multiple sources. Statistical retrieval involves learning classification or regression mappings from observed attributes to corresponding geophysical parameters.
Shashi Shekhar (Department of Computer Science, University of Minnesota)
What is Special About Mining Spatial Datasets?
Friday 11:30-12:00, Fountain II
Abstract:
The importance of spatial data mining is growing with the increasing availability of large geo-spatial datasets such as maps, remote-sensing images, and the decennial census. Applications include Geo-spatial intelligence; Location-based services; Predicting clustering or spread of disease; Finding crime hot spots; Mission to planet earth (global change and climatology, land-use classification); etc.
Classical data mining techniques often perform poorly when applied to spatial data sets because of the following reasons. First, spatial data is embedded in a continuous space (with notions of center and edge), whereas classical datasets are often discrete. Second, spatial patterns are often local where as classical data mining techniques often focus on global patterns. Finally, one of the common assumptions in classical statistical analysis is that data samples are independently generated. When it comes to the analysis of spatial data, however, the assumption about the independence of samples is generally false because spatial data tends to be highly self correlated. For example, people with similar characteristics, occupation and background tend to cluster together in the same neighborhoods. In spatial statistics this tendency is called spatial autocorrelation. Ignoring spatial autocorrelation when analyzing data with spatial characteristics may produce hypotheses or models that are inaccurate or inconsistent with the data set.
Thus new methods are needed to analyze spatial data to detect spatial patterns. This talk surveys some of the new methods including those for discovering spatial co-locations, detecting spatial outliers and location prediction.