Best of SIAM Data Mining 2006 (Vipin Kumar, organizer; Arnold Goodman, chair)


Indrajit Bhattacharya (University of Maryland, College Park)
Lisa Getoor (University of Maryland, College Park)
A Latent Dirichlet Model for Unsupervised Entity Resolution

Friday 2:00-2:20, Fountain III

Abstract:

Entity resolution has received considerable attention in recent years. Given many references to underlying entities, the goal is to predict which references correspond to the same entity. We show how to extend the Latent Dirichlet Allocation model for this task and propose a probabilistic model for collective entity resolution for relational domains where references are connected to each other. Our approach differs from other recently proposed entity resolution approaches in that it is a) generative, b) does not make pair-wise decisions and c) captures relations between entities through a hidden group variable. We propose a novel sampling algorithm for collective entity resolution which is unsupervised and also takes entity relations into account. Additionally, we do not assume the domain of entities to be known and show how to infer the number of entities from the data. We demonstrate the utility and practicality of our relational entity resolution approach for author resolution in two real-world bibliographic datasets. In addition, we present preliminary results on characterizing conditions under which relational information is useful.



Jerry Scripps (Michigan State University)
Pang-Ning Tan (Michigan State University)
Clustering in the Presence of Bridge-Nodes

Friday 2:20-2:40, Fountain III

Abstract:

In this paper, we study the ill-effects of bridge-nodes, which causes many dissimilar objects to be placed together in the same cluster by existing clustering algorithms. We offer two new metrics for measuring how well a clustering algorithm handles the presence of bridge-nodes. We also illustrate how algorithms that produce overlapping clusters help to alleviate the effect of bridge-nodes and form more meaningful clusters. However, if there is too much overlap, the clusters become less informative. To address this problem, we present a novel clustering algorithm called MIN-CUT. Our experimental results with real data sets show that the MIN-CUT algorithm leads to purer clusters that have very little overlap.



Hongyan Liu (University of Illinois, Urbana-Champaign)
Jiawei Han (University of Illinois, Urbana-Champaign)
Dong Xin (University of Illinois, Urbana-Champaign)
Zheng Shao (University of Illinois, Urbana-Champaign)
Mining Interesting Patterns from Very High Dimensional Data: A Top-Down Row Enumeration Approach

Friday 2:40-3:00, Fountain III

Abstract:

Data sets of very high dimensionality, such as microarray data, pose great challenges on efficient processing to most existing data mining algorithms. Recently, there comes a row-enumeration method that performs a bottom-up search of row combination space to find corresponding frequent patterns. Due to limited number of rows in micro-array data, this method is more efficient than column enumeration-based algorithms. However, the bottom-up search strategy cannot take advantage of user-specified minimum support threshold to efficiently prune search space, and therefore leads to long runtime and much memory overhead.

In this paper, we propose a new search strategy, top-down mining, integrated with a novel row enumeration tree, which makes full use of the pruning power of the minimum support threshold to cut down search space dramatically. Using this kind of searching strategy, we design an algorithm, TD-Close, to find a complete set of frequent closed patterns from very high-dimensional data. Furthermore, an effective closeness-checking method is also developed that avoids scanning the dataset multiple times. Our performance study shows that the TD-Close algorithm outperforms substantially both Carpenter, a bottom-up searching algorithm, and FPclose, a column enumeration-based frequent closed pattern mining algorithm.


Rong Ge (Simon Fraser University)
Martin Ester (Simon Fraser University)
Byron J. Gao (Simon Fraser University)
Zengjian Hu (Simon Fraser University)
Boaz Ben-Moshe (Simon Fraser University)
Joint Cluster Analysis of Attribute Data and Relationship Data: the Connected k-Center Problem

Friday 3:00-3:20, Fountain III

Abstract:

Attribute data and relationship data are two principle types of data, representing the intrinsic and extrinsic properties of entities. While attribute data has been the main source of data for cluster analysis, relationship data such as social networks or metabolic networks are becoming increasingly available. It is also common to observe both data types carry orthogonal information such as in market segmentation and community identification, which calls for a joint cluster analysis of both data types so as to achieve more accurate results. For this purpose, we introduce the novel Connected k-Center problem, taking into account attribute data as well as relationship data. We analyze the complexity of this problem and prove its NP-completeness. We also present a constant factor approximation algorithm, based on which we further design NetScan, a heuristic algorithm that is efficient for large, real databases. Our experimental evaluation demonstrates the meaningfulness and accuracy of the NetScan results.