George Mason University
CDS/CCDS/Statistics Colloquium Series
Seminar Announcement


Text Mining, Social Networks, and High Dimensional Analysis

Edward J. Wegman

Department of Computational and Data Sciences
and Department of Statistics
George Mason University

Research 1, Room 301, Fairfax Campus
George Mason University, 4400 University Drive, Fairfax, VA 22030

Time: 10:30 a.m. Refreshments, 10:45 a.m. Colloquium Talk
Date: April 25, 2008



ABSTRACT

A traditional approach to text mining has been to represent a document by a vector. In the bag-of-words representation binary vectors are used and two documents are regarded as similar if the angle between their corresponding vectors is small (i.e., correlation between the vectors is high). The document vectors may be assembled into a term-document matrix (TDM). A more satisfying representation of a document can be formulated in terms of bigrams or trigrams, because these have a better chance of capturing semantic content Bigram vectors ran be assembled into bigram document matrices (BDM). The TDM and BDM resemble the two-mode adjacency matrices associated with social network analysis (SNA). Using cues from SNA, we formulate the one-mode social network adjacency matrices to form document-document matrices (DD) and bigram-bigram matrices (BB). In this talk I outline the basics, discuss the connection between text mining and social networks and, by example, illustrate the dimensionality issues raised by such vector space methods.