George Mason University
AES/CCS/SCS/Statistics Colloquium Series
Seminar Announcement


Weighting the Text Proximity Matrix

Wendy L. Martinez

Office of Naval Research

Location: Johnson Center: Meeting Room D
Time: 10:30 a.m. Refreshments, 10:45 a.m. Colloquium Talk
Date: December 3, 2004



ABSTRACT

The bigram proximity matrix (BPM) was first developed by Martinez and Wegman [2002] as a way of encoding free-form text so textual data can be used in applications requiring numerical computation. Previous studies with the BPM indicated that documents can be successfully classified using k nearest neighbors and other methods when they are encoded in this way. The objective of the current work is to define bigram weights analogous to the term weights found in natural language processing and to investigate the utility of using them in document classification.

The BPM is a non-symmetric matrix that captures the number of word co-occurrences in a moving two-word window. The elements of the BPM represent the raw frequency of bigrams in a document. We can weight those frequencies using local, global and document weights. We define two local weights that attempt to down-weight the frequencies: logarithmic and augmented normalized frequency. We use cosine normalization for the document weight and inverse document frequency for the global weight. Finally, we define pointwise mutual information between a bigram and a document and use this as the entry in the BPM, instead of the raw frequency.

In the experiments conducted for this research, a BPM was created for each document in a corpus of 503 tagged documents, and the raw bigram frequencies were weighted as described above. We used the k nearest neighbor classification method to determine whether adjusting the bigram frequencies yield better results.

Martinez, A. R. and E. J. Wegman, 2002, "A text stream transformation for semantic-based clustering," Computing Science and Statistics, 34: 184-203.