Text Mining (Edward Wegman, organizer)
John Rigsby (Naval Surface Warfare Center)
Multi-Mode Co-clustering of Different Text Data Attributes
Saturday 8:30-9:00, Fountain II
Abstract:
There are many attributes to text analysis: words, documents, bigrams, trigrams, n-grams, contextual relationships, latent semantics, and many others. This paper covers a spectral graph method for co-clustering multiple attributes at the same time. Co-clustering is very useful not only because it turns a two step process into a one step process, but it also shows you the relationships between different sets of attributes. This paper goes beyond normal two-mode co-clustering (ie words and documents) into the area of co-clustering multiple modes (ie words, documents, bigrams, trigrams, etc.) all at the same time.
Padhraic Smyth (Department of Computer Science, UC Irvine)
Text Mining Using Statistical Topic Models
Saturday 9:00-9:30, Fountain II
Abstract:
This talk will describe recent work on latent variable
models for large sets of text documents. The focus will
be on topic models, where documents are represented
as finite mixtures of topics, and topics are narrowly
focused distributions over a word vocabulary. A key point is
that the topic-word distributions are learned automatically
from the data with no manual labelling required. Topic models
have been found to be highly effective for automated
summarization of documents, for tagging document content,
for information retrieval, and for a variety of other text-related
applications. The talk will discuss the underlying principles
of topic models as well as parameter estimation techniques using
Gibbs sampling. More recent extensions such as the author-topic
model will also be briefly discussed. The talk will include
illustrative applications using large document sets from
MEDLINE, CiteSeer, Enron emails, and Pennsylvania Gazette
articles from the 18th century.
Jeffrey L. Solka (Naval Surface Warfare Center)
Literature-based Discovery for the Identification of New Methods of Water Purification
Saturday 9:30-10:00, Fountain II
Abstract:
This talk will discuss our recent work in the identification of new approaches to water purification using literature-based discovery. We have developed several new approaches to the problem of finding interesting relationships among loosely related corpora. In this case we seek relationships between a set of core articles from the water purification literature and an expanded set of articles taken from the medical arena. Preliminary results obtained via the use of mathematical, statistical, and visualization methodologies will be presented.