Text Mining (Edward Wegman, organizer)


John Rigsby (Naval Surface Warfare Center)
Multi-Mode Co-clustering of Different Text Data Attributes


Saturday 8:30-9:00, Fountain II

Abstract:

There are many attributes to text analysis: words, documents, bigrams, trigrams, n-grams, contextual relationships, latent semantics, and many others. This paper covers a spectral graph method for co-clustering multiple attributes at the same time. Co-clustering is very useful not only because it turns a two step process into a one step process, but it also shows you the relationships between different sets of attributes. This paper goes beyond normal two-mode co-clustering (ie words and documents) into the area of co-clustering multiple modes (ie words, documents, bigrams, trigrams, etc.) all at the same time.



Padhraic Smyth (Department of Computer Science, UC Irvine)
Text Mining Using Statistical Topic Models


Saturday 9:00-9:30, Fountain II

Abstract:

This talk will describe recent work on latent variable models for large sets of text documents. The focus will be on topic models, where documents are represented as finite mixtures of topics, and topics are narrowly focused distributions over a word vocabulary. A key point is that the topic-word distributions are learned automatically from the data with no manual labelling required. Topic models have been found to be highly effective for automated summarization of documents, for tagging document content, for information retrieval, and for a variety of other text-related applications. The talk will discuss the underlying principles of topic models as well as parameter estimation techniques using Gibbs sampling. More recent extensions such as the author-topic model will also be briefly discussed. The talk will include illustrative applications using large document sets from MEDLINE, CiteSeer, Enron emails, and Pennsylvania Gazette articles from the 18th century.



Jeffrey L. Solka (Naval Surface Warfare Center)
Literature-based Discovery for the Identification of New Methods of Water Purification


Saturday 9:30-10:00, Fountain II

Abstract:

This talk will discuss our recent work in the identification of new approaches to water purification using literature-based discovery. We have developed several new approaches to the problem of finding interesting relationships among loosely related corpora. In this case we seek relationships between a set of core articles from the water purification literature and an expanded set of articles taken from the medical arena. Preliminary results obtained via the use of mathematical, statistical, and visualization methodologies will be presented.