While previous topic models have assumed that the corpus is static, many document collections actually change over time: scientific articles, emails, and search queries reflect evolving content, and it is important to model the corresponding evolution of the underlying topics. We describe new work on probabilistic models designed to capture of the dynamics of the topics as they evolve over time.
Traditional time series modeling has focused on continuous data; but topic models are designed for categorical data. Our approach is to use state space models on the natural parameter space of multinomial and logistic normal distributions that represent topic models as points on a high dimensional probability simplex over the word vocabulary. Due to the nonconjugacy of the Gaussian and multinomial models, posterior inference is intractable, and we develop variational approximations based on Kalman filters and nonparametric wavelet regression to carry out approximate posterior inference over the latent topics.
In addition to giving quantitative, predictive models of a corpus, topic models provide a qualitative window into the contents of a large document collection, allowing a user to explore the structure of the corpus in a topic-guided fashion. We demonstrate the capabilities of these new models on the archives of the journal Science, founded in 1880 by Thomas Edison. Our models are built on the noisy text resulting from an optical character recognition engine run over the original bound journals by JSTOR, the online scholarly journal archive.