Interface 2004
Abstract

Bayesian Hierarchical Models of the Browsing Behavior of World Wide Web Users
Juana Sanchez, (University of California Los Angeles), jsanchez@stat.ucla.edu, and
Ching-Ti Liu, (University of California Los Angeles), ctliu@stat.ucla.edu

Abstract

We consider the case of surfing within a single large Web site, which is important from the point of view of site design, web server proxy efficiency and search engine optimal ranking of pages. The site used as an example to illustrate the methods is msnbc.com. We use a set of server log data on the Web pages chosen by 989818 users in a twenty-five hour period, where the response measure for each user is an ordered sequence of choices among 17 categories (UCI KDD Archive). A common way to model the browsing behavior of users is to assume that the decision of users is a random walk with a probability distribution of first passage time to a threshold that is a two-parameter inverse-gaussian distribution. Another hypothesis examined is that users at each page conduct an independent Bernoulli trial to make a stopping decision, which implies a geometric distribution. Mixtures of first-order markov processes or model-based clustering with and without a Bayesian flavor ! have offered very useful exploratory data analyses. All these studies have shown evidence that web-surfing behavior may be non-Markov in nature and have illustrated how hard it is to capture dependencies in the data; the performance of the models over a wide range of Web Site formats is still inconclusive. This performance has been measured by the ability to predict page hits, by the resulting distribution of page hits, and by the contribution to efficient web caching schemes. Some models have been tested with server log data of AOL or similar Sites and others have been tested within a single Web site like msnbc.com. The levels of aggregation of pages and clustering of user behavior have also varied within studies. In this paper, we argue that for the case of browsing within a news portal like msnbc.com, where contents are continually changing, the server-log data is only meaningful when categories are aggregated, like they are for the msnbc.com data set! , and the order of the browsing may not be relevant. We use ! a complex Bayesian hierarchical model of the page counts per user. This model has the ability to have enough parameters to fit the data well, while using a population distribution to structure dependence in the parameters. The model can be generalized to different types of Web sites, different levels of aggregation of pages and different clustering schemes. We compare the performance of this new model to that of previous models.


Take me back to the main conference page.