Rebeka Jornsten (Department of Statistics, Rutgers University)
Clustering of miRNA and mRNA Expression via Rate-distortion Based Model Selection
Friday 2:30-3:00, San Rafael
Abstract:
We introduce a novel approach to model selection in the analysis of mRNA
and miRNA expression data. By reformulating selection in terms of
rate-distortion theory, we can simultaneously select genes that are
differentially expressed, and identify which
conditions are discriminating. This also simplifies the simultaneous
selection in model-based clustering by making it entirely parallel across
clusters. The goal is to allocate model complexity to each cluster of mRNA/miRNA, such that the trade-off between goodness-of-fit and model complexity is equally balanced between all clusters. In the most simple case, MSE is our distortion measure. Model complexity, or rate, is a function of the number of conditions for which the mRNA and miRNA are differentially expressed and the cluster size. Other rate criteria
can also be derived using predictive densities.
For each cluster, a rate-distortion curve is traced by computing the MSE
for models of different complexity. If we use e.g. L2-boosting these
curves are continuous; in subset selection the curves are linear
interpolations between subset models. It has long been known that the
optimal rate allocation corresponds to operating at points of equal slope
on the rate-distortion curves, for all data subsets k. Any other allocation will lead to an increase in overall distortion (Ortega et al, 1998).
Fixing an operating slope, we pick the cluster models at this slope of the
rate-distortion curves. If no point on the k-th curve satisfies this slope
constraint, the null model is automatically selected for cluster k.
Several clusters may thus form a joint null-model
cluster. Finally, the overall global fit of the mixture model is evaluated
using the BIC criterion, performing a line-search over operating slopes.
We apply our method to the analysis of developmental miRNA/mRNA
expression. Two cell-lines (one experimentally confirmed to be
'pre-programmed' to become neurons, the other to
become glia) are observed at 0, 1 and 3 hours after a growth factor is
added to the medium. To determine which miRNA-mRNA differ between the cell-lines, and at
what time points, we fit a multi-level mixture model to the data; the
first level of the hierarchy models the time-course, allowing for a sign-flip to account for negative association between miRNA-mRNA pairs (repressor vs activator); the second level the cell-line/time interactions. A total of 5 clusters are selected, corresponding to diverging, converging, and static cell-line differences. Biological validation of the diverging miRNAs, believed to determine the fate of the cell population, is now underway.