Is Multinomial PCA Multi-faceted Clustering or
Dimensionality Reduction?
Wray Buntine and Sami Perttu
Helsinki Inst. of Information Technology
HIIT, P.O. Box 9800
FIN-02015 HUT, Finland
{wray.buntine,sami.perttu}@hiit.fi
Appeared in AI and Statistics 2003.
Postscript version.
PDF version.
GPL'd test suite for this system now available for trial. Contact authors!
Abstract:
Discrete analogues to
Principal Components Analysis (PCA) are intended to handle discrete or
positive-only data, for instance sets of documents.
The class of methods is appropriately called multinomial PCA because
it replaces the Gaussian in the probabilistic formulation of
PCA with a multinomial.
Experiments to date, however, have been on small
data sets, for instance, from early information retrieval collections.
This paper demonstrates the method on two large data sets
and considers two extremes of behaviour:
(1) dimensionality reduction where the feature set (i.e., bag of words) is
considerably reduced, and
(2) multi-faceted clustering (or aspect modelling)
where clustering is done but items can now belong
in several clusters at once.
Wray Buntine
Last modified: Tue Jan 21 12:54:17 EET 2003