Is Multinomial PCA Multi-faceted Clustering or Dimensionality Reduction?

Wray Buntine and Sami Perttu

Helsinki Inst. of Information Technology
HIIT, P.O. Box 9800
FIN-02015 HUT, Finland


Appeared in AI and Statistics 2003. Postscript version. PDF version.

GPL'd test suite for this system now available for trial. Contact authors!


Discrete analogues to Principal Components Analysis (PCA) are intended to handle discrete or positive-only data, for instance sets of documents. The class of methods is appropriately called multinomial PCA because it replaces the Gaussian in the probabilistic formulation of PCA with a multinomial. Experiments to date, however, have been on small data sets, for instance, from early information retrieval collections. This paper demonstrates the method on two large data sets and considers two extremes of behaviour: (1) dimensionality reduction where the feature set (i.e., bag of words) is considerably reduced, and (2) multi-faceted clustering (or aspect modelling) where clustering is done but items can now belong in several clusters at once.

Wray Buntine
Last modified: Tue Jan 21 12:54:17 EET 2003