Home

Scaling em (expectation-maximization) clustering to large databases


Author(s) : Usama Fayyad P. S. Bradley Cory A. Reina Usama M. Fayyad Paul S. Bradley Cory Reina, 
Publisher : N/A
Publication Date : 1998
ISSN : N/A
Abstract : Practical statistical data clustering algorithms require multiple data scans to converge. For large databases, these scans become prohibitively expensive. We present a scalable clustering framework requiring at most one scan of the database, and apply it to the Expectation-Maximization (EM) algorithm. Unlike distance-based or hard membership algorithms (such as k-Means) EM is known to be an appropriate optimization algorithm for constructing proper statistical models of the data. It also easily accommodates categorical and continuous data fields. It admits varying degrees of data membership in multiple clusters. Our scalable method is based on identifying regions of the data that are compressible and regions that must be maintained in memory. The approach operates within the confines of a limited memory buffer. Data resolution is preserved to the extent possible based upon the size of the memory buffer and the fit of the current clustering model to the data. We extend the framework to update multiple clustering models simultaneously. Computational tests indicate that this scalable scheme outperforms sampling-based and incremental approaches-- the straightforward alternatives to 'scaling' existing traditional in-memory implementations to large databases.,