Best K: the Critical Clustering Structure in Categorical
Datasets
Keke
Chen and Ling Liu |
|
The demand on
cluster analysis for categorical data continues to grow over the last decade.
A well-known problem in categorical clustering is to determine the "best
K" number of clusters. Although several categorical clustering
algorithms have been developed, surprisingly, none has satisfactorily
addressed the problem of Best K for categorical clustering. Since categorical
data does not have the inherent distance function as the similarity measure,
traditional cluster validation techniques based on the geometry shape and
density distribution are not appropriate for categorical data. In this paper,
we study the entropy property between the clustering results of categorical
data and propose the BKPlot method to address the two important
problems: 1) How can we determine whether there is significant clustering
structure in the categorical dataset? 2) If there is significant clustering
structure, what are the set of candidate ``best Ks''? We develop a
hierarchical categorical clustering algorithm ACE to help explore the entropy
property of clustering structure and to generate high-quality BKPlots. Issues
in applying ACE algorithm to generate approximate BKPlots for very large
datasets and data streams are also investigated. Experimental results show
that the BKPlot method with ACE algorithm can effectively identify the
significant clustering structures for categorical datasets. |
|
Representative
papers:
|