Best K: the Critical Clustering Structure in Categorical Datasets

Keke Chen and Ling Liu


The demand on cluster analysis for categorical data continues to grow over the last decade. A well-known problem in categorical clustering is to determine the "best K" number of clusters. Although several categorical clustering algorithms have been developed, surprisingly, none has satisfactorily addressed the problem of Best K for categorical clustering. Since categorical data does not have the inherent distance function as the similarity measure, traditional cluster validation techniques based on the geometry shape and density distribution are not appropriate for categorical data. In this paper, we study the entropy property between the clustering results of categorical data and propose the BKPlot method to address the two important problems: 1) How can we determine whether there is significant clustering structure in the categorical dataset? 2) If there is significant clustering structure, what are the set of candidate ``best Ks''? We develop a hierarchical categorical clustering algorithm ACE to help explore the entropy property of clustering structure and to generate high-quality BKPlots. Issues in applying ACE algorithm to generate approximate BKPlots for very large datasets and data streams are also investigated. Experimental results show that the BKPlot method with ACE algorithm can effectively identify the significant clustering structures for categorical datasets.

 

Representative papers:

  • Keke Chen and Ling Liu: " Best K: the Critical Clustering Structures in Categorical Data ", Knowledge and Information Systems, 2008
  • Keke Chen and Ling Liu: "The ‘Best K’ for Entropy-based Categorical Clustering ", Proc of Scientific and Statistical Database Management (SSDBM05). Santa Barbara, CA June 2005.