Internal Quality Measures for Subspace Clusterings

Theoretical (Analytical):

Practical (Implementation):

Literature Work:


Overview and Background

Clustering is a well-known unsupervised data mining technique to find groups of similar data objects. Different data distributions and clustering shapes require the usage of different algorithms (e.g. k-means for circular based clusters, or DBSCAN for arbitrarily shaped clusters). Applying clustering to high-dimensional data (data objects with a large number of attributes) is influenced by the so-called curse of dimensionality. To put it simply, the huge number of dimensions influence the similarity measures (e.g. Euclidean distance) that build the foundation for any distance-based clustering algorithm.
To overcome the curse of dimensionality, subspace clustering algorithms have been developed. These algorithms search for (a large number of) different clusters in different subspaces (=subsets of dimension combinations). However, the results of subspace clustering algorithms are often difficult to analyze or interpret – especially for non-experts. Furthermore, the algorithms mostly depend on a large number of parameters which highly influence the detected clusters. Therefore, there is a need to compute the quality of a subspace clustering result in order to automatically determine the best parameter setting.

Problem Statement

The goal of this project is to develop novel quality measures for subspace clusters and subspace clustering results. While quite a few external quality measures (compare a result against a griven ground truth) for subspace clusterings exist, internal measures – purely based on the cluster characteristics (e.g., density) – do not exist. This makes it impossible to apply subspace clustering to real-world data where no ground-truth information exist.

Tasks

  • Literature review and analysis of existing external quality measures.
  • Literature review and analysis of internal quality measures for full-space clustering.
  • Development of different novel quality measures.
  • Throughout evaluation of the developed measures.
  • How can visualization help in the quality assessment?

Requirements

  • Good knowledge in data mining (especially clustering).
  • Good understanding of complex concepts
  • Motivated to read scientific papers with math-formulas
  • Good programming skills in Java.

Scope/Duration/Start

  • Scope: Bachelor/Master
  • 6 Month Project, 3 Month Thesis (Bachelor) / 6 Month Thesis (Master)
  • Start: immediately

Contact

References

  • Subspace Clustering for High Dimensional Data: A Review [Parsons et al., 2004]
  • Visual Quality Assessment of Subspace Clusterings [Hund et al., 2016]
  • Evaluating Clustering in Subspace Projections of High Dimensional Data [Müller et al., 2009]
  • On using class-labels in evaluation of clusterings [Färber et al., 2010]