Tuesday, May 15, 2012

CRM and data mining Day 09

For cluster analysis, the question is how to evaluate the “goodness” of the resulting clusters?

But “clusters are in the eye of the beholder”!

Then why do we want to evaluate them?
  • To compare clustering algorithms
  • To avoid finding patterns in noise
  • To compare two sets of clusters
  • To compare two clusters
Different Aspects of Cluster Validation
  1. Determining the clustering tendency of a set of data, i.e., distinguishing whether non-random structure actually exists in the data.
  2. Comparing the results of a cluster analysis to externally known results, e.g., to externally given class labels.
  3. Evaluating how well the results of a cluster analysis fit the data without reference to external information.
    1. Use only the data
  4. Comparing the results of two different sets of cluster analyses to determine which is better.
  5. Determining the ‘correct’ number of clusters.

For 2, 3, and 4, we can further distinguish whether we want to evaluate the entire clustering or just individual clusters.

Measures of Cluster Validity

Internal Measures: Cohesion and Separation
  1. Cluster Cohesion: measures how closely related are objects in a cluster.
    • Example: SSE
  2. Cluster Separation: measure how distinct or well-separated a cluster is from other clusters.
    1. Separation is measured by the between cluster sum of square
  3. Silhouette Coefficient combines ideas of both cohesion and separation, but for individual points, as well as clusters and clusterings.

For an individual point, i

Calculate a = average distance of i to the points in its cluster

Calculate b = min (average distance of i to points in another cluster)

The silhouette coefficient for a point is then given by 
s = 1 – a/b   if a < b,   
s = b/a - 1    if a > b, not the usual case 

Silhouette coefficient is typically between 0 and 1. The closer to 1 the better.


Can calculate the Average Silhouette width for a cluster or a clustering.

No comments:

Post a Comment

Mounting USB drives in Windows Subsystem for Linux

Windows Subsystem for Linux can use (mount): SD card USB drives CD drives (CDFS) Network drives UNC paths Local storage / drives Drives form...