But “clusters are in the eye of the beholder”!
Then why do we want to evaluate them?
- To compare clustering algorithms
- To avoid finding patterns in noise
- To compare two sets of clusters
- To compare two clusters
Different Aspects of Cluster Validation
- Determining the clustering tendency of a set of data, i.e., distinguishing whether non-random structure actually exists in the data.
- Comparing the results of a cluster analysis to externally known results, e.g., to externally given class labels.
- Evaluating how well the results of a cluster analysis fit the data without reference to external information.
- Use only the data
- Comparing the results of two different sets of cluster analyses to determine which is better.
- Determining the ‘correct’ number of clusters.
For 2, 3, and 4, we can further distinguish whether we want to evaluate the entire clustering or just individual clusters.
Measures of Cluster Validity
Internal Measures: Cohesion and Separation
- Cluster Cohesion: measures how closely related are objects in a cluster.
- Example: SSE
- Cluster Separation: measure how distinct or well-separated a cluster is from other clusters.
- Separation is measured by the between cluster sum of square
- Silhouette Coefficient combines ideas of both cohesion and separation, but for individual points, as well as clusters and clusterings.
Calculate a = average distance of i to the points in its cluster
Calculate b = min (average distance of i to points in another cluster)
The silhouette coefficient for a point is then given by
s = 1 – a/b if a < b,
s = b/a - 1 if a > b, not the usual case
Silhouette coefficient is typically between 0 and 1. The closer to 1 the better.
Can calculate the Average Silhouette width for a cluster or a clustering.
No comments:
Post a Comment