Guide to interpreting clustering quality metrics.

---

## Internal Validation Metrics

These metrics evaluate clustering using only the data, without external labels.

### Silhouette Score

**Range:** [-1, 1]
**Higher is better**

**Formula:** For each sample i:

```
s(i) = (b(i) - a(i)) / max(a(i), b(i))
```

- a(i) = mean distance to samples in same cluster
- b(i) = mean distance to samples in nearest other cluster

**Interpretation:**
- **> 0.7**: Strong, well-separated clusters
- **0.5 - 0.7**: Reasonable structure
- **0.25 - 0.5**: Weak structure, clusters overlap
- **< 0.25**: No substantial structure
- **Negative**: Samples likely assigned to wrong cluster

**Use when:**
- Comparing different k values
- Assessing overall clustering quality
- Identifying poorly clustered samples

**Limitations:**
- Biased toward compact, spherical clusters
- Computationally expensive for large data

---

### Davies-Bouldin Index (DB Index)

**Range:** [0, ∞]
**Lower is better**

**Formula:** Average similarity between each cluster and its most similar cluster:

```
DB = (1/k) * Σ max(R_ij)
```

where R\_ij measures ratio of within-cluster to between-cluster distances.

**Interpretation:**
- **< 1.0**: Good clustering (compact, separated clusters)
- **1.0 - 2.0**: Acceptable clustering
- **> 2.0**: Poor clustering (overlapping or diffuse)

**Use when:**
- Comparing different clustering algorithms
- Selecting optimal k (minimum DB index)

**Advantages:**
- Faster to compute than silhouette
- Intuitive (measures cluster separation vs. compactness)

---

### Calinski-Harabasz Index (CH Index)

**Range:** [0, ∞]
**Higher is better**

**Formula:** Ratio of between-cluster to within-cluster variance:

```
CH = [B(k) / (k-1)] / [W(k) / (n-k)]
```

- B(k) = between-cluster variance
- W(k) = within-cluster variance
- n = samples, k = clusters

**Interpretation:**
- No universal threshold (compare across k values)
- Higher values = better defined clusters
- Peak often indicates optimal k

**Use when:**
- Fast computation needed
- Comparing different k values

**Limitations:**
- Biased toward convex clusters
- Sensitive to dataset size

---

## External Validation Metrics

These metrics compare clustering to known "true" labels (e.g., validation dataset).

### Adjusted Rand Index (ARI)

**Range:** [-1, 1] (expected value 0 for random)
**1 = perfect agreement**

**Interpretation:**
- **> 0.9**: Excellent agreement
- **0.7 - 0.9**: Good agreement
- **0.5 - 0.7**: Moderate agreement
- **0.3 - 0.5**: Weak agreement
- **< 0.3**: Poor agreement
- **~ 0**: No better than random

**Use when:**
- Validating against known labels
- Comparing different clustering solutions
- Benchmark datasets

**Advantages:**
- Corrects for chance agreement
- Symmetric (switching labels doesn't change value)

---

### Normalized Mutual Information (NMI)

**Range:** [0, 1]
**1 = perfect agreement**

**Interpretation:**
- **> 0.9**: Very high agreement
- **0.7 - 0.9**: High agreement
- **0.5 - 0.7**: Moderate agreement
- **< 0.5**: Low agreement

**Use when:**
- Comparing to true labels
- Information-theoretic interpretation desired

**Advantages:**
- Normalized (accounts for different numbers of clusters)
- Based on information theory

---

### Fowlkes-Mallows Index (FMI)

**Range:** [0, 1]
**1 = perfect agreement**

**Formula:** Geometric mean of precision and recall:

```
FMI = √[(TP / (TP + FP)) * (TP / (TP + FN))]
```

**Interpretation:**
- Similar to NMI
- **> 0.7**: Good agreement
- **< 0.5**: Poor agreement

---

## Custom Metrics

### Separation / Compactness Ratio

**Higher is better**

- **Separation**: Average distance between cluster centroids
- **Compactness**: Average within-cluster distance
- **Ratio**: Separation / Compactness

**Interpretation:**
- **> 2.0**: Well-separated clusters
- **1.0 - 2.0**: Moderate separation
- **< 1.0**: Poorly separated (overlapping)

---

## Choosing Optimal k

### Method Comparison

| Method | When to Use | Computational Cost | Reliability |
| --- | --- | --- | --- |
| **Elbow** | Quick estimate | Fast | Subjective |
| **Silhouette** | General purpose | Medium | Reliable |
| **Gap statistic** | Rigorous test | Slow | Very reliable |
| **Calinski-Harabasz** | Fast screening | Fast | Good |
| **Davies-Bouldin** | Alternative to silhouette | Fast | Good |

### Recommended Workflow

1. **Quick screening**: Elbow + Calinski-Harabasz (fast)
2. **Validation**: Silhouette score (reliable)
3. **Rigorous test**: Gap statistic (if time permits)
4. **Consensus**: Choose k where multiple metrics agree

---

## Common Scenarios

### Scenario: Metrics disagree on optimal k

**Example:** Elbow suggests k=5, Silhouette suggests k=3

**Action:**
1. Try both k values
2. Visualize clusters (PCA/UMAP)
3. Consider biological/domain knowledge
4. Check stability (bootstrap analysis)
5. Sometimes k=4 (between suggestions) is optimal

---

### Scenario: Low silhouette scores across all k

**Example:** Best silhouette = 0.3

**Possible causes:**
1. No clear cluster structure in data
2. Continuous variation rather than discrete groups
3. Wrong distance metric
4. Need dimensionality reduction first

**Actions:**
- Try different distance metrics (correlation vs. Euclidean)
- Apply PCA before clustering
- Try HDBSCAN (may be density-based structure)
- Consider data may not cluster well

---

### Scenario: One cluster dominates (>80% of samples)

**Possible causes:**
1. k too small
2. Data naturally has one main group
3. Wrong clustering method

**Actions:**
- Increase k
- Try different method (HDBSCAN to auto-detect)
- Check if data has outliers pulling centroids

---

## Quality Control Checklist

Before trusting clustering results, verify:

- [ ] **Silhouette score** > 0.5 (preferably > 0.7)
- [ ] **Davies-Bouldin** < 1.0
- [ ] **No singleton clusters** (<2% of samples)
- [ ] **No dominant cluster** (>80% of samples)
- [ ] **Stability analysis** mean > 0.8
- [ ] **Multiple metrics agree** on optimal k
- [ ] **Biological interpretability** (if applicable)
- [ ] **Visualization** shows clear separation

---

## Reporting Recommendations

When reporting clustering results, include:

1. **Primary metric**: Silhouette score
2. **Secondary metric**: Davies-Bouldin or Calinski-Harabasz
3. **Stability**: Bootstrap or cross-validation stability
4. **Method comparison**: Show multiple k values tested
5. **Visualization**: PCA/UMAP plots
6. **Cluster sizes**: Distribution of samples across clusters

**Example:**

> "K-means clustering with k=5 achieved a silhouette score of 0.68 and Davies-Bouldin index of 0.82, indicating reasonable cluster separation. Bootstrap stability analysis (100 iterations) yielded a mean stability of 0.85, suggesting robust cluster assignments."

---

## Further Reading

1. Rousseeuw, P.J. (1987). "Silhouettes: A graphical aid to the interpretation and validation of cluster analysis." *J. Comput. Appl. Math.*
2. Davies, D.L. & Bouldin, D.W. (1979). "A cluster separation measure." *IEEE TPAMI*.
3. Hubert, L. & Arabie, P. (1985). "Comparing partitions." *Journal of Classification*.