⌂ / Overview / Bulk Transcriptomics / Omics Clustering

View companion source

Omics Clustering

Cluster samples or genes to find subtypes, with rigorous k-selection.

Overview

Problem. Find natural groupings and compare clustering methods.

Use when: Discovering subtypes or co-expression patterns

Avoid when: Concluding without stability checks

Learning goals

k is chosen systematically, not by guess
Trust clusters that survive resampling

Figures

Tutorial

Systematic workflow for clustering biological samples, features, or any quantitative data matrix. Implements multiple clustering algorithms with rigorous validation, comparison, and interpretation to identify meaningful data groupings.

When to Use This Skill

Use clustering analysis when you need to:

✅ Group biological samples by gene expression profiles (bulk RNA-seq, proteomics)
✅ Identify feature patterns (genes/proteins with similar expression across conditions)
✅ Discover subtypes in disease or treatment response groups
✅ Analyze trajectories in time-series or developmental data
✅ Quality control by detecting batch effects or outliers
✅ Compare methods by systematically evaluating multiple clustering approaches

Don't use this skill for:

❌ Single-cell RNA-seq clustering → Use scrnaseq-scanpy-core-analysis or scrnaseq-seurat-core-analysis
❌ Gene co-expression network analysis → Use coexpression-network

Key Concept: Clustering reveals natural groupings in data without prior labels. Different algorithms make different assumptions—this workflow helps you choose and validate the right approach for your data.

Language Support: This skill supports both Python and R implementations. Choose based on your preference and existing analysis pipeline. Python offers scikit-learn ecosystem integration; R offers ComplexHeatmap and rich Bioconductor tools.

Quick Start (5-Minute Example)

Test the workflow with the ALL (Acute Lymphoblastic Leukemia) dataset - 128 pediatric ALL patients with B-cell and T-cell subtypes:

R (Recommended for ALL dataset):

# 1. Load example data (ALL dataset from Chiaretti et al. 2004)
source("scripts/load_example_data.R")
data_list <- load_example_clustering_data()
data <- data_list$data
sample_names <- data_list$sample_names
feature_names <- data_list$feature_names
metadata <- data_list$metadata

# 2. Run clustering
source("scripts/hierarchical_clustering.R")
result <- hierarchical_clustering(data, n_clusters = 2)
cluster_labels <- result$cluster_labels
hclust_obj <- result$clustering_object

# 3. Visualize
source("scripts/plot_cluster_heatmap.R")
plot_cluster_heatmap(
  data,
  cluster_labels,
  output_dir = "quick_test_results"
)

# 4. Export results
# Results automatically saved by plotting functions
cat("\n✓ Quick start complete!\n")
cat(sprintf("Cell types: B-cell ALL (n=%d), T-cell ALL (n=%d)\n",
            sum(metadata$cell_type == "B"),
            sum(metadata$cell_type == "T")))

Python users: The Python implementation can use the same ALL dataset via rpy2, or use your own data. For ALL dataset examples in Python, you'll need pip install rpy2 and R installed. See full workflow below.

Expected output: Dendrogram distinguishing B-cell vs T-cell ALL, PCA plots, silhouette plots, heatmaps in quick_test_results/

Note: For your own data, follow the full workflow below with the Clarification Questions.

Installation

Language Choice

This workflow supports both Python and R. Choose one based on:

Python: Better for large-scale data (>10k samples), integration with machine learning pipelines, modern plotting (plotnine)
R: Better for heatmap visualization (ComplexHeatmap), Bioconductor integration, traditional bioinformatics workflows

You can also mix: use R for visualization (heatmaps) and Python for clustering algorithms.

Python Installation

Required Software

Software	Version	License	Commercial Use	Installation
numpy	≥1.20	BSD-3-Clause	✅ Permitted	`pip install numpy`
pandas	≥1.3	BSD-3-Clause	✅ Permitted	`pip install pandas`
scikit-learn	≥1.0	BSD-3-Clause	✅ Permitted	`pip install scikit-learn`
scipy	≥1.7	BSD-3-Clause	✅ Permitted	`pip install scipy`
hdbscan	≥0.8.28	BSD-3-Clause	✅ Permitted	`pip install hdbscan`
umap-learn	≥0.5	BSD-3-Clause	✅ Permitted	`pip install umap-learn`
plotnine	≥0.10	MIT	✅ Permitted	`pip install plotnine`
plotnine-prism	Latest	MIT	✅ Permitted	`pip install plotnine-prism`
seaborn	≥0.11	BSD-3-Clause	✅ Permitted	`pip install seaborn`
matplotlib	≥3.4	PSF-based	✅ Permitted	`pip install matplotlib`
adjustText	≥0.8	MIT	✅ Permitted	`pip install adjustText`
statsmodels	≥0.13	BSD-3-Clause	✅ Permitted	`pip install statsmodels`

Optional: gap-stat (Apache 2.0), yellowbrick (Apache 2.0) - both permitted for commercial use

Minimum Python version: Python ≥3.8

License Compliance: All software packages use BSD-3-Clause, MIT, or similar permissive licenses that allow commercial use in AI agent applications.

R Installation

Minimum R version: R ≥4.0

Required packages: cluster, factoextra, NbClust, ComplexHeatmap, pheatmap, dendextend, dbscan, mclust, ggplot2, ggprism (all GPL/MIT licensed, commercial use permitted)

Quick install:

install.packages(c('cluster', 'factoextra', 'NbClust', 'pheatmap',
                   'dendextend', 'dbscan', 'mclust', 'ggplot2', 'ggprism'))
if (!require('BiocManager')) install.packages('BiocManager')
BiocManager::install('ComplexHeatmap')

Detailed R installation guide: references/r-quick-start.md#r-installation

Inputs

Required Input

1. Data Matrix (CSV, TSV, Excel, or HDF5):

Sample clustering: Rows = samples, Columns = features (genes/proteins)
Feature clustering: Rows = features, Columns = samples
Values should be normalized and comparable (TPM, FPKM, VST, Z-scores)
Missing values: handle by imputation or removal before clustering

2. Sample/Feature Metadata (CSV or TSV, optional but recommended):

IDs matching data matrix rows/columns
Annotations for validation (tissue type, condition, known groups)
Used to interpret and validate clustering results

Data Requirements

Minimum samples/features: 10+ (20+ recommended for robust clustering)
Normalization: Required—use appropriate method for data type
Batch effects: Remove or regress out before clustering
Outliers: Identify and consider removing extreme outliers
Dimensionality: High-dimensional data (>1000 features) benefits from PCA first

Supported Data Types

Transcriptomics: Bulk RNA-seq (normalized counts, TPM, FPKM)
Proteomics: Protein abundance data
Metabolomics: Metabolite concentrations
Multi-omics: Integrated features from multiple assays
Clinical: Patient feature matrices
Any quantitative matrix: General-purpose application

Outputs

Cluster assignments:

clustering_assignments.csv - Sample/feature IDs with cluster labels
clustering_statistics.csv - Size, centroids, and characteristics per cluster

Validation metrics:

clustering_validation_metrics.json - Silhouette, Davies-Bouldin, Calinski-Harabasz scores
stability_results.csv - Bootstrap resampling stability scores (if run)

Characterization:

clustering_cluster*_features.csv - Top distinguishing features for each cluster (ANOVA/Kruskal-Wallis)
clustering_all_cluster_features.csv - Combined features for all clusters

Analysis objects (pickle/RDS):

clustering_analysis_object.pkl - Complete clustering object for downstream use
Load with: import pickle; obj = pickle.load(open('clustering_analysis_object.pkl', 'rb'))
Contains: linkage matrix (hierarchical), fitted model (k-means, GMM, HDBSCAN)
Required for: Advanced analysis, re-cutting dendrograms, model inspection
clustering_data_matrix.csv - Normalized/transformed data matrix used for clustering

Visualizations (PNG + SVG at 300 DPI):

Dendrogram (hierarchical clustering)
PCA/UMAP scatter plots with cluster colors
Silhouette plots
Cluster heatmaps
Cluster size barplots
Optimal k determination plots (if run)

Supporting data:

clustering_parameters.json - All parameters used in analysis

Clarification Questions

Before starting analysis, gather the following information:

1. Input Files (ASK THIS FIRST):

Do you have specific gene expression or omics data file(s) to cluster?
If you've uploaded a file, is it the data matrix (samples × features or features × samples) you'd like to analyze?
Expected file types: CSV, TSV, Excel (.xlsx), HDF5 (.h5), or RDS (R data objects)
Or use ALL (Acute Lymphoblastic Leukemia) example data? (Real patient data: 128 pediatric ALL samples with B-cell and T-cell subtypes, 1000 most variable genes. Dynamically loads from Bioconductor in ~30 seconds via scripts/load_example_data.R. Citation: Chiaretti et al. 2004. R recommended for this dataset.)

2. Which programming language do you prefer?

Python: Modern ecosystem with scikit-learn, better for large datasets (>10k samples), machine learning integration
R: Rich Bioconductor tools, excellent heatmap visualization (ComplexHeatmap), traditional bioinformatics workflows
Both: Use Python for clustering algorithms, R for heatmap visualization (recommended for best of both worlds)
No preference: Will use Python (default, more comprehensive scripts available)

3. What are you clustering?

Samples (rows = samples, columns = features like genes)
- Example: Group patients by gene expression profiles
Features (rows = features, columns = samples)
- Example: Group genes by expression patterns across conditions
Both (explore both clustering directions)

4. What is your data format and normalization status?

Bulk RNA-seq: TPM, FPKM, VST, rlog
Proteomics: TMT/iTRAQ normalized intensities, LFQ
Metabolomics: Normalized peak areas
Already normalized: Z-scores, scaled values
Raw data: Needs normalization (specify data type)

5. What clustering approach do you prefer?

Hierarchical (recommended for exploration): Creates a tree-like dendrogram showing relationships at all scales. You can cut the tree at any height to get different numbers of clusters after seeing the structure. Best for visualization and when you don't know k. Deterministic (same results every run). Works well for small-medium datasets (<5k samples).
K-means: Fast partitioning method that requires specifying k upfront. Best for large datasets (>5k samples) with spherical/compact clusters. Results may vary between runs, so use high n_init.
Density-based (HDBSCAN): Automatically finds k by identifying dense regions. Can detect arbitrary cluster shapes and marks outliers. No need to specify k. Best when clusters have varying densities or you want outlier detection.
Model-based (GMM): Probabilistic soft clustering where samples have membership probabilities for each cluster. Useful when clusters overlap or you need uncertainty estimates. Requires specifying k.
Compare all methods: Systematic comparison (recommended for exploratory analysis)

6. Do you know the expected number of clusters (k)?

Yes, k = [number]: Will validate this choice with quality metrics
No, exploratory: Will test multiple k values (2-15) and help determine optimal k
Approximate range: k between [min] and [max]
Not needed: Hierarchical clustering lets you choose k later; HDBSCAN finds k automatically

7. What distance metric is most appropriate?

Euclidean: Default for most continuous data
Correlation: For gene expression (focuses on pattern similarity)
Manhattan: Robust to outliers
Cosine: For high-dimensional sparse data
Unsure: Will use Euclidean (default)

8. Should we apply dimensionality reduction first?

Yes, use PCA: Recommended for >1000 features
Yes, use UMAP: For visualization and manifold learning
No: Keep original feature space (<100 features)
Both: PCA for clustering, UMAP for visualization

9. How should cluster quality be validated?

Internal metrics only: Silhouette, Davies-Bouldin, Calinski-Harabasz
Known labels available: Compare to known groups (adjusted Rand index)
Stability analysis: Bootstrap resampling to test robustness
Biological validation: Check for differential features between clusters

Standard Workflow

🚨 MANDATORY: USE SCRIPTS EXACTLY AS SHOWN - DO NOT WRITE INLINE CODE 🚨

Note: This skill requires decisions (algorithm, distance metric, k value) - adapt parameters to your data using guidance in references/decision-guide.md.

Step 1 - Load data:

For ALL example data (R recommended):

source("scripts/load_example_data.R")
data_list <- load_example_clustering_data()
data <- data_list$data
sample_names <- data_list$sample_names
feature_names <- data_list$feature_names
metadata <- data_list$metadata

For your own data (Python or R):

# Python
from scripts.prepare_data import load_and_prepare_data
data, sample_names, feature_names = load_and_prepare_data(
    "your_data.csv",
    normalize_method="zscore"
)

# R
source("scripts/prepare_data.R")
data_list <- load_and_prepare_data("your_data.csv", normalize_method = "zscore")
data <- data_list$data

DO NOT write custom data loading code. Use the provided functions.

Step 2 - Run clustering analysis:

# Choose ONE clustering method and run it:

# Option A: Hierarchical clustering (recommended for exploration)
from scripts.hierarchical_clustering import hierarchical_clustering
linkage_matrix, cluster_labels = hierarchical_clustering(
    data,
    n_clusters=2,  # For ALL dataset: B-cell vs T-cell; adjust for your data
    linkage_method="ward",
    distance_metric="euclidean"
)
clustering_object = linkage_matrix

# Option B: K-means clustering (fast, for large datasets)
from scripts.kmeans_clustering import kmeans_clustering
cluster_labels, kmeans_model = kmeans_clustering(
    data,
    n_clusters=2,  # For ALL dataset: B-cell vs T-cell; adjust for your data
    n_init=50
)
clustering_object = kmeans_model

# Option C: HDBSCAN (automatic k detection)
from scripts.density_clustering import hdbscan_clustering
cluster_labels, hdbscan_model = hdbscan_clustering(
    data,
    min_cluster_size=10
)
clustering_object = hdbscan_model

# Validate clustering quality
from scripts.cluster_validation import validate_clustering
validation_results = validate_clustering(data, cluster_labels, metrics="all")
print(f"Silhouette score: {validation_results['silhouette']:.3f}")

DO NOT write custom clustering code. Use one of the provided methods.

Step 3 - Generate visualizations:

# Generate comprehensive plots
from scripts.plot_clustering_results import plot_all_results
from scripts.dimensionality_reduction import apply_pca, apply_umap

# Optional: compute PCA/UMAP for visualization
pca_data = apply_pca(data, n_components=2)
umap_data = apply_umap(data, n_components=2)

# Create all plots
plot_all_results(
    data=data,
    cluster_labels=cluster_labels,
    sample_names=sample_names,
    feature_names=feature_names,
    pca_data=pca_data,
    umap_embedding=umap_data,
    linkage_matrix=clustering_object if 'linkage_matrix' in locals() else None,
    output_dir="clustering_results"
)

DO NOT write inline plotting code (matplotlib.pyplot.savefig, seaborn.clustermap, etc.). Use plot_all_results().

The script handles PNG + SVG export with graceful fallback for optional dependencies.

Step 4 - Export results:

# Export all results including analysis objects for downstream use
from scripts.export_results import export_all

export_all(
    data=data,
    cluster_labels=cluster_labels,
    sample_names=sample_names,
    feature_names=feature_names,
    validation_results=validation_results,
    clustering_object=clustering_object,
    output_dir="clustering_results"
)

DO NOT write custom export code. Use export_all().

✅ VERIFICATION - You should see:

After Step 1: "✓ Data loaded successfully!"
After Step 2: "✓ Clustering completed successfully!"
After Step 3: "✓ Plots saved to clustering_results"
After Step 4: "=== Export Complete ==="

❌ IF YOU DON'T SEE THESE: You wrote inline code. Stop and use the scripts as shown.

⚠️ CRITICAL - DO NOT:

❌ Write inline clustering algorithms → STOP: Use provided clustering functions
❌ Write inline plotting code (matplotlib.pyplot.savefig, seaborn.clustermap, etc.) → STOP: Use plot_all_results()
❌ Write custom export code → STOP: Use export_all()
❌ Try to install plotting dependencies manually → scripts handle fallback automatically

⚠️ IF SCRIPTS FAIL - Script Failure Hierarchy:

Fix and Retry (90%) - Install missing package, re-run script
Modify Script (5%) - Edit the script file itself, document changes
Use as Reference (4%) - Read script, adapt approach, cite source
Write from Scratch (1%) - Only if genuinely impossible, explain why

NEVER skip directly to writing inline code without trying the script first.

For detailed parameter guidance and decision-making, see:

Algorithm selection: references/decision-guide.md#algorithm
Distance metrics: references/decision-guide.md#distance
Optimal k: references/decision-guide.md#optimal-k

Common Issues

Error	Cause	Solution
`ValueError: n_clusters exceeds n_samples`	Too many clusters requested for dataset size	Reduce k or ensure data has ≥10 samples
`Ward linkage requires euclidean distance`	Incompatible linkage method and distance metric	Use euclidean distance with ward, or switch to average/complete linkage
`ModuleNotFoundError: No module named 'plotnine_prism'`	Missing optional visualization dependency	Install: `pip install plotnine-prism`
Silhouette score < 0.25	Poor cluster separation, k may be wrong	Try different k values, check data quality, use optimal k methods
HDBSCAN returns only noise (-1 labels)	min_cluster_size too large for dataset	Reduce min_cluster_size (try 5-10 for small datasets) or increase min_samples
`ValueError: Input contains NaN`	Missing values in data matrix	Impute or remove NaN values before clustering
Clusters of very unequal sizes	Algorithm bias or true biological structure	Validate with biology; try HDBSCAN for density-based approach
Different runs give different clusters (k-means)	Random initialization varies	Increase n_init (50-100) for stable results, or use hierarchical clustering

Debugging checklist:

Check data shape and missing values: data.shape, np.isnan(data).sum()
Verify normalization: Data should be comparable across features (z-scores, scaled)
Check cluster labels: Should range from 0 to k-1 (or include -1 for noise in HDBSCAN)
Validate with metrics: Silhouette >0.3 is decent, >0.5 is good
Inspect visually: PCA/UMAP plots should show clear groupings

Decision Guide

Make three critical decisions before clustering:

Decision	Quick Guide	Detailed Reference
Algorithm	Hierarchical (<5k samples, exploration), K-means (>5k, speed), HDBSCAN (unknown k, outliers), GMM (soft clustering)	decision-guide.md#algorithm
Distance	Correlation (gene expression), Euclidean (normalized data), Manhattan (outliers), Cosine (sparse)	decision-guide.md#distance
Cluster # (k)	Use multiple metrics (elbow, silhouette, gap), prioritize silhouette, consider biology	decision-guide.md#optimal-k

Comprehensive decision guidance: references/decision-guide.md

Common Patterns

Patterns:

Pattern 1: Sample Subtype Discovery - references/common-patterns.md#pattern-1
Pattern 2: Gene Co-clustering - references/common-patterns.md#pattern-2
Pattern 3: Method Comparison - references/common-patterns.md#pattern-3 Additional: QC/outlier detection, stability testing, batch validation - references/common-patterns.md

Suggested Next Steps

After completing clustering:

Differential Analysis - Use bulk-rnaseq-counts-to-de-deseq2 to identify cluster-specific signatures
Functional Enrichment - Use functional-enrichment-from-degs to interpret cluster biology
Refinement - If unclear: try different metrics, adjust k, remove outliers, use variable features subset

Related Skills

Prerequisite: bulk-rnaseq-counts-to-de-deseq2 (normalize RNA-seq data) Alternatives: scrnaseq-scanpy-core-analysis, scrnaseq-seurat-core-analysis (single-cell), coexpression-network Downstream: de-results-to-gene-lists, functional-enrichment-from-degs, de-results-to-plots

References

Reference documentation:

references/decision-guide.md - Comprehensive decision guidance (algorithm, distance, k)
references/common-patterns.md - Additional use case patterns
references/clustering_methods_comparison.md - Algorithm selection guide
references/validation_metrics_guide.md - Metric interpretation
references/distance_metrics_guide.md - Distance metric selection
references/parameter_guide.md - Parameter tuning details
references/best_practices.md - Workflow best practices

Python scripts:

scripts/prepare_data.py - Data loading and preprocessing
scripts/distance_metrics.py - Distance calculations
scripts/dimensionality_reduction.py - PCA, UMAP
scripts/hierarchical_clustering.py - Hierarchical methods
scripts/kmeans_clustering.py - K-means variants
scripts/density_clustering.py - DBSCAN, HDBSCAN
scripts/model_based_clustering.py - GMM
scripts/optimal_clusters.py - Optimal k determination
scripts/cluster_validation.py - Quality metrics
scripts/stability_analysis.py - Bootstrap validation
scripts/characterize_clusters.py - Feature importance
scripts/plot_clustering_results.py - Visualizations
scripts/export_results.py - Export functions

R scripts:

scripts/hierarchical_clustering.R - Hierarchical clustering with dendextend
scripts/plot_cluster_heatmap.R - ComplexHeatmap visualization (recommended for publication figures)
Additional R implementations available upon request

Evaluation:

eval/complete_example_analysis.py - Full workflow example

Online resources:

Python:

Key papers:

Rousseeuw (1987) - Silhouettes for cluster validation
Tibshirani et al. (2001) - Gap statistic
McInnes et al. (2017) - HDBSCAN algorithm
D'haeseleer (2005) - Clustering in genomics

Code preview

scripts/characterize_clusters.py

"""
Characterize clusters by identifying distinguishing features.

This module performs statistical tests to find features that differentiate clusters.
"""

import numpy as np
import pandas as pd
from scipy import stats
from sklearn.preprocessing import StandardScaler
from typing import Optional, List
import warnings


def characterize_clusters(
    data: np.ndarray,
    cluster_labels: np.ndarray,
    feature_names: List[str],
    method: str = "anova",
    top_n: int = 50,
    fdr_threshold: float = 0.05,
    plot_heatmap: bool = False,
    output_path: Optional[str] = None
) -> dict:
    """
    Identify features that distinguish clusters.

    Parameters
    ----------
    data : np.ndarray
        Data matrix (samples × features)
    cluster_labels : np.ndarray
        Cluster assignments
    feature_names : List[str]
        Feature names
    method : str, default="anova"
        Statistical test: "anova", "kruskal", or "ttest"
    top_n : int, default=50
        Number of top features to return per cluster
    fdr_threshold : float, default=0.05
        FDR threshold for significance
    plot_heatmap : bool, default=False
        If True, plot heatmap of top features
    output_path : str, optional
        Path to save heatmap

    Returns
    -------
    cluster_features : dict
        Dictionary mapping cluster ID to DataFrame of top features
    """

    print(f"\nCharacterizing clusters using {method}...")

    # Remove noise points
    mask = cluster_labels >= 0
    data_clean = data[mask]
    labels_clean = cluster_labels[mask]

    unique_labels = np.unique(labels_clean)
    n_clusters = len(unique_labels)

    # Test each feature
    feature_results = []

    for i, feature_name in enumerate(feature_names):
        feature_data = data_clean[:, i]

        # Perform statistical test
        if method == "anova":
            # One-way ANOVA across all clusters
            groups = [feature_data[labels_clean == label] for label in unique_labels]
            f_stat, p_value = stats.f_oneway(*groups)

        elif method == "kruskal":
            # Kruskal-Wallis (non-parametric ANOVA)
            groups = [feature_data[labels_clean == label] for label in unique_labels]
            h_stat, p_value = stats.kruskal(*groups)

        else:

scripts/cluster_validation.py

"""
Cluster validation metrics and quality assessment.

This module provides comprehensive metrics for evaluating clustering quality:
- Internal validation (silhouette, Davies-Bouldin, Calinski-Harabasz)
- External validation (if true labels known)
- Cluster separation and compactness metrics
"""

import numpy as np
from sklearn.metrics import (
    silhouette_score,
    davies_bouldin_score,
    calinski_harabasz_score,
    adjusted_rand_score,
    adjusted_mutual_info_score,
    fowlkes_mallows_score,
    normalized_mutual_info_score
)
from typing import Optional, List
import warnings


def validate_clustering(
    data: np.ndarray,
    cluster_labels: np.ndarray,
    metrics: str = "all",
    true_labels: Optional[np.ndarray] = None,
    plot_silhouette: bool = False,
    output_path: Optional[str] = None
) -> dict:
    """
    Comprehensive clustering validation using multiple metrics.

    Parameters
    ----------
    data : np.ndarray
        Data matrix (samples × features)
    cluster_labels : np.ndarray
        Cluster assignments
    metrics : str or list, default="all"
        Metrics to compute: "all", "internal", "external", or list of metric names
    true_labels : np.ndarray, optional
        True labels (if known) for external validation
    plot_silhouette : bool, default=False
        If True, create detailed silhouette plot
    output_path : str, optional
        Path to save silhouette plot

    Returns
    -------
    validation_results : dict
        Dictionary with validation metrics
    """

    print("\nValidating clustering quality...")

    results = {}

    # Check for noise points (label -1)
    has_noise = -1 in cluster_labels
    if has_noise:
        n_noise = np.sum(cluster_labels == -1)
        print(f"Note: {n_noise} noise points detected (label -1)")

        # Create clean version without noise for some metrics
        clean_mask = cluster_labels >= 0
        data_clean = data[clean_mask]
        labels_clean = cluster_labels[clean_mask]
    else:
        clean_mask = np.ones(len(cluster_labels), dtype=bool)
        data_clean = data
        labels_clean = cluster_labels

    # Check if we have enough clusters
    n_clusters = len(np.unique(labels_clean))
    if n_clusters < 2:
        print("Warning: Less than 2 clusters. Validation metrics not meaningful.")
        return results

scripts/density_clustering.py

"""
Density-based clustering (DBSCAN and HDBSCAN).

This module provides density-based clustering methods that can find
arbitrary-shaped clusters and identify noise/outliers.
"""

import numpy as np
from sklearn.cluster import DBSCAN
from typing import Tuple, Optional
import warnings

# HDBSCAN is optional but recommended
try:
    import hdbscan
    HDBSCAN_AVAILABLE = True
except ImportError:
    HDBSCAN_AVAILABLE = False
    warnings.warn("HDBSCAN not available. Install with: pip install hdbscan")


def hdbscan_clustering(
    data: np.ndarray,
    min_cluster_size: int = 10,
    min_samples: Optional[int] = None,
    metric: str = "euclidean",
    cluster_selection_method: str = "eom",
    plot_hierarchy: bool = False
) -> Tuple[np.ndarray, np.ndarray, int]:
    """
    Perform HDBSCAN (Hierarchical Density-Based Spatial Clustering).

    HDBSCAN advantages:
    - Automatically determines number of clusters
    - Finds arbitrary-shaped clusters
    - Identifies noise/outliers (label = -1)
    - More stable than DBSCAN
    - Provides cluster membership probabilities

    Parameters
    ----------
    data : np.ndarray
        Data matrix (samples × features)
    min_cluster_size : int, default=10
        Minimum number of samples in a cluster
        Smaller = more clusters; Larger = fewer, denser clusters
    min_samples : int, optional
        Number of samples in neighborhood for core point
        If None, uses min_cluster_size
        Higher = more conservative (denser clusters)
    metric : str, default="euclidean"
        Distance metric
    cluster_selection_method : str, default="eom"
        Method for selecting clusters from hierarchy:
        - "eom": Excess of Mass (default, good general choice)
        - "leaf": Selects leaf clusters (more granular)
    plot_hierarchy : bool, default=False
        If True, plot cluster hierarchy

    Returns
    -------
    cluster_labels : np.ndarray
        Cluster assignments (noise points labeled as -1)
    probabilities : np.ndarray
        Cluster membership probabilities (0-1)
    n_clusters : int
        Number of clusters found (excluding noise)
    """

    if not HDBSCAN_AVAILABLE:
        raise ImportError("HDBSCAN not installed. Install with: pip install hdbscan")

    print(f"Performing HDBSCAN clustering...")
    print(f"  min_cluster_size={min_cluster_size}, min_samples={min_samples}")

    if min_samples is None:
        min_samples = min_cluster_size

    clusterer = hdbscan.HDBSCAN(
        min_cluster_size=min_cluster_size,

Companion files

Type	Path	Bytes
Markdown	references/best_practices.md	8,616
Markdown	references/clustering_methods_comparison.md	7,576
Markdown	references/common-patterns.md	34,294
Markdown	references/decision-guide.md	25,403
Markdown	references/distance_metrics_guide.md	7,915
Markdown	references/parameter_guide.md	9,265
Markdown	references/r-quick-start.md	3,441
Markdown	references/validation_metrics_guide.md	7,105
Python	scripts/characterize_clusters.py	7,839
Python	scripts/cluster_validation.py	14,619
Python	scripts/density_clustering.py	9,758
Python	scripts/dimensionality_reduction.py	10,186
Python	scripts/distance_metrics.py	7,756
Python	scripts/export_results.py	13,000
Python	scripts/hierarchical_clustering.py	7,835
R	scripts/hierarchical_clustering.R	7,124
Python	scripts/kmeans_clustering.py	9,062
Python	scripts/load_example_data.py	6,001
R	scripts/load_example_data.R	5,976
Python	scripts/model_based_clustering.py	11,490
Python	scripts/optimal_clusters.py	12,044
R	scripts/plot_cluster_heatmap.R	4,282
Python	scripts/plot_clustering_results.py	13,629
Python	scripts/prepare_data.py	8,592
Python	scripts/stability_analysis.py	14,674
Markdown	SKILL.md	23,239
JSON	skill.meta.json	4,840