# Single-Cell RNA-seq Core Analysis (Scanpy)

Complete workflow for single-cell RNA-seq analysis using Scanpy and the scverse ecosystem. Process raw data through quality control, normalization, clustering, and cell type annotation with publication-ready visualizations.

## When to Use This Skill

- **Analyze 10X Chromium data** (CellRanger output, H5 files, raw/filtered matrices)
- **Process Drop-seq, Smart-seq2, or inDrop** single-cell RNA-seq data
- **Integrate multi-batch data** using scVI, scANVI, or Harmony
- **Annotate cell types** manually or with automated reference-based methods
- **Compare conditions** using pseudobulk differential expression (multi-sample data)

**Don't use for:** Bulk RNA-seq (use bulk-rnaseq-counts-to-de-deseq2), R-based scRNA-seq (use scrnaseq-seurat-core-analysis), Spatial transcriptomics (coming soon)

## Installation

| Package | Version | License | Commercial Use | Installation |
| --- | --- | --- | --- | --- |
| scanpy | ≥1.9 | BSD-3-Clause | Permitted | `pip install scanpy` |
| anndata | ≥0.8 | BSD-3-Clause | Permitted | `pip install anndata` |
| numpy | ≥1.20 | BSD-3-Clause | Permitted | `pip install numpy` |
| pandas | ≥1.3 | BSD-3-Clause | Permitted | `pip install pandas` |
| matplotlib | ≥3.4 | PSF | Permitted | `pip install matplotlib` |
| seaborn | ≥0.12 | BSD-3-Clause | Permitted | `pip install seaborn` |
| adjustText | ≥0.8 | MIT | Permitted | `pip install adjustText` |
| scrublet | ≥0.2.3 | MIT | Permitted | `pip install scrublet` |
| scvi-tools | ≥1.0 | BSD-3-Clause | Permitted | `pip install scvi-tools` |
| harmonypy | ≥0.0.9 | GPL-3 | Permitted | `pip install harmonypy` |
| celltypist | ≥1.0 | MIT | Permitted | `pip install celltypist` |
| pydeseq2 | ≥0.4 | MIT | Permitted | `pip install pydeseq2` |

**Install all:** `pip install scanpy anndata numpy pandas matplotlib seaborn adjustText scrublet`

**Minimum versions:** Python ≥3.8, scanpy ≥1.9, anndata ≥0.8

## Inputs

**Required:**
- **Raw or filtered count matrix:** CellRanger output (`filtered_feature_bc_matrix/`), H5 files (`.h5`), AnnData (`.h5ad`), or count matrices (CSV/TSV)

**Optional:** Sample metadata (CSV/TSV) with sample IDs, conditions, batches, donor IDs

**Data requirements:** Min 500 cells/sample (1000+ recommended), Human or Mouse, UMI-based or read counts. See [references/qc\_guidelines.md](#) for tissue-specific thresholds.

## Outputs

**Analysis objects:**
- `adata_processed.h5ad` - Complete annotated AnnData object
- **Load with:** `import scanpy as sc; adata = sc.read_h5ad('adata_processed.h5ad')`
- Contains: raw counts, normalized data, QC metrics, clusters, cell types, UMAP/PCA
- **Note:** `adata.X` contains log-normalized (not scaled) data. Scaled data is used internally for PCA but not stored in .X. This is correct for downstream analysis (DE, visualization).
- **Required for:** trajectory inference, cell-cell communication, downstream analyses

**Reports:**
- `scrna_analysis_report.pdf` - Agent-generated comprehensive PDF with Methods, Results, Figures, Conclusions
- `analysis_summary.txt` - Text summary of dataset, QC, clustering, integration (generated by `export_anndata_results()`)

**⚠️ PDF style rules:**
- **US Letter page size (8.5 × 11 in)** — always set page dimensions explicitly; do not rely on library defaults
- **No Unicode superscripts** — use `3.36e-06` or `3.36 × 10^(-6)`, not Unicode superscript chars (they render as ■ in PDF fonts)
- **No half-empty pages** — group headings with their content; only page-break before major sections (Results, Conclusions)
- **Figures ≥80% page width** — multi-panel figures must be large enough to read; never embed below 50% width

**Tables:** `cell_metadata.csv`, `expression_matrix_counts.csv`, `expression_matrix_normalized.csv`, `pca_coordinates.csv`, `umap_coordinates.csv`, `cluster_markers_all.csv`, `{celltype}_deseq2_results.csv`

**Visualizations (PNG + SVG at 300 DPI):** QC violins, UMAP plots, marker heatmaps, dot plots, volcano/MA plots

## Clarification Questions

**Default settings (use unless user specifies otherwise):**
- Format: Filtered 10X CellRanger output | Species: Human | Tissue: PBMC
- Normalization: Standard (target sum + log1p) | Clustering: Test 0.4, 0.6, 0.8, 1.0

### 1. **Input Files** (ASK THIS FIRST):

- Do you have specific single-cell data file(s) to analyze?
- Expected: CellRanger output, H5, H5AD, or count matrix
- **Or use example/demo data?** (PBMC 3k dataset available)

### 2. **Data format and species:**

- *(If using your own data)* What format? Filtered 10X (default), Raw 10X, H5, H5AD? Human or Mouse? Tissue type?
- *(If using example data)* Defaults apply: filtered 10X, human, PBMC — skip to Q3

### 3. **Batch structure:**

- a) Single sample (no integration needed)
- b) Multiple batches (scVI recommended, Harmony for speed)

### 4. **Analysis scope:**

- a) Standard: QC + normalize + cluster + annotate (recommended)
- b) Standard + pseudobulk DE (requires ≥2 samples per condition)
- c) Custom: specify which steps

### 5. **Clustering granularity:**

- a) Coarse (0.3-0.5) — major cell types only
- b) Standard (0.6-0.8) — recommended
- c) Fine (1.0-1.5) — subtypes
- d) Test multiple: 0.4, 0.6, 0.8, 1.0 (recommended)

## Standard Workflow

🚨 **MANDATORY: USE SCRIPTS EXACTLY AS SHOWN - DO NOT WRITE INLINE CODE** 🚨

**Detailed step-by-step code:** [references/workflow-details.md](#)

**CRITICAL - DO NOT:**
- Write inline analysis code → **STOP: Use the script functions**
- Write custom export code → **STOP: Use `export_anndata_results()`**
- Skip verification messages → **STOP: Check for "✓" messages after each step**

**IF SCRIPTS FAIL - Script Failure Hierarchy:**
1. **Fix and Retry (90%)** - Install missing package, re-run script
2. **Modify Script (5%)** - Edit the script file itself, document changes
3. **Use as Reference (4%)** - Read script, adapt approach, cite source
4. **Write from Scratch (1%)** - Only if genuinely impossible, explain why

**NEVER skip directly to writing inline code without trying the script first.**

---

**Step 1 — Load and QC** | [scripts/setup\_and\_import.py](#), [scripts/qc\_metrics.py](#), [scripts/filter\_cells.py](#)

```
# Load data (Option A: example, Option B: your data)
from load_example_data import load_example_data
adata = load_example_data("pbmc3k")

# QC metrics + adaptive filtering + doublet detection
from qc_metrics import calculate_qc_metrics, batch_mad_outlier_detection
from filter_cells import run_scrublet_detection, filter_by_mad_outliers
adata = calculate_qc_metrics(adata, species="human")
adata = batch_mad_outlier_detection(adata, batch_key="batch")  # creates 'outlier' column
adata = run_scrublet_detection(adata, batch_key="batch")
adata = filter_by_mad_outliers(adata, remove_doublets=True)
```

**DO NOT write inline QC code.** Doublet rate auto-scales per batch (~0.8% per 1,000 cells). Aim for >70% cell retention. For raw data, prepend ambient RNA correction: [references/ambient\_rna\_correction.md](#)

**✅ VERIFICATION:** `"✓ Data loaded successfully!"` → QC metrics added → filtering summary with retention %.

**Step 2 — Normalize, reduce, integrate** | [scripts/normalize\_data.py](#), [scripts/scale\_and\_pca.py](#), [scripts/integrate\_scvi.py](#)

```
from normalize_data import run_standard_normalization
from find_variable_genes import find_highly_variable_genes
from scale_and_pca import scale_data, run_pca_analysis
adata = run_standard_normalization(adata, target_sum=1e4)
adata = find_highly_variable_genes(adata, n_top_genes=2000)
adata = scale_data(adata, vars_to_regress=["total_counts", "pct_counts_mt"])
adata = run_pca_analysis(adata, n_pcs=50)

# Multi-batch only: integration + diagnostics
from integrate_scvi import run_scvi_integration
from integration_diagnostics import compute_lisi_scores
adata = run_scvi_integration(adata, batch_key="batch", condition_key="condition")
lisi = compute_lisi_scores(adata, batch_key="batch", use_rep="X_scVI")
```

**DO NOT write inline normalization or integration code.** The integration script auto-detects batch-condition confounding. [Details →](#)

**✅ VERIFICATION:**
- Normalization: `"✓ Normalization complete"`
- PCA: `"PCA loadings verified: N HVG rows have non-zero loadings"` — if you see a WARNING about zero loadings, re-run PCA with `use_highly_variable=True`
- After PCA, call `suggest_n_pcs(adata)` to get recommended PC count for Step 3
- Integration (multi-batch): LISI scores printed

**Step 3 — Cluster, annotate, visualize** | [scripts/cluster\_cells.py](#), [scripts/find\_markers.py](#), [scripts/annotate\_celltypes.py](#)

```
from cluster_cells import build_neighbor_graph, cluster_leiden_multiple_resolutions
from run_umap import run_umap_reduction
from find_markers import find_all_cluster_markers
from plot_dimreduction import plot_umap_clusters
use_rep = "X_scVI" if "X_scVI" in adata.obsm else "X_pca"
# Default n_pcs=30 is standard. NEVER use <15 PCs.
adata = build_neighbor_graph(adata, use_rep=use_rep, n_neighbors=10, n_pcs=30)
adata = cluster_leiden_multiple_resolutions(adata, resolutions=[0.4, 0.6, 0.8, 1.0])
adata = run_umap_reduction(adata)
markers = find_all_cluster_markers(adata, cluster_key="leiden_0.8")
plot_umap_clusters(adata, cluster_key="leiden_0.8", output_dir="results/umap")

# Annotate (manual or CellTypist)
from annotate_celltypes import annotate_clusters_manual
annotations = {"0": "CD4 T cells", "1": "CD14+ Monocytes", ...}
adata = annotate_clusters_manual(adata, annotations, cluster_key="leiden_0.8")
```

**DO NOT write inline clustering or annotation code.** CellTypist validates labels post-hoc (flags suspect ILC/HSC, contamination, low-complexity). [Markers →](#)

**⚠️ n\_pcs for neighbor graph:** Default is 30 PCs (standard). Using <15 PCs risks collapsing distinct populations. If you used `suggest_n_pcs()` in Step 2, pass that value here.

For **pseudobulk DE** (multi-sample, ≥2 replicates/condition): [scripts/pseudobulk\_de.py](#). Script blocks with N=1. [Details →](#)

**✅ VERIFICATION:**
- Cluster counts → UMAP plots saved → marker genes identified → cell type annotations added
- After `find_all_cluster_markers()`: verify the returned DataFrame columns and print `markers.head()` to confirm values are sensible before saving to CSV.

**Step 4 — Export results** | [scripts/export\_results.py](#)

```
from export_results import export_anndata_results

export_anndata_results(adata, output_dir="results", cluster_key="cell_type")
```

**DO NOT write custom export code. Use export\_anndata\_results().**

Exports: H5AD, expression matrices (raw + normalized CSV), cell metadata, UMAP/PCA coordinates, text summary.

**✅ VERIFICATION:** You MUST see:

```
=== Export Complete ===
```

**If you don't see "Export Complete":** The export did not complete. Re-run the export function.

**⚠️ Large datasets (>20k cells):** H5AD export may take 10-60 seconds depending on dataset size. The script prints progress updates (size estimate, compression method, elapsed time). If export appears stuck for >2 minutes, interrupt and retry with `save_h5ad(adata, path, compression=None)` to skip compression.

## Decision Guide

| Decision | Quick Guide | Reference |
| --- | --- | --- |
| **Ambient RNA** | Skip for filtered/PBMC. CellBender for raw/high-soup (brain, lung, tumor) | [ambient\_rna\_correction.md](#) |
| **QC Strategy** | MAD (multi-batch). Fixed (single batch, tissue-specific) | [qc\_guidelines.md](#) |
| **Normalization** | Standard (most data). Pearson (heteroscedastic) | [scanpy\_best\_practices.md](#) |
| **Integration** | scVI (complex batches). Harmony (fast, simple) | [integration\_methods.md](#) |
| **Resolution** | Test 0.4, 0.6, 0.8, 1.0. Choose by biology and stability | [scanpy\_best\_practices.md](#) |
| **Annotation** | Manual (accurate). CellTypist (fast). Both (validate) | [marker\_gene\_database.md](#) |

**Complete working examples:** [references/common-patterns.md](#)

## Common Issues

| Issue | Cause | Solution |
| --- | --- | --- |
| `ImportError: No module named 'scanpy'` | Not installed | `pip install scanpy anndata numpy pandas matplotlib seaborn` |
| Low cell retention (<70%) | Strict QC thresholds | Use MAD (nmads=5→7) or tissue-specific thresholds |
| Out of memory | Large dataset (>50k cells) | Use backed mode or subsample |
| Clusters driven by batch | Insufficient integration | Use scVI, increase n\_latent, check confounding |
| Poor UMAP separation | Wrong parameters | Check PCA elbow, use 20-40 PCs, adjust n\_neighbors |
| High MT% in all cells | Degradation or tissue-specific | Check distribution — bimodal: stricter filter; uniform: may be biological |
| `FileNotFoundError: barcodes.tsv.gz` | Wrong directory | Verify 10X output files present. Use `import_h5_data()` for .h5 |
| H5AD export hangs or is very slow | Large file write with compression | Normal for >50 MB files. Script uses fast `lzf` compression. If still slow, pass `compression=None` to `save_h5ad()`. |
| `NameError` after export interruption | Kernel restart lost variables | Re-run from Step 1 to restore `adata`. Export is idempotent — safe to re-run. |

**Expected warnings (not errors):**

| Warning | Meaning | Action |
| --- | --- | --- |
| SVG export failed | Optional SVG dependency unavailable | Normal — PNG always generated. Both created in most environments. |
| Detected doublet rate differs from expected | Scrublet threshold or pre-filtering | Inspect `adata.obs['doublet_score'].hist()`. Adjust threshold if needed. |
| Pseudobulk DE blocked: N=1 | Insufficient replicates | Need ≥2 per condition. Use cell-level Wilcoxon for exploratory only. |
| Batch-condition confounding | Condition has 1 sample | Clustering valid. Composition comparisons need caveats. |
| CellTypist labels suspect (ILC, HSC) | Automated misclassification | Cross-check markers. Relabel ILC→NK or HSC→Unassigned. |
| Multimodal data detected | RNA-only workflow on CITE-seq | Note in reports: "RNA modality only; ADT not analyzed." |

**Detailed troubleshooting:** [references/troubleshooting\_guide.md](#)

## Suggested Next Steps

1. **Functional Enrichment** — functional-enrichment-from-degs for pathway analysis of DE results
2. **Trajectory Analysis** — PAGA, Palantir, or scVelo for developmental datasets
3. **Cell-Cell Communication** — CellPhoneDB, LIANA, or NicheNet for ligand-receptor interactions

## Related Skills

**Alternative:** scrnaseq-seurat-core-analysis (R-based) | **Downstream:** functional-enrichment-from-degs, de-results-to-plots, de-results-to-gene-lists | **Complementary:** bulk-omics-clustering, experimental-design-statistics

## References

1. **Scanpy:** Wolf FA, et al. (2018) *Genome Biol*. 19:15.
2. **Best Practices:** Luecken MD, Theis FJ. (2019) *Mol Syst Biol*. 15:e8746.
3. **Pseudobulk DE:** Squair JW, et al. (2021) *Nat Commun*. 12:5692.
4. **scVI:** Lopez R, et al. (2018) *Nat Methods*. 15:1053-1058.

**Detailed guides:** [workflow-details.md](#) | [common-patterns.md](#) | [scanpy\_best\_practices.md](#) | [qc\_guidelines.md](#) | [integration\_methods.md](#) | [pseudobulk\_de\_guide.md](#) | [marker\_gene\_database.md](#) | [troubleshooting\_guide.md](#)

**Scripts:** scripts/ | **Evaluation:** assets/eval/complete\_example\_analysis.py
