# Single-Cell Trajectory Inference

## When to Use This Skill

**Use when you have preprocessed scRNA-seq data and want to:**
- ✅ Order cells along a differentiation or disease trajectory (pseudotime)
- ✅ Identify branching points and terminal cell fates
- ✅ Discover genes driving cell state transitions
- ✅ Visualize RNA velocity (direction of cell state change)
- ✅ Compute cell fate probabilities with CellRank
- ✅ Chain from `scrnaseq-scanpy-core-analysis` output

**Do NOT use when:**
- ❌ Data is not yet preprocessed (use `scrnaseq-scanpy-core-analysis` first)
- ❌ You have bulk RNA-seq (use `disease-progression-longitudinal` instead)
- ❌ Cells are terminally differentiated with no trajectory (e.g., resting PBMCs)
- ❌ Fewer than 200 cells

## Installation

```
pip install scanpy anndata scvelo cellrank numpy pandas matplotlib seaborn scipy statsmodels reportlab
```

| Package | Version | License | Commercial Use | Notes |
| --- | --- | --- | --- | --- |
| scanpy | ≥1.9 | BSD-3 | ✅ Permitted | Core trajectory (PAGA, DPT) |
| anndata | ≥0.8 | BSD-3 | ✅ Permitted | Data container |
| scvelo | ≥0.2.5 | BSD-3 | ✅ Permitted | RNA velocity (optional but recommended) |
| cellrank | ≥2.0 | BSD-3 | ✅ Permitted | Fate mapping (optional) |
| matplotlib | ≥3.4 | PSF | ✅ Permitted | Plotting |
| seaborn | ≥0.11 | BSD-3 | ✅ Permitted | Statistical plotting, heatmaps |
| scipy | ≥1.7 | BSD-3 | ✅ Permitted | Statistics |
| statsmodels | ≥0.13 | BSD-3 | ✅ Permitted | FDR correction |
| reportlab | ≥3.6 | BSD | ✅ Permitted | PDF report (optional) |

**Graceful degradation:** Core analysis (PAGA + pseudotime) requires only scanpy. scVelo and CellRank are optional — scripts detect availability and skip gracefully.

## Inputs

**Required:**
- Preprocessed AnnData (`.h5ad`) with PCA, UMAP, and cluster annotations
- Output from `scrnaseq-scanpy-core-analysis` (`adata_processed.h5ad`) works directly
- Must have ≥200 cells, ≥100 genes, cluster labels in `.obs`

**For RNA velocity (optional):**
- Spliced/unspliced count layers (`adata.layers['spliced']`, `adata.layers['unspliced']`)
- Generated by STARsolo, Cell Ranger, or velocyto

## Outputs

**Analysis objects (for downstream skills):**
- `adata_trajectory.h5ad` — AnnData with pseudotime, PAGA, diffusion map embedded
- Load with: `adata = sc.read_h5ad('adata_trajectory.h5ad')`
- Required for: downstream enrichment, regulatory network analysis
- `trajectory_results.pkl` — Full results dict (pseudotime, gene lists, model objects)
- Load with: `results = pickle.loads(Path('trajectory_results.pkl').read_bytes())`

**Primary results (CSV):**
- `pseudotime_assignments.csv` — Cell barcode, pseudotime, cell type
- `trajectory_genes.csv` — Genes correlated with pseudotime (gene, correlation, FDR, direction)
- `velocity_genes.csv` — Top RNA velocity genes (if scVelo ran)
- `fate_probabilities.csv` — Cell fate probabilities (if CellRank ran)
- `driver_genes_*.csv` — Driver genes per terminal fate (if CellRank ran)

**Visualizations (PNG + SVG at 300 DPI):**
- `paga_graph.png/.svg` — PAGA cluster connectivity + UMAP
- `pseudotime_umap.png/.svg` — UMAP colored by pseudotime
- `pseudotime_violin.png/.svg` — Pseudotime distribution per cell type
- `diffusion_components.png/.svg` — Diffusion map components
- `gene_heatmap.png/.svg` — Top trajectory genes heatmap
- `gene_trends.png/.svg` — Gene expression trends along pseudotime
- `paga_connectivity.png/.svg` — PAGA connectivity heatmap
- `velocity_stream.png/.svg` — RNA velocity stream plot (if scVelo)
- `velocity_confidence.png/.svg` — Velocity confidence (if scVelo)
- `latent_time.png/.svg` — scVelo latent time (if dynamical model)
- `velocity_top_genes.png/.svg` — Phase portraits for top velocity genes (if scVelo)
- `fate_probabilities.png/.svg` — Cell fate UMAP (if CellRank)
- `fate_heatmap.png/.svg` — Fate probability heatmap (if CellRank)
- `driver_genes.png/.svg` — Top driver genes per terminal fate (if CellRank)

**Reports:**
- `trajectory_analysis_report.pdf` — Publication-quality PDF (requires reportlab)
- `trajectory_analysis_report.md` — Markdown fallback report
- `analysis_metadata.json` — Parameters and quality metrics

## Clarification Questions

🚨 **ALWAYS ask Question 1 FIRST. Do not ask about analysis parameters before the user has answered Question 1.**

### 1. Input Files (ASK THIS FIRST)

- **Do you have a preprocessed scRNA-seq object (.h5ad)?**
  - If uploaded: Is this your processed AnnData file?
  - Expected: `.h5ad` with PCA, UMAP, and cluster annotations
- **Or use example/demo data?**
  - Pancreatic endocrinogenesis (3,696 cells, branching trajectory into alpha/beta/delta/epsilon cells)

> 🚨 **IF EXAMPLE DATA SELECTED:** All parameters are pre-defined. **Only ask Question 2.** Then proceed to Step 1.

### 2. Analysis Scope (structured — works for both demo and user data)

- a) **Core trajectory only** — PAGA + pseudotime (~3 min)
- b) **Core + RNA velocity** — adds scVelo streams (~5 min) *(recommended)*
- c) **Full analysis** — adds CellRank fate mapping (~8 min)

**Questions 3-5 are ONLY for users providing their own data:**

### 3. Root Cell Type

- Which cell type represents the starting point of the trajectory?
- *(For demo data: Ductal cells are the progenitors — pre-selected)*

### 4. Cluster Key

- Which column in `.obs` contains your cell type annotations?
- Common: `clusters`, `cell_type`, `leiden`, `louvain`

### 5. Expected Trajectory Structure

- a) Linear differentiation (one lineage)
- b) Branching (multiple fates from one progenitor)
- c) Not sure — let the data decide

## Standard Workflow

🚨 **MANDATORY: USE SCRIPTS EXACTLY AS SHOWN — DO NOT WRITE INLINE CODE** 🚨

**Step 1 — Load data:**

```
from scripts.load_example_data import load_example_data
adata = load_example_data()
```

**DO NOT write inline data loading code. Just use the script.**

**✅ VERIFICATION:** You MUST see: `"✓ Data loaded successfully!"`

**Step 2 — Run trajectory analysis:**

```
from scripts.run_trajectory_analysis import run_trajectory
results = run_trajectory(adata, root_cell_type="Ductal", cluster_key="clusters")
```

**DO NOT write inline trajectory code. Just use the script.**

**✅ VERIFICATION:** You MUST see: `"✓ Trajectory analysis completed successfully!"`

**Step 3 — Generate visualizations:**

```
from scripts.generate_all_plots import generate_all_plots
generate_all_plots(adata, results, output_dir="trajectory_results", cluster_key="clusters")
```

🚨 **DO NOT write inline plotting code. Just use the script.** 🚨

**✅ VERIFICATION:** You MUST see: `"✓ All plots generated successfully!"`

**Step 4 — Export results:**

```
from scripts.export_results import export_all
export_all(adata, results, output_dir="trajectory_results")
```

**DO NOT write custom export code. Use export\_all().**

**✅ VERIFICATION:** You MUST see: `"=== Export Complete ==="`

---

⚠️ **CRITICAL — DO NOT:**
- ❌ **Write inline data loading code** → **STOP: Use `load_example_data()` or `load_user_data()`**
- ❌ **Write inline PAGA/DPT/scVelo code** → **STOP: Use `run_trajectory()`**
- ❌ **Write inline plotting code (plt.savefig, sc.pl, etc.)** → **STOP: Use `generate_all_plots()`**
- ❌ **Write custom export code** → **STOP: Use `export_all()`**

**⚠️ IF SCRIPTS FAIL — Script Failure Hierarchy:**
1. **Fix and Retry (90%)** — Install missing package, re-run script
2. **Modify Script (5%)** — Edit the script file itself, document changes
3. **Use as Reference (4%)** — Read script, adapt approach, cite source
4. **Write from Scratch (1%)** — Only if genuinely impossible, explain why

**NEVER skip directly to writing inline code without trying the script first.**

## Common Issues

| Error | Cause | Solution |
| --- | --- | --- |
| **"Root cell type not found"** | Cluster name mismatch | Check `adata.obs['clusters'].unique()` for exact names |
| **"scvelo not installed"** | Missing optional dependency | `pip install scvelo` — or skip velocity (core trajectory still works) |
| **"cellrank not installed"** | Missing optional dependency | `pip install cellrank` — or skip fate mapping |
| **scVelo dynamical model fails** | Insufficient spliced/unspliced counts | Script auto-falls back to stochastic model |
| **SVG export failed** | Missing system library | Normal — PNG still generated. Script handles fallback. |
| **"Too few cells"** | Dataset too small | Need ≥200 cells for meaningful trajectory |
| **CellRank terminal states incorrect** | Auto-detection picked wrong states | Check PAGA graph to verify expected terminal fates |

## Suggested Next Steps

1. **Functional enrichment** → Use `functional-enrichment-from-degs` skill
   - Input: `trajectory_genes.csv` (top up/down genes along pseudotime)
   - Find pathways driving differentiation
2. **Gene regulatory networks** → Use `grn-pyscenic` skill
   - Input: `adata_trajectory.h5ad`
   - Identify transcription factors controlling cell fate decisions
3. **Upstream regulator analysis** → Use `upstream-regulator-analysis` skill
   - Input: `trajectory_genes.csv`
   - Predict upstream regulators of trajectory dynamics
4. **Differential expression at branch points** → Use `scrnaseq-scanpy-core-analysis` (marker finding)
   - Compare cells at branch points to identify fate-determining genes

## Related Skills

**Upstream (data generation):**
- `scrnaseq-scanpy-core-analysis` — Preprocessing, clustering, UMAP → feeds `.h5ad` into this skill
- `scrnaseq-seurat-core-analysis` — R alternative (convert with SeuratDisk)

**Downstream (interpretation):**
- `functional-enrichment-from-degs` — Pathway analysis of trajectory genes
- `grn-pyscenic` — Gene regulatory networks
- `upstream-regulator-analysis` — Upstream regulators of trajectory genes

**Alternative trajectory methods:**
- `disease-progression-longitudinal` — Bulk/multi-omics longitudinal trajectories (TimeAx)

## References

### Primary Citations

1. **PAGA:** Wolf FA, Hamey FK, Plass M, et al. PAGA: graph abstraction reconciles clustering with trajectory inference through a topology preserving map of single cells. *Genome Biol*. 2019;20:59.
2. **Diffusion Pseudotime:** Haghverdi L, Büttner M, Wolf FA, et al. Diffusion pseudotime robustly reconstructs lineage branching. *Nat Methods*. 2016;13:845-848.
3. **scVelo:** Bergen V, Lange M, Peidli S, et al. Generalizing RNA velocity to transient cell states through dynamical modeling. *Nat Biotechnol*. 2020;38:1408-1414.
4. **CellRank:** Lange M, Bergen V, Klein M, et al. CellRank for directed single-cell fate mapping. *Nat Methods*. 2022;19:159-170.
5. **Example dataset:** Bastidas-Ponce A, Tritschler S, Dony L, et al. Comprehensive single cell mRNA profiling reveals a detailed roadmap for pancreatic endocrinogenesis. *Development*. 2019;146:dev173849.

### Software

| Software | Version | License | Commercial Use |
| --- | --- | --- | --- |
| scanpy | ≥1.9 | BSD-3 | ✅ Permitted |
| scVelo | ≥0.2.5 | BSD-3 | ✅ Permitted |
| CellRank | ≥2.0 | BSD-3 | ✅ Permitted |
| seaborn | ≥0.11 | BSD-3 | ✅ Permitted |
| reportlab | ≥3.6 | BSD | ✅ Permitted |
