View companion source

Pooled CRISPR Screens

Analyse pooled CRISPR screens with scRNA-seq readout.

Overview

Problem. Perturb many genes; read each effect transcriptomically.

Use when: Perturb-seq / CROP-seq data
Avoid when: Insufficient library coverage

Learning goals

Figures

Pooled CRISPR Screen Overview
Perturb-seq Concept
Tiered Analysis Workflow
QC Thresholds by Cell Type
Calling Hits
CRISPR Screen Pitfalls

Tutorial

Analyze pooled CRISPR screens with single-cell RNA-seq readout using a tiered workflow: fast screening → target validation → rigorous differential expression.

When to Use This Skill

Use this skill when you have: - ✅ Pooled CRISPR screens with scRNA-seq (Perturb-seq, CROP-seq, CRISPRi/a) - ✅ 10X Feature Barcoding data (sgRNA captured as feature barcodes) - ✅ Multi-library experiments with biological replicates - ✅ sgRNA-to-cell mapping files (already assigned)

Don't use this skill for: - ❌ Arrayed CRISPR screens (separate wells per perturbation) → use bulk RNA-seq DE skills

  • ❌ Non-transcriptional readouts (e.g., protein, flow cytometry)
  • ❌ Data without sgRNA assignments → use CellRanger or CROP-seq pipeline first

Quick Start (Example Data)

Test this skill with a real CRISPRi Perturb-seq dataset (~10 minutes):

from scripts.load_example_data import load_example_data
data = load_example_data()  # Downloads Papalexi 2021 (~140MB, cached after first run)

adata_list = data['adata_list']  # List of AnnData objects (one per batch)
mapping_files = data['mapping_files']  # sgRNA mapping files

What you get:

  • Dataset: Papalexi & Satija 2021 ECCITE-seq CRISPRi screen (THP-1 cells)
  • Size: ~20,700 cells x 18,649 genes across 4 batches
  • Perturbations: 25 target genes (~4 guides each) targeting immune checkpoint regulators + non-targeting controls
  • Screen type: CRISPRi (expected knockdown direction: down)
  • Reference: Papalexi et al. (2021) Nature Genetics 53:322-331

For offline testing: Use load_example_data(dataset='demo') for a small synthetic dataset.

For your own data: Replace with your 10X feature-barcode matrices and sgRNA mapping files (see Inputs).

Installation

Core packages (required):

# Create conda environment
conda create -n crispr-screen python=3.8
conda activate crispr-screen

# Install packages
pip install scanpy==1.9+ anndata==0.8+ pandas numpy scipy
pip install scikit-learn  # For outlier detection
pip install diffxpy  # For differential expression

Visualization packages (required):

pip install matplotlib seaborn

Optional packages:

# For glmGamPoi (requires R ≥4.0)
# Install R, then:
# install.packages("BiocManager")
# BiocManager::install("glmGamPoi")
pip install rpy2  # Python-R interface

# For PDF report generation (optional, recommended)
pip install reportlab

Version requirements:

Software Version License Commercial Use Install
scanpy ≥1.9 BSD-3 ✅ Permitted pip install scanpy
anndata ≥0.8 BSD-3 ✅ Permitted pip install anndata
matplotlib ≥3.5 PSF ✅ Permitted pip install matplotlib
seaborn ≥0.12 BSD-3 ✅ Permitted pip install seaborn
scikit-learn ≥1.0 BSD-3 ✅ Permitted pip install scikit-learn
diffxpy ≥0.7 BSD-3 ✅ Permitted pip install diffxpy
reportlab ≥3.6 BSD ✅ Permitted pip install reportlab (optional)
rpy2 ≥3.5 GPL-2 ✅ Permitted pip install rpy2 (optional)

Inputs

Required:

  • 10X Feature-Barcode matrices (.h5 format, one per library)
  • Contains gene expression + sgRNA counts
  • From CellRanger with Feature Barcoding
  • sgRNA mapping files (TSV with cell-sgRNA assignments)
  • Format: cell_barcode\tsgRNA_id
  • One file per library
  • Should filter to single sgRNA assignments (doublets removed)

sgRNA naming conventions:

  • With gene names: GENE_sgRNA1, GENE_sgRNA2 (delimiter: _)
  • Without gene names: Requires separate gene lookup table

Alternative inputs: CROP-seq pipeline output, custom sgRNA assignment files

See references/crispr_screen_best_practices.md for data format details.

Outputs

Primary results:

  • screening_summary.txt - Per-perturbation DE statistics from t-test
  • validation_results.csv - Target gene knockdown/activation validation
  • glmgampoi/ - Batch-corrected DE results (top hits)
  • hit_list.csv - Final validated hits with DE gene counts

Processed data:

  • adata_normalized.h5ad - Normalized AnnData object for downstream analysis
  • Load with: adata = sc.read_h5ad('adata_normalized.h5ad')
  • Required for: Clustering, pseudotime, trajectory analysis
  • adata_processed.h5ad - Final processed data with all annotations

Analysis objects (pickle):

  • analysis_objects.pkl - Analysis results for downstream skills
  • Load with: import pickle; objs = pickle.load(open('analysis_objects.pkl', 'rb'))
  • Contains: perturbation_summary, de_results, outlier_cells
  • Required for: functional-enrichment-from-degs, coexpression-network

Report:

  • crispr_screen_report.pdf - Publication-quality PDF with Introduction, Methods, Results, Conclusions
  • Requires: pip install reportlab (optional — text report generated regardless)

QC plots (PNG + SVG format):

  • qc_metrics/ - Cell counts, mapping rates, doublet rates per library
  • volcano_plots/ - Per-perturbation volcano plots (top hits)
  • target_validation/ - Target gene expression heatmaps

DE results (per perturbation):

  • initial_screening/ - Fast t-test results (CSV per perturbation)
  • glmgampoi/ - Rigorous DE with batch correction (top 50-100 hits)

Clarification Questions

🚨 ALWAYS ask Question 1 FIRST. Do not ask about screen type, library details, or analysis goals before the user has answered Question 1.

1. Input Files (ASK THIS FIRST):

🚨 IF EXAMPLE DATA SELECTED: All parameters are pre-defined (CRISPRi, THP-1 immune cells, 25 target genes). DO NOT ask questions 2-6. Proceed directly to Step 1.

Questions 2-6 are ONLY for users providing their own data:

2. Screen Type: CRISPRi (knockdown), CRISPRa (activation), CRISPRko (knockout), or other?

3. Library Information: How many biological replicates? Non-targeting controls present? Approximate perturbations (10-100, 100-1000, 1000+)? sgRNAs per gene (1-2, 3-5, 5+)?

4. Cell Type: Primary neurons, iPSC-derived, immune cells, cancer cell lines, other? Expected phenotype (cell death, differentiation, activation, subtle)?

5. sgRNA Naming: Do sgRNA IDs include gene names (e.g., "GENE_sgRNA1")? If not, need gene lookup table?

6. Analysis Goals (skip for example data):

Typical Complete Workflow

This skill performs core Perturb-seq analysis: screening, validation, and hit identification. For a complete CRISPR screen workflow:

  1. This skill: Screen analysis → validated hits with transcriptional phenotypes (H5AD + hit lists)
  2. functional-enrichment-from-degs: Pathway enrichment on hit perturbations → biological interpretation
  3. coexpression-network: Network analysis from perturbed cells → regulatory relationships
  4. bulk-omics-clustering: Cluster perturbations by signature → functional groups

Quick start: "Analyze my pooled CRISPR screen and identify hits with pathway enrichment"

Why separate skills? Modular design works across screen types (CRISPRi/a/ko) and readouts (scRNA-seq, bulk). See Suggested Next Steps for details.

Standard Workflow

🚨 MANDATORY: USE SCRIPTS EXACTLY AS SHOWN - DO NOT WRITE INLINE CODE 🚨

This skill uses low-freedom script execution. Pooled CRISPR screen analysis requires careful multi-step processing with QC checkpoints. Scripts handle:

  • Multi-library loading and concatenation
  • sgRNA doublet filtering
  • Cell-type specific QC thresholds
  • Batch effect handling
  • Target validation checks

Step 1 - Load data and map sgRNAs:

For example data:

from scripts.load_example_data import load_example_data
data = load_example_data()  # Papalexi 2021 CRISPRi (~140MB, cached)
adata_list = data['adata_list']
mapping_files = data['mapping_files']

For your own data:

from scripts.load_10x_libraries import load_multiple_libraries
from scripts.map_sgrna_to_cells import map_sgrna_to_adata
from scripts.qc_filtering import apply_qc_filters
from scripts.concatenate_libraries import concatenate_libraries

# Load, map, filter, concatenate
adata_list = load_multiple_libraries(["lib1.h5", "lib2.h5"])
adata_mapped = [map_sgrna_to_adata(ad, mf, sgrna_delimiter="_")
                for ad, mf in zip(adata_list, mapping_files)]
adata_filtered = [apply_qc_filters(ad, min_genes=2000, min_counts=8000, max_mito_pct=0.15)
                  for ad in adata_mapped]
adata = concatenate_libraries(adata_filtered, batch_labels=["lib1", "lib2"])

DO NOT write inline loading/QC code. Use the functions above.

✅ VERIFICATION: You should see "✓ sgRNA mapping complete: [N] cells retained" for each library, then "✓ Libraries concatenated: [N] total cells"

Step 2 - Preprocess and normalize:

from scripts.gene_name_corrections import correct_gene_names
from scripts.expression_filtering import filter_by_expression
from scripts.normalize_and_scale import normalize_only

# Preprocessing pipeline
adata = correct_gene_names(adata, corrections={}, gene_col='gene')
adata = filter_by_expression(adata, group_key='sgRNA', min_mean_expression=0.5)
adata = normalize_only(adata, target_sum=1e6, exclude_highly_expressed=True)

DO NOT write inline preprocessing code. Use the functions above.

✅ VERIFICATION: You should see "✓ Gene name corrections applied", then "✓ Expression filtering: [N] genes retained", then "✓ Normalization complete"

Step 3 - Run tiered analysis:

from scripts.screen_all_perturbations import screen_all_perturbations, call_hits
from scripts.validate_perturbations import validate_target_effect
from scripts.differential_expression_glmgampoi import run_glmgampoi_batch

# Fast screening → preliminary hits → validation
de_results = screen_all_perturbations(adata, control_group='non-targeting',
                                       gene_col='gene', test_method='t-test',
                                       output_dir='results/initial_screening/')
preliminary_hits = call_hits(de_results, min_de_genes=10)
validation = validate_target_effect(de_results, expected_direction='down',  # 'up' for CRISPRa
                                     min_log2fc=-0.5)
validated_hits = preliminary_hits[
    preliminary_hits['perturbation'].isin(
        validation[validation['target_affected']]['perturbation']
    )
]

# Rigorous DE for top hits (optional, requires R + glmGamPoi)
top_hits = validated_hits.head(50)
glmgampoi_results = run_glmgampoi_batch(adata, perturbations=top_hits['perturbation'].tolist(),
                                         donor_col='batch', output_dir='results/glmgampoi/')

DO NOT write inline DE code. Use the functions above.

✅ VERIFICATION: You should see "✓ Screening complete: [N] perturbations tested", then "Validated hits: [N]/[M]", then (if glmGamPoi runs) "✓ glmGamPoi complete"

Step 4 - Visualize and export:

from scripts.visualize_perturbations import create_volcano_plots
from scripts.export_results import export_all_results

# Volcano plots for top hits
for gene in top_hits['perturbation'].head(10):
    create_volcano_plots(de_results[gene], perturbation=gene, output_dir='figures/')

# Export all results (CSV, H5AD, pickle, PDF report, hit lists)
export_all_results(adata=adata, perturbation_summary=validated_hits,
                   de_results=de_results, output_dir='results/')

DO NOT write custom export code. Use export_all_results().

✅ VERIFICATION: You should see "✓ Volcano plots generated: [N] perturbations", then "=== Export Complete ===" with list of files saved

❌ IF YOU DON'T SEE VERIFICATION MESSAGES: You wrote inline code instead of using the functions. Stop and use the provided functions.

⚠️ CRITICAL - DO NOT:

  • Write inline data loading codeSTOP: Use load_multiple_libraries() and map_sgrna_to_adata()
  • Write inline QC/filtering codeSTOP: Use apply_qc_filters() and filter_by_expression()
  • Write inline DE screening codeSTOP: Use screen_all_perturbations() and validate_target_effect()
  • Write custom export codeSTOP: Use export_all_results()

⚠️ IF SCRIPTS FAIL - Script Failure Hierarchy:

  1. Fix and Retry (90%) - Install missing package, re-run script
  2. Modify Script (5%) - Edit the script file itself, document changes
  3. Use as Reference (4%) - Read script, adapt approach, cite source
  4. Write from Scratch (1%) - Only if genuinely impossible, explain why

NEVER skip directly to writing inline code without trying the script first.

📁 QC Thresholds by Cell Type:

Cell Type min_genes min_counts max_mito_pct
Primary neurons 2000-3000 8000-10000 0.10-0.15
iPSC-derived 1500-2500 5000-8000 0.15-0.20
Immune cells 1000-1500 3000-5000 0.15-0.20
Macrophages 1000 3000-5000 0.18-0.20
Cancer cell lines 1500-2500 5000-8000 0.15-0.25

See references/qc_guidelines.md for detailed thresholds.

Common Issues

Issue Cause Solution
Low sgRNA mapping rate (<30%) Poor capture efficiency, wrong mapping file Check feature barcode library prep, verify sgRNA reference matches
High doublet rate (>10%) High MOI, barcode collisions Lower viral MOI in future screens, filter doublets computationally
Target gene not DE Incomplete knockdown, ineffective sgRNA, compensation Check sgRNA design, validate with protein, check for paralogs
Low validation rate (<50%) Weak perturbations, wrong expected direction Check CRISPRi vs CRISPRa, adjust log2fc threshold, check positive controls
Memory errors Large dataset (>100k cells) Process libraries separately, use backed mode (adata = sc.read_h5ad('file.h5ad', backed='r'))
Inconsistent replicates Batch effects, population drift Check batch balance, perform batch correction, analyze replicates separately first
glmGamPoi fails R not installed, missing package Install R ≥4.0 and glmGamPoi, or skip and use t-test results
SVG export failed Missing SVG backend Normal — script falls back automatically. Both PNG and SVG attempted.

See references/troubleshooting_guide.md for detailed solutions.

Suggested Next Steps

After running this skill:

  1. functional-enrichment-from-degs - Pathway enrichment on hit perturbations
  2. coexpression-network - Build gene regulatory networks from perturbed cells
  3. bulk-omics-clustering - Cluster perturbations by transcriptional signature

Related analyses: - Compare hit genes to GWAS loci (genetic-variant-annotation)

Related Skills

References

Key Papers:

  • Dixit et al. (2016) "Perturb-Seq" Cell 167:1853-1866
  • Datlinger et al. (2017) "Pooled CRISPR screening with single-cell readout" Nature Methods 14:297-301
  • Papalexi et al. (2021) "Characterizing the molecular regulation of inhibitory immune checkpoints" Nature Genetics 53:322-331 (example dataset)
  • Replogle et al. (2020) "Direct guide RNA capture" Nature Biotechnology 38:954-961
  • Peidli et al. (2024) "scPerturb: harmonized single-cell perturbation data" Nature Methods 21:531-540 (data source)

Online Resources: - 10X Feature Barcoding: https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/algorithms/crispr

  • CROP-seq pipeline: https://github.com/argschwind/CROP-seq-pipeline
  • Alector Perturb-seq analysis: https://github.com/Alector-BIO/CRISPR_PerturbSeq_pHM_publication

See references/crispr_screen_best_practices.md for comprehensive guidelines.

Code preview

scripts/concatenate_libraries.py

"""
Concatenate Multiple Libraries

Merge filtered libraries into a single AnnData object with batch labels.
"""

import scanpy as sc
import anndata as ad
from typing import List, Optional
import warnings


def concatenate_libraries(
    adata_list: List[ad.AnnData],
    batch_labels: Optional[List[str]] = None,
    batch_key: str = 'batch'
) -> ad.AnnData:
    """
    Concatenate multiple AnnData objects (libraries) into one.

    Parameters
    ----------
    adata_list : list of AnnData
        List of filtered AnnData objects to concatenate
    batch_labels : list of str, optional
        Labels for each batch (e.g., ["lib1", "lib2", "lib3"])
        If None, uses numeric labels (0, 1, 2, ...)
    batch_key : str, default='batch'
        Column name for batch labels in obs

    Returns
    -------
    adata_combined : AnnData
        Concatenated AnnData object with batch labels

    Example
    -------
    >>> adata_combined = concatenate_libraries(
    ...     [adata1_filtered, adata2_filtered, adata3_filtered, adata4_filtered],
    ...     batch_labels=["lib1", "lib2", "lib3", "lib4"]
    ... )
    """
    print(f"Concatenating {len(adata_list)} libraries...")

    # Print summary of each library
    for i, adata in enumerate(adata_list):
        label = batch_labels[i] if batch_labels else f"batch_{i}"
        print(f"  {label}: {adata.n_obs} cells x {adata.n_vars} genes")

        # Check for required columns
        if 'gene' in adata.obs.columns:
            n_genes = adata.obs['gene'].nunique()
            print(f"    Unique perturbations: {n_genes}")

    # Concatenate using modern anndata.concat (not deprecated concatenate method)
    if batch_labels:
        # Add batch labels to each AnnData
        for i, adata in enumerate(adata_list):
            adata.obs[batch_key] = batch_labels[i]

        adata_combined = ad.concat(
            adata_list,
            join='outer',
            label=None,  # We already added batch labels above
            keys=None,
            index_unique=None
        )
    else:
        # Simple concatenation without batch labels
        adata_combined = ad.concat(
            adata_list,
            join='outer'
        )

    print(f"\nCombined dataset:")
    print(f"  Total cells: {adata_combined.n_obs}")
    print(f"  Total genes: {adata_combined.n_vars}")

    if 'gene' in adata_combined.obs.columns:
        n_unique_genes = adata_combined.obs['gene'].nunique()

scripts/detect_perturbed_cells.py

"""
Detect Perturbed Cells Using Outlier Detection

For each perturbation, identify cells with significant phenotypic changes using
outlier detection methods in PCA space computed on differentially expressed genes.
"""

import numpy as np
import pandas as pd
import scanpy as sc
import anndata as ad
from sklearn.svm import OneClassSVM
from sklearn.neighbors import LocalOutlierFactor
from sklearn.ensemble import IsolationForest
import diffxpy.api as de
from typing import Dict, List, Optional, Literal


def detect_perturbed_cells(
    adata: ad.AnnData,
    control_group: str = 'non-targeting',
    gene_col: str = 'gene',
    sgrna_col: str = 'sgRNA',
    n_pcs: int = 4,
    method: Literal['LocalOutlierFactor', 'IsolationForest', 'OneClassSVM'] = 'LocalOutlierFactor',
    min_de_genes: int = 10,
    min_outliers: int = 10,
    pval_threshold: float = 0.05,
    save_plots: bool = True,
    plot_dir: str = 'figures/'
) -> Dict:
    """
    Detect perturbed cells with significant phenotypes using outlier detection.

    For each perturbation vs control:
    1. Run differential expression (t-test)
    2. Select top DE genes (p < pval_threshold)
    3. If ≥min_de_genes found, compute PCA on DE genes
    4. Train outlier detector on control cells
    5. Predict outliers in perturbed cells
    6. If ≥min_outliers detected, flag as "hit"

    Parameters
    ----------
    adata : AnnData
        Log-normalized AnnData (not scaled)
    control_group : str, default='non-targeting'
        Label for non-targeting control
    gene_col : str, default='gene'
        Column in obs with gene/perturbation labels
    sgrna_col : str, default='sgRNA'
        Column in obs with sgRNA labels
    n_pcs : int, default=4
        Number of PCs to compute on DE genes
    method : str, default='LocalOutlierFactor'
        Outlier detection method: 'LocalOutlierFactor', 'IsolationForest', 'OneClassSVM'
    min_de_genes : int, default=10
        Minimum DE genes required to proceed with outlier detection
    min_outliers : int, default=10
        Minimum outliers to call perturbation as "hit"
    pval_threshold : float, default=0.05
        P-value threshold for calling DE genes
    save_plots : bool, default=True
        Generate UMAP/PCA plots for each perturbation
    plot_dir : str, default='figures/'
        Directory to save plots

    Returns
    -------
    results : dict
        Dictionary containing:
        - 'outlier_cells': list of cell barcodes flagged as outliers
        - 'perturbation_summary': DataFrame with hit calls per perturbation
        - 'de_results': dict of DE results per perturbation
        - 'adata_dict': dict of per-perturbation AnnData objects with classifications

    Example
    -------
    >>> results = detect_perturbed_cells(
    ...     adata,

scripts/differential_expression.py

"""
Differential Expression Analysis per Perturbation

For each perturbation, perform differential expression analysis comparing
perturbed cells to non-targeting controls.
"""

import os
import pandas as pd
import anndata as ad
import diffxpy.api as de
from typing import Dict, Optional, Literal


def run_de_analysis(
    adata: ad.AnnData,
    control_group: str = 'non-targeting',
    gene_col: str = 'gene',
    test_method: Literal['t-test', 'wilcoxon', 'negative_binomial'] = 't-test',
    use_outliers_only: bool = True,
    outlier_cells: Optional[list] = None,
    top_n_genes: int = 50,
    output_dir: str = 'DEG/rank_test/',
    is_logged: bool = True
) -> Dict[str, pd.DataFrame]:
    """
    Run differential expression analysis for each perturbation vs control.

    Parameters
    ----------
    adata : AnnData
        Log-normalized AnnData (use adata.raw or normalized data)
    control_group : str, default='non-targeting'
        Label for control group
    gene_col : str, default='gene'
        Column in obs with perturbation labels
    test_method : str, default='t-test'
        Statistical test: 't-test', 'wilcoxon', 'negative_binomial'
    use_outliers_only : bool, default=True
        If True and outlier_cells provided, use only outlier cells for DE
    outlier_cells : list, optional
        List of cell barcodes identified as outliers (from detect_perturbed_cells)
    top_n_genes : int, default=50
        Number of top genes to highlight in output
    output_dir : str, default='DEG/rank_test/'
        Directory to save per-perturbation DE results
    is_logged : bool, default=True
        Whether data is log-transformed

    Returns
    -------
    de_results : dict
        Dictionary mapping gene -> DE results DataFrame

    Example
    -------
    >>> de_results = run_de_analysis(
    ...     adata,
    ...     control_group='non-targeting',
    ...     test_method='t-test',
    ...     use_outliers_only=True,
    ...     outlier_cells=results['outlier_cells'],
    ...     top_n_genes=50
    ... )
    """
    os.makedirs(output_dir, exist_ok=True)

    # Subset to outliers + controls if requested
    if use_outliers_only and outlier_cells is not None:
        print(f"Subsetting to outlier cells + controls...")
        control_cell_barcodes = adata[adata.obs[gene_col] == control_group].obs_names.tolist()
        kept_cells = list(set(outlier_cells + control_cell_barcodes))
        adata_selected = adata[adata.obs_names.isin(kept_cells)].copy()
        print(f"  Using {len(outlier_cells)} outlier cells + {len(control_cell_barcodes)} control cells")
    else:
        adata_selected = adata.copy()
        print("Using all cells for DE analysis")

    # Get unique perturbations
    genes = adata_selected.obs[gene_col].unique().tolist()

Companion files

TypePathBytes
Markdownreferences/crispr_screen_best_practices.md10,033
Markdownreferences/qc_guidelines.md19,037
Markdownreferences/statistical_methods.md16,067
Markdownreferences/troubleshooting_guide.md17,341
Markdownreferences/umi_optimization.md8,274
Pythonscripts/concatenate_libraries.py4,154
Pythonscripts/detect_perturbed_cells.py9,271
Pythonscripts/differential_expression.py6,364
Pythonscripts/differential_expression_glmgampoi.py9,804
Pythonscripts/export_results.py8,954
Pythonscripts/expression_filtering.py4,425
Pythonscripts/gene_name_corrections.py5,335
Pythonscripts/generate_report.py19,776
Pythonscripts/load_10x_libraries.py1,613
Pythonscripts/load_example_data.py15,150
Pythonscripts/map_sgrna_to_cells.py3,454
Pythonscripts/normalize_and_scale.py3,845
Pythonscripts/qc_filtering.py5,267
Rscripts/run_glmgampoi.R5,862
Pythonscripts/screen_all_perturbations.py9,968
Pythonscripts/validate_perturbations.py11,843
Pythonscripts/visualize_perturbations.py9,421
MarkdownSKILL.md16,975
JSONskill.meta.json4,359