View companion source

Upstream Regulator Analysis

Combine ChIP-Atlas binding with DE to find driver TFs.

Overview

Problem. A pile of changed genes — which TF drives them?

Use when: You already have DE results
Avoid when: Reading correlation as causation

Learning goals

Figures

Upstream Regulator Overview
Integration Workflow
Regulatory Score: 3 Axes
Activator / Repressor / Mixed
Reading Regulator Plots
Key Caveats

Tutorial

Identify transcription factors (TFs) driving observed differential expression by integrating ChIP-Atlas TF binding data (epigenomics) with RNA-seq DE results (transcriptomics). Ranks TFs by a combined regulatory score incorporating binding enrichment, target-DE overlap (Fisher's exact test), and directional concordance (activator vs repressor).

When to Use This Skill

Use when you: - Have DE results and want to identify TFs driving expression changes

  • Need to go beyond simple gene list enrichment to mechanistic TF-level evidence
  • Want to distinguish activators (targets upregulated) from repressors (targets downregulated)
  • Want to integrate epigenomics (ChIP-seq) with transcriptomics (RNA-seq) in one analysis

Don't use for: - Single-cell DE results (designed for bulk RNA-seq DE)

  • Organisms not in ChIP-Atlas (see supported genomes below)
  • Histone mark analysis (use chip-atlas-peak-enrichment directly)
  • When you only need TF binding enrichment without target gene integration

Requires: Internet access (ChIP-Atlas API + data server). Runtime: 15-25 minutes (API polling + target gene downloads).

Installation

pip install pandas numpy scipy requests matplotlib seaborn reportlab
Package Version License Commercial Use
pandas ≥1.5 BSD-3 ✅ Permitted
numpy ≥1.21 BSD-3 ✅ Permitted
scipy ≥1.9 BSD-3 ✅ Permitted
requests ≥2.28 Apache-2.0 ✅ Permitted
matplotlib ≥3.6 PSF ✅ Permitted
seaborn ≥0.12 BSD-3 ✅ Permitted
reportlab ≥3.6 BSD ✅ Permitted

Sibling skill dependencies: Requires chip-atlas-peak-enrichment and chip-atlas-target-genes directories at the same level.

Inputs

Outputs

Analysis objects:

  • analysis_object.pkl - Complete analysis for downstream use
  • Load with: import pickle; obj = pickle.load(open('analysis_object.pkl', 'rb'))
  • Contains: regulon_scores, enrichment results, target gene data, DE data, parameters

CSV results:

  • regulon_scores_all.csv - All scored TFs with regulatory score, Fisher's p-value, concordance, direction
  • regulon_scores_top.csv - Top 20 TFs
  • target_overlaps.csv - Per-TF target gene overlap with DE status (up/down)
  • enrichment_up.csv / enrichment_down.csv - ChIP-Atlas peak enrichment results

Visualizations (PNG + SVG):

  • upstream_regulators_top_regulators - Bar chart: TFs ranked by regulatory score
  • upstream_regulators_target_overlap - Stacked bar: TF targets classified as up/down/unchanged
  • upstream_regulators_evidence_scatter - Scatter: ChIP enrichment vs Fisher significance
  • upstream_regulators_heatmap - Clustermap: TFs × regulatory evidence metrics

Reports:

  • summary_report.md - Human-readable analysis summary
  • analysis_report.pdf - Publication-quality PDF with Introduction, Methods, Results, Conclusions
  • Requires: pip install reportlab (optional — markdown report generated regardless)

Clarification Questions

  1. Input Files (ASK THIS FIRST):

    • Do you have DE results (CSV/TSV) to analyze?
    • If uploaded: Is this the DE results file you'd like to find upstream regulators for?
    • Expected columns: gene symbol + log2FoldChange + adjusted p-value
    • Or use example data? Three options:
    • a) Estrogen/MCF7 dataset (recommended) — real DE results from GSE51403 (estradiol-treated MCF7 breast cancer cells, ~58K genes). Expected top regulator: ESR1
    • b) Airway dataset — real DE results from GSE52778 (dexamethasone-treated airway smooth muscle cells, ~58K genes). Expected top regulator: NR3C1
    • c) Synthetic TP53-driven data (~200 genes, fast, offline)
  2. Analysis Options:

    • (If using example data) Choose analysis parameters:
    • a) Standard analysis (top 10 TFs, q < 0.05) (recommended)
    • b) Comprehensive analysis (top 15 TFs, q < 0.1)
    • (If using your own data) What species/genome?
    • a) Human (hg38)
    • b) Human (hg19)
    • c) Mouse (mm10)
    • d) Other (specify)

Standard Workflow

🚨 MANDATORY: USE SCRIPTS EXACTLY AS SHOWN - DO NOT WRITE INLINE CODE 🚨

Step 1 - Load data:

# For example data (real estrogen/MCF7 dataset, downloads from EBI Expression Atlas):
from scripts.load_example_data import load_example_data
de_data = load_example_data(source="estrogen")

# Alternative: airway dataset (dexamethasone, real data):
de_data = load_example_data(source="airway")

# For synthetic data (offline, fast, TP53-driven):
de_data = load_example_data(source="synthetic")

# For user data:
from scripts.load_de_results import load_de_results
de_data = load_de_results("path/to/de_results.csv")

✅ VERIFICATION: "✓ Data loaded successfully: N total genes, M DE genes (X up, Y down)"

Step 2 - Run integration analysis:

from scripts.run_integration_workflow import run_integration_workflow
results = run_integration_workflow(de_data, genome="hg38", output_dir="regulator_results")

DO NOT write inline API code or custom scoring. Just call the workflow function.

⏱️ This step takes 15-25 minutes (ChIP-Atlas API polling + target gene downloads).

✅ VERIFICATION: "✓ Integration analysis completed successfully!"

Step 3 - Generate visualizations:

from scripts.generate_all_plots import generate_all_plots
generate_all_plots(results, output_dir="regulator_results")

DO NOT write inline plotting code (ggplot, ggsave, etc.). Just use the script.

✅ VERIFICATION: "✓ All visualizations generated successfully!"

Step 4 - Export results:

from scripts.export_all import export_all
export_all(results, output_dir="regulator_results")

DO NOT write custom export code. Use export_all().

✅ VERIFICATION: "=== Export Complete ==="

⚠️ CRITICAL - DO NOT:

  • Write inline API codeSTOP: Use run_integration_workflow()
  • Write inline plotting codeSTOP: Use generate_all_plots()
  • Write custom export codeSTOP: Use export_all()
  • Write custom Fisher's test codeSTOP: Built into score_regulons()

⚠️ IF SCRIPTS FAIL - Script Failure Hierarchy:

  1. Fix and Retry (90%) - Install missing package, re-run script
  2. Modify Script (5%) - Edit the script file itself, document changes
  3. Use as Reference (4%) - Read script, adapt approach, cite source
  4. Write from Scratch (1%) - Only if genuinely impossible, explain why

NEVER skip directly to writing inline code without trying the script first.

Common Issues

Error Cause Solution
ImportError: sibling skill not found Missing chip-atlas-peak-enrichment or chip-atlas-target-genes Ensure both sibling skills are installed at the same directory level
API 400 error Empty cellClass or invalid parameters Use cell_class="All cell types" (must be non-empty)
Both enrichment analyses failed Too few DE genes per direction Need ≥3 genes in at least one direction (up or down)
No TFs passed enrichment threshold Stringent cutoff or few DE genes Try min_enrichment_qvalue=0.1 or add more DE genes
Target gene download timeout Large TF file or slow connection Script retries; if persistent, reduce max_tfs
No TFs with target gene data Enriched TFs are histone marks Filter with antigen_class="TFs and others" (default)
SVG export failed Missing svglite/cairo Normal - PNG always generated; SVG is optional

Interpretation Guidelines

Regulatory Score

Combined evidence: -log10(Fisher P) × Concordance × -log10(ChIP Q)

Score Evidence
>100 Very strong — high ChIP enrichment + significant target overlap + high concordance
50-100 Strong
20-50 Moderate
<20 Weak — interpret with caution

Direction Classification

Key Caveats

Suggested Next Steps

After identifying upstream regulators: - Validate binding: Use chip-atlas-target-genes to examine cell-type-specific binding patterns for top TFs - Functional enrichment: Use functional-enrichment-from-degs on TF-target gene subsets - Co-expression: Use gene-correlation-archs4 to check if TF and targets co-express - Network inference: Use grn-pyscenic for single-cell GRN validation - Literature review: Use literature-review to validate TF-disease associations

Related Skills

References

Code preview

scripts/__init__.py

# upstream-regulator-analysis scripts package

scripts/export_all.py

"""
Export all upstream regulator analysis results (Step 4).

Exports:
1. analysis_object.pkl - Complete results for downstream use
2. regulon_scores_all.csv - All scored TFs
3. regulon_scores_top.csv - Top TFs by regulatory score
4. target_overlaps.csv - Per-TF target gene overlap details
5. enrichment_up.csv - Peak enrichment results for upregulated genes
6. enrichment_down.csv - Peak enrichment results for downregulated genes
7. summary_report.md - Human-readable summary
8. analysis_report.pdf - Publication-quality PDF report
"""

import os
import pickle
from datetime import datetime

import pandas as pd


def export_all(results, output_dir="regulator_results"):
    """
    Export all upstream regulator results with pickle object.

    Parameters
    ----------
    results : dict
        Results from run_integration_workflow().
    output_dir : str
        Output directory.

    Verification
    ------------
    Prints "=== Export Complete ===" when done.
    """
    os.makedirs(output_dir, exist_ok=True)

    print("\n--- Exporting results ---")

    regulon_scores = results["regulon_scores"]
    parameters = results["parameters"]
    metadata = results["metadata"]

    # 1. Analysis object (pickle)
    pkl_path = os.path.join(output_dir, "analysis_object.pkl")
    analysis_object = {
        "regulon_scores": regulon_scores,
        "enrichment_results_up": results.get("enrichment_results_up"),
        "enrichment_results_down": results.get("enrichment_results_down"),
        "top_tfs": results.get("top_tfs"),
        "target_gene_data": results.get("target_gene_data"),
        "de_data": results["de_data"],
        "parameters": parameters,
        "metadata": metadata,
        "timestamp": datetime.now().isoformat(),
    }
    with open(pkl_path, "wb") as f:
        pickle.dump(analysis_object, f)
    print(f"   1. {pkl_path}")
    print(f"      (Load with: import pickle; obj = pickle.load(open('{pkl_path}', 'rb')))")

    # 2. All regulon scores
    if len(regulon_scores) > 0:
        csv_all = os.path.join(output_dir, "regulon_scores_all.csv")
        regulon_scores.to_csv(csv_all, index=False)
        print(f"   2. {csv_all} ({len(regulon_scores)} TFs)")

        # 3. Top regulon scores
        top_n = min(20, len(regulon_scores))
        csv_top = os.path.join(output_dir, "regulon_scores_top.csv")
        regulon_scores.head(top_n).to_csv(csv_top, index=False)
        print(f"   3. {csv_top} (top {top_n})")
    else:
        print("   2-3. No regulon scores to export")

    # 4. Target overlaps detail
    _export_target_overlaps(results, output_dir)

    # 5-6. Enrichment results

scripts/generate_all_plots.py

"""
Visualization for upstream regulator analysis results.

Generates 4-panel publication-quality figure:
1. Top Regulators bar chart (regulatory score, colored by direction)
2. Target-DE Overlap stacked bars (up/down/unchanged per TF)
3. Evidence Scatter (ChIP enrichment vs Fisher p, sized by concordance)
4. Regulatory Heatmap (seaborn clustermap, TFs x metrics)
"""

import os

import numpy as np
import pandas as pd

import matplotlib
matplotlib.use("Agg")
import matplotlib.pyplot as plt
import seaborn as sns

# --- Theme setup per CLAUDE.md standard ---
sns.set_style("ticks")
plt.rcParams["font.family"] = "sans-serif"
plt.rcParams["font.sans-serif"] = ["Helvetica"]


# Direction color palette
_DIRECTION_COLORS = {
    "activator": "#E74C3C",
    "repressor": "#2E86C1",
    "mixed": "#95A5A6",
}


def _save_plot(fig, base_path, dpi=300):
    """Save a matplotlib figure to PNG + SVG with graceful fallback."""
    png_path = base_path + ".png"
    svg_path = base_path + ".svg"

    fig.savefig(png_path, dpi=dpi, bbox_inches="tight", facecolor="white")
    print(f"   Saved: {png_path}")
    try:
        fig.savefig(svg_path, format="svg", bbox_inches="tight", facecolor="white")
        print(f"   Saved: {svg_path}")
    except Exception:
        print("   (SVG export failed, PNG available)")
    plt.close(fig)


def _plot_top_regulators(regulon_scores, top_n=15):
    """Panel 1: Top regulators ranked by regulatory score, colored by direction."""
    df = regulon_scores.head(top_n).copy()
    if len(df) == 0:
        return None

    # Reverse for top-at-top horizontal bar
    df = df.iloc[::-1].reset_index(drop=True)

    fig, ax = plt.subplots(figsize=(10, max(6, top_n * 0.4)))
    colors = [_DIRECTION_COLORS.get(d, "#95A5A6") for d in df["direction"]]
    bars = ax.barh(df["tf"], df["regulatory_score"], color=colors, height=0.7)

    # Add concordance labels
    for i, (_, row) in enumerate(df.iterrows()):
        ax.text(
            row["regulatory_score"] + df["regulatory_score"].max() * 0.02,
            i, f"{row['concordance']:.0%}",
            va="center", fontsize=8,
        )

    ax.set_xlabel("Regulatory Score", fontsize=11)
    ax.set_title("Top Upstream Regulators", fontsize=14, fontweight="bold")
    sns.despine(ax=ax)

    # Legend
    from matplotlib.patches import Patch
    legend_elements = [
        Patch(facecolor=_DIRECTION_COLORS["activator"], label="Activator"),
        Patch(facecolor=_DIRECTION_COLORS["repressor"], label="Repressor"),
        Patch(facecolor=_DIRECTION_COLORS["mixed"], label="Mixed"),

Companion files

TypePathBytes
Markdownreferences/integration_methods.md5,505
Pythonscripts/__init__.py46
Pythonscripts/export_all.py10,981
Pythonscripts/generate_all_plots.py8,489
Pythonscripts/generate_report.py14,678
Pythonscripts/load_de_results.py5,293
Pythonscripts/load_example_data.py11,678
Pythonscripts/run_integration_workflow.py12,704
Pythonscripts/score_regulons.py5,459
MarkdownSKILL.md10,214
JSONskill.meta.json2,196