⌂ / Overview / Epigenomics & Gene Regulation / Upstream Regulator Analysis

View companion source

Upstream Regulator Analysis

Combine ChIP-Atlas binding with DE to find driver TFs.

Overview

Problem. A pile of changed genes — which TF drives them?

Use when: You already have DE results

Avoid when: Reading correlation as causation

Learning goals

Binding + expression = regulatory evidence
Direction tells activator from repressor

Figures

Tutorial

Identify transcription factors (TFs) driving observed differential expression by integrating ChIP-Atlas TF binding data (epigenomics) with RNA-seq DE results (transcriptomics). Ranks TFs by a combined regulatory score incorporating binding enrichment, target-DE overlap (Fisher's exact test), and directional concordance (activator vs repressor).

When to Use This Skill

Use when you:

Have DE results and want to identify TFs driving expression changes
Need to go beyond simple gene list enrichment to mechanistic TF-level evidence
Want to distinguish activators (targets upregulated) from repressors (targets downregulated)
Want to integrate epigenomics (ChIP-seq) with transcriptomics (RNA-seq) in one analysis

Don't use for:

Single-cell DE results (designed for bulk RNA-seq DE)
Organisms not in ChIP-Atlas (see supported genomes below)
Histone mark analysis (use chip-atlas-peak-enrichment directly)
When you only need TF binding enrichment without target gene integration

Requires: Internet access (ChIP-Atlas API + data server). Runtime: 15-25 minutes (API polling + target gene downloads).

Installation

pip install pandas numpy scipy requests matplotlib seaborn reportlab

Package	Version	License	Commercial Use
pandas	≥1.5	BSD-3	✅ Permitted
numpy	≥1.21	BSD-3	✅ Permitted
scipy	≥1.9	BSD-3	✅ Permitted
requests	≥2.28	Apache-2.0	✅ Permitted
matplotlib	≥3.6	PSF	✅ Permitted
seaborn	≥0.12	BSD-3	✅ Permitted
reportlab	≥3.6	BSD	✅ Permitted

Sibling skill dependencies: Requires chip-atlas-peak-enrichment and chip-atlas-target-genes directories at the same level.

Inputs

DE results CSV/TSV with columns: gene symbol, log2 fold change, adjusted p-value
Supports DESeq2 (log2FoldChange, padj), edgeR (logFC, FDR), limma (logFC, adj.P.Val)
Column names auto-detected; override with parameters if needed
Genome: hg38, hg19, mm10, mm9, rn6, dm6, dm3, ce11, ce10, sacCer3

Outputs

Analysis objects:

analysis_object.pkl - Complete analysis for downstream use
Load with: import pickle; obj = pickle.load(open('analysis_object.pkl', 'rb'))
Contains: regulon_scores, enrichment results, target gene data, DE data, parameters

CSV results:

regulon_scores_all.csv - All scored TFs with regulatory score, Fisher's p-value, concordance, direction
regulon_scores_top.csv - Top 20 TFs
target_overlaps.csv - Per-TF target gene overlap with DE status (up/down)
enrichment_up.csv / enrichment_down.csv - ChIP-Atlas peak enrichment results

Visualizations (PNG + SVG):

upstream_regulators_top_regulators - Bar chart: TFs ranked by regulatory score
upstream_regulators_target_overlap - Stacked bar: TF targets classified as up/down/unchanged
upstream_regulators_evidence_scatter - Scatter: ChIP enrichment vs Fisher significance
upstream_regulators_heatmap - Clustermap: TFs × regulatory evidence metrics

Reports:

summary_report.md - Human-readable analysis summary
analysis_report.pdf - Publication-quality PDF with Introduction, Methods, Results, Conclusions
Requires: pip install reportlab (optional — markdown report generated regardless)

Clarification Questions

Input Files (ASK THIS FIRST):
- Do you have DE results (CSV/TSV) to analyze?
- If uploaded: Is this the DE results file you'd like to find upstream regulators for?
- Expected columns: gene symbol + log2FoldChange + adjusted p-value
- Or use example data? Three options:
- a) Estrogen/MCF7 dataset (recommended) — real DE results from GSE51403 (estradiol-treated MCF7 breast cancer cells, ~58K genes). Expected top regulator: ESR1
- b) Airway dataset — real DE results from GSE52778 (dexamethasone-treated airway smooth muscle cells, ~58K genes). Expected top regulator: NR3C1
- c) Synthetic TP53-driven data (~200 genes, fast, offline)
Analysis Options:
- (If using example data) Choose analysis parameters:
- a) Standard analysis (top 10 TFs, q < 0.05) (recommended)
- b) Comprehensive analysis (top 15 TFs, q < 0.1)
- (If using your own data) What species/genome?
- a) Human (hg38)
- b) Human (hg19)
- c) Mouse (mm10)
- d) Other (specify)

Standard Workflow

🚨 MANDATORY: USE SCRIPTS EXACTLY AS SHOWN - DO NOT WRITE INLINE CODE 🚨

Step 1 - Load data:

# For example data (real estrogen/MCF7 dataset, downloads from EBI Expression Atlas):
from scripts.load_example_data import load_example_data
de_data = load_example_data(source="estrogen")

# Alternative: airway dataset (dexamethasone, real data):
de_data = load_example_data(source="airway")

# For synthetic data (offline, fast, TP53-driven):
de_data = load_example_data(source="synthetic")

# For user data:
from scripts.load_de_results import load_de_results
de_data = load_de_results("path/to/de_results.csv")

✅ VERIFICATION: "✓ Data loaded successfully: N total genes, M DE genes (X up, Y down)"

Step 2 - Run integration analysis:

from scripts.run_integration_workflow import run_integration_workflow
results = run_integration_workflow(de_data, genome="hg38", output_dir="regulator_results")

DO NOT write inline API code or custom scoring. Just call the workflow function.

⏱️ This step takes 15-25 minutes (ChIP-Atlas API polling + target gene downloads).

✅ VERIFICATION: "✓ Integration analysis completed successfully!"

Step 3 - Generate visualizations:

from scripts.generate_all_plots import generate_all_plots
generate_all_plots(results, output_dir="regulator_results")

DO NOT write inline plotting code (ggplot, ggsave, etc.). Just use the script.

✅ VERIFICATION: "✓ All visualizations generated successfully!"

Step 4 - Export results:

from scripts.export_all import export_all
export_all(results, output_dir="regulator_results")

DO NOT write custom export code. Use export_all().

✅ VERIFICATION: "=== Export Complete ==="

⚠️ CRITICAL - DO NOT:

❌ Write inline API code → STOP: Use run_integration_workflow()
❌ Write inline plotting code → STOP: Use generate_all_plots()
❌ Write custom export code → STOP: Use export_all()
❌ Write custom Fisher's test code → STOP: Built into score_regulons()

⚠️ IF SCRIPTS FAIL - Script Failure Hierarchy:

Fix and Retry (90%) - Install missing package, re-run script
Modify Script (5%) - Edit the script file itself, document changes
Use as Reference (4%) - Read script, adapt approach, cite source
Write from Scratch (1%) - Only if genuinely impossible, explain why

NEVER skip directly to writing inline code without trying the script first.

Common Issues

Error	Cause	Solution
ImportError: sibling skill not found	Missing chip-atlas-peak-enrichment or chip-atlas-target-genes	Ensure both sibling skills are installed at the same directory level
API 400 error	Empty cellClass or invalid parameters	Use `cell_class="All cell types"` (must be non-empty)
Both enrichment analyses failed	Too few DE genes per direction	Need ≥3 genes in at least one direction (up or down)
No TFs passed enrichment threshold	Stringent cutoff or few DE genes	Try `min_enrichment_qvalue=0.1` or add more DE genes
Target gene download timeout	Large TF file or slow connection	Script retries; if persistent, reduce `max_tfs`
No TFs with target gene data	Enriched TFs are histone marks	Filter with `antigen_class="TFs and others"` (default)
SVG export failed	Missing svglite/cairo	Normal - PNG always generated; SVG is optional

Interpretation Guidelines

Regulatory Score

Combined evidence: -log10(Fisher P) × Concordance × -log10(ChIP Q)

Score	Evidence
>100	Very strong — high ChIP enrichment + significant target overlap + high concordance
50-100	Strong
20-50	Moderate
<20	Weak — interpret with caution

Direction Classification

Activator (concordance >60%, majority up): TF likely activates these genes
Repressor (concordance >60%, majority down): TF likely represses these genes
Mixed (concordance ≤60%): No clear directional bias — context-dependent regulation

Key Caveats

Results biased toward well-studied TFs/cell types in ChIP-Atlas
Binding enrichment ≠ regulatory causation (validate with perturbation)
Directional labels assume simple activation/repression (ignores context-dependent regulation)
Combined score is a heuristic ranking, not a formal multi-test correction
Fisher's test assumes independence (may be violated if targets cluster in pathways)

Suggested Next Steps

After identifying upstream regulators:

Validate binding: Use chip-atlas-target-genes to examine cell-type-specific binding patterns for top TFs
Functional enrichment: Use functional-enrichment-from-degs on TF-target gene subsets
Co-expression: Use gene-correlation-archs4 to check if TF and targets co-express
Network inference: Use grn-pyscenic for single-cell GRN validation
Literature review: Use literature-review to validate TF-disease associations

Related Skills

chip-atlas-peak-enrichment - Component: TF binding enrichment analysis
chip-atlas-target-genes - Component: TF target gene retrieval
bulk-rnaseq-counts-to-de-deseq2 - Upstream: generates DE results input
de-results-to-gene-lists - Upstream: generates filtered gene lists
functional-enrichment-from-degs - Complementary: pathway-level enrichment

References

Zou Z, et al. (2024) ChIP-Atlas 3.0: a gene regulation data-mining platform. Nucleic Acids Res. 52(W1):W159-W166
Oki S, et al. (2018) ChIP-Atlas: a data-mining suite. EMBO Rep. 19(12):e46255
Fisher RA (1922) On the interpretation of chi-squared. J R Stat Soc. 85(1):87-94

Code preview

scripts/init.py

# upstream-regulator-analysis scripts package

scripts/export_all.py

"""
Export all upstream regulator analysis results (Step 4).

Exports:
1. analysis_object.pkl - Complete results for downstream use
2. regulon_scores_all.csv - All scored TFs
3. regulon_scores_top.csv - Top TFs by regulatory score
4. target_overlaps.csv - Per-TF target gene overlap details
5. enrichment_up.csv - Peak enrichment results for upregulated genes
6. enrichment_down.csv - Peak enrichment results for downregulated genes
7. summary_report.md - Human-readable summary
8. analysis_report.pdf - Publication-quality PDF report
"""

import os
import pickle
from datetime import datetime

import pandas as pd


def export_all(results, output_dir="regulator_results"):
    """
    Export all upstream regulator results with pickle object.

    Parameters
    ----------
    results : dict
        Results from run_integration_workflow().
    output_dir : str
        Output directory.

    Verification
    ------------
    Prints "=== Export Complete ===" when done.
    """
    os.makedirs(output_dir, exist_ok=True)

    print("\n--- Exporting results ---")

    regulon_scores = results["regulon_scores"]
    parameters = results["parameters"]
    metadata = results["metadata"]

    # 1. Analysis object (pickle)
    pkl_path = os.path.join(output_dir, "analysis_object.pkl")
    analysis_object = {
        "regulon_scores": regulon_scores,
        "enrichment_results_up": results.get("enrichment_results_up"),
        "enrichment_results_down": results.get("enrichment_results_down"),
        "top_tfs": results.get("top_tfs"),
        "target_gene_data": results.get("target_gene_data"),
        "de_data": results["de_data"],
        "parameters": parameters,
        "metadata": metadata,
        "timestamp": datetime.now().isoformat(),
    }
    with open(pkl_path, "wb") as f:
        pickle.dump(analysis_object, f)
    print(f"   1. {pkl_path}")
    print(f"      (Load with: import pickle; obj = pickle.load(open('{pkl_path}', 'rb')))")

    # 2. All regulon scores
    if len(regulon_scores) > 0:
        csv_all = os.path.join(output_dir, "regulon_scores_all.csv")
        regulon_scores.to_csv(csv_all, index=False)
        print(f"   2. {csv_all} ({len(regulon_scores)} TFs)")

        # 3. Top regulon scores
        top_n = min(20, len(regulon_scores))
        csv_top = os.path.join(output_dir, "regulon_scores_top.csv")
        regulon_scores.head(top_n).to_csv(csv_top, index=False)
        print(f"   3. {csv_top} (top {top_n})")
    else:
        print("   2-3. No regulon scores to export")

    # 4. Target overlaps detail
    _export_target_overlaps(results, output_dir)

    # 5-6. Enrichment results

scripts/generate_all_plots.py

"""
Visualization for upstream regulator analysis results.

Generates 4-panel publication-quality figure:
1. Top Regulators bar chart (regulatory score, colored by direction)
2. Target-DE Overlap stacked bars (up/down/unchanged per TF)
3. Evidence Scatter (ChIP enrichment vs Fisher p, sized by concordance)
4. Regulatory Heatmap (seaborn clustermap, TFs x metrics)
"""

import os

import numpy as np
import pandas as pd

import matplotlib
matplotlib.use("Agg")
import matplotlib.pyplot as plt
import seaborn as sns

# --- Theme setup per CLAUDE.md standard ---
sns.set_style("ticks")
plt.rcParams["font.family"] = "sans-serif"
plt.rcParams["font.sans-serif"] = ["Helvetica"]


# Direction color palette
_DIRECTION_COLORS = {
    "activator": "#E74C3C",
    "repressor": "#2E86C1",
    "mixed": "#95A5A6",
}


def _save_plot(fig, base_path, dpi=300):
    """Save a matplotlib figure to PNG + SVG with graceful fallback."""
    png_path = base_path + ".png"
    svg_path = base_path + ".svg"

    fig.savefig(png_path, dpi=dpi, bbox_inches="tight", facecolor="white")
    print(f"   Saved: {png_path}")
    try:
        fig.savefig(svg_path, format="svg", bbox_inches="tight", facecolor="white")
        print(f"   Saved: {svg_path}")
    except Exception:
        print("   (SVG export failed, PNG available)")
    plt.close(fig)


def _plot_top_regulators(regulon_scores, top_n=15):
    """Panel 1: Top regulators ranked by regulatory score, colored by direction."""
    df = regulon_scores.head(top_n).copy()
    if len(df) == 0:
        return None

    # Reverse for top-at-top horizontal bar
    df = df.iloc[::-1].reset_index(drop=True)

    fig, ax = plt.subplots(figsize=(10, max(6, top_n * 0.4)))
    colors = [_DIRECTION_COLORS.get(d, "#95A5A6") for d in df["direction"]]
    bars = ax.barh(df["tf"], df["regulatory_score"], color=colors, height=0.7)

    # Add concordance labels
    for i, (_, row) in enumerate(df.iterrows()):
        ax.text(
            row["regulatory_score"] + df["regulatory_score"].max() * 0.02,
            i, f"{row['concordance']:.0%}",
            va="center", fontsize=8,
        )

    ax.set_xlabel("Regulatory Score", fontsize=11)
    ax.set_title("Top Upstream Regulators", fontsize=14, fontweight="bold")
    sns.despine(ax=ax)

    # Legend
    from matplotlib.patches import Patch
    legend_elements = [
        Patch(facecolor=_DIRECTION_COLORS["activator"], label="Activator"),
        Patch(facecolor=_DIRECTION_COLORS["repressor"], label="Repressor"),
        Patch(facecolor=_DIRECTION_COLORS["mixed"], label="Mixed"),

Companion files

Type	Path	Bytes
Markdown	references/integration_methods.md	5,505
Python	scripts/__init__.py	46
Python	scripts/export_all.py	10,981
Python	scripts/generate_all_plots.py	8,489
Python	scripts/generate_report.py	14,678
Python	scripts/load_de_results.py	5,293
Python	scripts/load_example_data.py	11,678
Python	scripts/run_integration_workflow.py	12,704
Python	scripts/score_regulons.py	5,459
Markdown	SKILL.md	10,214
JSON	skill.meta.json	2,196

Upstream Regulator Analysis

Overview

Learning goals

Figures

Tutorial

When to Use This Skill

Installation

Inputs

Outputs

Clarification Questions

Standard Workflow

Step 1 - Load data:

Step 2 - Run integration analysis:

Step 3 - Generate visualizations:

Step 4 - Export results:

Common Issues

Interpretation Guidelines

Regulatory Score

Direction Classification

Key Caveats

Suggested Next Steps

Related Skills

References

Code preview

scripts/__init__.py

scripts/export_all.py

scripts/generate_all_plots.py

Companion files

scripts/init.py