⌂ / Overview / Epigenomics & Gene Regulation / Differential Peaks / DMR

View companion source

Differential Peaks / DMR

Compare two groups for differential peaks or methylation.

Overview

Problem. Where do binding, accessibility or methylation differ?

Use when: Two groups of accession IDs

Avoid when: A single group with no comparison

Learning goals

DE thinking extends to genomic regions
Learn to read BED intervals

Figures

Tutorial

Compare two groups of experiments to identify differential peak regions (DPR) or differentially methylated regions (DMR) using the ChIP-Atlas Diff Analysis API.

When to Use This Skill

Use ChIP-Atlas diff analysis when you need to:

Find differential peaks between two conditions (treated vs control, tissue A vs B)
Identify differentially methylated regions between sample groups (Bisulfite-seq)
Compare chromatin accessibility between cell types using ATAC-seq/DNase-seq data
Leverage edgeR-based statistical framework on ChIP-Atlas public experiment data
Avoid raw data downloads — works directly with experiment accession IDs

Don't use for:

Single experiment analysis or peak calling (use MACS2/MACS3 workflows)
Enrichment of factors near a gene list (use chip-atlas-peak-enrichment)
Offline analysis (requires internet for API calls)

Key Concept: Submits two groups of experiment IDs to ChIP-Atlas. Server performs edgeR differential analysis (for DPR) or metilene (for DMR), returning BED files with genomic coordinates, logFC, p-values, q-values (FDR), and per-experiment normalized counts.

Installation

Software	Version	License	Commercial Use	Installation
pandas	>=1.3	BSD-3-Clause	Permitted	`pip install pandas`
requests	>=2.25	Apache-2.0	Permitted	`pip install requests`
numpy	>=1.20	BSD-3-Clause	Permitted	`pip install numpy`
plotnine	>=0.10	MIT	Permitted	`pip install plotnine`
plotnine-prism	>=0.2	MIT	Permitted	`pip install plotnine-prism`

pip install pandas requests numpy plotnine plotnine-prism

System requirements: Internet connection (API calls to ChIP-Atlas)

Inputs

Experiment IDs (two groups, minimum 2 per group):

SRA accessions: SRX, ERX, DRX (e.g., SRX18419259)
GEO accessions: GSM (e.g., GSM6765200)
Formats: Python list, plain text (one per line), CSV with ID column

Parameters:

Genome: hg38 (default), hg19, mm10, mm9, rn6, dm6, dm3, ce11, ce10, sacCer3
Analysis type: "diffbind" (default, DPR for ChIP/ATAC/DNase-seq) or "dmr" (DMR for Bisulfite-seq)

Outputs

Analysis objects (Pickle):

analysis_object.pkl - Complete results for downstream use
Load with: import pickle; obj = pickle.load(open('analysis_object.pkl', 'rb'))
Contains: diff_regions, diff_regions_unfiltered, qc_warnings, experiments_a/b, raw_files, log_content, parameters

Results (CSV):

diff_regions_all.csv - All differential regions, QC-filtered (chrom, start, end, logFC, pvalue, qvalue, direction, counts)
diff_regions_significant.csv - Significant regions (FDR < 0.05)
diff_regions_top50.csv - Top 50 by significance (with nearest gene annotation when available)
diff_regions_unfiltered.csv - Pre-QC regions (includes non-standard contigs and regions < 10bp)

Raw results:

raw_results/ - Original BED, IGV session, and log files from ChIP-Atlas

Visualizations (PNG + SVG, plotnine with Prism theme):

volcano_plot.png/.svg - Volcano plot (logFC vs -log10 Q-value)
chromosome_distribution.png/.svg - Differential regions by chromosome (stacked bar)
region_size_distribution.png/.svg - Region size histogram (log-scale)
ma_plot.png/.svg - MA plot (mean counts vs logFC)

Reports:

summary_report.md - Human-readable analysis summary with QC warnings, nearest gene annotations, and statistical caveats

Clarification Questions

Input Files (ASK THIS FIRST):
- Do you have experiment IDs (SRX/GSM) for two groups to compare?
- Expected: At least 2 IDs per group (3+ recommended for statistical power)
- Formats: Plain text (one ID per line), CSV with ID column, or Python list
- Or use example data? tp53_k562_dmso_vs_daunorubicin (recommended: TP53 K-562 DMSO vs Daunorubicin, n=8 vs n=8) or tp53_molm13_vs_k562 (MOLM-13 vs K-562, n=4 vs n=4, demonstrates sex-chr QC)
Analysis parameters:
- Species/genome? Human hg38 (default), hg19, mouse mm10/mm9, rat rn6, fly, worm, yeast
- Analysis type? "diffbind" (default, for ChIP/ATAC/DNase-seq) or "dmr" (for Bisulfite-seq)
- Group labels? Names for each group (e.g., "Treated" vs "Control"). (Auto-set for example data.)

Standard Workflow

🚨 MANDATORY: USE SCRIPTS EXACTLY AS SHOWN - DO NOT WRITE INLINE CODE 🚨

Step 1 - Load data:

# Option 1: Example data
from scripts.load_example_data import load_example_data
data = load_example_data("tp53_k562_dmso_vs_daunorubicin")
experiments_a = data['experiments_a']
experiments_b = data['experiments_b']

# Option 2: Your own experiment IDs
# from scripts.load_user_data import load_user_data
# experiments_a = load_user_data("group_a.txt")
# experiments_b = load_user_data("group_b.txt")

✅ VERIFICATION: "✓ Data loaded successfully"

Step 2 - Run diff analysis:

from scripts.run_diff_workflow import run_diff_workflow

results = run_diff_workflow(
    experiments_a=experiments_a,
    experiments_b=experiments_b,
    genome=data.get('genome', 'hg38'),
    analysis_type=data.get('analysis_type', 'diffbind'),
    description_a=data.get('description_a', 'group_A'),
    description_b=data.get('description_b', 'group_B'),
    output_dir="diff_analysis_results"
)

DO NOT write inline API query code. Just use the script.

✅ VERIFICATION: "✓ Diff analysis completed successfully!"

Step 3 - Generate visualizations:

from scripts.generate_all_plots import generate_all_plots
generate_all_plots(results, output_dir="diff_analysis_results")

DO NOT write inline plotting code. The script handles PNG + SVG with graceful fallback.

✅ VERIFICATION: "✓ All visualizations generated successfully!"

Step 4 - Export results:

from scripts.export_all import export_all
export_all(results, output_dir="diff_analysis_results")

DO NOT write custom export code. Use export_all().

✅ VERIFICATION: "=== Export Complete ==="

⚠️ IF SCRIPTS FAIL - Hierarchy:

Fix and Retry (90%) - Install missing package, check internet, re-run
Modify Script (5%) - Edit the script, document changes
Use as Reference (4%) - Read script, adapt approach
Write from Scratch (1%) - Only if impossible, explain why

Common Issues

Error	Cause	Solution
API 400 error	Invalid parameters	Check experiment IDs are valid (SRX/ERX/DRX/GSM format). See references/chipatlas_diff_api_format.md.
API timeout (>15 min)	Large dataset or server load	Normal for many experiments. Default timeout is 15 min. Reduce experiments per group if needed.
No BED file in ZIP	Analysis produced no results	Check experiment IDs exist in ChIP-Atlas. Ensure both groups have valid data for the selected genome.
Empty differential regions	No significant differences	Groups may be too similar. Try experiments from more distinct conditions.
requestId parse error	API response format changed	Check ChIP-Atlas server status. The API uses text format (not JSON) for diff analysis.
SVG export error	Missing optional dependency	Normal - `generate_all_plots()` handles fallback automatically. PNG always created.
chrY/chrX regions all one-directional	Sex chromosome confound	Groups differ in sex (male vs female cell lines). QC check flags this automatically. Exclude sex chromosome regions from biological interpretation.
Gene annotations missing	UCSC API unavailable	Normal in offline environments. Gene annotations are optional — all statistical results remain complete.

Interpretation Guidelines

logFC: Log2 fold change — positive = enriched in Group A, negative = enriched in Group B Q-value (FDR): Benjamini-Hochberg corrected p-value — regions with FDR < 0.05 are significant Counts: TMM-normalized read counts per experiment (comma-separated in raw BED) Region Size: 100-500bp (TF sites), 500-2000bp (enhancers), >2000bp (broad marks)

Nearest gene annotations: Top regions in the summary report and top50 CSV are annotated with overlapping or nearby genes via UCSC REST API. Shows "GENE" for direct overlap or "GENE (+2.1kb)" for nearest gene within 5kb. Annotation is optional and skipped gracefully if the UCSC API is unavailable.

Automated QC checks (run automatically after Step 2):

Sex chromosome confounds: If chrY/chrX regions are overwhelmingly one-directional with near-zero counts in the depleted group, this signals a sex mismatch between groups (e.g., male vs female cell lines). Checks both all regions (>80% threshold) and significant regions specifically (>67% threshold) to catch cases where noisy non-significant regions mask the artifact. These peaks should NOT be interpreted as genuine differential binding.
Non-standard contigs: Regions on random scaffolds or unplaced sequences are filtered by default (mapping artifacts). Unfiltered data available in diff_regions_unfiltered.csv.
Implausibly small regions: Regions < 10bp are filtered by default (edge artifacts — TF motifs are ~20bp, histone marks 200bp+).
Genomic clusters: Dense clusters of one-directional significant regions (e.g., 10+ in <2 Mb) may indicate copy number amplification or structural variants rather than genuine differential binding. Investigate flagged clusters in a genome browser.
chrM enrichment (ChIP-seq/ATAC-seq only): Mitochondrial DNA lacks histones and canonical chromatin structure. Significant differential regions on chrM in ChIP-seq or ATAC-seq experiments almost always reflect mitochondrial DNA contamination or non-specific antibody binding, not genuine transcription factor binding. Regions on chrM that are also <20bp are doubly suspect (too small for a TF binding site). chrM regions are NOT flagged for Bisulfite-seq (DMR mode), where mitochondrial methylation differences can be biologically meaningful.
Sample size: With n ≤ 3, edgeR FDR values are unreliable — treat as approximate rankings. With n < 5, FDR estimates may be unstable. Focus on large effect sizes and validate key findings.
MA plot asymmetry: If significant hits in one direction cluster at substantially lower mean counts than the other direction (>1 log2 unit difference in median), this may indicate normalization artifacts or noise-driven signal. Focus biological interpretation on higher-count significant regions.
Unusually small regions: If the median significant region size is < 100bp for ChIP-seq/ATAC-seq, peak fragmentation may be inflating region counts. Check representative regions in a genome browser.

Reporting top hits: When summarizing top differential regions, present them in strict Q-value ranking order without skipping unannotated regions. Regions lacking gene annotations are still statistically significant. If you curate a subset (e.g., only gene-annotated hits), explicitly label the table as "selected highlights" — never imply it is a strict significance ranking. Accuracy: When quoting specific statistics (logFC, Q-values, coordinates) in your summary, copy values directly from summary_report.md or the CSV files — do NOT re-derive, round, or paraphrase numerical values from memory, as this introduces transcription errors.

Gene annotations: Nearest gene annotations indicate genomic proximity, not functional regulation. Describe associations factually (e.g., "a differential region near PCNA") without asserting direct regulatory relationships. Do not use individual gene annotations as validation of the analysis — gene proximity does not prove functional regulation.

Experimental design: edgeR treats all experiments within a group as biological replicates. If experiments within a group differ in important ways (e.g., different genotypes, tissues, or treatments), state this clearly — the differential analysis captures the average effect between groups, and within-group heterogeneity is treated as noise. When summarizing results to the user, design caveats MUST be stated prominently — not buried at the end. If within-group experiments are not true biological replicates, this is the single most important limitation to communicate.

MA plot interpretation: The MA plot shows mean expression (average log2 counts, x-axis) vs log fold change (y-axis). Normal pattern: funnel/fan shape narrowing at high counts (more power to detect small changes). Red flags: asymmetric significant hits clustering at low counts in one direction (potential normalization artifact), horizontal bands of significant hits (batch effects), or all significant hits at very low counts (noise-driven). CRITICAL: Do NOT claim MA plot asymmetry in your summary unless the automated QC check flagged it. The quantitative threshold is >1 log2 unit difference in median mean count between directions. Visual impressions of asymmetry below this threshold are not meaningful — small differences (e.g., 0.2 log2 units) are negligible and do not support claims of noise-driven signal. If no MA asymmetry QC warning appears, describe the MA plot as showing a normal pattern.

Biological interpretation of top hits: When discussing top differential regions near known genes, briefly note the gene's known function for context but do NOT claim the differential peak proves direct regulation. Genomic proximity does not establish causation. For exploratory analyses, suggest formal pathway/GO enrichment analysis as a next step to identify systematic patterns. Distinguish between confirmatory hits (regions near genes expected to differ) and exploratory hits (unexpected associations requiring validation). When the target factor is well-characterized (e.g., TP53, CTCF, ESR1, NF-kB), note whether canonical target genes appear in the top differential regions — the absence of expected targets is itself a notable finding that may indicate the differential signal is dominated by artifacts or indirect effects. After excluding artifact regions, comment on the effect sizes (logFC) of remaining biologically interpretable hits — modest effect sizes (|logFC| < 1.5) combined with annotations near pseudogenes or lncRNAs suggest limited genuine differential binding.

Citations: When writing your final summary, cite the data source (e.g., GSE accession) and key tools: ChIP-Atlas (Zou et al. 2024), edgeR (Robinson et al. 2010) for DPR, or metilene (Jühling et al. 2016) for DMR. Full references are in the References section below.

Caveats: Results depend on available public data quality. Always check QC warnings in the summary report. Validate key regions with orthogonal methods. See references/output_format.md for detailed output format.

Suggested Next Steps

Visualize in IGV using the .igv.xml session file or .igv.bed file
Annotate regions with nearby genes using bedtools or GREAT
Motif analysis on differential peaks (HOMER, MEME-ChIP)
Integrate with expression data to link peaks to gene regulation

Related Skills

chip-atlas-peak-enrichment - Enrichment of ChIP-seq peaks near gene lists
bulk-rnaseq-counts-to-de-deseq2 - Differential expression to identify genes for downstream analysis

References

Zou et al. (2024). ChIP-Atlas 3.0: a data-mining suite to explore chromosome architecture. Nucleic Acids Research. doi:10.1093/nar/gkad884
Zou et al. (2022). ChIP-Atlas 2021 update. Nucleic Acids Research. doi:10.1093/nar/gkab933
Robinson et al. (2010). edgeR: a Bioconductor package for differential expression analysis. Bioinformatics 26(1):139-140.
Jühling et al. (2016). metilene: fast and sensitive calling of differentially methylated regions. Genome Research 26(2):256-262.
ChIP-Atlas: https://chip-atlas.org
API documentation: See references/chipatlas_diff_api_format.md
Statistical methods: See references/diff_analysis_methods.md

Code preview

scripts/init.py

# chip-atlas-diff-analysis scripts package

scripts/annotate_genes.py

"""
Annotate differential regions with nearest/overlapping genes via UCSC REST API.

Uses the UCSC Genome Browser REST API to find genes overlapping or near
each genomic region. Requires internet (same as ChIP-Atlas API dependency).

Graceful fallback: if UCSC API is unavailable, returns empty annotations
without raising errors.
"""

import pandas as pd
import requests

UCSC_API_URL = "https://api.genome.ucsc.edu/getData/track"
FLANK_BP = 5000


def annotate_nearest_genes(df, genome="hg38", max_regions=50):
    """
    Annotate regions with overlapping/nearest gene symbols via UCSC API.

    Parameters
    ----------
    df : pd.DataFrame
        DataFrame with chrom, chromStart, chromEnd columns.
        Should be pre-sorted by significance (top regions first).
    genome : str
        UCSC genome assembly (hg38, mm10, etc.)
    max_regions : int
        Maximum number of regions to annotate (default: 50).

    Returns
    -------
    pd.Series
        Gene annotations indexed to match df. Each entry is a string
        like "SBF2" (overlapping) or "SBF2 (+2.1kb)" (nearest within flank).
        Empty string for unannotated regions.
    """
    annotations = pd.Series("", index=df.index)

    regions_to_query = df.head(max_regions)
    if len(regions_to_query) == 0:
        return annotations

    print(f"   Annotating top {len(regions_to_query)} regions with nearest genes (UCSC API)...")

    n_annotated = 0
    n_failed = 0

    for idx, row in regions_to_query.iterrows():
        try:
            gene_str = _query_genes_for_region(
                row["chrom"],
                int(row["chromStart"]),
                int(row["chromEnd"]),
                genome=genome,
            )
            if gene_str:
                annotations.at[idx] = gene_str
                n_annotated += 1
        except Exception:
            n_failed += 1
            if n_failed >= 3:
                print(f"   Warning: UCSC API unavailable ({n_failed} failures). "
                      f"Skipping remaining gene annotations.")
                break

    print(f"   Annotated {n_annotated}/{len(regions_to_query)} regions with gene symbols")
    return annotations


def _query_genes_for_region(chrom, start, end, genome="hg38"):
    """
    Query UCSC REST API for genes overlapping or near a region.

    Returns
    -------
    str
        Gene symbol(s) or empty string.
        Direct overlap: "GENE1, GENE2"

scripts/export_all.py

"""
Export all ChIP-Atlas diff analysis results (Step 4).

Exports:
1. analysis_object.pkl - Complete results for downstream use
2. diff_regions_all.csv - All differential regions
3. diff_regions_significant.csv - Significant regions (FDR < 0.05)
4. diff_regions_top50.csv - Top 50 by significance
5. summary_report.md - Human-readable summary
"""

import os
import pickle
from datetime import datetime

try:
    try:
        from scripts.annotate_genes import annotate_nearest_genes
    except ImportError:
        from annotate_genes import annotate_nearest_genes
    _HAS_GENE_ANNOTATION = True
except ImportError:
    _HAS_GENE_ANNOTATION = False


def export_all(results, output_dir="diff_analysis_results", qvalue_threshold=0.05):
    """
    Export all ChIP-Atlas diff analysis results with pickle object.

    Parameters
    ----------
    results : dict
        Results from run_diff_workflow()
    output_dir : str
        Output directory
    qvalue_threshold : float
        Maximum Q-value (FDR) for "significant" regions (default: 0.05)

    Verification
    ------------
    Prints "=== Export Complete ===" when done
    """
    os.makedirs(output_dir, exist_ok=True)

    print("\n" + "=" * 70)
    print("EXPORTING RESULTS")
    print("=" * 70 + "\n")

    df = results["diff_regions"]
    params = results["parameters"]

    qc_warnings = results.get("qc_warnings", [])
    unfiltered_df = results.get("diff_regions_unfiltered", df)

    # 1. Save analysis object as pickle (CRITICAL for downstream skills)
    print("1. Saving analysis objects for downstream use...")

    n_a = (df["direction"] == "A_enriched").sum() if len(df) > 0 else 0
    n_b = (df["direction"] == "B_enriched").sum() if len(df) > 0 else 0

    analysis_object = {
        "diff_regions": df,
        "diff_regions_unfiltered": unfiltered_df,
        "qc_warnings": qc_warnings,
        "experiments_a": results["experiments_a"],
        "experiments_b": results["experiments_b"],
        "raw_files": results.get("raw_files", {}),
        "log_content": results.get("log_content", ""),
        "parameters": params,
        "n_regions_total": len(df),
        "n_regions_unfiltered": len(unfiltered_df),
        "n_regions_a_enriched": int(n_a),
        "n_regions_b_enriched": int(n_b),
        "timestamp": datetime.now().isoformat(),
    }

    pickle_path = os.path.join(output_dir, "analysis_object.pkl")
    with open(pickle_path, "wb") as f:
        pickle.dump(analysis_object, f)

Companion files

Type	Path	Bytes
Markdown	references/chipatlas_diff_api_format.md	3,114
Markdown	references/diff_analysis_methods.md	3,082
Markdown	references/output_format.md	2,955
Python	scripts/__init__.py	43
Python	scripts/annotate_genes.py	4,346
Python	scripts/export_all.py	23,617
Python	scripts/filter_regions.py	5,816
Python	scripts/generate_all_plots.py	10,544
Python	scripts/load_example_data.py	6,108
Python	scripts/load_user_data.py	6,774
Python	scripts/parse_bed_results.py	8,029
Python	scripts/qc_checks.py	23,954
Python	scripts/query_chipatlas_api.py	10,536
Python	scripts/run_diff_workflow.py	8,941
Markdown	SKILL.md	16,221
JSON	skill.meta.json	3,044

Differential Peaks / DMR

Overview

Learning goals

Figures

Tutorial

When to Use This Skill

Installation

Inputs

Outputs

Clarification Questions

Standard Workflow

Step 1 - Load data:

Step 2 - Run diff analysis:

Step 3 - Generate visualizations:

Step 4 - Export results:

Common Issues

Interpretation Guidelines

Suggested Next Steps

Related Skills

References

Code preview

scripts/__init__.py

scripts/annotate_genes.py

scripts/export_all.py

Companion files

scripts/init.py