View companion source

Target Genes (TF → genes)

Pre-computed target genes for any transcription factor.

Overview

Problem. Which genes does this TF bind and possibly regulate?

Use when: Known TF; want its target list
Avoid when: Treating binding as proven regulation

Learning goals

Figures

TF Target Genes Overview
MACS2 Binding Score
Output Visualizations
Targets vs Enrichment
Interpretation Pitfalls

Tutorial

Find target genes for any transcription factor using pre-computed ChIP-Atlas public ChIP-seq data.

When to Use This Skill

Use ChIP-Atlas target genes when you need to: - Identify target genes of a specific TF from all public ChIP-seq experiments - Rank potential targets by MACS2 binding score across hundreds of experiments - Compare TF binding across cell types using per-experiment binding scores - Validate known TF-target relationships with independent ChIP-seq evidence - Cross-reference with STRING protein interaction data for high-confidence targets

Don't use for:

  • Finding which TFs bind near your genes (use chip-atlas-peak-enrichment instead)
  • Histone mark targets (only non-histone antigens/TFs available)
  • Offline analysis (requires internet for data download)
  • Raw ChIP-seq analysis from FASTQ/BAM files

Key Concept: Downloads pre-computed TSV files containing MACS2 binding scores for every gene, across all public ChIP-seq experiments for the specified protein. Genes are ranked by average binding score. No API job submission needed — data is served as static files. STRING protein interaction scores are pre-embedded columns in the ChIP-Atlas TSV — no separate STRING API query is performed.

Installation

Software Version License Commercial Use Installation
pandas >=1.3 BSD-3-Clause Permitted pip install pandas
requests >=2.25 Apache-2.0 Permitted pip install requests
numpy >=1.20 BSD-3-Clause Permitted pip install numpy
plotnine >=0.12 MIT Permitted pip install plotnine
plotnine_prism >=0.1 MIT Permitted pip install plotnine_prism
matplotlib >=3.4 PSF-based Permitted pip install matplotlib
seaborn >=0.12 BSD-3-Clause Permitted pip install seaborn
pip install pandas requests numpy plotnine plotnine_prism matplotlib seaborn

System requirements: Internet connection (downloads from ChIP-Atlas data server)

Inputs

Query parameters:

  • Protein/TF name: Case-sensitive gene symbol (e.g., "TP53", "CTCF", "MYC")
  • Genome: hg38 (default), hg19, mm10, mm9, rn6, dm6, dm3, ce11, ce10, sacCer3
  • Distance from TSS: 1kb, 5kb (default), or 10kb

Optional filters:

  • min_score: Minimum average MACS2 binding score (default: 0)
  • top_n: Keep top N genes (default: 500)
  • cell_types: List of cell types to subset (recalculates averages)
  • min_string_score: Minimum STRING interaction score
  • min_binding_rate: Minimum fraction of experiments with binding

Outputs

Analysis objects (Pickle):

  • analysis_object.pkl - Complete results for downstream use
  • Load with: import pickle; obj = pickle.load(open('analysis_object.pkl', 'rb'))
  • Contains: target_genes, experiment_data, cell_types, protein, parameters, metadata

Results (CSV):

  • target_genes_all.csv - All target genes (gene, avg_score, string_score, binding_rate, num_bound, max_score, colocated_group)
  • target_genes_top50.csv - Top 50 by average binding score
  • target_genes_with_string.csv - Genes with STRING interaction evidence
  • experiment_scores_top50.csv - Wide-format per-experiment scores for top 50

Visualizations (PNG + SVG, plotnine with Prism theme):

  • target_genes_top_targets.png/.svg - Top target genes barplot
  • target_genes_score_distribution.png/.svg - Binding score distribution histogram
  • target_genes_heatmap.png/.svg - Binding heatmap (top genes × experiments)
  • target_genes_string_vs_binding.png/.svg - STRING vs binding scatter

Reports:

  • summary_report.md - Human-readable analysis summary

Clarification Questions

🚨 ALWAYS ask Question 1 FIRST. Do not ask about species, genome, or analysis parameters before the user has answered Question 1.

1. Query (ASK THIS FIRST):

🚨 IF EXAMPLE DATA SELECTED: All parameters are pre-defined (human hg38, ±5kb TSS, all cell types, no score threshold). DO NOT ask question 2. Proceed directly to Step 1.

Question 2 is ONLY for users providing their own query:

2. Analysis parameters:

Standard Workflow

🚨 MANDATORY: USE SCRIPTS EXACTLY AS SHOWN - DO NOT WRITE INLINE CODE 🚨

Step 1 - Load query:

# Option 1: Example query
from scripts.load_example_query import load_example_query
query = load_example_query("tp53")

# Option 2: Your own protein
# from scripts.load_user_query import load_user_query
# query = load_user_query("TP53", genome="hg38", distance=5)

✅ VERIFICATION: "✓ Query loaded: TP53 target genes (hg38, ±5kb)"

Step 2 - Run target genes analysis:

from scripts.run_target_genes_workflow import run_target_genes_workflow

results = run_target_genes_workflow(
    protein=query['protein'],
    genome=query['genome'],
    distance=query['distance'],
    top_n=500,
    output_dir="target_genes_results"
)

DO NOT write inline download/parsing code. Just use the script.

✅ VERIFICATION: "✓ Target genes analysis completed successfully!"

Step 3 - Generate visualizations:

from scripts.generate_all_plots import generate_all_plots
generate_all_plots(results, output_dir="target_genes_results", top_n=25)

DO NOT write inline plotting code. The script handles PNG + SVG with graceful fallback.

✅ VERIFICATION: "✓ All visualizations generated successfully!"

Step 4 - Export results:

from scripts.export_all import export_all
export_all(results, output_dir="target_genes_results")

DO NOT write custom export code. Use export_all().

✅ VERIFICATION: "=== Export Complete ==="

⚠️ CRITICAL - DO NOT:

  • Write inline download/parsing codeSTOP: Use run_target_genes_workflow()
  • Write inline plotting codeSTOP: Use generate_all_plots()
  • Write custom export codeSTOP: Use export_all()

⚠️ IF SCRIPTS FAIL - Script Failure Hierarchy:

  1. Fix and Retry (90%) - Install missing package, check internet, re-run
  2. Modify Script (5%) - Edit the script file itself, document changes
  3. Use as Reference (4%) - Read script, adapt approach, cite source
  4. Write from Scratch (1%) - Only if genuinely impossible, explain why

NEVER skip directly to writing inline code without trying the script first.

Common Issues

Error Cause Solution
HTTP 404 for protein Invalid or unavailable antigen Check case sensitivity ("TP53" not "tp53"). Histone marks not available. See references/target_genes_data_format.md.
Download timeout Large file or slow connection TP53 is ~13MB; allow up to 2 minutes. Try smaller TF first (e.g., MYC).
Memory error on large file Very wide TSV (100s of columns) Use top_n parameter to limit genes. Cell-type filter reduces columns.
No STRING data (all zeros) Protein not in STRING database Normal for less-studied TFs. Binding scores still valid without STRING.
Empty results after filtering Filters too strict Lower min_score, remove cell_type filter, increase top_n.
SVG export error Missing optional dependency Normal - generate_all_plots() handles fallback. PNG always created.

Interpretation Guidelines

Average Binding Score (MACS2): −10 × log10(Q-value). Higher = stronger binding evidence.

  • ≥500: Very strong binding (Q ≤ 1e-50) — high-confidence direct target
  • 100-500: Strong binding — likely direct target
  • 50-100: Moderate binding — possible target, may be cell-type-specific
  • <50: Weak binding — marginal evidence

Note: The Q-value thresholds above apply to individual experiment scores. Average scores include zeros from non-binding experiments, so an average of 500 reflects a consensus level — not that every experiment shows Q ≤ 1e-50.

Binding Rate: Fraction of experiments with any binding, shown as % with n/N count (e.g., "66.3% (260/392)"). >50% = consistent across cell types; <10% = cell-type-specific.

STRING Score: Independent evidence of regulatory interaction. >400 = medium confidence; >700 = high confidence. Genes with BOTH high binding + high STRING = highest-confidence targets. STRING score of 0 does NOT mean "not a target" — even well-characterized targets (e.g., BBC3/PUMA for TP53) can have STRING score 0 due to gaps in STRING coverage.

Caveats:

  • Averaging includes zeros: Average scores are computed across ALL experiments (including those with score 0). Use max_score and binding_rate for complementary views.
  • Cell-type bias: Experiments are unevenly distributed — a few well-studied cell lines dominate. See the "Experiment Composition" section in summary_report.md for exact distribution.
  • Co-located genes: Some genes share identical binding scores because they sit at the same genomic locus within the TSS window. The colocated_group column in the CSV flags these genes. For pathway enrichment, consider collapsing co-located groups to avoid double-counting loci. See summary_report.md Caveats for exact counts.
  • External annotations: ChIP-Atlas provides binding data only. Any biological role descriptions in the agent's summary are from general knowledge, not from this analysis output. Always cite the actual data columns (avg_score, binding_rate, string_score) when reporting results.

Reporting Results

🚨 CRITICAL: Follow these rules when presenting results to the user.

  1. Rankings MUST come from the data files. Read summary_report.md or target_genes_all.csv for exact gene names, ranks, and scores. NEVER construct ranking tables from general biological knowledge — even if a gene is a well-known target, its rank must match the data.
  2. Do NOT substitute biologically famous genes into top-N lists where the data ranks them lower. If a well-known target is not in the top 10, say so explicitly (e.g., "BAX, a well-characterized target, ranks #17 with avg_score 368.8").
  3. Use exact values from the CSV/report (1 decimal place for scores, 1 decimal for percentages). Do not round to integers.
  4. Cite the data source as: ChIP-Atlas (Zou et al., 2024) with the DOI from the References section.
  5. Mention co-located gene groups if the summary report flags them — they affect the effective number of independent targets.
  6. Label biological annotations: When describing gene functions or pathway roles, explicitly note these come from general knowledge (e.g., "CDKN1A — known cell cycle arrest effector — ranks #1"), not from ChIP-Atlas output.

Suggested Next Steps

  1. Run peak enrichment with top target genes to find co-regulatory factors (chip-atlas-peak-enrichment)
  2. Cell-type-specific analysis — re-run with cell_types filter matching your experimental system
  3. Gene regulatory network construction using top targets as nodes
  4. Functional enrichment of top target genes (GO, pathway analysis)

Related Skills

References

Code preview

scripts/download_target_genes.py

"""
Download and parse ChIP-Atlas Target Genes pre-computed TSV data.

Core script for the chip-atlas-target-genes skill.
Downloads wide-format TSV from ChIP-Atlas and parses into summary + experiment DataFrames.
"""

import io
import re

import pandas as pd
import requests

# ChIP-Atlas data server base URL
BASE_URL = "https://chip-atlas.dbcls.jp/data"

# Valid genomes
VALID_GENOMES = ["hg38", "hg19", "mm10", "mm9", "rn6", "dm6", "dm3", "ce11", "ce10", "sacCer3"]

# Valid distance values (kb from TSS)
VALID_DISTANCES = [1, 5, 10]


def _build_url(protein, genome, distance):
    """Build the download URL for target genes TSV."""
    return f"{BASE_URL}/{genome}/target/{protein}.{distance}.tsv"


def check_antigen_available(protein, genome="hg38", distance=5):
    """
    Check if target gene data exists for a given protein/antigen.

    Args:
        protein: Protein/TF name (case-sensitive, e.g., "TP53")
        genome: Genome assembly (default: "hg38")
        distance: Distance from TSS in kb (1, 5, or 10)

    Returns:
        bool: True if data is available, False otherwise
    """
    if genome not in VALID_GENOMES:
        print(f"  ERROR: Invalid genome '{genome}'. Valid: {', '.join(VALID_GENOMES)}")
        return False

    if distance not in VALID_DISTANCES:
        print(f"  ERROR: Invalid distance {distance}. Valid: {VALID_DISTANCES}")
        return False

    url = _build_url(protein, genome, distance)

    try:
        resp = requests.head(url, timeout=15, allow_redirects=True)
        if resp.status_code == 200:
            return True
        elif resp.status_code == 404:
            # Provide helpful suggestions
            print(f"  WARNING: No target gene data for '{protein}' ({genome}, ±{distance}kb)")
            print(f"  - Protein names are case-sensitive (e.g., 'TP53' not 'tp53')")
            print(f"  - Histone marks (H3K4me3, etc.) are NOT available in Target Genes")
            print(f"  - Check https://chip-atlas.org/target_genes for available antigens")
            return False
        else:
            print(f"  WARNING: Unexpected HTTP {resp.status_code} for {url}")
            return False
    except requests.RequestException as e:
        print(f"  ERROR: Network error checking antigen availability: {e}")
        return False


def download_target_genes(protein, genome="hg38", distance=5):
    """
    Download and parse ChIP-Atlas target genes TSV data.

    Args:
        protein: Protein/TF name (case-sensitive, e.g., "TP53")
        genome: Genome assembly (default: "hg38")
        distance: Distance from TSS in kb (1, 5, or 10)

    Returns:
        tuple: (summary_df, experiment_df)

scripts/export_all.py

"""
Export results for ChIP-Atlas Target Genes analysis.

Saves analysis object (pickle), CSV files, and markdown summary report.
"""

import os
import pickle
from datetime import datetime


def export_all(results, output_dir="target_genes_results"):
    """
    Export all target genes results to files.

    Args:
        results: Results dict from run_target_genes_workflow()
        output_dir: Output directory (default: "target_genes_results")

    Exports:
        - analysis_object.pkl (complete results for downstream skills)
        - target_genes_all.csv (all target genes with summary scores)
        - target_genes_top50.csv (top 50 by average score)
        - target_genes_with_string.csv (genes with STRING score > 0, conditional)
        - experiment_scores_top50.csv (wide-format per-experiment for top 50)
        - summary_report.md (human-readable report)
    """
    os.makedirs(output_dir, exist_ok=True)

    target_genes = results["target_genes"]
    experiment_data = results["experiment_data"]
    protein = results["protein"]
    metadata = results["metadata"]
    parameters = results["parameters"]

    print(f"\n  Exporting results to: {output_dir}/")

    # 1. Analysis object (pickle)
    pkl_path = os.path.join(output_dir, "analysis_object.pkl")
    export_data = {
        "target_genes": target_genes,
        "experiment_data": experiment_data,
        "cell_types": results["cell_types"],
        "protein": protein,
        "parameters": parameters,
        "metadata": metadata,
        "exported_at": datetime.now().isoformat(),
    }
    with open(pkl_path, "wb") as f:
        pickle.dump(export_data, f)
    print(f"    Saved: analysis_object.pkl")
    print(f"    (Load with: import pickle; obj = pickle.load(open('{pkl_path}', 'rb')))")

    # 2. All target genes CSV
    all_csv_path = os.path.join(output_dir, "target_genes_all.csv")
    target_genes.to_csv(all_csv_path, index=False)
    print(f"    Saved: target_genes_all.csv ({len(target_genes)} genes)")

    # 3. Top 50 target genes CSV
    top50 = target_genes.head(50)
    top50_path = os.path.join(output_dir, "target_genes_top50.csv")
    top50.to_csv(top50_path, index=False)
    print(f"    Saved: target_genes_top50.csv ({len(top50)} genes)")

    # 4. Genes with STRING interactions (conditional)
    with_string = target_genes[target_genes["string_score"] > 0]
    if len(with_string) > 0:
        string_path = os.path.join(output_dir, "target_genes_with_string.csv")
        with_string.to_csv(string_path, index=False)
        print(f"    Saved: target_genes_with_string.csv ({len(with_string)} genes)")

    # 5. Wide-format experiment scores for top 50 genes
    if experiment_data is not None:
        top50_genes = top50["gene"].tolist()
        exp_top50 = experiment_data[experiment_data["gene"].isin(top50_genes)]
        exp_path = os.path.join(output_dir, "experiment_scores_top50.csv")
        exp_top50.to_csv(exp_path, index=False)
        print(f"    Saved: experiment_scores_top50.csv ({len(exp_top50)} genes)")

    # 6. Summary report

scripts/filter_targets.py

"""
Post-download filtering for ChIP-Atlas Target Genes results.

Filters target genes by binding score, cell type, STRING score, and more.
"""

import numpy as np
import pandas as pd


def filter_targets(
    target_genes_df,
    experiment_df=None,
    min_avg_score=0,
    cell_types=None,
    min_string_score=0,
    top_n=None,
    min_binding_rate=0,
):
    """
    Filter target genes by various criteria.

    Args:
        target_genes_df: Summary DataFrame (gene, avg_score, string_score, etc.)
        experiment_df: Wide-format per-experiment DataFrame (optional, needed for cell_type filter)
        min_avg_score: Minimum average MACS2 binding score (default: 0)
        cell_types: List of cell types to subset (recalculates average from those experiments only)
        min_string_score: Minimum STRING interaction score (default: 0)
        top_n: Keep only top N genes by average score (default: None = all)
        min_binding_rate: Minimum fraction of experiments with binding (0-1, default: 0)

    Returns:
        tuple: (filtered_target_genes_df, filtered_experiment_df or None)
    """
    initial_count = len(target_genes_df)
    df = target_genes_df.copy()
    exp_df = experiment_df.copy() if experiment_df is not None else None

    # Cell-type filtering (must come first — recalculates avg_score)
    if cell_types and exp_df is not None:
        df, exp_df = _filter_by_cell_type(df, exp_df, cell_types)

    # Score filters
    if min_avg_score > 0:
        df = df[df["avg_score"] >= min_avg_score]

    if min_string_score > 0:
        df = df[df["string_score"] >= min_string_score]

    if min_binding_rate > 0:
        df = df[df["binding_rate"] >= min_binding_rate]

    # Top N (applied last, after other filters)
    if top_n is not None and top_n > 0:
        df = df.head(top_n)

    # Sync experiment_df to match filtered genes
    if exp_df is not None:
        exp_df = exp_df[exp_df["gene"].isin(df["gene"])]

    df = df.reset_index(drop=True)
    if exp_df is not None:
        exp_df = exp_df.reset_index(drop=True)

    print(f"  ✓ Filtered: {initial_count} → {len(df)} target genes")
    return df, exp_df


def _filter_by_cell_type(target_genes_df, experiment_df, cell_types):
    """
    Filter to specific cell types and recalculate average scores.

    Args:
        target_genes_df: Summary DataFrame
        experiment_df: Wide-format experiment DataFrame
        cell_types: List of cell type names to keep (case-insensitive)

    Returns:
        tuple: (updated_target_genes_df, filtered_experiment_df)
    """

Companion files

TypePathBytes
JSON_unavailable_files.json269
Markdownreferences/macs2_binding_scores.md3,417
Markdownreferences/string_scores.md2,307
Markdownreferences/target_genes_data_format.md2,486
Pythonscripts/download_target_genes.py7,007
Pythonscripts/export_all.py19,053
Pythonscripts/filter_targets.py7,506
Pythonscripts/generate_all_plots.py11,853
Pythonscripts/load_example_query.py2,102
Pythonscripts/load_user_query.py1,628
Pythonscripts/run_target_genes_workflow.py4,716
MarkdownSKILL.md12,755
JSONskill.meta.json2,477