Omics Dataset Retrieval

Find, catalogue and relevance-audit public omics datasets.

Overview

Problem. Inventory what public data already exists before starting.

Use when: Surveying public omics datasets
Avoid when: Not recording inclusion / exclusion rationale

Learning goals

Figures

Omics Dataset Retrieval Overview
Retrieval Workflow
Repository Tier Map
Relevance Labels
Required Catalog Columns
Common Pitfalls

Tutorial

Scope

Systematically retrieve, deduplicate, classify, and relevance-audit publicly available omics datasets for a user-specified disease, phenotype, gene, or biological process. Covers all major omics types (transcriptomics, proteomics, metabolomics, epigenomics, genomics, single-cell, spatial transcriptomics, lipidomics, multi-omics) and the broadest possible set of public repositories. Does not download raw data files or perform downstream analysis.


Inputs

Parameter Type Description
disease_or_topic string Disease name, phenotype, gene, or biological process (e.g., "sickle cell disease", "Alzheimer's disease", "BCL11A")
synonyms list[str] Alternative names, abbreviations, gene symbols (e.g., ["SCD", "SCA", "HbSS", "sickle cell anemia"])
omics_types list[str] or "all" Restrict to specific omics types, or "all" (default)
organism string "all" (default) — includes human, mouse, and all other organisms; restrict to "human" or "mouse" only if explicitly requested
year_min int Earliest publication year to include (default: no limit)
output_dir path Where to save outputs (default: /mnt/results/)

Outputs

File Description
<disease>_omics_datasets_MASTER.csv One comprehensive catalog with all retained candidates and evidence-backed relevance labels
<disease>_query_manifest.csv One row per executed query with repository, exact query, query category, API/source, timestamp, raw hit count, and post-dedup count
Optional split CSVs, e.g. <disease>_gene_expression_datasets.csv, <disease>_proteomics_datasets.csv Only if the user explicitly asks for separate tables; must include the same required relevance audit columns
<disease>_omics_summary.md Markdown summary: search breadth, counts by relevance label / omics type / repository, label definitions, evidence examples, limitations
<disease>_omics_landscape.png Overview figure: donut chart + repository bar + timeline (if requested)

Do not create a default filtered or validated CSV. Save one master CSV by default so that no potentially relevant datasets disappear silently. If the user explicitly asks for a short example list, still use the same columns and state the search breadth in the Markdown summary. If the user asks for separate tables, specific output columns, or a custom table schema, treat those requested columns as additional presentation columns, not an exclusive schema. The required relevance audit columns below must be present in every CSV and Markdown table.

Every CSV and Markdown table must include these relevance audit columns and must not include a numeric relevance score:

Column Required content
relevance_label One of direct, adjacent, weak, out_of_scope, false_positive
include_in_final Boolean. true for rows included in final user-facing tables; false for adjacent/weak rows retained only in the master catalog and for excluded candidates
relevance_reason One concise sentence explaining why the label was assigned
relevance_evidence Short title/summary/metadata excerpt supporting the label
evidence_source Field(s) where the evidence came from, e.g. Title, Summary, Publication; never use source_query or Query as relevance evidence
matched_terms Semicolon-separated topic, synonym, tissue, gene, omics, or exclusion terms that matched

Non-negotiable output rule: never rename, drop, hide, or summarize away these six columns. Use exactly these column names. Do not rename relevance_label to Relevance. Do not save a proteomics/gene-expression/custom CSV without these columns. Do not omit them from a final Markdown table because the user listed other columns; the user's listed columns are minimum requested fields, not a maximum schema.

The final deliverable may exclude rows where include_in_final == false, but exclusions and non-final adjacent/weak rows must be auditable. Keep them in the master audit catalog or an excluded-candidates CSV, and summarize counts and examples in the Markdown report. By default, primary final split tables include direct rows only; include adjacent or weak rows only when the user explicitly asks for a broader final table.


Repository Coverage Map

Work through repositories in priority order. Tier 1 repositories have programmatic APIs and are the highest-yield sources. Tier 2 have APIs but are more specialized. Tier 3 require web search or manual curation. In comprehensive inventory mode, attempt all tiers because rare diseases often have data only in Tier 2/3 repositories. In representative examples mode, start with Tier 1 and add Tier 2/3 only as needed to cover the requested omics types.

Tier 1 — High-yield, programmatic APIs (query first)

Repository Omics Focus API Base URL Notes
GEO (NCBI) All transcriptomics, epigenomics https://eutils.ncbi.nlm.nih.gov/entrez/eutils/ Largest source; run 20–40 targeted queries
SRA (NCBI) Raw sequencing reads https://eutils.ncbi.nlm.nih.gov/entrez/eutils/ (db=sra) Complements GEO; finds studies not deposited as GEO series
ArrayExpress / BioStudies (EBI) Transcriptomics, functional genomics https://www.ebi.ac.uk/biostudies/api/v1/search European mirror of GEO; many unique studies
PRIDE / ProteomeXchange Proteomics (MS) https://www.ebi.ac.uk/pride/ws/archive/v2/projects/search Primary proteomics repository
OmicsDI (aggregator) Multi-omics aggregator https://www.omicsdi.org/ws/dataset/search Covers MetaboLights, MassIVE, GNPS, Metabolomics Workbench, ArrayExpress, GEO, PRIDE — use to catch metabolomics and avoid re-querying individual repos
CZ CELLxGENE scRNA-seq, spatial https://api.cellxgene.cziscience.com/curation/v1/collections 33M+ cells; query by disease using cellxgene_census Python package
GDC / TCGA (NCI) Cancer multi-omics https://api.gdc.cancer.gov/ Best for cancer diseases; covers TCGA, TARGET, CGCI, CPTAC-GDC

Tier 2 — Specialized APIs (query when relevant to disease/omics type)

Repository Omics Focus API / Access When to use
ENCODE Epigenomics (ChIP-seq, ATAC-seq, RNA-seq) https://www.encodeproject.org/search/?format=json Regulatory genomics; TF binding; chromatin accessibility
Expression Atlas (EBI) Bulk + single-cell RNA-seq https://www.ebi.ac.uk/gxa/json/experiments Curated, baseline + differential expression experiments
Human Cell Atlas (HCA) scRNA-seq, spatial https://service.azul.data.humancellatlas.org/index/projects Reference cell type atlases; healthy tissue baselines
Metabolomics Workbench (NIH) Metabolomics, lipidomics https://www.metabolomicsworkbench.org/rest/study/study_id/ST/named_json NIH-funded metabolomics; REST API available
MetaboLights (EBI) Metabolomics https://www.ebi.ac.uk/metabolights/ws/studies/ European metabolomics repository
MassIVE / GNPS Metabolomics, lipidomics https://massive.ucsd.edu/ProteoSAFe/datasets.jsp MS-based metabolomics; use OmicsDI to search
jPOST Proteomics (MS) https://repository.jpostdb.org/search Japanese proteomics repository; ProteomeXchange member
iProX Proteomics (MS) https://www.iprox.cn/page/project.html Chinese proteomics repository; ProteomeXchange member
cBioPortal Cancer multi-omics https://www.cbioportal.org/api/ Cancer genomics; mutation, CNA, expression, methylation
EpiRR / IHEC Epigenomics reference https://www.ebi.ac.uk/epirr/api/ IHEC reference epigenomes; healthy tissue baselines
Human Protein Atlas (HPA) Proteomics, RNA-seq https://www.proteinatlas.org/api/ Tissue/cell-type protein and RNA expression
ENA (EBI) Raw sequencing https://www.ebi.ac.uk/ena/portal/api/search European mirror of SRA; raw reads

Tier 3 — Web search + manual curation (always attempt in comprehensive mode)

Repository Omics Focus Search Strategy
Zenodo All types WebSearch: "{disease}" omics dataset site:zenodo.org
Figshare All types WebSearch: "{disease}" RNA-seq proteomics site:figshare.com
Dryad All types WebSearch: "{disease}" omics data site:datadryad.org
OSF (Open Science Framework) All types WebSearch: "{disease}" omics dataset site:osf.io
Harvard Dataverse All types WebSearch: "{disease}" omics dataset site:dataverse.harvard.edu
Synapse (Sage Bionetworks) Neuroscience, cancer WebSearch: "{disease}" omics dataset site:synapse.org
ICGC Data Portal Cancer genomics WebSearch: "{disease}" ICGC site:dcc.icgc.org
CPTAC (NCI) Cancer proteomics WebSearch: "{disease}" CPTAC proteomics site:proteomics.cancer.gov
AWS Open Data Registry All types WebSearch: "{disease}" omics site:registry.opendata.aws
dbGaP (NIH, controlled) Genomics, WGS WebSearch: "{disease}" WGS genomics site:ncbi.nlm.nih.gov/gap
EGA (EBI, controlled) Genomics, WGS WebSearch: "{disease}" genome sequencing site:ega-archive.org
JGA (Japan, controlled) Genomics WebSearch: "{disease}" genomics site:ddbj.nig.ac.jp/jga
UK Biobank Population genomics WebSearch: "{disease}" UK Biobank omics
FinnGen Population genomics WebSearch: "{disease}" FinnGen GWAS

Workflow

Step 1 — Ask upfront clarification questions (MANDATORY before any search)

This step is not optional. Before running any queries, use AskUserQuestion to collect the information below. Do not assume defaults — the answers materially affect which repositories are queried, how many results are returned, and how the relevance audit is tuned.

Ask ALL of the following in a single AskUserQuestion call (skip any already answered in the user's initial message, but always ask the rest):

Q1 — Search breadth

  • Should the result be a comprehensive inventory or only a few representative examples?
  • Comprehensive inventory (default) — search as broadly as possible and keep all plausible candidates in the master CSV with relevance labels.
  • Representative examples — return only a small curated sample, still with relevance labels and evidence. Ask the user for an approximate count if they do not provide one.

Q2 — Disease / topic (if not already provided) - What disease, phenotype, gene, or biological process should we search for? - Any synonyms, abbreviations, or alternative names to include? (e.g., "SCD", "SCA", "HbSS" for sickle cell disease — more synonyms = better recall) - Any key genes or pathways central to this topic? (used to build additional targeted GEO queries beyond the disease name alone)

Q3 — Omics types - Which omics types should be included?

  • All types (default) — transcriptomics, proteomics, metabolomics, epigenomics, genomics, single-cell, spatial, multi-omics
  • Transcriptomics only — bulk RNA-seq, microarray, scRNA-seq, spatial
  • Epigenomics only — ChIP-seq, ATAC-seq, methylation, CUT&RUN, Hi-C
  • Proteomics / metabolomics only — MS proteomics, metabolomics, lipidomics
  • Custom — user specifies which types to include

Q4 — Organism - Which organisms should be included?

  • All organisms (default) — human, mouse, and any other species
  • Human only — Homo sapiens studies only
  • Specific species — user specifies (e.g., mouse only, zebrafish only)

Q5 — Year range - Any restriction on publication/deposit year?

  • No restriction (default) — all years
  • Recent only — e.g., 2018 onwards, 2020 onwards
  • Custom range — user specifies start and/or end year

Q6 — Controlled-access repositories - Should we include controlled-access repositories (dbGaP, EGA, JGA)? These require institutional data access agreements to download data, but we can still catalog their existence and metadata. - Yes, include and flag them (default) — catalog with "Controlled" label - No, open-access only — skip dbGaP, EGA, JGA entirely

Q7 — Goal / output format - What is the primary goal?

  • Browse and select (default) — CSV catalog + Markdown summary for review
  • Landscape overview — also generate a visual overview figure (donut + timeline + repository bar)
  • Download and analyze — also include direct download links and file formats where available
  • All of the above

Q8 — Tissue or cell type focus (optional but significantly improves recall) - Are there specific tissues, cell types, or sample sources to prioritize? (e.g., "whole blood", "brain cortex", "CD34+ HSCs", "plasma", "tumor") These are added as targeted search terms to GEO and other repositories. If none specified, broad disease-name queries are used.

Q9 — Final table inclusion - Which relevance labels should appear in final user-facing tables?

  • Direct only (default) — final split tables include only datasets with direct disease/topic evidence; adjacent and weak candidates stay in the master audit catalog.
  • Direct + adjacent — include direct rows plus mechanistically related adjacent rows in final tables.
  • Direct + adjacent + weak — include broad or tangential weak rows too; use only when the user wants maximum recall in the final tables.

Q10 — Raw-read granularity - How should SRA/ENA raw-read repositories be represented?

  • Study/project level (default) — aggregate raw reads by study/project accession to avoid duplicate SRR/SRX run inflation.
  • Run level — expand individual raw runs only if the user explicitly wants run-level accessions.

Only proceed to Step 2 once all answers are collected. If the user provides partial information upfront (e.g., just the disease name), ask only the remaining unanswered questions.

Step 2 — Build search term matrix

Build the search term matrix according to the user's requested breadth.

Do not use a disease-specific hardcoded query manifest. Use a fixed query-generation protocol so the behavior is consistent while still adapting to the user's disease/topic.

For comprehensive inventory mode, generate a broad set of search queries by combining: - Primary disease name + synonyms - Omics-type-specific terms (RNA-seq, microarray, ChIP-seq, ATAC-seq, scRNA-seq, proteomics, metabolomics, methylation, WGS, SNP array, CUT&RUN, Hi-C, spatial transcriptomics, lipidomics) - Tissue/cell-type terms relevant to the disease - Clinical context terms (treatment, crisis, biomarker, pediatric, longitudinal) - Key gene/pathway terms central to the disease

For representative examples mode, use the same term categories but run fewer high-precision queries, prefer records with strong metadata and publication links, and do not describe the result as exhaustive.

For every query, record provenance fields on each returned row:

Field Meaning
source_repository Repository or aggregator queried
source_query Exact query string or structured API filters used
source_query_category One of disease_exact, synonym, omics_type, tissue_cell_type, clinical_context, gene_pathway, controlled_access, manual_web
source_api API/tool endpoint or web search source
retrieved_at Timestamp for the retrieval run
n_raw_hits Number of raw hits returned by that query before cross-query deduplication
n_records_after_dedup Number of unique records from that query category/source retained after accession-level deduplication

Save these provenance fields as <disease>_query_manifest.csv. source_query and related query fields explain why a candidate was retrieved, but they are provenance only. Never use source_query, source_query_category, source_repository, source_api, retrieved_at, or Query as positive relevance evidence for direct or adjacent labels.

query_manifest_rows = []

def add_query_manifest_row(source_repository, source_query, source_query_category,
                           source_api, retrieved_at, n_raw_hits, n_records_after_dedup):
    query_manifest_rows.append({
        "source_repository": source_repository,
        "source_query": source_query,
        "source_query_category": source_query_category,
        "source_api": source_api,
        "retrieved_at": retrieved_at,
        "n_raw_hits": n_raw_hits,
        "n_records_after_dedup": n_records_after_dedup,
    })

GEO query syntax tips:

  • Use [Title] field tag for high-precision hits: "sickle cell"[Title] AND "RNA-seq"
  • Use [DataSet Type] for omics filtering: "expression profiling by array"[DataSet Type]
  • Use retmax=100 per query; run 20–40 queries to maximize recall
  • Deduplicate by GEO UID across all queries

Step 3 — Query GEO (NCBI E-utilities) — PRIMARY SOURCE

import requests, time, pandas as pd

BASE = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/"

# Comprehensive mode: build 20-40 queries combining disease + omics + tissue + gene terms.
# Representative mode: run a smaller high-precision subset and report that it is not exhaustive.
# Example for any disease:
search_queries = {
    "title_rnaseq":    f'"{disease}"[Title] AND "RNA-seq"',
    "title_scrna":     f'"{disease}"[Title] AND "single cell"',
    "title_array":     f'"{disease}"[Title] AND "expression profiling by array"[DataSet Type]',
    "title_chip":      f'"{disease}" AND "ChIP-seq"',
    "title_atac":      f'"{disease}" AND "ATAC-seq"',
    "title_methyl":    f'"{disease}" AND "methylation"',
    "title_wgs":       f'"{disease}" AND "whole genome sequencing"',
    "title_proteom":   f'"{disease}" AND "proteomics"',
    "title_metabolom": f'"{disease}" AND "metabolomics"',
    # Add synonym-based queries:
    # f'"{synonym}"[Title] AND "RNA-seq"' for each synonym
    # Add tissue-specific queries:
    # f'"{disease}" AND "{tissue_term}"' for key tissues
    # Add gene-specific queries:
    # f'"{disease}" AND "{key_gene}"' for key disease genes
}

all_ids = set()
for label, query in search_queries.items():
    r = requests.get(f"{BASE}esearch.fcgi",
                     params={"db": "gds", "term": query, "retmax": 100, "retmode": "json"})
    ids = r.json().get("esearchresult", {}).get("idlist", [])
    all_ids.update(ids)
    time.sleep(0.34)  # Stay under NCBI rate limit (3 req/sec without API key)

# Fetch summaries in batches of 20
records = []
for i in range(0, len(list(all_ids)), 20):
    batch = list(all_ids)[i:i+20]
    r = requests.get(f"{BASE}esummary.fcgi",
                     params={"db": "gds", "id": ",".join(batch), "retmode": "json"})
    result = r.json().get("result", {})
    for uid in result.get("uids", []):
        item = result[uid]
        records.append({
            "Accession": item.get("accession", ""),
            "Title": item.get("title", ""),
            "GEO_Type": item.get("gdstype", ""),
            "Organism": item.get("taxon", ""),
            "N_Samples": item.get("n_samples", ""),
            "Date": item.get("pdat", ""),
            "Summary": item.get("summary", "")[:500],
            "Repository": "GEO",
        })
    time.sleep(0.34)

df_geo = pd.DataFrame(records)
# Filter: GSE/GDS accessions only (exclude GPL platform records)
# By default, keep ALL organisms. Only filter by organism if user explicitly requested a specific species.
# Example organism filter (apply only if requested):
#   df_geo = df_geo[df_geo["Organism"].str.contains("Homo sapiens", na=False)]
df_geo = df_geo[df_geo["Accession"].str.startswith(("GSE", "GDS"))]

NCBI API key: Set api_key param to increase rate limit to 10 req/sec (register free at NCBI).

Step 4 — Query SRA (NCBI) for raw sequencing studies not in GEO

# SRA catches studies deposited as raw reads without a GEO series.
# Default: aggregate at study/project level to avoid SRR/SRX run-level inflation.
# Expand to individual runs only if the user explicitly requested run-level raw reads.
RAW_READ_GRANULARITY = "study_project"  # or "run_level" if explicitly requested
EXPAND_RAW_RUNS = RAW_READ_GRANULARITY == "run_level"

sra_records = []
for query in [disease] + synonyms[:3]:
    # Default: search all organisms. Add AND "Homo sapiens"[Organism] only if user requested human-only.
    r = requests.get(f"{BASE}esearch.fcgi",
                     params={"db": "sra", "term": f'"{query}"',
                             "retmax": 100, "retmode": "json"})
    ids = r.json().get("esearchresult", {}).get("idlist", [])
    if ids:
        r2 = requests.get(f"{BASE}esummary.fcgi",
                          params={"db": "sra", "id": ",".join(ids), "retmode": "json"})
        result = r2.json().get("result", {})
        for uid in result.get("uids", []):
            item = result[uid]
            exp_acc = item.get("experiment", {}).get("acc", "")
            study_acc = (
                item.get("study", {}).get("acc", "")
                or item.get("bioproject", "")
                or exp_acc
            )
            runs_obj = item.get("runs", {})
            runs = runs_obj.get("Run", []) if isinstance(runs_obj, dict) else []
            if isinstance(runs, dict):
                runs = [runs]
            run_accs = [run.get("acc", "") for run in runs if run.get("acc")]
            accession = ";".join(run_accs) if EXPAND_RAW_RUNS and run_accs else study_acc

            sra_records.append({
                "Accession": accession,
                "Raw_Run_Accessions": ";".join(run_accs),
                "Raw_Read_Granularity": RAW_READ_GRANULARITY,
                "Title": item.get("title", ""),
                "GEO_Type": item.get("exptype", ""),
                "Organism": item.get("organism", {}).get("ScientificName", "unknown"),
                "N_Samples": len(run_accs) if run_accs else (runs_obj.get("@total", "") if isinstance(runs_obj, dict) else ""),
                "Date": item.get("createdate", ""),
                "Summary": item.get("summary", "")[:500],
                "Repository": "SRA",
            })
    time.sleep(0.34)

Step 5 — Query ArrayExpress / BioStudies (EBI)

# BioStudies API — covers ArrayExpress functional genomics collection
# Source: https://www.ebi.ac.uk/biostudies/api/v1/search
biostudies_url = "https://www.ebi.ac.uk/biostudies/api/v1/search"
biostudies_records = []

for keyword in [disease] + synonyms[:2]:
    r = requests.get(biostudies_url,
                     params={"query": keyword, "collection": "arrayexpress",
                             "pageSize": 100, "page": 1},
                     headers={"Accept": "application/json"}, timeout=60)
    if r.status_code == 200:
        data = r.json()
        for hit in data.get("hits", []):
            biostudies_records.append({
                "Accession": hit.get("accession", ""),
                "Title": hit.get("title", ""),
                "GEO_Type": hit.get("type", ""),
                "Organism": "; ".join([o.get("scientificName","") for o in hit.get("organism",[])]),
                "N_Samples": hit.get("numberOfSamples", ""),
                "Date": hit.get("releaseDate", ""),
                "Summary": hit.get("description", "")[:500],
                "Repository": "ArrayExpress/BioStudies",
            })
    time.sleep(1)

df_biostudies = pd.DataFrame(biostudies_records).drop_duplicates(subset="Accession")
# Exclude GEO mirror accessions (E-GEOD- prefix = same study already captured via GEO)
# Keep all organisms by default. Apply organism filter only if user explicitly requested a specific species.
# Example: df_biostudies = df_biostudies[df_biostudies["Organism"].str.contains("Homo sapiens|human", case=False, na=False)]
df_biostudies = df_biostudies[~df_biostudies["Accession"].str.startswith("E-GEOD")]

Step 6 — Query PRIDE / ProteomeXchange (proteomics)

# PRIDE REST API v2
# Source: https://www.ebi.ac.uk/pride/ws/archive/v2/projects/search
pride_url = "https://www.ebi.ac.uk/pride/ws/archive/v2/projects/search"
pride_records = []

for keyword in [disease] + synonyms:
    r = requests.get(pride_url,
                     params={"keyword": keyword, "pageSize": 100, "page": 0},
                     headers={"Accept": "application/json"}, timeout=60)
    if r.status_code == 200:
        data = r.json()
        # Handle both list and dict response formats
        projects = data if isinstance(data, list) else data.get("_embedded", {}).get("compactprojects", [])
        for p in projects:
            pride_records.append({
                "Accession": p.get("accession", ""),
                "Title": p.get("title", ""),
                "GEO_Type": "Proteomics (MS)",
                "Organism": "; ".join(p.get("organisms", [])),
                "N_Samples": p.get("numberOfSamples", ""),
                "Date": p.get("submissionDate", ""),
                "Summary": p.get("projectDescription", "")[:500],
                "Repository": "PRIDE",
            })
    time.sleep(1)

df_pride = pd.DataFrame(pride_records).drop_duplicates(subset="Accession")
# Keep all organisms by default. Apply organism filter only if user explicitly requested a specific species.
# Example: df_pride = df_pride[df_pride["Organism"].str.contains("Homo sapiens|human", case=False, na=False)]

Also check jPOST and iProX (ProteomeXchange members not always in PRIDE search):

# jPOST REST API
# Source: https://repository.jpostdb.org/search
jpost_records = []
for keyword in [disease] + synonyms[:2]:
    r = requests.get("https://repository.jpostdb.org/search",
                     params={"keyword": keyword, "format": "json"}, timeout=30)
    if r.status_code == 200:
        for p in r.json().get("projects", []):
            jpost_records.append({
                "Accession": p.get("projectId", ""),
                "Title": p.get("title", ""),
                "GEO_Type": "Proteomics (MS)",
                "Organism": p.get("organism", ""),
                "N_Samples": "",
                "Date": p.get("releaseDate", ""),
                "Summary": p.get("description", "")[:500],
                "Repository": "jPOST",
            })
    time.sleep(1)

Step 7 — Query OmicsDI (metabolomics aggregator)

OmicsDI is the single best entry point for metabolomics — it aggregates MetaboLights, MassIVE, GNPS, Metabolomics Workbench, and Metabolon. Filter by source to avoid GEO duplicates.

# OmicsDI aggregator
# Source: https://www.omicsdi.org/ws/dataset/search
omicsdi_url = "https://www.omicsdi.org/ws/dataset/search"
omicsdi_records = []
METABOLOMICS_REPOS = {"MetaboLights", "MassIVE", "GNPS", "Metabolon",
                      "Metabolomics Workbench", "HMDB", "Lipidmaps"}

for keyword in [disease] + synonyms[:2]:
    r = requests.get(omicsdi_url,
                     params={"query": keyword, "size": 100, "start": 0},
                     headers={"Accept": "application/json"}, timeout=60)
    if r.status_code == 200:
        for ds in r.json().get("datasets", []):
            repo = ds.get("source", "")
            if repo in METABOLOMICS_REPOS:
                omicsdi_records.append({
                    "Accession": ds.get("id", ""),
                    "Title": ds.get("name", ""),
                    "GEO_Type": "Metabolomics",
                    "Organism": "; ".join(ds.get("organisms", {}).get("name", [])),
                    "N_Samples": "",
                    "Date": ds.get("publicationDate", ""),
                    "Summary": ds.get("description", "")[:500],
                    "Repository": repo,
                })
    time.sleep(1)

df_metabolomics = pd.DataFrame(omicsdi_records).drop_duplicates(subset="Accession")

Also query Metabolomics Workbench REST API directly for NIH-funded studies:

# Metabolomics Workbench REST API
# Source: https://www.metabolomicsworkbench.org/tools/mw_rest.php
mw_url = "https://www.metabolomicsworkbench.org/rest/study/study_title"
for keyword in [disease] + synonyms[:2]:
    r = requests.get(f"{mw_url}/{requests.utils.quote(keyword)}/summary/json", timeout=30)
    if r.status_code == 200 and r.text.strip():
        # Response is a dict of study_id -> study metadata
        for sid, meta in r.json().items():
            omicsdi_records.append({
                "Accession": sid,
                "Title": meta.get("study_title", ""),
                "GEO_Type": "Metabolomics",
                "Organism": meta.get("subject_species", ""),
                "N_Samples": meta.get("subject_count", ""),
                "Date": meta.get("submit_date", ""),
                "Summary": meta.get("study_summary", "")[:500],
                "Repository": "Metabolomics Workbench",
            })

Step 8 — Query CZ CELLxGENE (single-cell)

# CZ CELLxGENE Collections API
# Source: https://api.cellxgene.cziscience.com/curation/v1/collections
cxg_url = "https://api.cellxgene.cziscience.com/curation/v1/collections"
r = requests.get(cxg_url, headers={"Accept": "application/json"}, timeout=60)
cxg_records = []
if r.status_code == 200:
    for col in r.json():
        title = col.get("name", "")
        desc = col.get("description", "")
        # Filter by disease keywords
        text = (title + " " + desc).lower()
        if any(kw.lower() in text for kw in [disease] + synonyms):
            cxg_records.append({
                "Accession": col.get("collection_id", ""),
                "Title": title,
                "GEO_Type": "scRNA-seq",
                "Organism": "; ".join(sorted({d.get("organism","") for d in col.get("datasets",[]) if d.get("organism")})) or "unknown",
                "N_Samples": col.get("cell_count", ""),
                "Date": col.get("published_at", ""),
                "Summary": desc[:500],
                "Repository": "CZ CELLxGENE",
            })

# Also use cellxgene_census Python package for cell-level metadata queries:
# import cellxgene_census
# census = cellxgene_census.open_soma()
# obs = census["census_data"]["homo_sapiens"].obs.read(
#     value_filter=f'disease == "{disease_ontology_term}"'
# ).concat().to_pandas()

Step 9 — Query GDC / TCGA (cancer diseases)

Only relevant for cancer diseases. Skip for non-cancer diseases.

# GDC REST API
# Source: https://api.gdc.cancer.gov/
gdc_url = "https://api.gdc.cancer.gov/projects"
r = requests.get(gdc_url,
                 params={"filters": json.dumps({"op": "in", "content": {
                             "field": "disease_type", "value": [disease] + synonyms}}),
                         "fields": "project_id,name,disease_type,primary_site,summary",
                         "format": "json", "size": 100},
                 timeout=60)
# Parse and add to records with Repository = "GDC/TCGA"

Step 10 — Query ENCODE (epigenomics)

Only relevant when the disease has known epigenomic/regulatory components.

# ENCODE REST API — rate limit: 10 req/sec
# Source: https://www.encodeproject.org/help/rest-api
encode_url = "https://www.encodeproject.org/search/"
r = requests.get(encode_url,
                 params={"searchTerm": disease, "type": "Experiment",
                         "status": "released", "format": "json", "limit": 100},
                 headers={"Accept": "application/json"}, timeout=60)
if r.status_code == 200:
    for exp in r.json().get("@graph", []):
        encode_records.append({
            "Accession": exp.get("accession", ""),
            "Title": exp.get("description", ""),
            "GEO_Type": exp.get("assay_title", ""),
            "Organism": exp.get("organism", {}).get("scientific_name", ""),
            "N_Samples": len(exp.get("replicates", [])),
            "Date": exp.get("date_released", ""),
            "Summary": exp.get("biosample_summary", "")[:500],
            "Repository": "ENCODE",
        })

Step 11 — Query Expression Atlas (EBI)

# Expression Atlas REST API
# Source: https://www.ebi.ac.uk/gxa/json/experiments
# Default: no species filter — returns all organisms. Add species param only if user requested a specific organism.
atlas_url = "https://www.ebi.ac.uk/gxa/json/experiments"
r = requests.get(atlas_url, timeout=60)  # omit species= to get all organisms
atlas_records = []
if r.status_code == 200:
    for exp in r.json().get("experiments", []):
        text = (exp.get("experimentDescription","") + " " +
                " ".join(exp.get("factors",[])) + " " +
                " ".join(exp.get("experimentalFactors",[]))).lower()
        if any(kw.lower() in text for kw in [disease] + synonyms):
            atlas_records.append({
                "Accession": exp.get("experimentAccession", ""),
                "Title": exp.get("experimentDescription", ""),
                "GEO_Type": exp.get("experimentType", ""),
                "Organism": "; ".join(exp.get("species", [])) or "unknown",
                "N_Samples": exp.get("numberOfAssays", ""),
                "Date": exp.get("lastUpdate", ""),
                "Summary": "",
                "Repository": "Expression Atlas",
            })

Step 12 — Web search for Tier 3 repositories

Use the WebSearch tool for repositories without disease-specific APIs. Run all of these:

# Open repositories
WebSearch: "{disease}" omics dataset site:zenodo.org
WebSearch: "{disease}" RNA-seq proteomics site:figshare.com
WebSearch: "{disease}" omics data site:datadryad.org
WebSearch: "{disease}" omics dataset site:osf.io
WebSearch: "{disease}" omics dataset site:dataverse.harvard.edu
WebSearch: "{disease}" omics dataset site:synapse.org

# Controlled-access repositories
WebSearch: "{disease}" WGS genomics site:ncbi.nlm.nih.gov/gap
WebSearch: "{disease}" genome sequencing site:ega-archive.org
WebSearch: "{disease}" genomics site:ddbj.nig.ac.jp/jga

# Consortium/specialized
WebSearch: "{disease}" ICGC genomics dataset
WebSearch: "{disease}" CPTAC proteomics dataset
WebSearch: "{disease}" UK Biobank omics
WebSearch: "{disease}" AWS open data omics

Extract accession IDs, titles, and descriptions from search results. Add as rows with the appropriate Repository label. Flag dbGaP/EGA/JGA as Access = "Controlled".

Step 13 — Classify omics type

Apply rule-based classification to each record's GEO_Type, Title, and Summary. The GEO gdstype field is unreliable — always use title + summary as primary signal.

def classify_omics(row):
    combined = (str(row.get("GEO_Type","")) + " " +
                str(row.get("Title","")) + " " +
                str(row.get("Summary",""))).lower()

    # Single-cell (check before bulk RNA-seq)
    if any(x in combined for x in ["single cell", "scrna", "10x chromium", "dropseq",
                                    "smart-seq", "single-nucleus", "snrna"]):
        return "scRNA-seq"
    # Spatial transcriptomics
    elif any(x in combined for x in ["spatial transcriptom", "visium", "slide-seq",
                                      "merfish", "seqfish", "stereo-seq"]):
        return "Spatial Transcriptomics"
    # Chromatin accessibility
    elif "atac" in combined:
        return "ATAC-seq"
    # CUT&RUN / CUT&TAG
    elif any(x in combined for x in ["cut&run", "cut and run", "cutana", "cut&tag"]):
        return "CUT&RUN/CUT&TAG"
    # ChIP-seq
    elif any(x in combined for x in ["chip-seq", "chip seq", "binding/occupancy",
                                      "chromatin immunoprecipitation"]):
        return "ChIP-seq"
    # 3D genome
    elif any(x in combined for x in ["hi-c", "hic", "3d genome", "chromatin conformation",
                                      "capture-c", "4c-seq", "5c"]):
        return "Hi-C / 3D Genome"
    # DNA methylation
    elif any(x in combined for x in ["methylat", "bisulfite", "wgbs", "rrbs",
                                      "epic array", "850k", "450k", "dnmt"]):
        return "DNA Methylation"
    # miRNA / ncRNA
    elif any(x in combined for x in ["mirna", "microrna", "ncrna", "lncrna",
                                      "small rna", "pirna", "circrna"]):
        return "miRNA/ncRNA"
    # Ribo-seq / RIP-seq
    elif any(x in combined for x in ["ribo-seq", "ribosome profiling", "rip-seq",
                                      "clip-seq", "iclip"]):
        return "Other (Ribo/RIP/CLIP-seq)"
    # Bulk RNA-seq
    elif any(x in combined for x in ["high throughput sequencing", "rna-seq", "rnaseq",
                                      "mrna-seq", "total rna", "poly-a"]):
        return "Bulk RNA-seq"
    # Microarray
    elif any(x in combined for x in ["expression profiling by array", "microarray",
                                      "affymetrix", "illumina beadchip", "agilent"]):
        return "Microarray"
    # Genomics
    elif any(x in combined for x in ["wgs", "whole genome sequencing", "whole-genome",
                                      "snp array", "genotyping", "gwas", "exome"]):
        return "Genomics (WGS/SNP/Exome)"
    # Proteomics
    elif any(x in combined for x in ["proteom", "mass spectrometry", "lc-ms", "tmtpro",
                                      "label-free", "dia-ms", "dda-ms", "phosphoproteom"]):
        return "Proteomics (MS)"
    # Metabolomics / lipidomics
    elif any(x in combined for x in ["metabolom", "metabolite", "nmr", "gcms", "lcms",
                                      "lipidom", "lipidome", "untargeted ms"]):
        return "Metabolomics/Lipidomics"
    # Multi-omics
    elif any(x in combined for x in ["multi-omics", "multiomics", "multi omics",
                                      "integrat", "joint profil"]):
        return "Multi-omics"
    else:
        return "Other/Mixed"

Step 14 — Relevance audit and final-inclusion decision

Classify every candidate with an evidence-backed relevance label and an explicit include_in_final decision. Do not use a numeric score, confidence score, or rank score. Do not silently drop candidates. You may exclude out-of-scope or false-positive rows from final user-facing split tables, but excluded rows must remain auditable in the master catalog or excluded-candidates CSV with label, reason, evidence, source, and matched terms.

Tune keyword lists for each disease — the examples below are for SCD.

POSITIVE_EVIDENCE_FIELDS = [
    "Title",
    "Summary",
    "Description",
    "Abstract",
    "Publication",
    "GEO_Type",
    "Omics_Type",
    "Technology",
    "Organism",
    "Species",
    "Tissue",
    "Cell_Type",
]
NEGATIVE_EVIDENCE_FIELDS = POSITIVE_EVIDENCE_FIELDS + ["Access"]
PROVENANCE_ONLY_FIELDS = {
    "source_query",
    "source_query_category",
    "source_repository",
    "source_api",
    "retrieved_at",
    "Query",
    "Repository",
}

# Set from Q9. Defaults keep final user-facing tables strict and reproducible.
FINAL_TABLE_INCLUSION = "direct_only"  # direct_only | direct_adjacent | direct_adjacent_weak
INCLUDE_ADJACENT_IN_FINAL = FINAL_TABLE_INCLUSION in {"direct_adjacent", "direct_adjacent_weak"}
INCLUDE_WEAK_IN_FINAL = FINAL_TABLE_INCLUSION == "direct_adjacent_weak"

def _collect_hits(row, terms, fields):
    hits = []
    for field in fields:
        value = str(row.get(field, "") or "")
        value_lower = value.lower()
        for term in terms:
            if term.lower() in value_lower:
                hits.append({"term": term, "field": field, "value": value})
    return hits

def _include_in_final(label):
    if label == "direct":
        return True
    if label == "adjacent":
        return INCLUDE_ADJACENT_IN_FINAL
    if label == "weak":
        return INCLUDE_WEAK_IN_FINAL
    return False

def _short_evidence(hit, max_len=220):
    if not hit:
        return ""
    value = " ".join(hit["value"].split())
    term = hit["term"]
    idx = value.lower().find(term.lower())
    if idx == -1:
        return value[:max_len]
    start = max(0, idx - 80)
    end = min(len(value), idx + len(term) + 120)
    prefix = "..." if start > 0 else ""
    suffix = "..." if end < len(value) else ""
    return prefix + value[start:end] + suffix

def audit_relevance(row, direct_terms, adjacent_terms, weak_terms, out_of_scope_terms, false_positive_terms):
    """
    Return required relevance columns:
    relevance_label, include_in_final, relevance_reason, relevance_evidence, evidence_source, matched_terms.

    direct_terms: disease/topic terms indicating patient data, primary disease biology, or a validated model
    adjacent_terms: mechanistic, pathway, gene, tissue, treatment, or model terms relevant to the topic
    weak_terms: broad or tangential terms that may still be useful for user review
    out_of_scope_terms: terms showing the row is outside requested omics/species/tissue/scope
    false_positive_terms: terms showing the row is unrelated to the disease/topic
    """
    # Do not include source_query/Query/provenance fields here. They explain retrieval,
    # not biological relevance.
    direct_hits = _collect_hits(row, direct_terms, POSITIVE_EVIDENCE_FIELDS)
    adjacent_hits = _collect_hits(row, adjacent_terms, POSITIVE_EVIDENCE_FIELDS)
    weak_hits = _collect_hits(row, weak_terms, POSITIVE_EVIDENCE_FIELDS)
    out_of_scope_hits = _collect_hits(row, out_of_scope_terms, NEGATIVE_EVIDENCE_FIELDS)
    false_positive_hits = _collect_hits(row, false_positive_terms, NEGATIVE_EVIDENCE_FIELDS)

    if false_positive_hits and not direct_hits:
        label = "false_positive"
        primary_hits = false_positive_hits
        reason = "Excluded from final tables: false-positive term matched and no direct disease/topic evidence was found."
    elif out_of_scope_hits and not direct_hits:
        label = "out_of_scope"
        primary_hits = out_of_scope_hits
        reason = "Excluded from final tables: row is outside the requested omics/species/tissue/scope."
    elif direct_hits:
        label = "direct"
        primary_hits = direct_hits + adjacent_hits[:3]
        reason = "Directly relevant: disease/topic term matched in dataset metadata."
    elif adjacent_hits:
        label = "adjacent"
        primary_hits = adjacent_hits
        reason = "Adjacent relevance: related mechanism, gene, tissue, model, or treatment term matched in dataset metadata, but direct disease/topic evidence was not found."
    elif weak_hits:
        label = "weak"
        primary_hits = weak_hits
        reason = "Weak relevance: only broad or tangential metadata terms matched; retain for user review in the master catalog."
    else:
        label = "weak"
        primary_hits = []
        reason = "Weak relevance: retained from repository query but no configured relevance terms matched in available metadata; do not use query provenance as evidence."

    matched_terms = sorted({hit["term"] for hit in primary_hits})
    evidence_fields = sorted({hit["field"] for hit in primary_hits})
    if label in {"direct", "adjacent"} and any(field in PROVENANCE_ONLY_FIELDS for field in evidence_fields):
        raise ValueError(
            f"{label} evidence used provenance-only fields {evidence_fields}. "
            "Do not use source_query, Query, repository, or API metadata as relevance evidence."
        )

    return {
        "relevance_label": label,
        "include_in_final": _include_in_final(label),
        "relevance_reason": reason,
        "relevance_evidence": _short_evidence(primary_hits[0]) if primary_hits else "",
        "evidence_source": "; ".join(evidence_fields),
        "matched_terms": "; ".join(matched_terms),
    }

Relevance label definitions:

Label Meaning Catalog action
direct Actual disease/topic samples, patient data, primary disease biology, or validated disease model include_in_final = true
adjacent Mechanistically relevant gene/pathway/treatment/tissue/model but not clearly disease/topic samples include_in_final = false by default; set true only if the user explicitly asks to include adjacent rows in final tables
weak Broad, tangential, incomplete, or metadata-sparse candidate needing user review include_in_final = false by default; set true only if the user explicitly asks for maximum-recall final tables
out_of_scope Dataset is real but outside requested omics/species/tissue/year/access scope include_in_final = false; keep in audit trail
false_positive Likely unrelated hit based on exclusion terms and lack of direct/adjacent evidence include_in_final = false; keep in audit trail

Example keyword lists for SCD:

direct_terms = ["sickle cell disease", "sickle cell anemia", "sickle-cell disease",
                "sickle-cell anemia", "sickle cell", "HbSS", "hemoglobin S",
                "vaso-occlus", "sickling", "SCA patient", "SCD patient"]
adjacent_terms = ["HbF", "fetal hemoglobin", "BCL11A", "globin switching",
                  "erythroid", "CD34", "hydroxyurea", "hemoglobinopathy"]
weak_terms = ["sickle cell trait", "HbAS", "beta-thalassemia", "thalassemia",
              "anemia", "erythrocyte", "hemoglobin"]
out_of_scope_terms = ["non-omics", "review article", "protocol only"]  # tune per run
false_positive_terms = ["lung cancer", "gastric cancer", "colon cancer",
                        "BACH1 lung", "unrelated disease"]  # tune per run

Manually review excluded candidates in the Markdown summary. The point of the audit is transparent triage, not silent removal.

Step 15 — Assemble one master catalog

import pandas as pd

# Combine all sources
all_dfs = [df for df in [df_geo, df_sra, df_biostudies, df_pride, df_jpost,
                          df_metabolomics, df_cxg, df_encode, df_atlas,
                          df_zenodo, df_figshare, df_controlled]
           if df is not None and len(df) > 0]
df_all = pd.concat(all_dfs, ignore_index=True)

# Deduplicate by accession (cross-repository duplicates are common)
df_all = df_all.drop_duplicates(subset="Accession")

# Apply omics classification
df_all["Omics_Type"] = df_all.apply(classify_omics, axis=1)

# Apply relevance audit and expand required audit fields
audit_rows = pd.DataFrame(
    [audit_relevance(row, direct_terms, adjacent_terms, weak_terms, out_of_scope_terms, false_positive_terms)
     for _, row in df_all.iterrows()],
    index=df_all.index,
)
df_all = pd.concat([df_all, audit_rows], axis=1)

REQUIRED_RELEVANCE_COLUMNS = [
    "relevance_label",
    "include_in_final",
    "relevance_reason",
    "relevance_evidence",
    "evidence_source",
    "matched_terms",
]
QUERY_MANIFEST_COLUMNS = [
    "source_repository",
    "source_query",
    "source_query_category",
    "source_api",
    "retrieved_at",
    "n_raw_hits",
    "n_records_after_dedup",
]

def require_relevance_columns(df, table_name):
    missing = [col for col in REQUIRED_RELEVANCE_COLUMNS if col not in df.columns]
    if missing:
        raise ValueError(
            f"{table_name} is missing required relevance audit columns: {missing}. "
            "User-requested columns are additive, not exclusive; do not save this table."
        )
    if "Relevance" in df.columns and "relevance_label" not in df.columns:
        raise ValueError(
            f"{table_name} renamed relevance_label to Relevance. "
            "Use the exact required column names."
        )
    return df

def require_query_manifest_columns(df):
    missing = [col for col in QUERY_MANIFEST_COLUMNS if col not in df.columns]
    if missing:
        raise ValueError(f"query manifest is missing required columns: {missing}")
    return df

# Add access level column
CONTROLLED_REPOS = {"dbGaP", "EGA", "JGA", "UK Biobank", "FinnGen"}
df_all["Access"] = df_all["Repository"].apply(
    lambda r: "Controlled" if any(c in r for c in CONTROLLED_REPOS) else "Open"
)

# Sort for readability without saving any numeric score/rank column
label_order = {"direct": 0, "adjacent": 1, "weak": 2, "out_of_scope": 3, "false_positive": 4}
df_all["_label_order"] = df_all["relevance_label"].map(label_order)
df_all = df_all.sort_values(["_label_order", "Date"], ascending=[True, False]).drop(columns=["_label_order"])

# Save one CSV by default
df_all = require_relevance_columns(df_all, "master catalog")
df_all.to_csv(f"{output_dir}/{disease_slug}_omics_datasets_MASTER.csv", index=False)

query_manifest = pd.DataFrame(query_manifest_rows, columns=QUERY_MANIFEST_COLUMNS)
query_manifest = require_query_manifest_columns(query_manifest)
query_manifest.to_csv(f"{output_dir}/{disease_slug}_query_manifest.csv", index=False)

print(f"Total datasets: {len(df_all)}")
print(f"direct: {(df_all.relevance_label=='direct').sum()}")
print(f"adjacent: {(df_all.relevance_label=='adjacent').sum()}")
print(f"weak: {(df_all.relevance_label=='weak').sum()}")
print(f"out_of_scope: {(df_all.relevance_label=='out_of_scope').sum()}")
print(f"false_positive: {(df_all.relevance_label=='false_positive').sum()}")
print(f"final table rows: {(df_all.include_in_final == True).sum()}")
print(f"adjacent retained in master only: {((df_all.relevance_label == 'adjacent') & (df_all.include_in_final == False)).sum()}")
print(f"weak retained in master only: {((df_all.relevance_label == 'weak') & (df_all.include_in_final == False)).sum()}")
print(f"excluded from final: {(df_all.include_in_final == False).sum()}")
print(f"query manifest: {output_dir}/{disease_slug}_query_manifest.csv")

Step 15b — Optional split tables must preserve relevance columns

If the user asks for separate gene-expression/proteomics tables, create them only as derived views of the audited master catalog. Do not rebuild them from scratch and do not select only the user's requested display columns. Always append the required relevance audit columns. Final split tables should include only rows where include_in_final == true. By default, this means direct rows only. adjacent and weak rows remain in the master catalog unless Q9 explicitly requested a broader final table. Excluded rows must remain in the master catalog or be written to a separate excluded-candidates CSV.

USER_REQUESTED_COLUMNS = [
    "Accession", "Title", "Species", "Tissue", "N_Samples",
    "Technology", "Publication", "PMID",
]

def make_output_table(df, requested_columns, table_name):
    cols = []
    for col in requested_columns:
        if col in df.columns and col not in cols:
            cols.append(col)
    for col in REQUIRED_RELEVANCE_COLUMNS:
        if col not in cols:
            cols.append(col)
    out = df[cols].copy()
    return require_relevance_columns(out, table_name)

df_final = df_all[df_all["include_in_final"] == True].copy()

df_gene_expression = make_output_table(
    df_final[
        df_final["Omics_Type"].isin(["Bulk RNA-seq", "Microarray", "scRNA-seq", "Spatial Transcriptomics", "miRNA/ncRNA"])
    ],
    USER_REQUESTED_COLUMNS,
    "gene expression table",
)
df_proteomics = make_output_table(
    df_final[
        df_final["Omics_Type"].isin(["Proteomics (MS)"])
    ],
    USER_REQUESTED_COLUMNS,
    "proteomics table",
)
df_excluded = require_relevance_columns(
    df_all[df_all["include_in_final"] == False].copy(),
    "excluded candidates table",
)

df_gene_expression.to_csv(f"{output_dir}/{disease_slug}_gene_expression_datasets.csv", index=False)
df_proteomics.to_csv(f"{output_dir}/{disease_slug}_proteomics_datasets.csv", index=False)
df_excluded.to_csv(f"{output_dir}/{disease_slug}_excluded_candidates.csv", index=False)

Step 16 — Generate Markdown summary report

Write <disease>_omics_summary.md with these sections:

  1. Overview: total candidates found, final table rows, repositories searched, date of search, whether the user requested comprehensive inventory or representative examples, final inclusion rule, raw-read aggregation mode, and query manifest path
  2. By repository: table of dataset counts per repository
  3. By omics type: table of dataset counts per omics type, split by relevance_label and include_in_final when useful
  4. Relevance breakdown: counts per label with definitions of direct, adjacent, weak, out_of_scope, and false_positive; explicitly report final direct rows, adjacent rows retained in master only, weak rows retained in master only, and excluded rows
  5. Evidence examples: representative rows for each relevance label showing relevance_reason, relevance_evidence, evidence_source, and matched_terms
  6. Top direct datasets: largest N, most recent, most cited (if available); include adjacent datasets in this section only if Q9 requested adjacent rows in final tables
  7. Excluded candidates: counts by exclusion label (out_of_scope, false_positive) and 5-10 representative examples with evidence and matched terms
  8. Query provenance: count by source_query_category and repository; mention exact query categories used and link/name <disease>_query_manifest.csv
  9. Controlled-access datasets: list with access instructions
  10. Coverage gaps and limitations: which repositories failed, which APIs require keys, which omics types are data-sparse for this disease

Explicitly state that the CSV is an inventory with relevance annotations, not a silently filtered list. Explicitly state that source_query is query provenance, not relevance evidence. If rows were excluded from final tables, state that they were excluded by include_in_final=false and are available in the master audit catalog or excluded-candidates CSV. If adjacent or weak rows were kept out of final tables, state that they remain available in the master catalog. If the user requested representative examples, state that the output is not exhaustive. When rendering Markdown tables, include the required relevance audit columns even when the user asked for specific biomedical metadata columns. It is acceptable to make the table wide; do not omit relevance evidence to make the table shorter.

Step 17 — Generate overview figure (if requested)

Create a 3-panel figure saved as <disease>_omics_landscape.png:

import matplotlib.pyplot as plt, matplotlib
matplotlib.rcParams['font.family'] = ['Liberation Sans', 'Arimo', 'DejaVu Sans']
matplotlib.rcParams['svg.fonttype'] = 'none'

fig, axes = plt.subplots(1, 3, figsize=(18, 6))

# Panel A: Donut chart — direct datasets by omics type
# Panel B: Horizontal bar chart — all datasets by repository
# Panel C: Stacked bar timeline — datasets per year, colored by omics type

fig.savefig(f"{output_dir}/{disease_slug}_omics_landscape.png", dpi=150, bbox_inches="tight")

After saving, run Read(file_path=..., mode="media_output_check") to verify the figure renders correctly.


Known API Limitations and Workarounds

Repository Issue Workaround
GEO retmax=100 per query cap Run 20–40 targeted queries; use [Title] field tags
PRIDE Response is list or dict depending on result count data if isinstance(data, list) else data.get("_embedded", {}).get("compactprojects", [])
MetaboLights Direct search API returns 404 Use OmicsDI aggregator instead
Metabolomics Workbench REST API requires exact study title match Use keyword search via OmicsDI; use REST for known study IDs
ArrayExpress/BioStudies Frequently times out Retry with shorter timeout; most studies mirrored in GEO
DisGeNET API v7 requires paid API key; returns HTML without key Skip; use GEO + PubMed searches instead
GWAS Catalog findByDiseaseTrait returns 0 for many disease names (trait name mismatch) Search by EFO ontology ID; browse manually at https://www.ebi.ac.uk/gwas/
CellxGene No disease-specific collection tagging for rare diseases Filter by disease field in Census cell metadata
ENCODE Rate limit 10 req/sec Add time.sleep(0.1) between requests
dbGaP / EGA / JGA No open programmatic search API Web search + manual curation; flag as Controlled access
Zenodo / Figshare / Dryad No disease-specific API WebSearch with site: operator
Synapse Requires login for some datasets Web search for public projects; note login requirement
GDC/TCGA Only relevant for cancer diseases Skip for non-cancer diseases
jPOST / iProX API may be unstable Fall back to ProteomeCentral search at http://central.proteomexchange.org

Scientific Caveats

  1. GEO gdstype field is unreliable: Often says "Other" or "Expression profiling by high throughput sequencing" for ChIP-seq, ATAC-seq, and Ribo-seq studies. Always re-classify using title + summary text. Many studies are misclassified as "Hi-C" when they are RNA-seq.

  2. Cross-repository duplicates are common: The same study may appear in GEO, ArrayExpress, SRA, and OmicsDI. Deduplicate by accession; note that GEO accessions (GSE) and ArrayExpress accessions (E-GEOD-) often refer to the same study.

  3. Controlled-access datasets: dbGaP, EGA, and JGA studies require institutional data access agreements. Include them in the catalog but clearly label as "Controlled" access.

  4. Relevance audit is keyword-based: Automated labeling produces false positives (unrelated studies mentioning disease keywords) and false negatives (relevant studies with unusual terminology). Query strings are provenance only and must not be used as biological evidence. Keep all candidates in the master audit CSV and manually review out_of_scope and false_positive rows before making exclusion decisions.

  5. GEO retmax cap: Each individual query is capped at 100 results. For common diseases (e.g., Alzheimer's, cancer), run many targeted queries to maximize recall.

  6. PRIDE sample counts: Not always reported in the API response. Report as "N/A" when missing.

  7. CellxGene disease ontology: CellxGene uses MONDO ontology terms for disease annotation. Look up the correct MONDO ID for your disease before filtering cell metadata.

  8. Metabolomics is data-sparse: For most rare diseases, metabolomics datasets are few. Check Metabolomics Workbench and HMDB disease pages manually if OmicsDI returns nothing.

  9. Spatial transcriptomics is emerging: Most spatial data (Visium, MERFISH) is in GEO or CellxGene. The field is growing rapidly — search with recent year filters.

  10. Raw-read repositories inflate counts: SRA and ENA can expose many SRR/SRX run records for one biological study. Default to study/project-level aggregation and expand run-level rows only when explicitly requested.


Example Usage

User prompt: "Find all publicly available omics datasets for Alzheimer's disease"

Skill execution:

  1. Synonyms: ["AD", "Alzheimer", "LOAD", "EOAD", "dementia", "amyloid", "tau"]
  2. Tissue terms: brain, cortex, hippocampus, CSF, blood, iPSC neurons, microglia, astrocytes
  3. Gene terms: APOE, APP, PSEN1, PSEN2, TREM2, BIN1, CLU, ABCA7
  4. Run 30+ GEO queries, ArrayExpress, PRIDE, OmicsDI, CellxGene, Expression Atlas, Zenodo, Figshare
  5. Classify omics types, audit relevance using metadata fields only, and keep adjacent/weak rows in the master catalog by default
  6. Output: alzheimers_omics_datasets_MASTER.csv, alzheimers_query_manifest.csv, alzheimers_omics_summary.md

User prompt: "Find gene expression and proteomics datasets for sickle cell disease and summarize into separate tables with accession, title, species, tissue, samples, technology, publication, PMID"

Skill execution:

  1. Ask search breadth / organism / year / access / final-table inclusion / raw-read granularity questions.
  2. Build one audited master catalog first.
  3. Create optional split gene-expression and proteomics CSVs as direct-only views of the master catalog by default.
  4. Include the user's requested columns plus relevance_label, include_in_final, relevance_reason, relevance_evidence, evidence_source, and matched_terms in each split CSV and Markdown table.
  5. Do not use source_query as relevance evidence; use title, summary, publication, organism, tissue, technology, or other dataset metadata.
  6. Do not rename relevance_label to Relevance; do not save the split CSVs if any required relevance audit column is missing.
  7. Save scd_query_manifest.csv; if adjacent/weak/false-positive/out-of-scope records are excluded from the split tables, keep them in the master catalog and summarize counts with evidence.

User prompt: "What RNA-seq data is available for BCL11A in erythroid cells?"

Skill execution:

  1. Topic: BCL11A erythroid
  2. Synonyms: ["BCL11A", "CTIP1", "fetal hemoglobin", "HbF", "globin switching"]
  3. Tissue terms: erythroid, HUDEP, CD34, erythroblast, reticulocyte
  4. Run targeted GEO + ArrayExpress queries; skip PRIDE/metabolomics (RNA-seq focus)
  5. Output: bcl11a_erythroid_omics_datasets_MASTER.csv

User prompt: "Survey all cancer proteomics datasets for pancreatic ductal adenocarcinoma"

Skill execution:

  1. Synonyms: ["PDAC", "pancreatic cancer", "pancreatic adenocarcinoma"]
  2. Query PRIDE, jPOST, iProX, GDC/TCGA (PAAD project), CPTAC, cBioPortal
  3. Skip metabolomics-only repos; focus on MS proteomics
  4. Output: pdac_omics_datasets_MASTER.csv

Code preview

No Python/R preview files were found.

Companion files

TypePathBytes
MarkdownSKILL.md61,519
JSONskill.meta.json1,191