⌂ / Overview / Databases, Literature & Drug Discovery / Preclinical Literature Mining

View companion source

Preclinical Literature Mining

Find preclinical studies and extract structured in vitro / in vivo details.

Overview

Problem. Evidence is scattered across more papers than anyone can read.

Use when: Surveying preclinical evidence for a target

Avoid when: Trusting extraction as ground truth

Learning goals

Structured extraction turns abstracts into tables
Watch in vitro vs in vivo concordance

Figures

Tutorial

Search Consensus (consensus.app) for preclinical studies on a molecular target in a disease, then extract structured in vitro and in vivo experiment details from each paper.

When to Use This Skill

Use this skill when you need to:

Survey preclinical evidence for a drug target in a disease indication
Extract in vitro experiments — cell lines, assays (viability, migration, apoptosis, etc.), key findings
Extract in vivo experiments — animal models (xenograft, PDX, syngeneic, transgenic), endpoints, key findings
Identify common model systems — which cell lines and animal models are most used for your target
Compare in vitro vs in vivo concordance — papers reporting both experiment types
Support IND-enabling decisions — compile preclinical evidence landscape

Do NOT use this skill for:

❌ Clinical trial literature (use literature-review instead)
❌ Automated full-text parsing (agent reads full text for top papers after abstract extraction)
❌ Meta-analysis or statistical pooling of preclinical results
❌ Citation management / formatting only

Installation

Python (Search + Extraction)

pip install requests pandas

PDF Report Generation (Optional)

pip install reportlab

R (Visualization)

install.packages(c("ggplot2", "ggprism", "dplyr", "tidyr", "patchwork"))
# Optional for high-quality SVG:
install.packages("svglite")

Package Licenses

Software	Version	License	Commercial Use	Installation
requests	≥2.25	Apache 2.0	✅ Permitted	`pip install requests`
pandas	≥1.3	BSD	✅ Permitted	`pip install pandas`
reportlab	≥3.6	BSD	✅ Permitted	`pip install reportlab`
ggplot2	≥3.4	MIT	✅ Permitted	`install.packages("ggplot2")`
ggprism	≥1.0.3	GPL (≥3)	✅ Permitted	`install.packages("ggprism")`
dplyr	≥1.1	MIT	✅ Permitted	`install.packages("dplyr")`
tidyr	≥1.3	MIT	✅ Permitted	`install.packages("tidyr")`
patchwork	≥1.1	MIT	✅ Permitted	`install.packages("patchwork")`

API Requirements:

Consensus API: Requires CONSENSUS_API_KEY environment variable. Get a key at https://consensus.app/home/api/ bash export CONSENSUS_API_KEY="your_key_here"

Inputs

Required Inputs

Target — Molecular target name (e.g., "CDK4/6", "BRAF", "PD-L1", "HER2")
Disease — Disease context (e.g., "breast cancer", "melanoma", "NSCLC")

Optional Inputs

max_results — Maximum papers to retrieve (default: 50, API max per query: 50)
years — Search last N years (default: 5)

Outputs

Generated Files

File	Description
`preclinical_search_results.csv`	All papers with metadata (PMID, DOI, title, abstract, etc.)
`experiment_extraction.csv`	Per-paper extraction: experiment type, cell lines, assays, models, endpoints, findings
`preclinical_synthesis_report.md`	Structured markdown report with narrative synthesis, frequency tables, hyperlinked references, and full-text insights
`preclinical_synthesis_report.pdf`	Publication-quality PDF with Introduction, Methods, Results, Conclusions
`preclinical_plots.png`	4-panel visualization (300 DPI)
`preclinical_plots.svg`	Vector format (with graceful fallback)
`analysis_object.pkl`	Complete analysis object for downstream use

Analysis object (pickle):

analysis_object.pkl — Contains search results, experiments, synthesis
Load with: import pickle; obj = pickle.load(open('analysis_object.pkl', 'rb'))

Clarification Questions

🚨 ALWAYS ask Question 1 FIRST.

1. Target and Disease (ASK THIS FIRST):

Do you have a specific target and disease to search?
- If so: provide the target name and disease
Or use an example search to try the skill?
- a) CDK4/6 in triple-negative breast cancer
- b) KRAS in pancreatic cancer
- c) PD-L1 in non-small cell lung cancer
- d) BRAF in melanoma

🚨 IF EXAMPLE SEARCH SELECTED: All parameters are pre-defined (last 5 years, 50 results). DO NOT ask questions 2-3. Proceed directly to Step 1.

Questions 2-3 are ONLY for users providing their own target/disease:

2. Search Parameters:

Date range? (default: last 5 years)
- a) Last 3 years
- b) Last 5 years (default)
- c) Last 10 years

3. Results Scope:

Maximum papers to retrieve?
- a) 20 (quick scan)
- b) 50 (standard, default)

4. Full-Text Depth:

How many papers should be reviewed in full text?
- a) Up to 30 papers (default — deeper analysis, takes longer)
- b) Up to 10 papers (faster, focuses on highest-impact papers)
- c) All available open-access papers (comprehensive, slowest)

Standard Workflow

🚨 MANDATORY: USE SCRIPTS EXACTLY AS SHOWN - DO NOT WRITE INLINE CODE 🚨

⚠️ CRITICAL - DO NOT:

❌ Write inline Consensus/PubMed API code → STOP: Use search_preclinical()
❌ Write inline extraction code → STOP: Use extract_all_experiments()
❌ Write inline plotting code (ggplot, ggsave, etc.) → STOP: Use generate_all_plots()
❌ Write custom export code → STOP: Use export_all()
❌ Try to install svglite → script handles SVG fallback automatically

Steps 1-4 are automated (scripts). Step 5 is agent-guided (manual full-text reading).

Step 1 — Search for preclinical studies:

import sys
sys.path.append("scripts")
from preclinical_search import search_preclinical

results = search_preclinical(
    target="CDK4/6",
    disease="triple-negative breast cancer",
    max_results=50,
    years=5,
    output_dir="preclinical_results"
)

DO NOT write inline Consensus API code. Use the script.

Step 2 — Extract in vitro and in vivo experiments:

from extract_experiments import extract_all_experiments

experiments = extract_all_experiments(results, output_dir="preclinical_results")

DO NOT write inline extraction code. The script handles all keyword matching.

Step 3 — Generate visualizations:

source("scripts/generate_plots.R")
generate_all_plots(input_dir = "preclinical_results", output_dir = "preclinical_results")

DO NOT write inline plotting code (ggplot, ggsave, etc.). Just source the script.

The script handles PNG + SVG export with graceful fallback for SVG dependencies.

Step 4 — Synthesize and export results:

from preclinical_synthesis import synthesize_preclinical, export_all

synthesis = synthesize_preclinical(results, experiments, target="CDK4/6")
export_all(results, experiments, synthesis,
           target="CDK4/6", disease="triple-negative breast cancer",
           output_dir="preclinical_results")

DO NOT write custom export code. Use export_all().

Step 5 — Full-text deep dive (top papers):

Read the full-text enrichment guide and follow its instructions to read full text for the top papers (default: up to 30). Replace the ## Full-Text Insights placeholder in the report with per-paper findings.

DO NOT skip this step. Select papers based on the criteria in the guide.

✅ VERIFICATION — You should see:

After Step 1: "✓ Literature search completed successfully!"
After Step 2: "✓ Experiment extraction completed successfully!"
After Step 3: "✓ All plots generated successfully!"
After Step 4: "=== Export Complete ==="
After Step 5: Report contains ## Full-Text Insights section with per-paper details

❌ IF YOU DON'T SEE THESE: You wrote inline code. Stop and use the scripts.

Common Issues

Issue	Cause	Solution
"CONSENSUS_API_KEY not set"	API key missing	`export CONSENSUS_API_KEY='your_key_here'` — get key at https://consensus.app/home/api/
"Invalid or expired API key"	Bad API key	Verify your CONSENSUS_API_KEY is valid and not expired
"HTTP 429: Rate limited"	Consensus rate limit exceeded	Script handles retries with exponential backoff. Wait and retry if persists
"No results found"	Query too specific or target name mismatch	Try alternative target names (e.g., "CDK4" vs "CDK4/6" vs "cyclin-dependent kinase 4")
"experiment_extraction.csv not found"	Step 2 not run before Step 3	Run Steps 1-2 (Python) before Step 3 (R)
"Most papers classified as unclassified"	Abstracts don't contain expected keywords	Expected for some targets — check if papers use different terminology
"Missing R package: ggprism"	R packages not installed	`install.packages(c("ggplot2", "ggprism", "dplyr", "tidyr", "patchwork"))`
SVG export failed	Missing svglite dependency	Normal — script falls back to base R svg() device. PNG always generated
"PDF skipped"	Missing reportlab package	`pip install reportlab`. Markdown report always generated regardless

⚠️ IF SCRIPTS FAIL — Script Failure Hierarchy:

Fix and Retry (90%) — Install missing package, re-run script
Modify Script (5%) — Edit the script file itself, document changes
Use as Reference (4%) — Read script, adapt approach, cite source
Write from Scratch (1%) — Only if genuinely impossible, explain why

NEVER skip directly to writing inline code without trying the script first.

Suggested Next Steps

After extracting preclinical experiments:

Deep dive — Read full-text papers for the most relevant "both" papers (in vitro + in vivo)
Expand search — Try alternative target names or broader disease terms
Functional enrichment — Use functional-enrichment-from-degs on genes from relevant pathways
Literature review — Use literature-review for broader context including clinical studies
Target gene analysis — Use chip-atlas-target-genes to identify transcription factor targets

Related Skills

Upstream Skills

literature-review — General literature search and synthesis (broader scope)

Downstream Skills

functional-enrichment-from-degs — Pathway analysis of target-related genes
chip-atlas-target-genes — Identify TF binding targets for your target gene
chip-atlas-peak-enrichment — Check ChIP-seq enrichment near your target

References

Search Strategy

[references/preclinical-search-guide.md] — Consensus API search strategies, query construction, API parameters

Extraction Methods

[references/experiment-extraction-guide.md] — Keyword-based extraction approach, dictionaries, limitations

External Documentation

Consensus API: https://consensus.app/home/api/

Code preview

assets/eval/simple_test.py

#!/usr/bin/env python3
"""
Simple test for literature-preclinical skill.

Mock test: Validates experiment extraction logic (Steps 2-4) without network.
Live test: Full Consensus search + extraction (requires network + CONSENSUS_API_KEY).

Usage:
    python3 simple_test.py          # Mock test only
    python3 simple_test.py --live   # Mock + live test (needs CONSENSUS_API_KEY)
"""

import os
import sys
import json
import shutil

# Add scripts/ to path
SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__))
SKILL_DIR = os.path.join(SCRIPT_DIR, "..", "..")
sys.path.insert(0, os.path.join(SKILL_DIR, "scripts"))

TEST_OUTPUT_DIR = os.path.join(SCRIPT_DIR, "test_results")


# ---------------------------------------------------------------------------
# Mock abstracts with known in vitro / in vivo content
# ---------------------------------------------------------------------------

MOCK_PAPERS = [
    {
        "pmid": "MOCK001",
        "doi": "10.1234/mock001",
        "title": "CDK4/6 inhibition suppresses triple-negative breast cancer cell proliferation in vitro and in vivo",
        "authors": "Smith J, Jones A, Wang L et al.",
        "journal": "Cancer Research (2024)",
        "publication_date": "2024-03-15",
        "abstract": (
            "CDK4/6 inhibitors have shown promise in hormone receptor-positive breast cancer, "
            "but their role in triple-negative breast cancer (TNBC) remains unclear. "
            "We investigated the effects of palbociclib on TNBC cell lines in vitro and in vivo. "
            "MDA-MB-231 and BT-549 cells were treated with palbociclib. Cell viability was assessed "
            "by MTT assay, and apoptosis was measured by annexin V/PI staining and flow cytometry. "
            "Western blot analysis showed reduced phosphorylation of Rb protein. "
            "Colony formation assays demonstrated significantly reduced proliferation. "
            "In a subcutaneous xenograft model using nude mice, palbociclib significantly inhibited "
            "tumor growth compared to vehicle control. Tumor volume was reduced by 65% at day 28. "
            "Immunohistochemistry of tumor sections showed decreased Ki-67 staining. "
            "Body weight monitoring showed no significant toxicity."
        ),
        "keywords": "CDK4/6; breast cancer; palbociclib; xenograft",
        "url": "https://pubmed.ncbi.nlm.nih.gov/MOCK001/",
        "source": "PubMed",
    },
    {
        "pmid": "MOCK002",
        "doi": "10.1234/mock002",
        "title": "KRAS G12C inhibitor demonstrates anti-tumor activity in pancreatic cancer cell lines",
        "authors": "Chen X, Kim Y, Park S",
        "journal": "Molecular Cancer Therapeutics (2023)",
        "publication_date": "2023-11-01",
        "abstract": (
            "KRAS G12C mutations are found in a subset of pancreatic cancers. "
            "We evaluated a novel KRAS G12C inhibitor in pancreatic cancer cell lines. "
            "PANC-1 and MiaPaCa-2 cells were treated with increasing doses of the inhibitor. "
            "CCK-8 assay showed dose-dependent reduction in cell viability with IC50 values "
            "of 2.3 uM and 4.1 uM respectively. Transwell migration assay revealed significantly "
            "reduced invasion capacity. qPCR analysis demonstrated downregulation of MYC and "
            "ERK pathway genes. Caspase-3/7 activity was significantly increased, indicating "
            "apoptosis induction."
        ),
        "keywords": "KRAS; pancreatic cancer; targeted therapy",
        "url": "https://pubmed.ncbi.nlm.nih.gov/MOCK002/",
        "source": "PubMed",
    },
    {
        "pmid": "MOCK003",
        "doi": "10.1234/mock003",
        "title": "PD-L1 blockade enhances anti-tumor immunity in syngeneic mouse models of NSCLC",
        "authors": "Rodriguez M, Taylor B, Wilson K",

scripts/extract_experiments.py

"""
Preclinical Experiment Extraction Module

Parse abstracts to extract structured in vitro and in vivo experiment details.
Uses keyword-based extraction to identify cell lines, assays, animal models,
endpoints, and key findings from each paper.
"""

import re
import os
from typing import List, Dict, Tuple
import pandas as pd


# ---------------------------------------------------------------------------
# Keyword dictionaries
# ---------------------------------------------------------------------------

# In vitro indicators
IN_VITRO_KEYWORDS = [
    "cell line", "cell lines", "cell culture", "in vitro", "cultured cells",
    "transfect", "transduct", "knockdown", "overexpress", "overexpression",
    "siRNA", "shRNA", "CRISPR", "sgRNA",
    "co-culture", "monolayer", "spheroid", "organoid",
]

# Common cell line names (case-insensitive matching handled separately)
CELL_LINE_NAMES = [
    "MCF-7", "MCF7", "MDA-MB-231", "MDA-MB-468", "T47D", "BT-474", "BT474",
    "BT-549", "BT549", "MDA-MB-453", "CAL-51", "HCC1937", "HCC1806",
    "SK-BR-3", "SKBR3", "ZR-75", "4T1", "EMT6",
    "HeLa", "HEK293", "HEK-293", "293T", "HEK293T",
    "A549", "H1299", "H460", "H1975", "PC9", "HCC827",
    "HCT116", "HT29", "SW480", "SW620", "LoVo", "Caco-2",
    "U87", "U251", "T98G", "LN229",
    "PC3", "PC-3", "LNCaP", "DU145", "22Rv1", "VCaP",
    "K562", "HL60", "HL-60", "Jurkat", "THP-1", "U937",
    "HepG2", "Hep3B", "Huh7", "SMMC-7721",
    "PANC-1", "MiaPaCa-2", "BxPC-3", "AsPC-1",
    "A375", "SK-MEL-28", "B16", "B16F10",
    "OVCAR3", "SKOV3", "A2780",
    "CHO", "NIH3T3", "3T3", "COS-7",
    "Raji", "Ramos", "Daudi",
    "SH-SY5Y", "Neuro-2a", "N2a",
    "RAW264.7", "RAW 264.7", "J774",
]

# Assay keyword categories
ASSAY_KEYWORDS = {
    "viability": [
        "viability", "MTT", "CCK-8", "CCK8", "WST", "cell counting",
        "CellTiter", "MTS", "XTT", "alamarBlue", "resazurin",
        "cytotoxicity", "IC50", "EC50", "dose-response",
    ],
    "proliferation": [
        "proliferation", "colony formation", "clonogenic", "BrdU", "EdU",
        "Ki-67", "Ki67", "cell growth", "growth curve", "doubling time",
    ],
    "apoptosis": [
        "apoptosis", "annexin", "caspase", "TUNEL", "cell death",
        "sub-G1", "programmed cell death", "Bcl-2", "BAX",
        "cleaved PARP", "cytochrome c release",
    ],
    "migration_invasion": [
        "migration", "invasion", "wound healing", "transwell", "Boyden",
        "scratch assay", "chemotaxis", "Matrigel",
    ],
    "gene_expression": [
        "qPCR", "RT-PCR", "real-time PCR", "qRT-PCR",
        "mRNA expression", "RNA-seq", "RNAseq", "transcriptom",
        "gene expression", "Northern blot",
    ],
    "protein_analysis": [
        "Western blot", "immunoblot", "ELISA", "immunoprecipitation",
        "phosphorylation", "Co-IP", "pull-down", "mass spectrometry",
        "proteomics", "immunofluorescence",
    ],
    "flow_cytometry": [
        "flow cytometry", "FACS", "cell cycle", "cell sorting",
        "intracellular staining", "surface marker",

scripts/generate_plots.R

#!/usr/bin/env Rscript
#
# Generate 4-panel preclinical experiment visualization (Step 3).
#
# Creates:
# 1. Experiment type breakdown (in vitro / in vivo / both / unclassified)
# 2. Top assay types (horizontal bar chart)
# 3. Animal model distribution (bar chart)
# 4. Publication timeline by experiment type (stacked bars)
#
# Uses ggplot2 with ggprism publication theme.
# Exports both PNG (300 DPI) and SVG with graceful fallback.
#

# --- Load packages -----------------------------------------------------------

required_pkgs <- c("ggplot2", "ggprism", "dplyr", "tidyr", "patchwork")

for (pkg in required_pkgs) {
  if (!requireNamespace(pkg, quietly = TRUE)) {
    stop(paste0("Missing required package: ", pkg,
                "\nInstall with: install.packages('", pkg, "')"))
  }
}

library(ggplot2)
library(ggprism)
library(dplyr)
library(tidyr)
library(patchwork)

# Try to load svglite for high-quality SVG (optional)
.has_svglite <- requireNamespace("svglite", quietly = TRUE)
if (.has_svglite) library(svglite)


# --- Main function -----------------------------------------------------------

generate_all_plots <- function(input_dir = "preclinical_results",
                               output_dir = "preclinical_results") {
  cat("\n", paste(rep("=", 70), collapse = ""), "\n")
  cat("GENERATING VISUALIZATIONS\n")
  cat(paste(rep("=", 70), collapse = ""), "\n\n")

  dir.create(output_dir, showWarnings = FALSE, recursive = TRUE)

  # Read extraction CSV
  extract_file <- file.path(input_dir, "experiment_extraction.csv")
  if (!file.exists(extract_file)) {
    stop(paste("File not found:", extract_file,
               "\nRun Steps 1-2 first to generate experiment_extraction.csv"))
  }

  df <- read.csv(extract_file, stringsAsFactors = FALSE)
  cat("  Read", nrow(df), "papers from", extract_file, "\n\n")

  # Build 4 panels
  cat("1. Generating experiment type breakdown...\n")
  p1 <- .plot_experiment_types(df)

  cat("2. Generating top assay types...\n")
  p2 <- .plot_assay_types(df)

  cat("3. Generating animal model distribution...\n")
  p3 <- .plot_animal_models(df)

  cat("4. Generating publication timeline...\n")
  p4 <- .plot_timeline(df)

  # Combine with patchwork
  cat("\n5. Saving combined figure...\n")
  combined <- (p1 | p2) / (p3 | p4) +
    plot_annotation(
      title = "Preclinical Literature Extraction",
      subtitle = paste(nrow(df), "papers analyzed"),
      theme = theme(
        plot.title = element_text(hjust = 0.5, face = "bold", size = 16),
        plot.subtitle = element_text(hjust = 0.5, size = 12)
      )
    )

Companion files

Type	Path	Bytes
Python	assets/eval/simple_test.py	17,253
Text	assets/eval/test_results/analysis_object.pkl.UNAVAILABLE.txt	319
CSV	assets/eval/test_results/experiment_extraction.csv	2,265
CSV	assets/eval/test_results/preclinical_search_results.csv	3,593
Markdown	assets/eval/test_results/preclinical_synthesis_report.md	4,965
Text	assets/eval/test_results/preclinical_synthesis_report.pdf.UNAVAILABLE.txt	345
Python	scripts/extract_experiments.py	13,714
R	scripts/generate_plots.R	9,639
Python	scripts/generate_report.py	22,210
Python	scripts/narrative_synthesis.py	24,508
Python	scripts/preclinical_search.py	9,917
Python	scripts/preclinical_synthesis.py	8,866
Python	scripts/report_generation.py	11,998
Markdown	SKILL.md	10,910
JSON	skill.meta.json	3,044