Preclinical Literature Mining

Find preclinical studies and extract structured in vitro / in vivo details.

Overview

Problem. Evidence is scattered across more papers than anyone can read.

Use when: Surveying preclinical evidence for a target
Avoid when: Trusting extraction as ground truth

Learning goals

Figures

Preclinical Literature Overview
Five-Step Workflow
In Vitro vs In Vivo
Output Deliverables
Common Issues

Tutorial

Search Consensus (consensus.app) for preclinical studies on a molecular target in a disease, then extract structured in vitro and in vivo experiment details from each paper.


When to Use This Skill

Use this skill when you need to: - Survey preclinical evidence for a drug target in a disease indication - Extract in vitro experiments — cell lines, assays (viability, migration, apoptosis, etc.), key findings - Extract in vivo experiments — animal models (xenograft, PDX, syngeneic, transgenic), endpoints, key findings - Identify common model systems — which cell lines and animal models are most used for your target - Compare in vitro vs in vivo concordance — papers reporting both experiment types - Support IND-enabling decisions — compile preclinical evidence landscape

Do NOT use this skill for:

  • ❌ Clinical trial literature (use literature-review instead)
  • ❌ Automated full-text parsing (agent reads full text for top papers after abstract extraction)
  • ❌ Meta-analysis or statistical pooling of preclinical results
  • ❌ Citation management / formatting only

Installation

Python (Search + Extraction)

pip install requests pandas

PDF Report Generation (Optional)

pip install reportlab

R (Visualization)

install.packages(c("ggplot2", "ggprism", "dplyr", "tidyr", "patchwork"))
# Optional for high-quality SVG:
install.packages("svglite")

Package Licenses

Software Version License Commercial Use Installation
requests ≥2.25 Apache 2.0 ✅ Permitted pip install requests
pandas ≥1.3 BSD ✅ Permitted pip install pandas
reportlab ≥3.6 BSD ✅ Permitted pip install reportlab
ggplot2 ≥3.4 MIT ✅ Permitted install.packages("ggplot2")
ggprism ≥1.0.3 GPL (≥3) ✅ Permitted install.packages("ggprism")
dplyr ≥1.1 MIT ✅ Permitted install.packages("dplyr")
tidyr ≥1.3 MIT ✅ Permitted install.packages("tidyr")
patchwork ≥1.1 MIT ✅ Permitted install.packages("patchwork")

API Requirements:

  • Consensus API: Requires CONSENSUS_API_KEY environment variable. Get a key at https://consensus.app/home/api/ bash export CONSENSUS_API_KEY="your_key_here"

Inputs

Required Inputs

  1. Target — Molecular target name (e.g., "CDK4/6", "BRAF", "PD-L1", "HER2")
  2. Disease — Disease context (e.g., "breast cancer", "melanoma", "NSCLC")

Optional Inputs


Outputs

Generated Files

File Description
preclinical_search_results.csv All papers with metadata (PMID, DOI, title, abstract, etc.)
experiment_extraction.csv Per-paper extraction: experiment type, cell lines, assays, models, endpoints, findings
preclinical_synthesis_report.md Structured markdown report with narrative synthesis, frequency tables, hyperlinked references, and full-text insights
preclinical_synthesis_report.pdf Publication-quality PDF with Introduction, Methods, Results, Conclusions
preclinical_plots.png 4-panel visualization (300 DPI)
preclinical_plots.svg Vector format (with graceful fallback)
analysis_object.pkl Complete analysis object for downstream use

Analysis object (pickle):

  • analysis_object.pkl — Contains search results, experiments, synthesis
  • Load with: import pickle; obj = pickle.load(open('analysis_object.pkl', 'rb'))

Clarification Questions

🚨 ALWAYS ask Question 1 FIRST.

1. Target and Disease (ASK THIS FIRST):

🚨 IF EXAMPLE SEARCH SELECTED: All parameters are pre-defined (last 5 years, 50 results). DO NOT ask questions 2-3. Proceed directly to Step 1.

Questions 2-3 are ONLY for users providing their own target/disease:

2. Search Parameters:

3. Results Scope:

4. Full-Text Depth:


Standard Workflow

🚨 MANDATORY: USE SCRIPTS EXACTLY AS SHOWN - DO NOT WRITE INLINE CODE 🚨

⚠️ CRITICAL - DO NOT:

  • Write inline Consensus/PubMed API codeSTOP: Use search_preclinical()
  • Write inline extraction codeSTOP: Use extract_all_experiments()
  • Write inline plotting code (ggplot, ggsave, etc.)STOP: Use generate_all_plots()
  • Write custom export codeSTOP: Use export_all()
  • Try to install svglite → script handles SVG fallback automatically

Steps 1-4 are automated (scripts). Step 5 is agent-guided (manual full-text reading).

Step 1 — Search for preclinical studies:

import sys
sys.path.append("scripts")
from preclinical_search import search_preclinical

results = search_preclinical(
    target="CDK4/6",
    disease="triple-negative breast cancer",
    max_results=50,
    years=5,
    output_dir="preclinical_results"
)

DO NOT write inline Consensus API code. Use the script.

Step 2 — Extract in vitro and in vivo experiments:

from extract_experiments import extract_all_experiments

experiments = extract_all_experiments(results, output_dir="preclinical_results")

DO NOT write inline extraction code. The script handles all keyword matching.

Step 3 — Generate visualizations:

source("scripts/generate_plots.R")
generate_all_plots(input_dir = "preclinical_results", output_dir = "preclinical_results")

DO NOT write inline plotting code (ggplot, ggsave, etc.). Just source the script.

The script handles PNG + SVG export with graceful fallback for SVG dependencies.

Step 4 — Synthesize and export results:

from preclinical_synthesis import synthesize_preclinical, export_all

synthesis = synthesize_preclinical(results, experiments, target="CDK4/6")
export_all(results, experiments, synthesis,
           target="CDK4/6", disease="triple-negative breast cancer",
           output_dir="preclinical_results")

DO NOT write custom export code. Use export_all().

Step 5 — Full-text deep dive (top papers):

Read the full-text enrichment guide and follow its instructions to read full text for the top papers (default: up to 30). Replace the ## Full-Text Insights placeholder in the report with per-paper findings.

DO NOT skip this step. Select papers based on the criteria in the guide.

✅ VERIFICATION — You should see:

  • After Step 1: "✓ Literature search completed successfully!"
  • After Step 2: "✓ Experiment extraction completed successfully!"
  • After Step 3: "✓ All plots generated successfully!"
  • After Step 4: "=== Export Complete ==="
  • After Step 5: Report contains ## Full-Text Insights section with per-paper details

❌ IF YOU DON'T SEE THESE: You wrote inline code. Stop and use the scripts.


Common Issues

Issue Cause Solution
"CONSENSUS_API_KEY not set" API key missing export CONSENSUS_API_KEY='your_key_here' — get key at https://consensus.app/home/api/
"Invalid or expired API key" Bad API key Verify your CONSENSUS_API_KEY is valid and not expired
"HTTP 429: Rate limited" Consensus rate limit exceeded Script handles retries with exponential backoff. Wait and retry if persists
"No results found" Query too specific or target name mismatch Try alternative target names (e.g., "CDK4" vs "CDK4/6" vs "cyclin-dependent kinase 4")
"experiment_extraction.csv not found" Step 2 not run before Step 3 Run Steps 1-2 (Python) before Step 3 (R)
"Most papers classified as unclassified" Abstracts don't contain expected keywords Expected for some targets — check if papers use different terminology
"Missing R package: ggprism" R packages not installed install.packages(c("ggplot2", "ggprism", "dplyr", "tidyr", "patchwork"))
SVG export failed Missing svglite dependency Normal — script falls back to base R svg() device. PNG always generated
"PDF skipped" Missing reportlab package pip install reportlab. Markdown report always generated regardless

⚠️ IF SCRIPTS FAIL — Script Failure Hierarchy:

  1. Fix and Retry (90%) — Install missing package, re-run script
  2. Modify Script (5%) — Edit the script file itself, document changes
  3. Use as Reference (4%) — Read script, adapt approach, cite source
  4. Write from Scratch (1%) — Only if genuinely impossible, explain why

NEVER skip directly to writing inline code without trying the script first.


Suggested Next Steps

After extracting preclinical experiments:

  1. Deep dive — Read full-text papers for the most relevant "both" papers (in vitro + in vivo)
  2. Expand search — Try alternative target names or broader disease terms
  3. Functional enrichment — Use functional-enrichment-from-degs on genes from relevant pathways
  4. Literature review — Use literature-review for broader context including clinical studies
  5. Target gene analysis — Use chip-atlas-target-genes to identify transcription factor targets

Related Skills

Upstream Skills

Downstream Skills


References

Search Strategy

Extraction Methods

External Documentation

Code preview

assets/eval/simple_test.py

#!/usr/bin/env python3
"""
Simple test for literature-preclinical skill.

Mock test: Validates experiment extraction logic (Steps 2-4) without network.
Live test: Full Consensus search + extraction (requires network + CONSENSUS_API_KEY).

Usage:
    python3 simple_test.py          # Mock test only
    python3 simple_test.py --live   # Mock + live test (needs CONSENSUS_API_KEY)
"""

import os
import sys
import json
import shutil

# Add scripts/ to path
SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__))
SKILL_DIR = os.path.join(SCRIPT_DIR, "..", "..")
sys.path.insert(0, os.path.join(SKILL_DIR, "scripts"))

TEST_OUTPUT_DIR = os.path.join(SCRIPT_DIR, "test_results")


# ---------------------------------------------------------------------------
# Mock abstracts with known in vitro / in vivo content
# ---------------------------------------------------------------------------

MOCK_PAPERS = [
    {
        "pmid": "MOCK001",
        "doi": "10.1234/mock001",
        "title": "CDK4/6 inhibition suppresses triple-negative breast cancer cell proliferation in vitro and in vivo",
        "authors": "Smith J, Jones A, Wang L et al.",
        "journal": "Cancer Research (2024)",
        "publication_date": "2024-03-15",
        "abstract": (
            "CDK4/6 inhibitors have shown promise in hormone receptor-positive breast cancer, "
            "but their role in triple-negative breast cancer (TNBC) remains unclear. "
            "We investigated the effects of palbociclib on TNBC cell lines in vitro and in vivo. "
            "MDA-MB-231 and BT-549 cells were treated with palbociclib. Cell viability was assessed "
            "by MTT assay, and apoptosis was measured by annexin V/PI staining and flow cytometry. "
            "Western blot analysis showed reduced phosphorylation of Rb protein. "
            "Colony formation assays demonstrated significantly reduced proliferation. "
            "In a subcutaneous xenograft model using nude mice, palbociclib significantly inhibited "
            "tumor growth compared to vehicle control. Tumor volume was reduced by 65% at day 28. "
            "Immunohistochemistry of tumor sections showed decreased Ki-67 staining. "
            "Body weight monitoring showed no significant toxicity."
        ),
        "keywords": "CDK4/6; breast cancer; palbociclib; xenograft",
        "url": "https://pubmed.ncbi.nlm.nih.gov/MOCK001/",
        "source": "PubMed",
    },
    {
        "pmid": "MOCK002",
        "doi": "10.1234/mock002",
        "title": "KRAS G12C inhibitor demonstrates anti-tumor activity in pancreatic cancer cell lines",
        "authors": "Chen X, Kim Y, Park S",
        "journal": "Molecular Cancer Therapeutics (2023)",
        "publication_date": "2023-11-01",
        "abstract": (
            "KRAS G12C mutations are found in a subset of pancreatic cancers. "
            "We evaluated a novel KRAS G12C inhibitor in pancreatic cancer cell lines. "
            "PANC-1 and MiaPaCa-2 cells were treated with increasing doses of the inhibitor. "
            "CCK-8 assay showed dose-dependent reduction in cell viability with IC50 values "
            "of 2.3 uM and 4.1 uM respectively. Transwell migration assay revealed significantly "
            "reduced invasion capacity. qPCR analysis demonstrated downregulation of MYC and "
            "ERK pathway genes. Caspase-3/7 activity was significantly increased, indicating "
            "apoptosis induction."
        ),
        "keywords": "KRAS; pancreatic cancer; targeted therapy",
        "url": "https://pubmed.ncbi.nlm.nih.gov/MOCK002/",
        "source": "PubMed",
    },
    {
        "pmid": "MOCK003",
        "doi": "10.1234/mock003",
        "title": "PD-L1 blockade enhances anti-tumor immunity in syngeneic mouse models of NSCLC",
        "authors": "Rodriguez M, Taylor B, Wilson K",

scripts/extract_experiments.py

"""
Preclinical Experiment Extraction Module

Parse abstracts to extract structured in vitro and in vivo experiment details.
Uses keyword-based extraction to identify cell lines, assays, animal models,
endpoints, and key findings from each paper.
"""

import re
import os
from typing import List, Dict, Tuple
import pandas as pd


# ---------------------------------------------------------------------------
# Keyword dictionaries
# ---------------------------------------------------------------------------

# In vitro indicators
IN_VITRO_KEYWORDS = [
    "cell line", "cell lines", "cell culture", "in vitro", "cultured cells",
    "transfect", "transduct", "knockdown", "overexpress", "overexpression",
    "siRNA", "shRNA", "CRISPR", "sgRNA",
    "co-culture", "monolayer", "spheroid", "organoid",
]

# Common cell line names (case-insensitive matching handled separately)
CELL_LINE_NAMES = [
    "MCF-7", "MCF7", "MDA-MB-231", "MDA-MB-468", "T47D", "BT-474", "BT474",
    "BT-549", "BT549", "MDA-MB-453", "CAL-51", "HCC1937", "HCC1806",
    "SK-BR-3", "SKBR3", "ZR-75", "4T1", "EMT6",
    "HeLa", "HEK293", "HEK-293", "293T", "HEK293T",
    "A549", "H1299", "H460", "H1975", "PC9", "HCC827",
    "HCT116", "HT29", "SW480", "SW620", "LoVo", "Caco-2",
    "U87", "U251", "T98G", "LN229",
    "PC3", "PC-3", "LNCaP", "DU145", "22Rv1", "VCaP",
    "K562", "HL60", "HL-60", "Jurkat", "THP-1", "U937",
    "HepG2", "Hep3B", "Huh7", "SMMC-7721",
    "PANC-1", "MiaPaCa-2", "BxPC-3", "AsPC-1",
    "A375", "SK-MEL-28", "B16", "B16F10",
    "OVCAR3", "SKOV3", "A2780",
    "CHO", "NIH3T3", "3T3", "COS-7",
    "Raji", "Ramos", "Daudi",
    "SH-SY5Y", "Neuro-2a", "N2a",
    "RAW264.7", "RAW 264.7", "J774",
]

# Assay keyword categories
ASSAY_KEYWORDS = {
    "viability": [
        "viability", "MTT", "CCK-8", "CCK8", "WST", "cell counting",
        "CellTiter", "MTS", "XTT", "alamarBlue", "resazurin",
        "cytotoxicity", "IC50", "EC50", "dose-response",
    ],
    "proliferation": [
        "proliferation", "colony formation", "clonogenic", "BrdU", "EdU",
        "Ki-67", "Ki67", "cell growth", "growth curve", "doubling time",
    ],
    "apoptosis": [
        "apoptosis", "annexin", "caspase", "TUNEL", "cell death",
        "sub-G1", "programmed cell death", "Bcl-2", "BAX",
        "cleaved PARP", "cytochrome c release",
    ],
    "migration_invasion": [
        "migration", "invasion", "wound healing", "transwell", "Boyden",
        "scratch assay", "chemotaxis", "Matrigel",
    ],
    "gene_expression": [
        "qPCR", "RT-PCR", "real-time PCR", "qRT-PCR",
        "mRNA expression", "RNA-seq", "RNAseq", "transcriptom",
        "gene expression", "Northern blot",
    ],
    "protein_analysis": [
        "Western blot", "immunoblot", "ELISA", "immunoprecipitation",
        "phosphorylation", "Co-IP", "pull-down", "mass spectrometry",
        "proteomics", "immunofluorescence",
    ],
    "flow_cytometry": [
        "flow cytometry", "FACS", "cell cycle", "cell sorting",
        "intracellular staining", "surface marker",

scripts/generate_plots.R

#!/usr/bin/env Rscript
#
# Generate 4-panel preclinical experiment visualization (Step 3).
#
# Creates:
# 1. Experiment type breakdown (in vitro / in vivo / both / unclassified)
# 2. Top assay types (horizontal bar chart)
# 3. Animal model distribution (bar chart)
# 4. Publication timeline by experiment type (stacked bars)
#
# Uses ggplot2 with ggprism publication theme.
# Exports both PNG (300 DPI) and SVG with graceful fallback.
#

# --- Load packages -----------------------------------------------------------

required_pkgs <- c("ggplot2", "ggprism", "dplyr", "tidyr", "patchwork")

for (pkg in required_pkgs) {
  if (!requireNamespace(pkg, quietly = TRUE)) {
    stop(paste0("Missing required package: ", pkg,
                "\nInstall with: install.packages('", pkg, "')"))
  }
}

library(ggplot2)
library(ggprism)
library(dplyr)
library(tidyr)
library(patchwork)

# Try to load svglite for high-quality SVG (optional)
.has_svglite <- requireNamespace("svglite", quietly = TRUE)
if (.has_svglite) library(svglite)


# --- Main function -----------------------------------------------------------

generate_all_plots <- function(input_dir = "preclinical_results",
                               output_dir = "preclinical_results") {
  cat("\n", paste(rep("=", 70), collapse = ""), "\n")
  cat("GENERATING VISUALIZATIONS\n")
  cat(paste(rep("=", 70), collapse = ""), "\n\n")

  dir.create(output_dir, showWarnings = FALSE, recursive = TRUE)

  # Read extraction CSV
  extract_file <- file.path(input_dir, "experiment_extraction.csv")
  if (!file.exists(extract_file)) {
    stop(paste("File not found:", extract_file,
               "\nRun Steps 1-2 first to generate experiment_extraction.csv"))
  }

  df <- read.csv(extract_file, stringsAsFactors = FALSE)
  cat("  Read", nrow(df), "papers from", extract_file, "\n\n")

  # Build 4 panels
  cat("1. Generating experiment type breakdown...\n")
  p1 <- .plot_experiment_types(df)

  cat("2. Generating top assay types...\n")
  p2 <- .plot_assay_types(df)

  cat("3. Generating animal model distribution...\n")
  p3 <- .plot_animal_models(df)

  cat("4. Generating publication timeline...\n")
  p4 <- .plot_timeline(df)

  # Combine with patchwork
  cat("\n5. Saving combined figure...\n")
  combined <- (p1 | p2) / (p3 | p4) +
    plot_annotation(
      title = "Preclinical Literature Extraction",
      subtitle = paste(nrow(df), "papers analyzed"),
      theme = theme(
        plot.title = element_text(hjust = 0.5, face = "bold", size = 16),
        plot.subtitle = element_text(hjust = 0.5, size = 12)
      )
    )

Companion files

TypePathBytes
Pythonassets/eval/simple_test.py17,253
Textassets/eval/test_results/analysis_object.pkl.UNAVAILABLE.txt319
CSVassets/eval/test_results/experiment_extraction.csv2,265
CSVassets/eval/test_results/preclinical_search_results.csv3,593
Markdownassets/eval/test_results/preclinical_synthesis_report.md4,965
Textassets/eval/test_results/preclinical_synthesis_report.pdf.UNAVAILABLE.txt345
Pythonscripts/extract_experiments.py13,714
Rscripts/generate_plots.R9,639
Pythonscripts/generate_report.py22,210
Pythonscripts/narrative_synthesis.py24,508
Pythonscripts/preclinical_search.py9,917
Pythonscripts/preclinical_synthesis.py8,866
Pythonscripts/report_generation.py11,998
MarkdownSKILL.md10,910
JSONskill.meta.json3,044