View companion source

LASSO Biomarker Panel

A minimal 5–15 feature panel via LASSO.

Overview

Problem. Which few features predict the outcome stably?

Use when: High-dim features, binary outcome
Avoid when: No independent validation cohort

Learning goals

Figures

LASSO Panel Overview
LASSO Regularization
Nested CV & Stability
Reading ROC & Calibration
LASSO Pitfalls
Inputs & Outputs

Tutorial

Select minimal, interpretable biomarker panels from high-dimensional omics data using penalized logistic regression (LASSO/elastic net) with nested cross-validation and stability selection.

When to Use This Skill

Use this skill when you need to: - Select a minimal biomarker panel (5-15 features) from thousands of candidates - Build a predictive model for a binary clinical outcome (responder/non-responder, disease/control) - Validate across cohorts — discovery + independent validation design - Generate regulatory-grade outputs — ROC/AUC, calibration, decision curves - Design clinical exploratory endpoints from omics biomarker signatures

Don't use this skill for:

  • Unsupervised clustering (use bulk-omics-clustering)
  • Differential expression only (use bulk-rnaseq-counts-to-de-deseq2)
  • Multi-omics integration/factor discovery (use multiomics-patient-stratification)
  • Continuous outcomes — this skill is for binary classification

Installation

options(repos = c(CRAN = "https://cloud.r-project.org"))
if (!require('BiocManager', quietly = TRUE)) install.packages('BiocManager')

# Core (required)
install.packages(c('glmnet', 'pROC', 'ggplot2', 'ggprism', 'ggrepel'))

# Heatmap (required for feature heatmap)
BiocManager::install(c('ComplexHeatmap'))
install.packages('circlize')

# Example data — Sepsis MARS consortium (recommended demo) + breast cancer/UNIFI
BiocManager::install(c('GEOquery', 'Biobase'))

# Example data — IMvigor210 bladder cancer IO (alternative demo)
install.packages('remotes')
remotes::install_github('SiYangming/DESeq', upgrade = 'never')
remotes::install_github('SiYangming/IMvigor210CoreBiologies', upgrade = 'never')
BiocManager::install('DESeq2')

# Optional: DE pre-filtering
BiocManager::install('limma')

# Optional: Biological interpretation (pathway enrichment, cell-type context)
BiocManager::install(c('clusterProfiler', 'org.Hs.eg.db', 'fgsea'))
install.packages('msigdbr')
Software Version License Commercial Use Installation
glmnet >=4.1 GPL-2 Permitted install.packages('glmnet')
pROC >=1.18 GPL (>=3) Permitted install.packages('pROC')
ggplot2 >=3.4 MIT Permitted install.packages('ggplot2')
ggprism >=1.0.3 GPL (>=3) Permitted install.packages('ggprism')
ggrepel >=0.9 GPL-3 Permitted install.packages('ggrepel')
ComplexHeatmap >=2.10 MIT Permitted BiocManager::install('ComplexHeatmap')
circlize >=0.4.15 MIT Permitted install.packages('circlize')

Inputs

Required:

  • Expression matrix (genes x samples): TPM, FPKM, normalized counts, or protein abundance
  • Sample metadata: Data frame with a binary outcome column (0/1 or two-level factor)
  • Rownames must match expression column names
  • Minimum 20 samples per group (40+ recommended)

Optional:

  • Validation cohort: Independent expression matrix + metadata with same outcome
  • Pre-filtered feature list: From DE analysis or WGCNA hub genes (see Related Skills)

Outputs

Primary results:

  • biomarker_panel.csv — Final panel features with LASSO coefficients and selection frequencies
  • all_feature_stability.csv — All features ranked by selection frequency
  • discovery_performance.csv — Per-fold AUC, sensitivity, specificity
  • validation_performance.csv — External validation metrics (if validation cohort provided)

Analysis objects (RDS):

  • lasso_model.rds — Complete model result for downstream use
  • Load with: model <- readRDS('results/lasso_model.rds')
  • Predict with: source("scripts/lasso_workflow.R"); predict_biomarker_panel(model, new_X)
  • Required for: multiomics-patient-stratification, downstream prediction

Plots (PNG + SVG at 300 DPI):

  • roc_curve.png/.svg — ROC curve (discovery + validation overlay)
  • stability_barplot.png/.svg — Feature selection frequency across CV folds
  • coefficient_forest.png/.svg — LASSO coefficients with 95% CIs
  • calibration_curve.png/.svg — Predicted vs observed probability
  • auc_distribution.png/.svg — AUC distribution across CV folds
  • feature_heatmap.png/.svg — Clustered heatmap of panel features

Reports:

  • summary_report.md — Comprehensive analysis report (goals, context, datasets, methods, results)
  • summary_report.pdf — PDF report (only if tinytex installed; falls back to HTML automatically — this is expected in most environments)
  • summary_report.Rmd — R Markdown source (customizable)

Clarification Questions

ALWAYS ask Question 1 FIRST.

1. Example or Own Data? (ASK THIS FIRST):

IF EXAMPLE SELECTED (option a): Skip all remaining questions. Run Step 1 with: data <- load_sepsis_data(outcome = "endotype"), then Steps 2-4 with defaults (top_n_variable = 2000, alpha = 0.5, disease = "sepsis").

Questions 2-4 are ONLY asked if the user selected option (b) — own data:

2. Input Files (own data only):

3. Outcome & Organism (own data only):

4. Analysis Options (own data only — structured):

Standard Workflow

MANDATORY: USE SCRIPTS EXACTLY AS SHOWN - DO NOT WRITE INLINE CODE

Step 1 - Load data:

source("scripts/load_example_data.R")
data <- load_sepsis_data(outcome = "endotype")  # 479 ICU patients, Mars1 endotype classification
# OR: data <- load_sepsis_data(outcome = "mortality")  # 479 patients, 28-day mortality
# OR: data <- load_breast_cancer_pcr_data(outcome = "subtype")  # 218 tumors, Basal vs LumA
# OR: data <- load_imvigor210_data()   # 190 bladder cancer patients, atezolizumab
# OR: data <- load_unifi_data()        # 542 UC patients, UNIFI ustekinumab trial

DO NOT write custom data loading code. Use the loader functions.

✅ VERIFICATION: You MUST see: "✓ Sepsis endotype data loaded successfully!" (or similar per dataset)

Step 2 - Run LASSO analysis:

source("scripts/prepare_features.R")
features <- prepare_feature_matrix(data$expression, data$metadata, data$outcome_col, top_n_variable = 2000)

source("scripts/lasso_workflow.R")
model <- run_lasso_panel(features$X, features$y, alpha = 0.5)

DO NOT write inline LASSO code (glmnet, cv.glmnet, etc.). Just use the scripts.

✅ VERIFICATION: You MUST see: "✓ LASSO panel selection completed successfully!"

IF YOU DON'T SEE THIS: You wrote inline code. Stop and use source().

Step 3 - Generate visualizations:

source("scripts/biomarker_plots.R")
generate_all_plots(model, X = features$X, y = features$y, output_dir = "results")

DO NOT write inline plotting code (ggsave, ggplot, etc.). Just use generate_all_plots().

The script handles PNG + SVG export with graceful fallback for SVG dependencies.

✅ VERIFICATION: You MUST see: "✓ All biomarker plots generated successfully!"

Step 4 - Export results:

source("scripts/export_results.R")
export_all(model, output_dir = "results", data = data, features = features)

DO NOT write custom export code. Use export_all().

Pass data and features for a comprehensive report with disease context and methods.

✅ VERIFICATION: You MUST see: "=== Export Complete ==="

NOTE on PDF: If export_all() reports "PDF rendering failed", this is expected in environments without LaTeX. Inform the user that summary_report.html and summary_report.md are available instead. Do NOT silently omit this from the summary.

⚠️ CRITICAL: Do NOT interpret panel biology without running enrichment

After identifying the panel, do NOT describe gene functions or pathway membership from gene names alone. This is a hallucination risk. Instead: - State the panel genes and their coefficients/stability scores - Direct the user to run pathway enrichment as a next step - Only describe biology if functional-enrichment-from-degs has been run in this session

✅ Acceptable: "ACSL6 was selected in 100% of folds with a positive coefficient." ❌ NOT acceptable: "ACSL6 is involved in mitochondrial fatty acid metabolism, consistent with..."

Step 5 (Strongly Recommended) — External Validation:

Without this step, all performance metrics are discovery-cohort estimates only and are expected to be optimistic. Do not present results as a validated panel without completing this step.

source("scripts/load_example_data.R")
val_data <- load_breast_cancer_validation_data("GSE32646")  # or load_pursuit_data()
source("scripts/validate_external.R")
source("scripts/prepare_features.R")
val_features <- prepare_validation_features(val_data$expression, features$feature_names)
val_result <- validate_panel(model, val_features$X, val_data$metadata[[val_data$outcome_col]],
                              cohort_name = "GSE32646")
export_all(model, validation_result = val_result, output_dir = "results",
           data = data, features = features)

CRITICAL - DO NOT:

  • Write inline LASSO code -> STOP: Use source("scripts/lasso_workflow.R")
  • Write inline plotting code -> STOP: Use generate_all_plots()
  • Write custom export code -> STOP: Use export_all()
  • Try to install svglite -> script handles SVG fallback automatically

IF SCRIPTS FAIL - Script Failure Hierarchy:

  1. Fix and Retry (90%) - Install missing package, re-run script
  2. Modify Script (5%) - Edit the script file itself, document changes
  3. Use as Reference (4%) - Read script, adapt approach, cite source
  4. Write from Scratch (1%) - Only if genuinely impossible, explain why

NEVER skip directly to writing inline code without trying the script first.

Common Issues

Error Cause Fix
"No shared samples" Column names don't match rownames Check colnames(expression) match rownames(metadata)
"Outcome must be binary" Non-binary outcome column Ensure 2 unique values in outcome. Recode if multi-level.
cv.glmnet convergence warning Too few samples for 10-fold CV Reduce n_inner_folds to 5 or increase sample size
"No features passed stability" Very noisy data or small effect Script auto-relaxes to top 10 features. Consider alpha=0.5 (elastic net).
Low validation AUC Cross-platform gene loss Check val_features$n_matched. If <50% matched, use same-platform validation.
SVG export error Missing optional dependency Normal - generate_all_plots() falls back to base R svg() automatically.
GEOquery download fails Network/firewall issue Retry or download manually from NCBI GEO website
PDF report not generated tinytex/LaTeX not installed Normal fallback — use summary_report.html or summary_report.md instead. Do NOT silently omit this from the summary.

Agent Summary Guidelines

When presenting final results to the user, the agent MUST:

  1. Always cite the CV AUC (from discovery_performance.csv), never the final model AUC from the ROC plot
  2. Always state whether external validation was performed. If not, include this sentence verbatim:

    "These results are from the discovery cohort only. No external validation was performed. AUC estimates are expected to be optimistic."

  3. Never describe panel gene biology unless functional-enrichment-from-degs was run in this session
  4. Always report PDF status — if PDF generation failed, say so explicitly and note that .html/.md reports are available
  5. Never use the word "validated" to describe a panel that has only been tested in the discovery cohort

Interpretation Guidelines

AUC Values — Which Number to Use

Two AUC figures are produced by this workflow:

Figure Value What it means Use this?
Mean CV AUC (discovery_performance.csv) e.g. 0.986 Average AUC across held-out test sets YES — cite this
Final model AUC (ROC curve plot annotation) e.g. 0.996 Model applied back to its own training data NO — in-sample, optimistic

Always report the mean CV AUC as the performance estimate. The final model AUC is only shown for reference and will always be higher.

Interpreting CV AUC: Known Optimism Bias

The reported CV AUC from this workflow is an estimate within the discovery cohort, not a true out-of-sample performance figure. It is subject to optimism bias because:

Expected magnitude of bias: Typically 0.02–0.05 AUC units for datasets of this size. A discovery AUC of 0.986 should be interpreted as "likely >0.93 in an independent cohort if the signal is real" — not as a guaranteed performance floor.

The only way to get an unbiased estimate is external validation.

Suggested Next Steps

After building a biomarker panel:

  1. [REQUIRED before biological interpretation] Pathway enrichment of panel genes -> functional-enrichment-from-degs
  2. Co-expression context — which WGCNA modules contain panel genes -> coexpression-network
  3. Patient stratification using panel as input -> multiomics-patient-stratification
  4. Literature validation — are panel genes known disease/therapy markers?
  5. Independent cohort replication on external validation datasets

Related Skills

Skill Relationship
bulk-rnaseq-counts-to-de-deseq2 Upstream — DE results can pre-filter features for LASSO
coexpression-network Upstream — WGCNA hub genes as LASSO candidates (Paper 1 cascade)
functional-enrichment-from-degs Downstream — Pathway enrichment of panel genes
bulk-omics-clustering Alternative — Unsupervised patient stratification
multiomics-patient-stratification Downstream — Multi-omics integration + subtyping

References

Code preview

scripts/biological_interpretation.R

# Biological Interpretation of Biomarker Panel
#
# Three complementary analyses:
#   1. Pathway enrichment (ORA on panel genes + GSEA on ranked features)
#   2. Cell-type expression context (CZI CELLxGENE Census)
#   3. GWAS genetic risk / disease gene overlap
#
# Usage:
#   source("scripts/biological_interpretation.R")
#   interp <- run_biological_interpretation(model, features, output_dir = "results",
#                                            disease = "ibd")  # or "bladder_cancer", "breast_cancer", or "sepsis"

cat("Loading biological interpretation functions...\n")

# ============================================================
# 1. Pathway Enrichment
# ============================================================

.run_pathway_enrichment <- function(model_result, features = NULL, output_dir = "results") {
    cat("\n--- Pathway Enrichment Analysis ---\n\n")

    suppressPackageStartupMessages({
        library(clusterProfiler)
        library(org.Hs.eg.db)
        library(fgsea)
        library(msigdbr)
        library(ggplot2)
        library(ggprism)
    })

    panel_genes <- model_result$stable_features
    fi <- model_result$feature_importance

    results <- list(ora_go = NULL, ora_kegg = NULL, gsea_hallmark = NULL,
                    gsea_reactome = NULL, gsea_immunologic = NULL)

    # ---- ORA on panel genes (GO:BP) ----
    cat("  ORA: GO Biological Process on", length(panel_genes), "panel genes...\n")

    # Filter out non-standard gene symbols (IGLV4-60 etc.)
    valid_genes <- panel_genes[!grepl("^IGH|^IGK|^IGL|^LOC|^C\\d+orf", panel_genes)]
    cat("    Valid gene symbols for ORA:", length(valid_genes), "\n")

    # Get universe from all tested features
    all_genes <- fi$feature
    all_valid <- all_genes[!grepl("^IGH|^IGK|^IGL|^LOC|^C\\d+orf", all_genes)]

    ora_go <- tryCatch({
        ego <- enrichGO(
            gene         = valid_genes,
            universe     = all_valid,
            OrgDb        = org.Hs.eg.db,
            keyType      = "SYMBOL",
            ont          = "BP",
            pAdjustMethod = "BH",
            pvalueCutoff = 0.1,   # Relaxed for small gene list
            qvalueCutoff = 0.2,
            minGSSize    = 5,
            maxGSSize    = 500,
            readable     = TRUE
        )
        if (!is.null(ego) && nrow(ego@result[ego@result$p.adjust < 0.1, ]) > 0) {
            cat("    Found", sum(ego@result$p.adjust < 0.1), "significant GO terms (padj < 0.1)\n")
            ego
        } else {
            cat("    No significant GO terms at padj < 0.1 (expected with 10 genes)\n")
            ego
        }
    }, error = function(e) {
        cat("    GO ORA failed:", conditionMessage(e), "\n")
        NULL
    })
    results$ora_go <- ora_go

    # ---- ORA on panel genes (Reactome via msigdbr) ----
    cat("  ORA: Reactome pathways on panel genes...\n")

    reactome_sets <- msigdbr(species = "Homo sapiens", collection = "C2", subcollection = "CP:REACTOME")
    reactome_list <- split(reactome_sets$gene_symbol, reactome_sets$gs_name)

scripts/biomarker_plots.R

# Biomarker Panel Visualization Suite
# Publication-quality plots for LASSO biomarker panel results
# All ggplot-based plots use ggprism::theme_prism()
# Heatmaps use ComplexHeatmap + circlize

library(ggplot2)
library(ggprism)
library(pROC)

# Load plotting helpers
source("scripts/plotting_helpers.R")

#' Generate all biomarker panel plots
#'
#' Master function that generates all diagnostic and result plots.
#'
#' @param model_result Result from run_lasso_panel()
#' @param validation_result Result from validate_panel() (optional)
#' @param X Original feature matrix (for heatmap, optional)
#' @param y Original outcome vector (for heatmap, optional)
#' @param output_dir Output directory for plots (default: "results")
generate_all_plots <- function(model_result, validation_result = NULL,
                                X = NULL, y = NULL,
                                output_dir = "results") {

    cat("\n=== Generating Biomarker Panel Plots ===\n\n")

    if (!dir.exists(output_dir)) {
        dir.create(output_dir, recursive = TRUE)
    }

    # 1. ROC curve
    cat("1. ROC Curve\n")
    plot_roc_curve(model_result, validation_result, output_dir)

    # 2. Stability barplot
    cat("\n2. Feature Stability Barplot\n")
    plot_stability_barplot(model_result, output_dir)

    # 3. Coefficient forest plot
    cat("\n3. Coefficient Forest Plot\n")
    plot_coefficient_forest(model_result, output_dir)

    # 4. Calibration curve
    cat("\n4. Calibration Curve\n")
    plot_calibration_curve(model_result, output_dir)

    # 5. AUC distribution
    cat("\n5. AUC Distribution\n")
    plot_auc_distribution(model_result, output_dir)

    # 6. Feature heatmap (if expression data provided)
    if (!is.null(X) && !is.null(y) && length(model_result$stable_features) > 0) {
        cat("\n6. Feature Heatmap\n")
        plot_feature_heatmap(X, y, model_result$stable_features, output_dir)
    } else {
        cat("\n6. Feature Heatmap (skipped - provide X and y for heatmap)\n")
    }

    cat("\n✓ All biomarker plots generated successfully!\n")
}

#' Plot ROC curve with discovery and optional validation overlay
plot_roc_curve <- function(model_result, validation_result = NULL, output_dir = "results") {

    # Build ROC data from CV predictions
    cv_preds <- model_result$cv_predictions

    # Aggregate: average prediction per sample across folds
    agg <- aggregate(predicted_prob ~ sample_id + true_label, data = cv_preds, FUN = mean)
    roc_disc <- roc(agg$true_label, agg$predicted_prob, quiet = TRUE)

    # Build plot data
    roc_df <- data.frame(
        specificity = 1 - roc_disc$specificities,
        sensitivity = roc_disc$sensitivities,
        cohort = paste0("Discovery (CV AUC = ", round(model_result$mean_auc, 3), ")"),
        stringsAsFactors = FALSE
    )

scripts/export_results.R

# Export LASSO Biomarker Panel Results
# Saves all results including RDS objects for downstream skills
# Generates comprehensive markdown + PDF reports with embedded plots

#' Export all LASSO biomarker panel results
#'
#' @param model_result Result from run_lasso_panel()
#' @param validation_result Result from validate_panel() (optional)
#' @param output_dir Output directory (default: "results")
#' @param data Original data object from load_*_data() (optional, enriches report)
#' @param features Feature matrix result from prepare_feature_matrix() (optional)
#'
#' @export
export_all <- function(model_result, validation_result = NULL,
                        output_dir = "results",
                        data = NULL, features = NULL,
                        interpretation = NULL) {

    cat("\n=== Exporting LASSO Biomarker Panel Results ===\n\n")

    # Create output directory
    if (!dir.exists(output_dir)) {
        dir.create(output_dir, recursive = TRUE)
        cat("Created directory:", output_dir, "\n\n")
    }

    # 1. Biomarker panel (stable features with coefficients)
    cat("1. Exporting biomarker panel...\n")
    panel <- model_result$feature_importance[model_result$feature_importance$is_stable, ]
    write.csv(panel, file.path(output_dir, "biomarker_panel.csv"), row.names = FALSE)
    cat("   Saved: biomarker_panel.csv (", nrow(panel), "features)\n\n")

    # 2. All feature stability scores
    cat("2. Exporting all feature stability scores...\n")
    write.csv(model_result$feature_importance,
              file.path(output_dir, "all_feature_stability.csv"), row.names = FALSE)
    cat("   Saved: all_feature_stability.csv (",
        nrow(model_result$feature_importance), "features)\n\n")

    # 3. Discovery performance (per-fold)
    cat("3. Exporting discovery performance...\n")
    perf <- data.frame(
        fold = seq_along(model_result$fold_aucs),
        auc = model_result$fold_aucs,
        sensitivity = model_result$fold_sensitivities,
        specificity = model_result$fold_specificities,
        stringsAsFactors = FALSE
    )
    write.csv(perf, file.path(output_dir, "discovery_performance.csv"), row.names = FALSE)
    cat("   Saved: discovery_performance.csv\n")
    cat("   Mean AUC:", round(model_result$mean_auc, 3),
        "(95% CI:", round(model_result$auc_ci[1], 3), "-",
        round(model_result$auc_ci[2], 3), ")\n\n")

    # 4. Validation performance (if provided)
    if (!is.null(validation_result)) {
        cat("4. Exporting validation performance...\n")
        write.csv(validation_result$performance_table,
                  file.path(output_dir, "validation_performance.csv"), row.names = FALSE)
        cat("   Saved: validation_performance.csv\n")
        cat("   Validation AUC:", round(validation_result$auc, 3), "\n\n")

        # Validation predictions
        write.csv(validation_result$predictions,
                  file.path(output_dir, "validation_predictions.csv"), row.names = FALSE)
        cat("   Saved: validation_predictions.csv\n\n")
    } else {
        cat("4. No validation result provided (skipped)\n\n")
    }

    # 5. Cross-validation predictions
    cat("5. Exporting CV predictions...\n")
    write.csv(model_result$cv_predictions,
              file.path(output_dir, "cv_predictions.csv"), row.names = FALSE)
    cat("   Saved: cv_predictions.csv (",
        nrow(model_result$cv_predictions), "predictions across",
        length(unique(model_result$cv_predictions$fold)), "folds)\n\n")

    # 6. LASSO model object (CRITICAL for downstream skills)
    cat("6. Saving LASSO model object (RDS)...\n")

Companion files

TypePathBytes
Markdownreferences/decision-guide.md2,649
Markdownreferences/lasso-reference.md4,081
Markdownreferences/validation-guide.md3,742
Rscripts/biological_interpretation.R86,941
Rscripts/biomarker_plots.R13,551
Rscripts/export_results.R63,850
Rscripts/lasso_workflow.R12,534
Rscripts/load_example_data.R101,173
Rscripts/plotting_helpers.R2,840
Rscripts/prepare_features.R8,534
Pythonscripts/query_cellxgene.py3,859
Rscripts/validate_external.R6,846
MarkdownSKILL.md16,748
JSONskill.meta.json2,673