View companion source

MS Differential Proteins

MS differential expression with PSM-aware variance (limma + DEqMS).

Overview

Problem. Find differential proteins between groups — proteomics' DESeq2.

Use when: TMT or LFQ MS data
Avoid when: Blindly imputing unknown missingness

Learning goals

Figures

Proteomics DE Overview
DE Analysis Workflow
DEqMS PSM-Aware Variance
Normalization & Imputation
Reading Output Plots
Common Pitfalls

Tutorial

Differential protein expression analysis for TMT/LFQ mass spectrometry proteomics data using limma linear models with DEqMS PSM-count-aware variance correction.

When to Use This Skill

Use this skill when you have: - ✅ Protein quantification data from TMT or LFQ mass spectrometry - ✅ PSM/peptide counts per protein (for DEqMS variance correction) - ✅ Biological replicates (≥2 per condition, ≥3 recommended) - ✅ Need for PSM-aware statistical testing (improved power over standard limma)

Don't use this skill for: - ❌ RNA-seq data → use bulk-rnaseq-counts-to-de-deseq2

  • ❌ Metabolomics data → different normalization/statistics needed
  • ❌ Pre-computed fold changes without raw intensities

Quick Start (Example Data)

Test this skill with real TMT proteomics data in ~2 minutes:

source("scripts/load_example_data.R")
data <- load_example_data()    # Auto-downloads A431 TMT 10-plex data (~30s)
psm_data <- data$psm_data      # 316,726 PSMs × 10 TMT channels
metadata <- data$metadata       # 4 conditions: ctrl, miR191, miR372, miR519

# Run complete workflow
source("scripts/basic_workflow.R")  # Creates fit_deqms, deqms_results + prints summary

What you get:

  • Dataset: A431 human epidermoid carcinoma cells treated with miRNAs (TMT 10-plex)
  • Comparison: miR372 vs ctrl (3 vs 3 replicates)
  • Expected results: ~9,000 proteins quantified, significant DE proteins at adj.p < 0.05

For your own data: Replace data loading with your protein intensity matrix and metadata (see Inputs section).

Installation

Core packages (required):

options(repos = c(CRAN = "https://cloud.r-project.org"))
if (!require('BiocManager', quietly = TRUE)) install.packages('BiocManager')
BiocManager::install(c('limma', 'DEqMS', 'ExperimentHub'))

Visualization packages (required):

install.packages(c('ggplot2', 'ggprism', 'ggrepel', 'circlize', 'matrixStats'))
BiocManager::install('ComplexHeatmap')

Optional packages:

install.packages(c('rmarkdown', 'knitr'))        # PDF report
BiocManager::install(c('impute', 'vsn'))          # kNN imputation, VSN normalization
Software Version License Commercial Use Installation
limma ≥3.50.0 GPL (≥2) ✅ Permitted BiocManager::install('limma')
DEqMS ≥1.12.0 LGPL ✅ Permitted BiocManager::install('DEqMS')
ExperimentHub ≥2.0.0 Artistic-2.0 ✅ Permitted BiocManager::install('ExperimentHub')
ggplot2 ≥3.4.0 MIT ✅ Permitted install.packages('ggplot2')
ggprism ≥1.0.3 GPL (≥3) ✅ Permitted install.packages('ggprism')
ggrepel ≥0.9.0 GPL-3 ✅ Permitted install.packages('ggrepel')
ComplexHeatmap ≥2.10.0 MIT ✅ Permitted BiocManager::install('ComplexHeatmap')
circlize ≥0.4.15 MIT ✅ Permitted install.packages('circlize')
matrixStats ≥0.60.0 Artistic-2.0 ✅ Permitted install.packages('matrixStats')
rmarkdown ≥2.20 GPL-3 ✅ Permitted Optional

Note: Scripts automatically generate both PNG and SVG formats. SVG export uses base R svg() device (always available) or svglite if installed. No additional setup needed.

Inputs

Required:

  • Protein intensity matrix: Rows = proteins, Columns = samples
  • PSM-level table with gene/protein column (recommended — enables medianSweeping aggregation)
  • OR protein-level intensities (log2 or raw)
  • Sample metadata: data.frame with condition column

Optional but recommended:

  • PSM/peptide counts per protein (critical for DEqMS variance correction)

Supported formats: MaxQuant proteinGroups.txt, Proteome Discoverer export, generic CSV/TSV

Outputs

Result tables (CSV):

  • all_results.csv — Full DEqMS results (logFC, sca.P.Value, sca.adj.pval, count)
  • significant_results.csv — Filtered by adjusted p-value and fold change
  • normalized_protein_matrix.csv — Log2 normalized protein intensities
  • psm_counts.csv — PSM counts per protein
  • top100_proteins.csv — Top 100 by DEqMS adjusted p-value

Analysis objects (RDS):

  • analysis_object.rds — Complete analysis object for downstream skills
  • Load with: obj <- readRDS('results/analysis_object.rds')
  • Contains: fit_deqms, deqms_results, protein_matrix, metadata, psm_counts

Plots (PNG + SVG):

  • intensity_distribution — Before/after normalization boxplots
  • missing_values_heatmap — Missing value pattern across samples
  • pca_plot — PCA colored by condition
  • sample_correlation_heatmap — Pearson correlation between samples
  • volcano_plot — Differential expression with labeled top hits
  • ma_plot — Log2 fold change vs average expression
  • variance_psm_plot — DEqMS variance vs PSM count relationship

Reports:

  • analysis_report.pdf — PDF report (requires rmarkdown + LaTeX)
  • analysis_report.md — Markdown report (always generated)

Clarification Questions

1. Input Files (ASK THIS FIRST):

2. Analysis Options (structured):

3. Thresholds:

Standard Workflow

🚨 MANDATORY: USE SCRIPTS EXACTLY AS SHOWN - DO NOT WRITE INLINE CODE 🚨

Step 1 - Load data:

source("scripts/load_example_data.R")
data <- load_example_data()
psm_data <- data$psm_data
metadata <- data$metadata

For your own data: Replace with your loading code, then call validate_input_data().

Step 2 - Run DE analysis:

source("scripts/basic_workflow.R")

DO NOT expand this into inline code. DO NOT write limma/DEqMS steps manually. Just source the script.

Step 3 - Generate plots:

source("scripts/qc_plots.R")
generate_all_plots(fit_deqms, deqms_results, protein_matrix,
                    metadata, output_dir = "results", raw_matrix = raw_matrix)

DO NOT write inline plotting code (ggsave, ggplot, Heatmap, etc.). Just use the script.

Step 4 - Export results:

source("scripts/export_results.R")
export_all(fit_deqms, deqms_results, protein_matrix, metadata,
            output_dir = "results")

DO NOT write custom export code. Use export_all() to save all outputs including RDS.

✅ VERIFICATION - You should see:

  • After Step 1: "✓ Example data loaded successfully" with PSM/protein counts
  • After Step 2: "✓ Proteomics DE analysis completed successfully!" with summary
  • After Step 3: "✓ All plots generated successfully!"
  • After Step 4: "=== Export Complete ===" with file list

❌ IF YOU DON'T SEE THESE MESSAGES: You wrote inline code. Stop and use source().

⚠️ CRITICAL - DO NOT:

  • Write inline limma/DEqMS codeSTOP: Use source("scripts/basic_workflow.R")
  • Write inline plotting codeSTOP: Use generate_all_plots()
  • Write custom export codeSTOP: Use export_all()
  • Try to install svglite → scripts handle SVG fallback automatically
  • Use absolute paths → Always use scripts/file.R

⚠️ IF SCRIPTS FAIL - Script Failure Hierarchy:

  1. Fix and Retry (90%) - Install missing package, re-run script
  2. Modify Script (5%) - Edit the script file itself, document changes
  3. Use as Reference (4%) - Read script, adapt approach, cite source
  4. Write from Scratch (1%) - Only if genuinely impossible, explain why

NEVER skip directly to writing inline code without trying the script first.

What the scripts provide:

Customizing the Analysis

To change the comparison (before sourcing basic_workflow.R):

comparison_name <- "miR519-ctrl"  # or any valid contrast
source("scripts/basic_workflow.R")

To change imputation/normalization:

imputation_method <- "kNN"         # "MinProb" (default) or "kNN"
normalization_method <- "quantile" # "median" (default), "quantile", or "none"
source("scripts/basic_workflow.R")

For detailed method documentation: See references/proteomics-reference.md For normalization guidance: See references/normalization-guide.md

Common Issues

Issue Cause Fix
Not seeing verification messages Wrote inline code instead of source() Stop and use Standard Workflow commands exactly
"cannot open file" error Using absolute paths Use relative paths: source("scripts/file.R")
ExperimentHub download fails Network timeout Set options(timeout = 300) and retry
Missing package errors Package not installed BiocManager::install('package') or install.packages('package')
SVG export error "svglite required" Missing optional dependency Use generate_all_plots() — it handles fallback automatically. DO NOT try to install svglite manually
svglite dependency conflict System library version mismatch Normal — generate_all_plots() falls back to base R svg() device automatically. Both PNG and SVG will be created
All proteins filtered out Too stringent missing value filter Adjust filter threshold in basic_workflow.R
No significant proteins Weak effect or wrong comparison Check PCA for condition separation; try relaxed thresholds

Suggested Next Steps

After running this skill:

  1. Pathway enrichment → functional-enrichment skill with significant proteins
  2. Biomarker panel → lasso-biomarker-panel with DE proteins as features
  3. Network analysis → coexpression-network with protein matrix
  4. Gene list processing → de-results-to-gene-lists for annotation

Related Skills

Skill Relationship When to Use
bulk-rnaseq-counts-to-de-deseq2 Alternative RNA-seq count data (not proteomics)
lasso-biomarker-panel Downstream Build biomarker panel from DE proteins
coexpression-network Downstream Protein co-expression modules

References

Code preview

scripts/basic_workflow.R

# Proteomics differential expression workflow using limma + DEqMS
# Source this script after loading data with load_example_data.R
#
# Expects in calling environment:
#   psm_data  - PSM-level data.frame from load_example_data() or user data
#   metadata  - Sample metadata data.frame with 'condition' column
#
# Creates in calling environment:
#   protein_matrix  - Log2 normalized protein intensities (proteins x samples)
#   raw_matrix      - Pre-normalization log2 protein intensities
#   fit_deqms       - DEqMS fit object (augmented limma fit)
#   deqms_results   - DEqMS results data.frame
#   psm_counts      - Named vector of PSM counts per protein
#   comparison_name - String describing the contrast

cat("\n=== Proteomics DE Analysis (limma + DEqMS) ===\n\n")

# ---- Load required packages ----
library(limma)
library(DEqMS)
library(matrixStats)

# ---- Configuration ----
# These can be overridden before sourcing this script
if (!exists("comparison_name")) {
    comparison_name <- "miR372-ctrl"
}
if (!exists("padj_threshold")) {
    padj_threshold <- 0.05
}
if (!exists("lfc_threshold")) {
    lfc_threshold <- 0.58  # log2(1.5) — standard proteomics threshold (1.5-fold change)
}
if (!exists("imputation_method")) {
    imputation_method <- "MinProb"  # "MinProb" or "kNN"
}
if (!exists("normalization_method")) {
    normalization_method <- "median"  # "median", "quantile", or "none"
}

# ---- Validate inputs ----
if (!exists("psm_data") && !exists("protein_matrix")) {
    stop("No input data found. Run load_example_data() first or provide psm_data/protein_matrix.")
}
if (!exists("metadata")) {
    stop("No metadata found. Run load_example_data() first or provide metadata.")
}

# ---- Step 1: PSM-to-protein aggregation ----
if (exists("psm_data") && !exists("protein_matrix")) {
    cat("1. Aggregating PSMs to protein level (medianSweeping)...\n")

    # Identify intensity columns (columns 3-12 for example data)
    # For user data, intensity columns are all numeric columns except gene/protein ID
    if ("gene" %in% colnames(psm_data)) {
        gene_col <- which(colnames(psm_data) == "gene")
        # Get sample columns from metadata
        sample_cols <- which(colnames(psm_data) %in% rownames(metadata))
        if (length(sample_cols) == 0) {
            # Fallback: assume intensity columns are numeric columns after first 2
            sample_cols <- 3:ncol(psm_data)
        }
    } else {
        stop("PSM data must have a 'gene' column for protein grouping")
    }

    # Log2 transform PSM intensities
    dat.psm.log <- psm_data
    dat.psm.log[, sample_cols] <- log2(psm_data[, sample_cols])

    # Replace -Inf (from log2(0)) with NA
    for (col in sample_cols) {
        dat.psm.log[is.infinite(dat.psm.log[, col]), col] <- NA
    }

    # Aggregate PSMs to protein level using medianSweeping
    # group_col = column index of gene/protein ID (typically column 2)
    protein_matrix <- as.matrix(medianSweeping(dat.psm.log, group_col = gene_col))

    # Count PSMs per protein

scripts/export_results.R

# Export proteomics DE results and analysis objects
# Generates all output files including RDS for downstream skills

#' Export all proteomics DE results
#'
#' @param fit_deqms DEqMS fit object from basic_workflow.R
#' @param deqms_results DEqMS results data.frame from basic_workflow.R
#' @param protein_matrix Normalized protein intensity matrix
#' @param metadata Sample metadata data.frame
#' @param psm_counts Named vector of PSM counts (optional, extracted from fit if NULL)
#' @param comparison_name Name of the comparison
#' @param output_dir Output directory (default: "results")
#' @param padj_threshold Adjusted p-value threshold (default: 0.05)
#' @param lfc_threshold Log2 fold change threshold (default: 0.58 = 1.5-fold)
#' @export
export_all <- function(fit_deqms, deqms_results, protein_matrix, metadata,
                        psm_counts = NULL, comparison_name = NULL,
                        output_dir = "results",
                        padj_threshold = 0.05, lfc_threshold = 0.58) {

    cat("\n=== Exporting Proteomics DE Results ===\n\n")

    # Create output directory
    if (!dir.exists(output_dir)) {
        dir.create(output_dir, recursive = TRUE)
        cat("Created directory:", output_dir, "\n\n")
    }

    # Infer comparison name if not provided
    if (is.null(comparison_name) && exists("comparison_name", envir = parent.frame())) {
        comparison_name <- get("comparison_name", envir = parent.frame())
    }
    if (is.null(comparison_name)) comparison_name <- "comparison"

    # Get PSM counts from fit if not provided
    if (is.null(psm_counts) && !is.null(fit_deqms$count)) {
        psm_counts <- fit_deqms$count
    }

    # 1. All DEqMS results
    cat("1. Exporting all DEqMS results...\n")
    write.csv(deqms_results,
              file.path(output_dir, "all_results.csv"),
              row.names = FALSE)
    cat("   Saved: all_results.csv (", nrow(deqms_results), " proteins)\n\n")

    # 2. Significant results
    cat("2. Exporting significant results...\n")
    sig <- deqms_results[!is.na(deqms_results$sca.adj.pval) &
                          deqms_results$sca.adj.pval < padj_threshold &
                          abs(deqms_results$logFC) > lfc_threshold, ]
    sig <- sig[order(sig$sca.adj.pval), ]
    write.csv(sig,
              file.path(output_dir, "significant_results.csv"),
              row.names = FALSE)
    n_up <- sum(sig$logFC > 0)
    n_down <- sum(sig$logFC < 0)
    cat("   Saved: significant_results.csv (", nrow(sig), " proteins:",
        n_up, "up,", n_down, "down)\n")
    cat("   Thresholds: sca.adj.pval <", padj_threshold,
        ", |logFC| >", lfc_threshold, "\n\n")

    # 3. Normalized protein matrix
    cat("3. Exporting normalized protein matrix...\n")
    write.csv(protein_matrix,
              file.path(output_dir, "normalized_protein_matrix.csv"),
              row.names = TRUE)
    cat("   Saved: normalized_protein_matrix.csv (",
        nrow(protein_matrix), "x", ncol(protein_matrix), ")\n\n")

    # 4. Analysis object (CRITICAL for downstream skills)
    cat("4. Saving analysis object (RDS)...\n")
    analysis_object <- list(
        fit_deqms = fit_deqms,
        deqms_results = deqms_results,
        protein_matrix = protein_matrix,
        metadata = metadata,
        psm_counts = psm_counts,
        comparison_name = comparison_name,
        thresholds = list(padj = padj_threshold, lfc = lfc_threshold)

scripts/generate_report.R

# Generate PDF analysis report for proteomics DE analysis
# Uses rmarkdown with PDF output (optional dependency)

#' Generate PDF analysis report
#'
#' @param deqms_results DEqMS results data.frame
#' @param metadata Sample metadata data.frame
#' @param comparison_name Name of the comparison (e.g., "miR372-ctrl")
#' @param output_dir Directory containing plots and for output
#' @param n_proteins Total number of proteins tested
#' @return Path to generated PDF, or NULL if generation failed
#' @export
generate_report <- function(deqms_results, metadata,
                             comparison_name = "Treatment vs Control",
                             output_dir = "results",
                             n_proteins = NULL,
                             padj_threshold = 0.05,
                             lfc_threshold = 0.58) {

    # Check rmarkdown availability
    if (!requireNamespace("rmarkdown", quietly = TRUE)) {
        cat("   rmarkdown not installed - skipping PDF report\n")
        cat("   Install with: install.packages('rmarkdown')\n")
        return(NULL)
    }

    # Check for LaTeX
    has_latex <- FALSE
    if (requireNamespace("tinytex", quietly = TRUE)) {
        has_latex <- tinytex::is_tinytex() || nchar(Sys.which("xelatex")) > 0
    }
    if (!has_latex) {
        has_latex <- nchar(Sys.which("pdflatex")) > 0 || nchar(Sys.which("xelatex")) > 0
    }

    if (!has_latex) {
        cat("   No LaTeX installation found - skipping PDF report\n")
        cat("   Install with: tinytex::install_tinytex()\n")
        return(NULL)
    }

    cat("   Generating PDF report...\n")

    # Compute summary stats
    if (is.null(n_proteins)) n_proteins <- nrow(deqms_results)
    n_sig <- sum(deqms_results$sca.adj.pval < padj_threshold &
                  abs(deqms_results$logFC) > lfc_threshold, na.rm = TRUE)
    n_up <- sum(deqms_results$sca.adj.pval < padj_threshold &
                 deqms_results$logFC > lfc_threshold, na.rm = TRUE)
    n_down <- sum(deqms_results$sca.adj.pval < padj_threshold &
                   deqms_results$logFC < -lfc_threshold, na.rm = TRUE)
    n_samples <- nrow(metadata)
    conditions <- paste(levels(metadata$condition), collapse = ", ")

    # Top 20 proteins table
    top20 <- head(deqms_results[order(deqms_results$sca.adj.pval), ], 20)
    top20_table <- data.frame(
        Protein = top20$protein,
        logFC = sprintf("%.3f", top20$logFC),
        `DEqMS.adj.pval` = sprintf("%.2e", top20$sca.adj.pval),
        `limma.adj.pval` = sprintf("%.2e", top20$adj.P.Val),
        PSM.count = top20$count,
        check.names = FALSE
    )

    # Find available plot files
    plot_files <- list.files(output_dir, pattern = "\\.png$", full.names = TRUE)

    # Build Rmd content
    rmd_content <- paste0(
'---
title: "Proteomics Differential Expression Report"
subtitle: "limma + DEqMS Analysis"
date: "', format(Sys.Date(), "%B %d, %Y"), '"
output:
  pdf_document:
    toc: true
    toc_depth: 2
    number_sections: true
---

Companion files

TypePathBytes
Markdownreferences/normalization-guide.md3,020
Markdownreferences/proteomics-reference.md4,882
Rscripts/basic_workflow.R10,225
Rscripts/export_results.R8,366
Rscripts/generate_report.R8,514
Rscripts/load_example_data.R6,831
Rscripts/qc_plots.R15,392
MarkdownSKILL.md11,510
JSONskill.meta.json1,858