⌂ / Overview / Proteomics & Multi-omics / Serum Proteomics, Longitudinal

View companion source

Serum Proteomics, Longitudinal

A full DIA-MS pipeline for longitudinal treatment response.

Overview

Problem. Find response markers in small, sparse, longitudinal data.

Use when: DIA-MS, longitudinal, small cohorts (n=8–20)

Avoid when: Discarding missingness as mere noise

Learning goals

Missingness itself can be signal (MNAR)
Non-parametric for small n; mixed models for longitudinal

Figures

Tutorial

When to Use This Skill

Use this skill when ALL of the following are true:

You have serum or plasma proteomics data (DIA-MS, Olink, SomaScan, or similar)
The study has ≥2 longitudinal timepoints (e.g., baseline T1, mid-treatment T2, end T3)
Samples are from ≥2 response groups (e.g., Early Responder, Poor Responder)
You want MNAR detection, longitudinal modelling, or biomarker tiering — not just standard DE

Do NOT use this skill for:

TMT/LFQ proteomics with PSM counts → use proteomics-diff-exp (limma + DEqMS)
LASSO panel with nested cross-validation → use lasso-biomarker-panel
Pathway enrichment of DE proteins → use functional-enrichment-from-degs
Single-timepoint case-control proteomics without MNAR concern → use proteomics-diff-exp

Inputs

Input	Format	Required
Protein intensity matrix	CSV: rows = protein names, cols = sample IDs; `NA` = not detected (below LOD)	Required
Sample metadata	CSV: `sample_id`, `patient_id`, `timepoint` (T1/T2/T3), `response_group` + optional covariates	Required
Clinical score table	CSV: `patient_id`, `timepoint`, `PASI` (or equivalent score)	Optional — enables trajectory clustering
Cross-species DE table	CSV: `protein`, `log2FC`, `p_value`, `direction` (from mouse model)	Optional — enables cross-species tier bonus

Input format notes

Intensity values: Log2-transformed preferred. If raw intensities (median > 100), scripts auto-apply log2.
NA encoding: Use NA or empty cells for non-detected proteins. Do NOT use 0 — zero is a valid intensity.
Sample IDs: Must match between intensity matrix column names and metadata sample_id column.
Protein names: Use gene symbols (CHGA, SCG2, ITIH1) or UniProt accessions. Consistent naming required.
Supported platforms: Spectronaut 16 (DIA-MS), MaxQuant (LFQ), Olink NPX, SomaScan RFU — any platform that produces a protein × sample matrix with NA for non-detection.

Outputs

File	Description
`results/mnar_results.csv`	Per-protein MNAR classification: detection rates by group, Fisher p, Fisher adj_p, MNAR direction
`results/de_results.csv`	Full DE table: log2FC, Wilcoxon p, BH adj_p, MNAR flag, direction
`results/de_results_significant.csv`	Significant DE proteins only (adj_p < threshold OR MNAR)
`results/lme_results.csv`	Per-protein LME: time×group interaction p, adj_p, trajectory type, T2 coefficient
`results/trajectory_clusters.csv`	Patient-level cluster assignments, cluster label, Flare→Clear flag
`results/biomarker_tiers.csv`	Primary output: Tier 1/2/3 ranking with MNAR flag, LME interaction p, cross-species flag, mechanistic annotation
`results/biomarker_tier1_priority.csv`	Tier 1 proteins only — immediate ELISA validation candidates
`results/analysis_object.rds`	Complete pipeline object for downstream skills
`results/analysis_summary.txt`	Plain-text summary of all modules

Installation

options(repos = c(CRAN = "https://cloud.r-project.org"))

# Core (required)
install.packages(c("dplyr", "tidyr", "nlme"))

# Trajectory clustering (strongly recommended)
install.packages(c("dtw", "cluster"))

# Downstream skills
if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager")
BiocManager::install("limma")   # for proteomics-diff-exp upstream
install.packages(c("glmnet", "pROC"))  # for lasso-biomarker-panel downstream

Package	Version	Purpose
dplyr	≥1.1	Data manipulation
tidyr	≥1.3	Pivoting
nlme	≥3.1	Mixed-effects models
dtw	≥1.22	DTW distance for trajectory clustering
cluster	≥2.1	Silhouette width for optimal k

Standard Workflow

MANDATORY: SOURCE SCRIPTS — DO NOT WRITE INLINE CODE

Step 0 — Load data

# Load your protein intensity matrix (rows = proteins, cols = samples, NA = not detected)
intensity_matrix <- read.csv("your_protein_matrix.csv", row.names = 1, check.names = FALSE)

# Load sample metadata
metadata <- read.csv("your_metadata.csv", stringsAsFactors = FALSE)
# Required columns: sample_id, patient_id, timepoint, response_group

# Optional: clinical score table for trajectory clustering
score_table <- read.csv("your_pasi_scores.csv", stringsAsFactors = FALSE)
# Required columns: patient_id, timepoint, PASI (or your score column)

# Optional: cross-species DE table
xspecies_df <- read.csv("your_mouse_de.csv", stringsAsFactors = FALSE)
# Required columns: protein, log2FC, p_value, direction

VERIFICATION: Check dimensions and NA pattern:

cat(sprintf("Matrix: %d proteins × %d samples\n", nrow(intensity_matrix), ncol(intensity_matrix)))
cat(sprintf("Missing values: %.1f%%\n", 100 * mean(is.na(intensity_matrix))))
cat(sprintf("Metadata: %d rows, groups: %s\n",
    nrow(metadata), paste(unique(metadata$response_group), collapse=", ")))

Step 1 — MNAR Detection

source("scripts/01_mnar_detection.R")

mnar_result <- run_mnar_detection(
  intensity_matrix = intensity_matrix,
  metadata         = metadata,
  params = list(
    baseline_timepoint = "T1",       # timepoint for MNAR test
    response_col       = "response_group",
    sample_col         = "sample_id",
    mnar_fisher_p      = 0.05        # Fisher p threshold
  )
)

# Optional: check granin dissociation pattern (CHGA vs SCG2)
check_granin_dissociation(mnar_result$mnar_table,
  params = list(responder_group = "Early"))

VERIFICATION: You MUST see "✓ MNAR detection complete." with a count of MNAR proteins.

DO NOT impute MNAR proteins before this step. Imputation destroys the MNAR signal.

Step 2 — Wilcoxon Differential Expression

source("scripts/02_wilcoxon_de.R")

de_result <- run_wilcoxon_de(
  intensity_matrix = intensity_matrix,
  metadata         = metadata,
  mnar_table       = mnar_result$mnar_table,   # pass MNAR flags
  params = list(
    de_timepoint      = "T1",          # baseline comparison
    response_col      = "response_group",
    sample_col        = "sample_id",
    group_a           = "Early",       # numerator group
    group_b           = "Poor",        # denominator group
    de_adj_p          = 0.05,
    de_log2fc         = 0.58,
    min_detected_frac = 0.5
  )
)

# Quick volcano summary (text)
summarise_volcano(de_result$de_table)

VERIFICATION: You MUST see "✓ Wilcoxon DE complete." with significant protein counts.

Note: MNAR proteins are automatically appended to the DE table with is_mnar = TRUE and wilcox_p = NA. They are included in the biomarker tiering step.

Step 3 — Longitudinal Mixed-Effects Modelling

source("scripts/03_longitudinal_lme.R")

lme_result <- run_longitudinal_lme(
  intensity_matrix = intensity_matrix,
  metadata         = metadata,
  params = list(
    timepoints        = c("T1", "T2", "T3"),
    timepoint_col     = "timepoint",
    response_col      = "response_group",
    patient_col       = "patient_id",
    covariates        = c("baseline_PASI", "age", "sex"),  # adjust to your metadata
    ref_timepoint     = "T1",
    ref_group         = "Poor",          # reference group for contrasts
    lme_interaction_p = 0.05,
    t2_surge_group    = "Early"          # group expected to show T2 surge
  )
)

VERIFICATION: You MUST see "✓ Longitudinal LME complete." with interaction counts.

Note on convergence: 5–15% of proteins may fail to converge — this is normal. They are flagged as model_status = "model_failed" and excluded from interaction ranking but retained in the DE table.

Skip this step if: You have only 1 timepoint, or fewer than 3 patients with repeated measures.

Step 4 — PASI Trajectory Clustering (Optional)

source("scripts/04_trajectory_clustering.R")

cluster_result <- run_trajectory_clustering(
  score_table = score_table,
  params = list(
    patient_col   = "patient_id",
    timepoint_col = "timepoint",
    score_col     = "PASI",
    timepoints    = c("T1", "T2", "T3"),   # NULL = auto-detect
    k_range       = 2:7,
    flare_ratio   = 1.2,    # T2/T1 > 1.2 = flare
    clear_ratio   = 0.5     # T3/T1 < 0.5 = PASI50 response
  )
)

VERIFICATION: You MUST see "✓ Trajectory clustering complete." with cluster summary table.

CRITICAL CLINICAL NOTE: If Flare→Clear patients are identified, do NOT recommend treatment discontinuation at 12 weeks based on T2 PASI alone. These patients show transient worsening before resolution — a pattern that would be misclassified as non-response under a standard 12-week endpoint.

Skip this step if: You have no clinical score data, or only 1 timepoint.

Step 5 — Biomarker Tiering

source("scripts/05_biomarker_tiering.R")

tier_result <- run_biomarker_tiering(
  de_result   = de_result,
  lme_result  = lme_result,    # NULL if Step 3 was skipped
  xspecies_df = xspecies_df,   # NULL if no cross-species data
  params = list(
    tier1_mnar_adj_p      = 0.001,
    tier1_lme_adj_p       = 0.001,
    tier1_require_t2surge = TRUE,
    tier2_de_adj_p        = 0.01,
    tier2_de_log2fc       = 1.0,
    tier3_de_adj_p        = 0.05,
    tier3_de_log2fc       = 0.58
  )
)

VERIFICATION: You MUST see "✓ Biomarker tiering complete." with Tier 1/2/3 counts.

Step 6 — Export All Results

source("scripts/06_export_results.R")

export_all(
  mnar_result    = mnar_result,
  de_result      = de_result,
  lme_result     = lme_result,       # NULL if skipped
  cluster_result = cluster_result,   # NULL if skipped
  tier_result    = tier_result,
  output_dir     = "results",
  study_label    = "MMIP_psoriasis"  # change to your study name
)

VERIFICATION: You MUST see "=== Export Complete ===" with file list.

Parameters Reference

Parameter	Default	Description
`baseline_timepoint`	`"T1"`	Timepoint used for MNAR test and baseline DE
`timepoints`	`c("T1","T2","T3")`	All timepoints in longitudinal order
`response_col`	`"response_group"`	Column name for responder classification
`patient_col`	`"patient_id"`	Column for random effect (repeated measures)
`sample_col`	`"sample_id"`	Column linking metadata to matrix columns
`score_col`	`"PASI"`	Clinical score column for trajectory clustering
`group_a`	first group	Numerator group for log2FC (positive = higher in A)
`group_b`	second group	Denominator group for log2FC
`mnar_fisher_p`	`0.05`	Fisher p threshold for MNAR classification
`de_adj_p`	`0.05`	BH-adjusted p threshold for DE significance
`de_log2fc`	`0.58`	\|log2FC\| threshold (~1.5-fold)
`lme_interaction_p`	`0.05`	Time×group interaction adj_p threshold
`covariates`	`c()`	Fixed-effect covariates in LME (must be in metadata)
`t2_surge_group`	`NULL`	Group expected to show T2 surge (for trajectory classification)
`flare_ratio`	`1.2`	T2/T1 ratio threshold for "flare" classification
`clear_ratio`	`0.5`	T3/T1 ratio threshold for "clear" (PASI50 equivalent)
`tier1_mnar_adj_p`	`0.001`	Fisher adj_p threshold for Tier 1 MNAR
`tier1_lme_adj_p`	`0.001`	LME interaction adj_p threshold for Tier 1
`tier2_de_adj_p`	`0.01`	DE adj_p threshold for Tier 2
`tier2_de_log2fc`	`1.0`	\|log2FC\| threshold for Tier 2 (2-fold)

Worked Example: MMIP Psoriasis Cohort

Study design: n=90 psoriasis patients, DIA-MS serum proteomics, 3 timepoints (T1=baseline, T2=12 weeks, T3=1 year), 3 response groups (Early Responder n=10, Late Responder n=10, Poor Responder n=10 for discovery; n=60 for validation).

Expected results from this pipeline:

Module	Key finding
MNAR detection	CHGA: detected in 1/10 Early Responders vs 9/10 Poor Responders (Fisher p<0.001, FDR q=1.45×10⁻⁹) → Tier 1
MNAR detection	SCG2: inverse pattern — detected in 8/10 Early Responders → granin dissociation
Wilcoxon DE	AGT, CBG, KLKB1, KNG1 elevated in Early Responders (adj_p<0.05) → Tier 2/3
Wilcoxon DE	ITIH1 lower in Poor Responders at T1 (adj_p=0.049) → Tier 3
LME	SAA1, DBI: significant time×group interaction (T2 surge in Early Responders) → Tier 1
LME	ITIH1: progressive decline in Poor Responders (interaction p=0.049) → Tier 2
Trajectory clustering	5 clusters: Stable→Clear, Late→Clear, Flare→Clear, Partial, Non-responder
Trajectory clustering	Flare→Clear patients: T2 PASI worsening then 1yr resolution — do not discontinue
Biomarker tiering	Tier 1: CHGA (MNAR), SAA1 (T2 surge), DBI (T2 surge)
Biomarker tiering	Tier 2: ITIH1, AGT, CBG, KLKB1, KNG1, SCG2
Biomarker panel	CHGA + ITIH1 AUC = 0.87 (95% CI 0.79–0.95) for Early vs Poor Responder

Biological interpretation:

CHGA MNAR in Early Responders = chromaffin granule depletion = "low-reserve" sympatho-adrenal state. These patients have already mobilised their catecholamine stores before treatment, making them primed to respond to psychological intervention.
SAA1 T2 surge in Early Responders = acute-phase reactogenicity at 12 weeks = the immune system is actively responding to the psychological therapy. This is a marker of engagement, not failure.
ITIH1 decline in Poor Responders = progressive ECM destabilisation = ongoing inflammation without resolution.

Clarification Questions

ALWAYS ask Question 1 FIRST.

1. Data availability

Do you have a protein intensity matrix with NA values for non-detected proteins?
If yes: proceed to Question 2
If no (all proteins quantified, no NAs): MNAR step will find nothing — consider proteomics-diff-exp instead

2. Study design

How many timepoints? (2 = skip LME; 3+ = run full longitudinal model)
How many patients per response group? (< 5 per group = results unreliable; 8–15 = typical discovery cohort)
Do you have clinical score data (PASI or equivalent) for trajectory clustering?

3. Response group definition

Are response groups pre-defined in your metadata, or do you need to derive them from clinical scores?
Pre-defined: proceed directly
Need to derive: use PASI50/75/90 thresholds or trajectory clustering first, then re-run DE

4. Covariates

Which covariates are available in your metadata for the LME model?
Recommended: baseline clinical score, age, sex
If none available: run without covariates (set covariates = c())

Common Issues

Issue	Cause	Fix
`"No overlap between metadata sample_ids and intensity_matrix column names"`	Column name mismatch	Check `colnames(intensity_matrix)` vs `metadata$sample_id` — must be identical strings
`"No samples found for timepoint 'T1'"`	Wrong timepoint label	Check `unique(metadata$timepoint)` and set `baseline_timepoint` to match
All proteins show `mnar_class = "undetected_all"`	NA encoding issue	Ensure non-detected proteins are `NA`, not `0` or `""`
LME convergence failures > 50%	Too few repeated measures	Check each patient has ≥2 timepoints; reduce covariates if overparameterised
`dtw` package not found	Not installed	`install.packages("dtw")` — falls back to Euclidean distance automatically
`cluster` package not found	Not installed	`install.packages("cluster")` — falls back to k=3 automatically
Tier 1 is empty	Thresholds too stringent	Relax `tier1_mnar_adj_p` to 0.01 or `tier1_lme_adj_p` to 0.01
log2FC direction unexpected	group_a/group_b swapped	log2FC = median(group_a) - median(group_b); swap group_a and group_b
`"Need at least 4 patients for clustering"`	Too few patients	Trajectory clustering requires ≥4 patients; skip Step 4 for very small cohorts

Interpretation Guidelines

MNAR results

MNAR_up_in_X: protein is MORE OFTEN detected in group X → group X has higher circulating levels (above LOD)
MNAR_down_in_X: protein is LESS OFTEN detected in group X → group X has lower levels (below LOD)
CHGA MNAR in responders = responders have LOWER CHGA (below LOD) = chromaffin granule depletion

DE results

log2FC > 0 = higher in group_a; log2FC < 0 = higher in group_b
MNAR proteins have wilcox_p = NA — their significance comes from fisher_adj_p
Do NOT interpret MNAR proteins using log2FC alone

LME results

trajectory_type = "T2_surge" = protein rises at T2 specifically in the surge group
trajectory_type = "T2_dip" = protein falls at T2 specifically in the surge group
model_status = "model_failed" = convergence failure — not a biological finding

Biomarker tiers

Tier 1 (★★★): Recommend for immediate ELISA validation in independent cohort
Tier 2 (★★): Recommend for multiplexed panel (Luminex/Olink) validation
Tier 3 (★): Hypothesis-generating only; do not prioritise for validation without additional evidence
Mechanistic annotation: "unknown" is not a negative finding — it may indicate a novel axis

AUC / ROC

AUC values reported in analysis_summary.txt are from the discovery cohort only
These are expected to be optimistic; external validation is required before clinical claims
For validated AUC, use lasso-biomarker-panel with an independent cohort

Suggested Next Steps

After running this skill:

ELISA validation of Tier 1 proteins in independent cohort → use biomarker_tier1_priority.csv
LASSO panel from Tier 1+2 proteins → lasso-biomarker-panel (input: de_results_significant.csv)
Pathway enrichment of significant DE proteins → functional-enrichment-from-degs
Mendelian randomisation for causal inference on top candidates → mendelian-randomization-twosamplemr
Cross-species validation in mouse model (IMQ psoriasis) → re-run with mouse serum proteomics and pass as xspecies_df

Related Skills

Skill	Relationship	When to use
`proteomics-diff-exp`	Upstream	PSM aggregation, TMT/LFQ normalization before this skill
`lasso-biomarker-panel`	Downstream	Build minimal ELISA panel from Tier 1+2 proteins
`functional-enrichment-from-degs`	Downstream	Pathway enrichment of DE protein list
`mendelian-randomization-twosamplemr`	Downstream	Causal inference for top biomarker candidates
`disease-progression-longitudinal`	Alternative	Pseudotime-based trajectory inference (single-cell)
`survival-analysis-clinical`	Downstream	Time-to-response analysis using biomarker tiers

References

MNAR handling: Lazar C, et al. (2016) J Proteome Res 15(4):1116–1125
Wilcoxon for small proteomics: Kammers K, et al. (2015) EuPA Open Proteomics 7:11–19
nlme mixed-effects: Pinheiro J, Bates D (2000) Mixed-Effects Models in S and S-PLUS. Springer
DTW clustering: Sakoe H, Chiba S (1978) IEEE Trans Acoust 26(1):43–49
BH correction: Benjamini Y, Hochberg Y (1995) J R Stat Soc B 57(1):289–300
MMIP study: Wan F, et al. (in preparation) — Sympatho-adrenal hypothesis for psoriasis treatment response
See references/method-notes.md for full statistical rationale

Code preview

scripts/01_mnar_detection.R

# =============================================================================
# 01_mnar_detection.R
# Missing-Not-At-Random (MNAR) protein detection for serum proteomics
#
# Purpose:
#   Identify proteins whose absence from serum is non-random with respect to
#   treatment response group. In DIA-MS data, a protein below the detection
#   limit (NA) carries biological information distinct from a low-abundance
#   quantified protein. This script separates MNAR proteins from quantified
#   proteins before downstream DE analysis.
#
# Inputs (loaded from environment or passed as arguments):
#   intensity_matrix  - data.frame, rows = proteins, cols = sample_ids
#                       NA values indicate non-detection (below LOD)
#   metadata          - data.frame with columns: sample_id, patient_id,
#                       timepoint, response_group
#   params            - list of analysis parameters (see run_mnar_detection())
#
# Outputs:
#   mnar_results.csv  - per-protein MNAR classification table
#   Returns:          - list with $mnar_table and $mnar_proteins vector
# =============================================================================

suppressPackageStartupMessages({
  library(dplyr)
  library(tidyr)
})

# -----------------------------------------------------------------------------
# Main function
# -----------------------------------------------------------------------------

run_mnar_detection <- function(
    intensity_matrix,
    metadata,
    params = list()
) {
  # --- Default parameters ---
  p <- modifyList(
    list(
      baseline_timepoint  = "T1",          # timepoint to use for MNAR test
      timepoint_col       = "timepoint",
      response_col        = "response_group",
      sample_col          = "sample_id",
      mnar_fisher_p       = 0.05,          # Fisher p threshold
      min_detected_any    = 2,             # min detections in any group to test
      responder_label     = NULL           # e.g. "Early"; NULL = first level
    ),
    params
  )

  cat("=== MNAR Detection ===\n")
  cat(sprintf("Baseline timepoint: %s\n", p$baseline_timepoint))
  cat(sprintf("Fisher p threshold: %.3f\n", p$mnar_fisher_p))

  # --- Subset metadata to baseline timepoint ---
  meta_t1 <- metadata %>%
    filter(.data[[p$timepoint_col]] == p$baseline_timepoint)

  if (nrow(meta_t1) == 0) {
    stop(sprintf(
      "No samples found for timepoint '%s'. Check params$baseline_timepoint.",
      p$baseline_timepoint
    ))
  }

  # --- Align matrix to baseline samples ---
  t1_samples <- meta_t1[[p$sample_col]]
  t1_samples <- intersect(t1_samples, colnames(intensity_matrix))

  if (length(t1_samples) == 0) {
    stop("No overlap between metadata sample_ids and intensity_matrix column names.")
  }

  mat_t1 <- intensity_matrix[, t1_samples, drop = FALSE]
  meta_t1 <- meta_t1[meta_t1[[p$sample_col]] %in% t1_samples, ]

  groups <- unique(meta_t1[[p$response_col]])
  cat(sprintf("Response groups: %s\n", paste(groups, collapse = ", ")))
  cat(sprintf("Proteins to test: %d\n", nrow(mat_t1)))

scripts/02_wilcoxon_de.R

# =============================================================================
# 02_wilcoxon_de.R
# Wilcoxon rank-sum differential expression for small serum proteomics cohorts
#
# Purpose:
#   Non-parametric DE for discovery cohorts (typically n=8–15 per group) where
#   normality cannot be assumed. Handles MNAR proteins separately: they are
#   flagged from 01_mnar_detection.R and excluded from the Wilcoxon test.
#   log2FC is computed as the difference of group medians (on log2 scale).
#
# Inputs:
#   intensity_matrix  - data.frame, rows = proteins, cols = sample_ids (log2)
#   metadata          - data.frame: sample_id, patient_id, timepoint, response_group
#   mnar_table        - output$mnar_table from run_mnar_detection() [optional]
#   params            - list of analysis parameters
#
# Outputs:
#   de_results.csv    - full DE table with log2FC, p-value, adj_p, MNAR flag
#   Returns:          - list with $de_table
# =============================================================================

suppressPackageStartupMessages({
  library(dplyr)
  library(tidyr)
})

# -----------------------------------------------------------------------------
# Main function
# -----------------------------------------------------------------------------

run_wilcoxon_de <- function(
    intensity_matrix,
    metadata,
    mnar_table  = NULL,
    params      = list()
) {
  p <- modifyList(
    list(
      de_timepoint        = "T1",          # timepoint for baseline DE
      timepoint_col       = "timepoint",
      response_col        = "response_group",
      sample_col          = "sample_id",
      group_a             = NULL,          # "numerator" group (e.g. "Early")
      group_b             = NULL,          # "denominator" group (e.g. "Poor")
      de_adj_p            = 0.05,
      de_log2fc           = 0.58,          # ~1.5-fold
      min_detected_frac   = 0.5,           # min fraction detected in ≥1 group
      log2_transform      = TRUE           # apply log2 if not already done
    ),
    params
  )

  cat("=== Wilcoxon Differential Expression ===\n")
  cat(sprintf("Timepoint: %s\n", p$de_timepoint))

  # --- Subset to DE timepoint ---
  meta_de <- metadata %>%
    filter(.data[[p$timepoint_col]] == p$de_timepoint)

  de_samples <- intersect(meta_de[[p$sample_col]], colnames(intensity_matrix))
  mat_de <- intensity_matrix[, de_samples, drop = FALSE]
  meta_de <- meta_de[meta_de[[p$sample_col]] %in% de_samples, ]

  # --- Determine groups ---
  all_groups <- unique(meta_de[[p$response_col]])
  if (is.null(p$group_a) || is.null(p$group_b)) {
    if (length(all_groups) < 2) stop("Need at least 2 response groups.")
    p$group_a <- as.character(all_groups[1])
    p$group_b <- as.character(all_groups[2])
    cat(sprintf("Auto-selected groups: A=%s vs B=%s\n", p$group_a, p$group_b))
  }
  cat(sprintf("Comparison: %s (A) vs %s (B)\n", p$group_a, p$group_b))
  cat(sprintf("log2FC = median(A) - median(B)\n"))

  samps_a <- meta_de[[p$sample_col]][meta_de[[p$response_col]] == p$group_a]
  samps_b <- meta_de[[p$sample_col]][meta_de[[p$response_col]] == p$group_b]
  samps_a <- intersect(samps_a, colnames(mat_de))
  samps_b <- intersect(samps_b, colnames(mat_de))

  cat(sprintf("n(A)=%d, n(B)=%d\n", length(samps_a), length(samps_b)))

scripts/03_longitudinal_lme.R

# =============================================================================
# 03_longitudinal_lme.R
# Longitudinal mixed-effects modelling for serum proteomics
#
# Purpose:
#   Fit per-protein linear mixed-effects models to detect proteins with
#   significant time × response_group interaction — the statistical signature
#   of "psychological reactogenicity" (T2 surge in responders, absent in
#   non-responders). Random intercept per patient accounts for repeated measures.
#
# Model:
#   intensity ~ time * response_group + [covariates] + (1 | patient_id)
#   Fitted with nlme::lme(); time and response_group are treated as factors.
#
# Inputs:
#   intensity_matrix  - data.frame, rows = proteins, cols = sample_ids (log2)
#   metadata          - data.frame: sample_id, patient_id, timepoint,
#                       response_group [+ optional covariate columns]
#   params            - list of analysis parameters
#
# Outputs:
#   lme_results.csv   - per-protein interaction p-value, trajectory type
#   Returns:          - list with $lme_table
# =============================================================================

suppressPackageStartupMessages({
  library(nlme)
  library(dplyr)
  library(tidyr)
})

# -----------------------------------------------------------------------------
# Main function
# -----------------------------------------------------------------------------

run_longitudinal_lme <- function(
    intensity_matrix,
    metadata,
    params = list()
) {
  p <- modifyList(
    list(
      timepoints          = c("T1", "T2", "T3"),
      timepoint_col       = "timepoint",
      response_col        = "response_group",
      sample_col          = "sample_id",
      patient_col         = "patient_id",
      covariates          = c(),            # e.g. c("age", "sex", "baseline_PASI")
      ref_timepoint       = "T1",           # reference level for time factor
      ref_group           = NULL,           # reference level for response_group
      lme_interaction_p   = 0.05,
      min_obs_per_protein = 10,             # min non-NA observations to fit model
      t2_surge_group      = NULL,           # group expected to show T2 surge
      log2_transform      = TRUE
    ),
    params
  )

  cat("=== Longitudinal Mixed-Effects Modelling ===\n")
  cat(sprintf("Timepoints: %s\n", paste(p$timepoints, collapse = " → ")))
  cat(sprintf("Interaction p threshold: %.3f\n", p$lme_interaction_p))

  # --- Subset metadata to relevant timepoints ---
  meta_long <- metadata %>%
    filter(.data[[p$timepoint_col]] %in% p$timepoints)

  long_samples <- intersect(meta_long[[p$sample_col]], colnames(intensity_matrix))
  mat_long <- intensity_matrix[, long_samples, drop = FALSE]
  meta_long <- meta_long[meta_long[[p$sample_col]] %in% long_samples, ]

  # --- Optional log2 transform ---
  if (p$log2_transform) {
    med_val <- median(as.matrix(mat_long), na.rm = TRUE)
    if (med_val > 100) {
      mat_long <- log2(mat_long + 1)
    }
  }

  # --- Set factor levels ---
  meta_long[[p$timepoint_col]] <- factor(

Companion files

Type	Path	Bytes
Markdown	references/method-notes.md	8,589
R	scripts/01_mnar_detection.R	9,294
R	scripts/02_wilcoxon_de.R	9,149
R	scripts/03_longitudinal_lme.R	9,795
R	scripts/04_trajectory_clustering.R	9,596
R	scripts/05_biomarker_tiering.R	12,960
R	scripts/06_export_results.R	9,635
Markdown	SKILL.md	19,502
JSON	skill.meta.json	2,327