⌂ / Overview / Clinical & Biostatistics / Disease Progression

View companion source

Disease Progression

Reconstruct disease trajectories from longitudinal patient omics.

Overview

Problem. Align irregular patient samples onto one progression axis.

Use when: Longitudinal patient data

Avoid when: Without clinical-outcome validation

Learning goals

Like scRNA pseudotime, but for patients
Validate the axis against independent outcomes

Figures

Tutorial

When to Use This Skill

Use this skill when you have longitudinal patient omics data and want to:

✅ Reconstruct disease progression trajectories from time-series data
✅ Order samples by disease stage (pseudotime) with irregular sampling
✅ Identify biomarkers changing along disease trajectory
✅ Stratify patients as fast vs. slow progressors
✅ Predict clinical outcomes from trajectory position
✅ Validate computational staging against clinical measures

Required data:

Minimum 10 patients with 3+ timepoints each
Omics data: RNA-seq, proteomics, metabolomics, or clinical biomarkers
Metadata: Patient IDs, timepoints (days/months/years), optional outcomes

Primary method: TimeAx multiple trajectory alignment (handles irregular sampling)

Feature identification: Polynomial regression (linear/quadratic/cubic) per the TimeAx paper (Frishberg et al., Nat Commun 2023), with FDR-corrected Q-value filtering. Captures both monotonic and non-monotonic dynamics.

Alternative methods: Linear Mixed Models (regular sampling), Hidden Markov Models (discrete stages)

Installation

R ≥ 4.0 with the TimeAx package (primary trajectory method):

# Install TimeAx from GitHub
install.packages("remotes")
remotes::install_github("amitfrish/TimeAx")

# Required for plotting
install.packages(c("ggplot2", "ggprism"))

# Required for demo dataset (GSE128959 batch correction)
BiocManager::install("sva")

Python ≥ 3.9 for the workflow wrapper and analysis pipeline:

# Core analysis packages
pip install numpy pandas scipy scikit-learn statsmodels lifelines

# Visualization packages
pip install seaborn matplotlib

# PDF report generation (optional)
pip install reportlab

# Optional
pip install hmmlearn        # Hidden Markov Models alternative

For Linear Mixed Models alternative: R packages lme4, lmerTest

License compliance: All packages use permissive licenses (MIT, BSD, Apache 2.0) - commercial AI agent use permitted.

For detailed installation and troubleshooting, see troubleshooting_guide.md

Inputs

Required files:

Data matrix (features × samples)
- CSV/TSV format with features as rows, samples as columns
- Feature types: genes, proteins, metabolites, or clinical biomarkers
- Normalized counts or continuous measurements
- ⚠️ TimeAx requires all-positive values (e.g., log2-normalized, RMA). Do NOT Z-score normalize before TimeAx — it creates negative values that disable the ratio mechanism.
Sample metadata (CSV/TSV)
- Required columns: sample_id, patient_id, timepoint
- Optional: outcome, treatment, batch, clinical covariates
- Timepoints: numeric values (days, months, years from baseline)

Data requirements:

≥10 patients minimum (20+ recommended)
≥3 timepoints per patient minimum
Handles irregular sampling (different timepoints per patient)
Works with cross-sectional + longitudinal cohorts

Outputs

Primary results:

pseudotime_assignments.csv - Disease pseudotime for each sample
trajectory_features.csv - Features changing along trajectory, with polynomial degree, R², and direction
patient_summaries.csv - Per-patient progression statistics
all_feature_statistics.csv - Statistics for all tested features

Analysis objects (for downstream use):

timeax_model.pkl - Complete TimeAx model object
Load with: model = pickle.load(open('timeax_model.pkl', 'rb'))
Required for: Projecting new samples, downstream trajectory analysis

Reports:

analysis_report.pdf - Publication-quality PDF with Introduction, Methods, Results, Conclusions
Requires: pip install reportlab (optional — markdown report generated regardless)
SUMMARY.txt - Plain-text summary report

TimeAx R plots (PNG + SVG, generated in Step 2):

timeax_pseudotime_vs_time.png/.svg - Per-patient pseudotime vs actual time trajectories
timeax_progression_rates.png/.svg - Patient progression rate comparison (fast vs slow)
timeax_seed_dynamics.png/.svg - Seed feature expression trends along pseudotime
timeax_uncertainty.png/.svg - Pseudotime uncertainty distribution

Python plots (PNG + SVG, generated in Step 3):

patient_trajectories_pca.png/.svg - PCA with pseudotime coloring and patient trajectory lines
patient_trajectories_umap.png/.svg - UMAP nonlinear projection
trajectory_heatmap.png/.svg - Feature expression clustermap
trajectory_trends.png/.svg - Polynomial fit trends for top features
pseudotime_vs_stage.png/.svg - Pseudotime vs clinical tumor stage (biological validation)
patient_progression.png/.svg - Per-patient pseudotime spaghetti plot
seed_feature_heatmap.png/.svg - TimeAx seed feature dynamics heatmap

Metadata:

model_metadata.json - Analysis parameters, quality metrics (monotonicity, robustness)

Clarification Questions

🚨 ALWAYS ask Question 1 FIRST. Do not ask about data type, study design, or analysis parameters before the user has answered Question 1.

1. Input Files (ASK THIS FIRST)

Do you have longitudinal patient omics data to analyze?
- If uploaded: Are these your data matrix and sample metadata files?
- Expected: Data matrix (features × samples) + metadata (sample_id, patient_id, timepoint)
Or use example/demo data?
- GSE128959 bladder cancer recurrence (18 patients, 84 samples, 17K genes) — from the TimeAx paper (Frishberg et al. 2023). Requires R + sva package. Downloads ~5MB on first run.

🚨 IF EXAMPLE DATA SELECTED: All parameters are pre-defined (bladder cancer microarray, 18 patients, 4-6 timepoints, tumor recurrence, TimeAx method with ComBat batch correction). DO NOT ask questions 2-6. Proceed directly to Step 1.

Questions 2-6 are ONLY for users providing their own data:

2. Data Type: Bulk RNA-seq, proteomics, metabolomics, clinical biomarkers, or multi-omics?

3. Study Design: Number of patients (min 10, recommend 20+)? Timepoints per patient (min 3)? Sampling pattern (regular/irregular)? Time units and range?

4. Disease Context: Disease type? Treatment status? Available clinical outcomes (survival, relapse, response)? Known clinical staging?

5. Analysis Goals: Pseudotime ordering, patient stratification, biomarker discovery, outcome prediction, or trajectory comparison?

6. Method Preference: TimeAx (recommended), Linear Mixed Models, Hidden Markov Models, or not sure?

Standard Workflow

🚨 MANDATORY: USE SCRIPTS EXACTLY AS SHOWN - DO NOT WRITE INLINE CODE 🚨

Step 1 - Load and preprocess data:

For example/demo data (GSE128959 bladder cancer):

from scripts.load_and_preprocess import load_example_data, load_and_preprocess_data

# Load GSE128959 (downloads and preprocesses via R on first run)
data, metadata = load_example_data()

# Save to files for the standard pipeline
data.to_csv('gse128959_expression.csv')
metadata.to_csv('gse128959_metadata.csv', index=False)

# Run through standard preprocessing
data, metadata, preprocessing_stats = load_and_preprocess_data(
    data_file='gse128959_expression.csv',
    metadata_file='gse128959_metadata.csv',
    data_type='rnaseq',
    min_patients=10,
    min_timepoints=3
)

For your own data:

from scripts.load_and_preprocess import load_and_preprocess_data

data, metadata, preprocessing_stats = load_and_preprocess_data(
    data_file="patient_expression.csv",
    metadata_file="sample_metadata.csv",
    data_type='rnaseq',  # 'rnaseq', 'proteomics', 'metabolomics', 'clinical'
    min_patients=10,
    min_timepoints=3
)

DO NOT write inline data loading or preprocessing code. Just use the script.

✅ VERIFICATION: You MUST see: "✓ Data loaded and preprocessed successfully!"

Step 2 - Run trajectory analysis:

from scripts.run_trajectory_analysis import run_trajectory_analysis

# Run TimeAx alignment and identify trajectory features
results = run_trajectory_analysis(
    data,
    metadata,
    method='timeax',  # 'timeax', 'lmm', 'hmm'
    patient_column='patient_id',
    time_column='timepoint'
)
# Extract: pseudotime, trajectory_features, model, robustness_score

DO NOT write inline TimeAx or trajectory code. Just use the script.

✅ VERIFICATION: You MUST see: "✓ Trajectory analysis completed successfully!"

Step 3 - Generate visualizations:

from scripts.generate_all_plots import generate_all_plots

# Generate all plots (PNG + SVG with graceful fallback)
generate_all_plots(
    data,
    metadata,
    results,
    output_dir='trajectory_results'
)

DO NOT write inline plotting code (plt.savefig, seaborn, etc.). Just use the script.

The script handles PNG + SVG export with graceful fallback for SVG dependencies.

✅ VERIFICATION: You MUST see: "✓ All visualizations generated successfully!"

Step 4 - Export results:

from scripts.export_results import export_all

# Export all results, model object, and metadata
export_all(
    data=data,
    metadata=metadata,
    results=results,
    output_dir='trajectory_results'
)

DO NOT write custom export code. Use export_all().

✅ VERIFICATION: You MUST see: "=== Export Complete ==="

⚠️ CRITICAL - DO NOT:

❌ Write inline data loading code → STOP: Use load_and_preprocess_data()
❌ Write inline TimeAx/trajectory code → STOP: Use run_trajectory_analysis()
❌ Write inline plotting code → STOP: Use generate_all_plots()
❌ Write custom export code → STOP: Use export_all()

⚠️ IF SCRIPTS FAIL - Script Failure Hierarchy:

Fix and Retry (90%) - Install missing package, re-run script
Modify Script (5%) - Edit the script file itself, document changes
Use as Reference (4%) - Read script, adapt approach, cite source
Write from Scratch (1%) - Only if genuinely impossible, explain why

NEVER skip directly to writing inline code without trying the script first.

Detailed Methodology

For comprehensive details on algorithms, parameters, and methods:

TimeAx algorithm: timeax_methodology.md
Alternative methods (LMM, HMM): lmm_hmm_alternatives.md
Method comparison and selection: method_comparison.md
Data preprocessing by type: data_preprocessing_guide.md
Validation framework: validation_framework.md

Quality Control

Quick checklist:

✅ ≥10 patients with ≥3 timepoints each
✅ Within-patient monotonicity >0.5 (good) or 0.3-0.5 (moderate)
✅ Pseudotime correlates with clinical measures (r >0.2, p <0.05)
✅ Trajectory features identified (seed feature fallback if FDR <0.05 yields 0)
✅ Samples don't cluster by batch (check PCA)

Note on robustness: The TimeAx robustness() function (v0.1.1) can produce misleading negative values even on valid data. Use within-patient monotonicity as the primary quality metric instead.

For comprehensive QC guidelines, see validation_framework.md

Common Issues

Error	Cause	Solution
Negative robustness score	Known TimeAx v0.1.1 bug	Normal. The `robustness()` function is unreliable. Use within-patient monotonicity (>0.5 = good) as the primary quality metric instead.
0 trajectory features (FDR <0.05)	FDR too stringent for genome-wide test	Normal with real data. The script automatically falls back to testing TimeAx seed features (reduced FDR burden) and nominal p < 0.05.
Memory error during alignment	Too many features (>20,000)	Reduce to 5000-10000 most variable features before TimeAx. Script does this automatically.
SVG export failed	Missing cairo system library	Normal - PNG still generated. Script handles fallback automatically. DO NOT try to install cairo manually.
Samples cluster by batch not time	Uncorrected batch effects	Run ComBat batch correction before trajectory analysis. Set `batch_correction=True` in preprocessing.
Negative values disable `ratio` mode	Z-score or log-fold-change normalization	TimeAx `ratio=TRUE` requires positive values. Use log2 counts, RMA, or TPM — NOT Z-scores.
"R TimeAx not available"	R or TimeAx R package not installed	STOP: Install R, then run: `Rscript -e 'remotes::install_github("amitfrish/TimeAx")'`. See Installation section.
PDF report not generated	reportlab not installed	Normal. Install with `pip install reportlab`. SUMMARY.txt is always generated as fallback.

For complete troubleshooting, see troubleshooting_guide.md

Suggested Next Steps

After trajectory analysis, consider these downstream analyses:

Functional enrichment → Use functional-enrichment-gprofiler skill
- Input: trajectory_features.csv (top up/down-regulated features)
- Find pathways changing along disease trajectory
Tissue expression analysis → Use tissue-expression-from-degs skill
- Input: trajectory_features.csv
- Identify tissue-specific trajectory markers
Transcription factor activity → Use tf-activity-dorothea skill
- Input: pseudotime_assignments.csv + original expression data
- Find TFs driving disease progression
Survival analysis → Built into clinical validation
- Input: pseudotime_assignments.csv + survival data
- Stratify patients by pseudotime tertiles/quartiles
Project new samples → Use scripts/timeax_inference.py
- Load: timeax_model.pkl
- Stage new patients on trained trajectory

Related Skills

Upstream (data generation):

bulk-rnaseq-counts-to-de-deseq2 - Generate expression data
proteomics-differential-expression - Proteomics quantification
metabolomics-preprocessing - Metabolite data

Downstream (interpretation):

functional-enrichment-gprofiler - Pathway analysis of trajectory features
tissue-expression-from-degs - Tissue-specific markers
tf-activity-dorothea - Transcription factor drivers
grn-pyscenic - Gene regulatory networks along trajectory

Alternative trajectory methods:

pseudotime-monocle - For single-cell RNA-seq trajectories
trajectory-inference-slingshot - Branching trajectories

References

Primary Citations

TimeAx: Frishberg A, van den Munckhof ICL, Ter Horst R, et al. Reconstructing disease dynamics for mechanistic insights and clinical benefit. Nat Commun. 2023;14(1):6940. https://doi.org/10.1038/s41467-023-42354-8
Linear Mixed Models: Bates D, Mächler M, Bolker B, Walker S. Fitting Linear Mixed-Effects Models Using lme4. J Stat Softw. 2015;67(1):1-48.
Disease Trajectories Review: Schmidt AF, Heerspink HJL, Denig P, et al. Disease trajectory browser for exploring temporal, population-wide disease progression patterns. Nat Commun. 2020;11:4952.

Software

Software	Version	License	Commercial Use
TimeAx	≥0.1.0	MIT	✅ Permitted
NumPy	≥1.21	BSD	✅ Permitted
Pandas	≥1.3	BSD	✅ Permitted
scikit-learn	≥1.0	BSD	✅ Permitted
seaborn	≥0.11	BSD	✅ Permitted
matplotlib	≥3.5	BSD	✅ Permitted
ggprism (R)	≥1.0.3	GPL (≥3)	✅ Permitted
sva (R)	≥3.40	Artistic-2.0	✅ Permitted
reportlab	≥3.6	BSD	✅ Permitted

Online Resources

TimeAx GitHub: https://github.com/amitfrish/TimeAx
TimeAx Documentation: https://timeax.readthedocs.io/
Disease Progression Modeling Review: https://www.annualreviews.org/content/journals/10.1146/annurev-biodatasci-110123-041001

Code preview

scripts/clinical_validation.py

"""
Clinical Validation Functions for Disease Progression Trajectories

This module provides functions for validating disease progression trajectories
against clinical measures, outcomes, and staging systems.
"""

import numpy as np
import pandas as pd
from scipy.stats import spearmanr, mannwhitneyu, kruskal
from lifelines import CoxPHFitter, KaplanMeierFitter
from lifelines.statistics import logrank_test
from plotnine import (ggplot, aes, geom_boxplot, geom_step, labs,
                      theme_minimal, facet_wrap)
from plotnine_prism import theme_prism


def correlate_with_clinical_scores(metadata, pseudotime_column='pseudotime',
                                    clinical_columns=None):
    """
    Correlate pseudotime with clinical severity scores.

    Parameters
    ----------
    metadata : pd.DataFrame
        Sample metadata with pseudotime and clinical scores
    pseudotime_column : str
        Column name for pseudotime values
    clinical_columns : list of str, optional
        Clinical score columns to correlate. If None, auto-detect numeric columns.

    Returns
    -------
    pd.DataFrame
        Correlation results with columns: variable, correlation, pvalue
    """
    if clinical_columns is None:
        # Auto-detect numeric columns (excluding pseudotime)
        clinical_columns = metadata.select_dtypes(include=[np.number]).columns
        clinical_columns = [c for c in clinical_columns
                            if c != pseudotime_column and c != 'patient_id']

    results = []

    for col in clinical_columns:
        # Remove missing values
        valid_mask = ~metadata[col].isnull() & ~metadata[pseudotime_column].isnull()
        if valid_mask.sum() < 3:
            continue  # Skip if too few valid pairs

        corr, pval = spearmanr(metadata.loc[valid_mask, pseudotime_column],
                               metadata.loc[valid_mask, col])

        results.append({
            'variable': col,
            'correlation': corr,
            'pvalue': pval,
            'n_samples': valid_mask.sum()
        })

    results_df = pd.DataFrame(results)
    results_df = results_df.sort_values('pvalue')

    # Print summary
    print("Pseudotime vs. Clinical Scores Correlation")
    print("=" * 60)
    for _, row in results_df.iterrows():
        print(f"{row['variable']:30s} r={row['correlation']:6.3f}  p={row['pvalue']:.3e}  n={row['n_samples']}")

    return results_df


def compare_by_clinical_stage(metadata, pseudotime_column='pseudotime',
                               stage_column='clinical_stage',
                               output_file=None):
    """
    Compare pseudotime across discrete clinical stages.

    Parameters
    ----------

scripts/export_results.py

"""
Export all trajectory analysis results and model objects.

This module provides the export_all() function for Step 4 of the standard workflow.
"""

import pandas as pd
import numpy as np
import pickle
import json
import os
from typing import Dict, Optional, List
from datetime import datetime


def export_all(
    data: pd.DataFrame,
    metadata: pd.DataFrame,
    results: Dict,
    output_dir: str = 'trajectory_results',
    include_plots: bool = True
) -> None:
    """
    Export all trajectory analysis results, model objects, and metadata.

    This is the Step 4 function that exports:
    - Pseudotime assignments (CSV)
    - Trajectory features (CSV)
    - Patient summaries (CSV)
    - Model object (pickle) - CRITICAL for downstream use
    - Analysis metadata (JSON)
    - Optional: plots if generated

    Parameters
    ----------
    data : pd.DataFrame
        Preprocessed data matrix (features × samples)
    metadata : pd.DataFrame
        Sample metadata with pseudotime
    results : dict
        Results from run_trajectory_analysis() containing:
        - 'pseudotime': sample pseudotimes
        - 'trajectory_features': significant features
        - 'model': fitted model object
        - 'robustness_score': quality metric
        - 'method': method used
        - 'feature_stats': statistics for all features
    output_dir : str, default='trajectory_results'
        Output directory for all files
    include_plots : bool, default=True
        Whether to copy plot files to export directory

    Returns
    -------
    None
        Saves all outputs to output_dir
    """

    print("\n=== Step 4: Export Results ===\n")

    # Create output directory
    os.makedirs(output_dir, exist_ok=True)

    # Extract results
    pseudotime = results['pseudotime']
    trajectory_features = results['trajectory_features']
    feature_stats = results['feature_stats']
    model = results.get('model')
    robustness_score = results.get('robustness_score')
    monotonicity_score = results.get('monotonicity_score')
    method = results.get('method', 'unknown')

    # 1. Export pseudotime assignments
    print("Exporting pseudotime assignments...")
    pseudotime_df = pd.DataFrame({
        'sample_id': pseudotime.index,
        'pseudotime': pseudotime.values
    })

    # Add metadata columns if available

scripts/generate_all_plots.py

"""
Generate all trajectory visualizations with PNG + SVG export.

This module creates publication-quality plots with graceful SVG fallback.
"""

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from typing import Dict, Optional, List
import os
from sklearn.decomposition import PCA

# Set seaborn theme and Helvetica font (project standard)
sns.set_style("ticks")
plt.rcParams['font.family'] = 'sans-serif'
plt.rcParams['font.sans-serif'] = ['Helvetica']

# Try to import UMAP
try:
    from umap import UMAP
    UMAP_AVAILABLE = True
except ImportError:
    UMAP_AVAILABLE = False


def generate_all_plots(
    data: pd.DataFrame,
    metadata: pd.DataFrame,
    results: Dict,
    output_dir: str = 'trajectory_results'
) -> None:
    """
    Generate all trajectory visualizations (PNG + SVG).

    This is the consolidated Step 3 function that creates:
    - PCA trajectory plot
    - UMAP trajectory plot (if available)
    - Trajectory feature heatmap
    - Feature trend plots

    Parameters
    ----------
    data : pd.DataFrame
        Preprocessed data matrix (features × samples)
    metadata : pd.DataFrame
        Sample metadata
    results : dict
        Results from run_trajectory_analysis() containing:
        - 'pseudotime': sample pseudotimes
        - 'trajectory_features': significant features
        - 'robustness_score': quality metric
    output_dir : str, default='trajectory_results'
        Output directory for plots

    Returns
    -------
    None
        Saves plots to output_dir
    """

    print("\n=== Step 3: Generate Visualizations ===\n")

    # Create output directory
    os.makedirs(output_dir, exist_ok=True)

    # Extract results
    pseudotime = results['pseudotime']
    trajectory_features = results['trajectory_features']

    # Add pseudotime to metadata for plotting
    plot_metadata = metadata.copy()
    plot_metadata['pseudotime'] = pseudotime.values

    # 1. PCA trajectory plot (with patient lines)
    print("Generating PCA trajectory plot...")
    try:
        _plot_pca_trajectory(
            data, plot_metadata,

Companion files

Type	Path	Bytes
Markdown	references/data_preprocessing_guide.md	9,949
Markdown	references/lmm_hmm_alternatives.md	19,544
Markdown	references/method_comparison.md	11,854
Markdown	references/timeax_methodology.md	9,954
Markdown	references/troubleshooting_guide.md	7,456
Markdown	references/validation_framework.md	18,179
Markdown	references/workflow_integration.md	18,785
Python	scripts/clinical_validation.py	17,271
Python	scripts/export_results.py	15,114
Python	scripts/generate_all_plots.py	22,187
Python	scripts/generate_report.py	19,402
Python	scripts/load_and_preprocess.py	11,643
R	scripts/load_gse128959.R	4,464
Python	scripts/load_longitudinal_data.py	8,784
Python	scripts/mock_timeax.py	7,261
Python	scripts/plot_patient_trajectories.py	6,915
Python	scripts/preprocess_features.py	8,822
R	scripts/run_timeax.R	11,197
Python	scripts/run_trajectory_analysis.py	18,202
Python	scripts/timeax_alignment.py	6,508
Python	scripts/timeax_r_wrapper.py	8,763
Markdown	SKILL.md	15,821
JSON	skill.meta.json	4,172