⌂ / Overview / Proteomics & Multi-omics / Multi-omics Integration

View companion source

Multi-omics Integration

Integrate 2+ omics into interpretable latent factors.

Overview

Problem. Each layer has its own scale and noise; naive merging distorts.

Use when: 2+ layers with shared samples

Avoid when: Non-matched samples

Learning goals

A few latent factors explain cross-layer variance
Shared vs layer-specific factors

Figures

Tutorial

Identify latent factors driving variation across 2+ omics layers using MOFA+ (Multi-Omics Factor Analysis). Decomposes multi-omics data into interpretable factors, each capturing shared or view-specific biological signal. Handles missing data across views natively.

When to Use This Skill

Use when you:

✅ Have 2+ omics layers measured on overlapping samples (RNA-seq + proteomics, methylation + mutations, etc.)
✅ Want to find shared sources of variation across omics (not just per-omics analysis)
✅ Need to identify which omics layers contribute to each source of variation
✅ Have incomplete data (not all samples measured in all views) — MOFA handles this
✅ Want factor scores for downstream patient stratification or survival analysis

Don't use for:

❌ Single omics data (use bulk-rnaseq-counts-to-de-deseq2 or bulk-omics-clustering)
❌ Supervised prediction (use lasso-biomarker-panel instead)
❌ Single-cell multi-modal (MOFA2 supports it, but consider scrna-trajectory-inference)
❌ Fewer than 10 samples per view

Runtime: ~5-8 minutes total (CLL example). First run adds ~1-3 min for Python environment setup.

Installation

# Bioconductor packages
if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")
BiocManager::install(c("MOFA2", "MOFAdata", "ComplexHeatmap"))

# CRAN packages
install.packages(c("ggprism", "circlize", "reshape2", "RColorBrewer"))

Package	Version	License	Commercial Use	Installation
MOFA2	≥1.12.0	LGPL (≥3)	✅ Permitted	`BiocManager::install("MOFA2")`
MOFAdata	≥1.8.0	Artistic-2.0	✅ Permitted	`BiocManager::install("MOFAdata")` (example data)
ComplexHeatmap	≥2.18.0	MIT	✅ Permitted	`BiocManager::install("ComplexHeatmap")`
ggprism	≥1.0.3	GPL (≥3)	✅ Permitted	`install.packages("ggprism")`
circlize	≥0.4.15	MIT	✅ Permitted	`install.packages("circlize")`
reshape2	≥1.4.4	MIT	✅ Permitted	`install.packages("reshape2")`
RColorBrewer	≥1.1	Apache-2.0	✅ Permitted	`install.packages("RColorBrewer")`
rmarkdown	≥2.25	GPL-3	✅ Permitted	`install.packages("rmarkdown")` (optional, PDF)

Inputs

Multi-omics data: Named list of matrices (features × samples), one per omics view
Minimum 2 views, any combination of omics types
Samples as columns, features as rows
Missing samples across views OK (MOFA handles incomplete overlap)
Sample metadata (optional): CSV/TSV with sample IDs + clinical variables (for factor-trait associations)
Supported formats: R matrices, CSV/TSV files, or MultiAssayExperiment

Outputs

Analysis objects (RDS):

mofa_model.rds — Complete trained MOFA model for downstream use
Load with: model <- readRDS('mofa_results/mofa_model.rds')
Required for: bulk-omics-clustering (factor-based clustering), lasso-biomarker-panel (feature selection)

CSV results:

factor_values.csv — Sample factor scores (samples × factors)
weights_*.csv — Feature weights per view (features × factors)
variance_explained_per_factor.csv — R² per factor per view
variance_explained_total.csv — Total R² per view
top_features_per_factor.csv — Top 20 features per factor per view

Visualizations (PNG + SVG):

mofa_variance_per_factor — Heatmap: R² per factor per view (signature MOFA plot)
mofa_total_variance — Bar chart: total R² per view
mofa_factor_scatter — Scatter: Factor 1 vs 2 colored by clinical variable
mofa_factor_correlation — Tile: factor-factor correlations
mofa_top_weights — Faceted bar: top feature weights per factor
mofa_factor_heatmap — ComplexHeatmap: factors × samples with annotations
mofa_factor_clinical — Box plots: factor values by clinical groups

Reports:

analysis_report.md — Markdown summary with methods, results, references
analysis_report.pdf — PDF report with embedded figures (requires rmarkdown + LaTeX)

Clarification Questions

Input Files (ASK THIS FIRST): - Do you have multi-omics data matrices to integrate? - Expected: Named list of matrices (features × samples), or CSV files per omics view - Or use example data? CLL blood cancer dataset (200 patients: mRNA, methylation, mutations, drug response)

🚨 IF EXAMPLE DATA SELECTED: Skip questions 3-4. Proceed directly to Step 1.

Analysis Options:
- (If using example data) Number of factors:
- a) 15 factors — standard analysis (recommended)
- b) 5 factors — quick demo (~2 min faster)
- (If using own data) Number of factors:
- a) 15 (recommended starting point)
- b) Custom number
(Own data only) Data types per view:
- Which omics types? (RNA-seq, proteomics, methylation, mutations, metabolomics, drug response, other)
- Are any views binary (0/1)? MOFA uses Bernoulli likelihood for binary data.
(Own data only) Sample metadata:
- Do you have a sample metadata file (CSV/TSV) with clinical variables?
- Variables for factor-trait associations (e.g., disease status, treatment, subtype)?

Standard Workflow

🚨 MANDATORY: USE SCRIPTS EXACTLY AS SHOWN - DO NOT WRITE INLINE CODE 🚨

Step 1 - Load data:

# For CLL example data:
source("scripts/load_example_data.R")
cll <- load_cll_data()

# For user data:
# source("scripts/load_example_data.R")
# cll <- load_user_data(
#   file_paths = list(RNA = "rna.csv", Protein = "protein.csv"),
#   metadata_path = "metadata.csv"
# )

✅ VERIFICATION: "✓ Data loaded successfully!" with per-view dimensions

Step 2 - Run MOFA analysis:

source("scripts/mofa_workflow.R")
model <- run_mofa_analysis(
    data_list = cll$data,
    metadata = cll$metadata,
    n_factors = 15,
    output_dir = "mofa_results"
)

DO NOT write inline MOFA code. Just call run_mofa_analysis().

⏱️ Takes ~2-5 min (+ ~1-3 min extra on first run for Python environment setup via basilisk).

✅ VERIFICATION: "✓ MOFA model trained successfully!" with variance explained summary

Step 3 - Generate visualizations:

source("scripts/mofa_plots.R")
generate_all_plots(model, output_dir = "mofa_results")

DO NOT write inline plotting code (ggsave, ggplot, Heatmap, etc.). Just use the script.

The script handles PNG + SVG export with graceful fallback for SVG dependencies.

✅ VERIFICATION: "✓ All plots generated successfully!" with file count

Step 4 - Export results:

source("scripts/export_results.R")
export_all(model, output_dir = "mofa_results")

DO NOT write custom export code. Use export_all().

✅ VERIFICATION: "=== Export Complete ===" with file list

⚠️ CRITICAL - DO NOT:

❌ Write inline MOFA code → STOP: Use run_mofa_analysis()
❌ Write inline plotting code (ggsave, ggplot, Heatmap, etc.) → STOP: Use generate_all_plots()
❌ Write custom export code → STOP: Use export_all()
❌ Try to install basilisk/reticulate manually → MOFA2 handles Python automatically

⚠️ IF SCRIPTS FAIL - Script Failure Hierarchy:

Fix and Retry (90%) — Install missing package, re-run script
Modify Script (5%) — Edit the script file itself, document changes
Use as Reference (4%) — Read script, adapt approach, cite source
Write from Scratch (1%) — Only if genuinely impossible, explain why

NEVER skip directly to writing inline code without trying the script first.

Common Issues

Error	Cause	Solution
basilisk Python env setup slow	First-time setup of Python backend	Normal — wait 1-3 minutes. Only happens once per R installation.
`run_mofa` hangs at "Training model..."	Model training in progress	Normal — wait 2-5 min. Training is compute-intensive.
`Error in py_call_impl`: Python error	basilisk environment issue	Restart R session, retry. If persistent: `BiocManager::install("MOFA2", force = TRUE)`
Metadata download failed	EBI FTP blocked or offline	Normal fallback. Analysis runs without trait plots. Metadata is optional.
"No convergence"	Too many factors or too few samples	Reduce `n_factors` (try 5-10). Ensure ≥10 samples.
SVG export failed	Missing svglite/cairo	Normal. PNG always generated. `generate_all_plots()` handles fallback automatically.
Memory error	Dataset too large	Filter features to top 5,000 most variable per view before MOFA.

Interpretation Guide

Variance Decomposition (Key MOFA Output)

High R² in one view: Factor captures view-specific variation
High R² across views: Factor captures shared cross-omics signal (most interesting)
Low total R²: MOFA explains little variation in that view — consider adding features or views

Factor Interpretation

Pattern	Meaning
Factor active in mRNA + methylation	Epigenetic regulation of transcription
Factor active in mutations + drug response	Genetic determinants of drug sensitivity
Factor correlates with clinical subtype	Biologically meaningful patient stratification
Factor active in only one view	View-specific technical or biological variation

See: references/mofa-interpretation-guide.md for detailed downstream analysis.

Suggested Next Steps

After running MOFA:

Patient stratification: Use bulk-omics-clustering on factor scores to define molecular subtypes
Biomarker discovery: Use lasso-biomarker-panel on top-weighted features per factor
Pathway enrichment: Use functional-enrichment-from-degs on top mRNA features per factor
Network analysis: Use coexpression-network on factor-associated genes
Survival analysis: Use survival-analysis-clinical with factor scores as covariates

Related Skills

Skill	Relationship
`bulk-omics-clustering`	Downstream: cluster on MOFA factor scores
`lasso-biomarker-panel`	Downstream: select biomarkers from top factor features
`disease-progression-longitudinal`	Complementary: trajectory analysis on factor scores
`coexpression-network`	Downstream: network analysis on factor-associated genes
`functional-enrichment-from-degs`	Downstream: pathway enrichment on top factor features
`bulk-rnaseq-counts-to-de-deseq2`	Upstream: generate DE results as one omics view

References

Argelaguet R, et al. (2020) MOFA+: a statistical framework for comprehensive integration of multi-modal single-cell data. Genome Biology 21:111.
Argelaguet R, et al. (2018) Multi-Omics Factor Analysis—a framework for unsupervised integration of multi-omics data sets. Molecular Systems Biology 14:e8124.
Dietrich S, et al. (2018) Drug-perturbation-based stratification of blood cancer. Journal of Clinical Investigation 128(1):427-445.

Code preview

scripts/export_results.R

# =============================================================================
# Export MOFA+ Analysis Results
# =============================================================================
# Exports factor values, feature weights, variance explained, and analysis
# objects. Generates markdown summary report and optional PDF report.
# =============================================================================

options(repos = c(CRAN = "https://cloud.r-project.org"))

.install_if_missing <- function(pkg, bioc = FALSE) {
    if (!requireNamespace(pkg, quietly = TRUE)) {
        cat("Installing", pkg, "...\n")
        if (bioc) {
            if (!requireNamespace("BiocManager", quietly = TRUE))
                install.packages("BiocManager")
            BiocManager::install(pkg, ask = FALSE, update = FALSE)
        } else {
            install.packages(pkg)
        }
    }
}

#' Export all MOFA analysis results
#'
#' @param model Trained MOFA model object
#' @param output_dir Directory for output files
export_all <- function(model, output_dir = "mofa_results") {

    cat("\n=== Exporting MOFA Analysis Results ===\n\n")

    .install_if_missing("MOFA2", bioc = TRUE)
    .install_if_missing("reshape2")
    library(MOFA2)

    if (!dir.exists(output_dir)) {
        dir.create(output_dir, recursive = TRUE)
    }

    # -------------------------------------------------------------------------
    # 1. Factor values (samples x factors)
    # -------------------------------------------------------------------------
    cat("1. Exporting factor values...\n")
    factors_df <- get_factors(model, as.data.frame = TRUE)
    factors_wide <- reshape2::dcast(factors_df, sample ~ factor, value.var = "value")
    write.csv(factors_wide, file.path(output_dir, "factor_values.csv"), row.names = FALSE)
    cat(sprintf("   %d samples x %d factors\n", nrow(factors_wide), ncol(factors_wide) - 1))
    cat(sprintf("   Saved: %s\n\n", file.path(output_dir, "factor_values.csv")))

    # -------------------------------------------------------------------------
    # 2. Feature weights per view
    # -------------------------------------------------------------------------
    cat("2. Exporting feature weights per view...\n")
    weights_df <- get_weights(model, as.data.frame = TRUE)
    views <- unique(weights_df$view)
    for (v in views) {
        w_sub <- weights_df[weights_df$view == v, ]
        w_wide <- reshape2::dcast(w_sub, feature ~ factor, value.var = "value")
        fname <- sprintf("weights_%s.csv", gsub("[^a-zA-Z0-9]", "_", tolower(v)))
        write.csv(w_wide, file.path(output_dir, fname), row.names = FALSE)
        cat(sprintf("   %s: %d features x %d factors -> %s\n",
                    v, nrow(w_wide), ncol(w_wide) - 1, fname))
    }
    cat("\n")

    # -------------------------------------------------------------------------
    # 3. Variance explained
    # -------------------------------------------------------------------------
    cat("3. Exporting variance explained...\n")
    r2 <- get_variance_explained(model)
    r2_per_factor <- r2$r2_per_factor[[1]]
    r2_total <- r2$r2_total[[1]]

    # Per-factor per-view
    r2_df <- as.data.frame(r2_per_factor)
    r2_df$Factor <- rownames(r2_df)
    write.csv(r2_df, file.path(output_dir, "variance_explained_per_factor.csv"),
              row.names = FALSE)

    # Total per view
    total_df <- data.frame(View = names(r2_total), Total_R2 = as.numeric(r2_total))

scripts/load_example_data.R

# =============================================================================
# Load Example Data for MOFA+ Multi-Omics Integration
# =============================================================================
# Loads the CLL (Chronic Lymphocytic Leukemia) multi-omics dataset
# from the MOFAdata package. 200 patients, 4 omics layers:
#   - Drugs: drug response (310 features)
#   - Methylation: DNA methylation (4248 features)
#   - mRNA: gene expression (5000 features)
#   - Mutations: somatic mutations (69 features, binary)
#
# Also downloads sample metadata from EBI for clinical annotations.
# =============================================================================

options(repos = c(CRAN = "https://cloud.r-project.org"))

.install_if_missing <- function(pkg, bioc = FALSE) {
    if (!requireNamespace(pkg, quietly = TRUE)) {
        cat("Installing", pkg, "...\n")
        if (bioc) {
            if (!requireNamespace("BiocManager", quietly = TRUE))
                install.packages("BiocManager")
            BiocManager::install(pkg, ask = FALSE, update = FALSE)
        } else {
            install.packages(pkg)
        }
    }
}

#' Load CLL multi-omics example data
#'
#' @return list with components:
#'   - data: named list of 4 matrices (Drugs, Methylation, mRNA, Mutations)
#'   - metadata: data.frame of sample annotations (or NULL if download fails)
load_cll_data <- function() {
    cat("\n=== Loading CLL Multi-Omics Example Data ===\n\n")

    # --- Install/load MOFAdata ---
    .install_if_missing("MOFAdata", bioc = TRUE)
    library(MOFAdata)

    # --- Load CLL dataset ---
    cat("Loading CLL multi-omics data (200 patients, 4 omics layers)...\n")
    utils::data("CLL_data", package = "MOFAdata", envir = environment())

    # Validate structure
    stopifnot(is.list(CLL_data))
    stopifnot(length(CLL_data) >= 4)

    # --- Print per-view summary ---
    cat("\nDataset overview:\n")
    all_samples <- unique(unlist(lapply(CLL_data, colnames)))
    cat(sprintf("  Total unique samples: %d\n", length(all_samples)))
    cat("\n  View summaries:\n")
    for (view_name in names(CLL_data)) {
        mat <- CLL_data[[view_name]]
        n_feat <- nrow(mat)
        n_samp <- ncol(mat)
        pct_missing <- round((1 - n_samp / length(all_samples)) * 100, 1)
        cat(sprintf("    %-15s %5d features x %3d samples (%4.1f%% samples missing)\n",
                    view_name, n_feat, n_samp, pct_missing))
    }

    # --- Sample overlap ---
    sample_lists <- lapply(CLL_data, colnames)
    shared <- Reduce(intersect, sample_lists)
    cat(sprintf("\n  Samples with ALL 4 views: %d / %d (%.0f%%)\n",
                length(shared), length(all_samples),
                100 * length(shared) / length(all_samples)))

    # --- Download sample metadata ---
    metadata <- .download_cll_metadata()

    cat("\n✓ Data loaded successfully!\n\n")
    return(list(data = CLL_data, metadata = metadata))
}

#' Download CLL sample metadata from EBI
#' @return data.frame or NULL if download fails
.download_cll_metadata <- function() {
    cat("\nDownloading sample metadata...\n")

scripts/mofa_plots.R

# =============================================================================
# MOFA+ Visualization: Publication-Quality Plots
# =============================================================================
# Generates 7 plots using ggprism::theme_prism() and ComplexHeatmap.
# All plots saved as PNG + SVG with graceful fallback.
# =============================================================================

options(repos = c(CRAN = "https://cloud.r-project.org"))

.install_if_missing <- function(pkg, bioc = FALSE) {
    if (!requireNamespace(pkg, quietly = TRUE)) {
        cat("Installing", pkg, "...\n")
        if (bioc) {
            if (!requireNamespace("BiocManager", quietly = TRUE))
                install.packages("BiocManager")
            BiocManager::install(pkg, ask = FALSE, update = FALSE)
        } else {
            install.packages(pkg)
        }
    }
}

# --- Load required packages ---
.load_plot_packages <- function() {
    .install_if_missing("ggplot2")
    .install_if_missing("ggprism")
    .install_if_missing("reshape2")
    .install_if_missing("RColorBrewer")
    .install_if_missing("ComplexHeatmap", bioc = TRUE)
    .install_if_missing("circlize")
    .install_if_missing("MOFA2", bioc = TRUE)

    library(ggplot2)
    library(ggprism)
    library(reshape2)
    library(RColorBrewer)
    library(ComplexHeatmap)
    library(circlize)
    library(MOFA2)
    library(grid)
}

# --- PNG + SVG save helper ---
.save_plot <- function(plot_obj, base_name, output_dir, width = 8, height = 6, dpi = 300) {
    png_path <- file.path(output_dir, paste0(base_name, ".png"))
    svg_path <- file.path(output_dir, paste0(base_name, ".svg"))

    # Always save PNG
    ggsave(png_path, plot = plot_obj, width = width, height = height, dpi = dpi, device = "png")
    cat(sprintf("   Saved: %s\n", png_path))

    # Try SVG with fallback
    tryCatch({
        ggsave(svg_path, plot = plot_obj, width = width, height = height, device = "svg")
        cat(sprintf("   Saved: %s\n", svg_path))
    }, error = function(e) {
        tryCatch({
            svg(svg_path, width = width, height = height)
            print(plot_obj)
            dev.off()
            cat(sprintf("   Saved: %s\n", svg_path))
        }, error = function(e2) {
            cat("   (SVG export failed)\n")
        })
    })
}

# --- ComplexHeatmap save helper ---
.save_heatmap <- function(ht, base_name, output_dir, width = 10, height = 8) {
    png_path <- file.path(output_dir, paste0(base_name, ".png"))
    svg_path <- file.path(output_dir, paste0(base_name, ".svg"))

    # PNG
    png(png_path, width = width, height = height, units = "in", res = 300)
    draw(ht)
    dev.off()
    cat(sprintf("   Saved: %s\n", png_path))

    # SVG
    tryCatch({

Companion files

Type	Path	Bytes
Markdown	references/mofa-interpretation-guide.md	4,597
R	scripts/export_results.R	13,207
R	scripts/load_example_data.R	6,447
R	scripts/mofa_plots.R	18,418
R	scripts/mofa_workflow.R	6,349
Markdown	SKILL.md	11,142
JSON	skill.meta.json	1,540