View companion source

Co-expression Network

Build a co-expression network; find modules and hub genes.

Overview

Problem. Move from single genes to co-varying modules.

Use when: Seeking modules or hub genes
Avoid when: With very small sample sizes

Learning goals

Figures

Modules & Hub Genes
WGCNA Workflow
When to Use WGCNA
Soft Power & Scale-Free
Reading Module Results
WGCNA Pitfalls

Tutorial

Overview

Build weighted gene co-expression networks to identify modules of coordinately expressed genes and discover hub genes that may be key regulators. This workflow uses WGCNA (Weighted Gene Co-expression Network Analysis) to group genes into modules based on their expression patterns across samples, then correlates these modules with experimental conditions or traits.

Key Concept: Unlike single-gene analysis, WGCNA identifies groups of genes that behave similarly across samples, revealing biological pathways and potential regulatory relationships.

Use Cases: - Identify gene modules associated with experimental conditions

  • Discover hub genes (highly connected genes within modules)
  • Find genes with similar expression patterns to known genes of interest
  • Reduce dimensionality of gene expression data for downstream analysis
  • Generate hypotheses about gene function based on co-expression

Default Prompt: "Build a co-expression network to identify gene modules and hub genes from my RNA-seq data"

When to Use This Skill

Use WGCNA when you want to:

Requirements: - ≥15 samples (20+ recommended for robust results)

  • Normalized expression data (VST, rlog, TPM, or FPKM - NOT raw counts)
  • 5,000-15,000 most variable genes
  • Batch effects removed or corrected

Not suitable for: - Small sample sizes (<15 samples) - consider alternative approaches

  • Raw count data - normalize first using DESeq2 or similar
  • Data with uncorrected batch effects - correct before WGCNA

Installation

Core WGCNA packages:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("WGCNA")

Visualization packages:

install.packages(c("ggplot2", "ggprism"))
BiocManager::install("ComplexHeatmap")

Enrichment analysis (optional):

BiocManager::install(c("clusterProfiler", "org.Hs.eg.db"))  # Human
# BiocManager::install("org.Mm.eg.db")  # Mouse
# BiocManager::install("org.Rn.eg.db")  # Rat
Software Version License Commercial Use Installation
WGCNA ≥1.70 GPL-2+ ✅ Permitted BiocManager::install('WGCNA')
ggplot2 ≥3.3.0 MIT ✅ Permitted install.packages('ggplot2')
ComplexHeatmap ≥2.10.0 MIT ✅ Permitted BiocManager::install('ComplexHeatmap')
clusterProfiler ≥4.0.0 Artistic-2.0 ✅ Permitted BiocManager::install('clusterProfiler')

Inputs

Required:

  1. Normalized expression matrix (CSV/TSV): - Rows: Genes, Columns: Samples - Values: VST, rlog, TPM, or FPKM (NOT raw counts) - 5,000-15,000 most variable genes recommended
  1. Sample metadata (CSV/TSV): - Sample IDs matching expression matrix columns - Traits/conditions for module-trait correlation

Optional: - Differential expression results (to highlight DEGs)

  • Gene annotations for enrichment analysis

Data Requirements: - ≥15 samples (20+ recommended)

  • Batch effects removed or corrected
  • No missing values in expression matrix

Outputs

CSV Files:

  1. wgcna_gene_modules.csv - Gene-module assignments with connectivity metrics
  2. wgcna_hub_genes.csv - Top hub genes per module
  3. wgcna_module_trait_cor.csv - Module-trait correlations with p-values
  4. wgcna_eigengenes.csv - Module eigengene values per sample
  5. wgcna_report.md - Summary report with interpretation

Plots (PNG + SVG):

  1. soft_power_selection.png/.svg - Power selection diagnostic plot
  2. module_dendrogram.png/.svg - Gene dendrogram with module colors
  3. module_trait_correlation.png/.svg - Module-trait heatmap
  4. eigengene_heatmap.png/.svg - Module eigengene expression patterns
  5. hub_genes_barplot.png/.svg - Hub genes by connectivity

Analysis Objects (RDS):

  1. wgcna_network.rds - Complete network object from blockwiseModules - Load with: net <- readRDS('wgcna_network.rds') - Required for: module preservation analysis, advanced network visualization
  2. wgcna_module_colors.rds - Module color assignments per gene - Load with: colors <- readRDS('wgcna_module_colors.rds') - Required for: downstream module-specific analyses
  3. wgcna_expression_matrix.rds - Filtered expression matrix used for analysis - Load with: expr <- readRDS('wgcna_expression_matrix.rds') - Required for: reanalysis, module preservation testing
  4. wgcna_full_results.rds - Complete results object with all components - Load with: results <- readRDS('wgcna_full_results.rds') - Required for: replotting, additional analyses

Key Metrics:

  • module: Module color assignment (grey = unassigned)
  • kWithin: Intramodular connectivity (higher = more connected)
  • MM: Module membership (correlation with eigengene)
  • hub_score: Combined connectivity metric (MM × kWithin)

Clarification Questions

  1. Input Files (ASK THIS FIRST):

    • Do you have specific normalized expression data and sample metadata files to analyze?
    • If uploaded: Are these the expression matrix and metadata you'd like to analyze?
    • Expected formats: CSV or TSV with genes as rows, samples as columns
    • Or use example data? Female mouse liver dataset (135 samples, liver tissue, multiple traits)
  2. What is your normalized expression data format? - VST (variance stabilizing transformation) from DESeq2

    • rlog (regularized log) from DESeq2
    • TPM (transcripts per million)
    • FPKM/RPKM
    • If unsure or raw counts: normalize first using DESeq2
  3. How many samples do you have? - 15-30 samples (minimum for WGCNA, results may be less robust)

    • 30-50 samples (good power for network detection)
    • 50+ samples (excellent power, most reliable results)
  4. What traits/conditions do you want to correlate with modules? - Treatment vs control (binary)

    • Disease status or phenotype
    • Continuous variables (age, dose, time, weight)
    • Multiple traits (all will be tested)
  5. Gene filtering strategy? - Top 5,000 most variable genes (default, recommended)

    • Top 10,000-15,000 genes (for larger datasets)
    • All genes passing expression threshold
    • Pre-filtered gene list (e.g., from DE analysis)

Standard Workflow

🚨 MANDATORY: USE SCRIPTS EXACTLY AS SHOWN - DO NOT WRITE INLINE CODE 🚨

Step 1 - Load data:

library(WGCNA)
allowWGCNAThreads()

source("scripts/load_example_data.R")
wgcna_data <- load_example_wgcna_data()
datExpr <- wgcna_data$datExpr
meta <- wgcna_data$meta

# For your own data:
# source("scripts/prepare_wgcna_data.R")
# data <- prepare_wgcna_data("expression.csv", "metadata.csv", top_n_genes = 5000)
# datExpr <- data$datExpr
# meta <- data$meta

Step 2 - Run WGCNA analysis:

source("scripts/wgcna_workflow.R")
results <- run_wgcna_analysis(
  datExpr,
  meta,
  traits = c("weight_g", "Glucose_mg_dl"),  # Adjust to your traits
  organism = "mouse"  # or "human", "rat", or NULL to skip enrichment
)

DO NOT write inline WGCNA code. Just source the script.

Step 3 - Generate visualizations:

source("scripts/plot_all_wgcna.R")
plot_all_wgcna(results, output_dir = "wgcna_results")

DO NOT write inline plotting code (png, svg, plotDendroAndColors, etc.). Just use the script.

The script handles PNG + SVG export with graceful fallback for SVG dependencies.

Step 4 - Export results:

source("scripts/export_wgcna_results.R")
export_all(results, output_dir = "wgcna_results")

DO NOT write custom export code. Use export_all().

✅ VERIFICATION - You should see:

  • After Step 1: "✓ Successfully loaded female mouse liver dataset"
  • After Step 2: "✓ WGCNA analysis completed successfully!"
  • After Step 3: "✓ All WGCNA plots generated successfully!"
  • After Step 4: "=== Export Complete ==="

⚠️ CRITICAL - DO NOT:

  • Write inline WGCNA codeSTOP: Use source("scripts/wgcna_workflow.R")
  • Write inline plotting code (png, svg, plotDendroAndColors, etc.)STOP: Use plot_all_wgcna()
  • Write custom export codeSTOP: Use export_all()
  • Try to install svglite → script handles SVG fallback automatically
  • ❌ Use absolute paths like /mnt/knowhow/ → use relative paths scripts/
  • ❌ Skip soft power selection → required for scale-free topology
  • ❌ Use raw counts → normalize first with DESeq2 VST or rlog

⚠️ IF SCRIPTS FAIL - Script Failure Hierarchy:

  1. Fix and Retry (90%) - Install missing package, re-run script
  2. Modify Script (5%) - Edit the script file itself, document changes
  3. Use as Reference (4%) - Read script, adapt approach, cite source
  4. Write from Scratch (1%) - Only if genuinely impossible, explain why

NEVER skip directly to writing inline code without trying the script first.

What the scripts provide:


Parameter Customization

When customization is needed:


Common Issues

Issue Cause Solution
Too few samples error <15 samples WGCNA requires ≥15 samples; combine replicates or use alternative methods
Scale-free R² never exceeds 0.75 Batch effects or poor data quality Check for batch effects; try different normalization; inspect PCA
All genes assigned to grey module minModuleSize too large or poor gene filtering Lower minModuleSize to 20-30; increase top_n_genes to 10,000-15,000
No significant module-trait correlations Weak biological signal or incorrect traits Check trait coding (numeric for continuous, 0/1 for binary); try more samples
Soft power recommended is very high (>20) Data not suitable for scale-free network Check normalization; consider signed vs unsigned network
Hub gene identification fails Module colors not provided correctly Ensure module_colors matches gene order in datExpr
Enrichment analysis returns no results Wrong organism or gene ID format Verify organism parameter matches data; convert gene IDs if needed
Memory errors during network construction Too many genes Reduce to 5,000-10,000 most variable genes; increase RAM

Interpretation Guidelines

Module colors: - Each color = distinct co-expression module

  • Grey = genes not assigned to any module
  • Larger modules may represent broader biological processes

Hub genes:

  • High kWithin = highly connected within module
  • High MM = strong correlation with module eigengene
  • Hub genes are candidates for experimental validation

Module-trait correlations:

  • |r| > 0.5 and p < 0.05 = significant association
  • Positive correlation = module genes increase with trait
  • Negative correlation = module genes decrease with trait
  • Focus on modules with strongest associations

Suggested Next Steps

After identifying modules and hub genes:

  1. Functional validation - Validate hub genes experimentally (qPCR, knockdown, overexpression)
  2. Enrichment analysis - Test modules for GO/KEGG enrichment to understand biological processes
  3. Compare with DE results - Overlay DE genes on network to see which modules are enriched
  4. Network visualization - Export to Cytoscape for detailed network visualization
  5. Cross-dataset validation - Test module preservation in independent datasets

Related Skills


References

Documentation:

Example Data:

  • Example Datasets - Public datasets for WGCNA analysis

Key Papers:

  • Key WGCNA Papers - Essential publications
  • Langfelder & Horvath (2008). WGCNA: an R package for weighted correlation network analysis. BMC Bioinformatics. doi:10.1186/1471-2105-9-559
  • Zhang & Horvath (2005). A general framework for weighted gene co-expression network analysis. Statistical Applications in Genetics and Molecular Biology. doi:10.2202/1544-6115.1128

Code preview

scripts/build_network.R

# Build co-expression network and detect modules

library(WGCNA)

#' Build co-expression network and detect modules
#'
#' @param datExpr Expression matrix (samples x genes)
#' @param power Soft-thresholding power
#' @param min_module_size Minimum genes per module
#' @param merge_cut_height Height for merging similar modules
#' @return List containing network object, module colors, and module labels
build_network <- function(datExpr, power, min_module_size = 30, merge_cut_height = 0.25) {

  cat("Building network with power =", power, "\n")

  # One-step network construction and module detection
  net <- blockwiseModules(
    datExpr,
    power = power,
    TOMType = "signed",
    networkType = "signed",
    minModuleSize = min_module_size,
    reassignThreshold = 0,
    mergeCutHeight = merge_cut_height,
    numericLabels = TRUE,
    pamRespectsDendro = FALSE,
    saveTOMs = FALSE,
    verbose = 3
  )

  # Convert numeric labels to colors
  module_colors <- labels2colors(net$colors)

  cat("\nModule detection complete:\n")
  cat("Number of modules:", length(unique(module_colors)) - 1, "(excluding grey/unassigned)\n")
  print(table(module_colors))

  return(list(
    net = net,
    module_colors = module_colors,
    module_labels = net$colors
  ))
}

scripts/correlate_modules_traits.R

# Correlate module eigengenes with sample traits

library(WGCNA)
library(ComplexHeatmap)
library(circlize)

#' Correlate module eigengenes with sample traits
#'
#' @param datExpr Expression matrix
#' @param module_colors Module color assignments
#' @param traits Data frame of sample traits (numeric)
#' @param output_file Path to save heatmap
#' @return List with module eigengenes, correlations, and p-values
correlate_modules_traits <- function(datExpr, module_colors, traits,
                                      output_file = "module_trait_correlation.svg") {

  # Calculate module eigengenes
  MEs <- moduleEigengenes(datExpr, colors = module_colors)$eigengenes
  MEs <- orderMEs(MEs)

  # Convert traits to numeric if needed
  traits_numeric <- data.frame(lapply(traits, function(x) {
    if (is.factor(x) || is.character(x)) {
      as.numeric(as.factor(x))
    } else {
      as.numeric(x)
    }
  }))
  rownames(traits_numeric) <- rownames(traits)

  # Calculate correlations
  module_trait_cor <- cor(MEs, traits_numeric, use = "pairwise.complete.obs")
  module_trait_pval <- corPvalueStudent(module_trait_cor, nrow(datExpr))

  # Create text matrix for heatmap (correlation and p-value)
  text_matrix <- matrix(
    paste0(signif(module_trait_cor, 2), "\n(", signif(module_trait_pval, 1), ")"),
    nrow = nrow(module_trait_cor),
    ncol = ncol(module_trait_cor)
  )

  # Create color function
  col_fun <- colorRamp2(c(-1, 0, 1), c("blue", "white", "red"))

  # Create heatmap with ComplexHeatmap
  ht <- Heatmap(
    module_trait_cor,
    name = "Correlation",
    col = col_fun,
    cluster_rows = FALSE,
    cluster_columns = FALSE,
    show_row_names = TRUE,
    show_column_names = TRUE,
    row_names_side = "left",
    column_names_side = "bottom",
    column_title = "Module-Trait Relationships",
    cell_fun = function(j, i, x, y, width, height, fill) {
      grid.text(text_matrix[i, j], x, y, gp = gpar(fontsize = 8))
    },
    heatmap_legend_param = list(
      title = "Correlation",
      at = c(-1, -0.5, 0, 0.5, 1)
    ),
    width = unit(max(8, ncol(traits) * 1.5), "cm"),
    height = unit(max(8, nrow(module_trait_cor) * 0.5), "cm")
  )

  # Save to file
  svg(output_file, width = max(8, ncol(traits) * 1.5), height = max(8, nrow(module_trait_cor) * 0.4))
  draw(ht)
  dev.off()

  cat("Saved:", output_file, "\n")

  # Return results
  results <- data.frame(
    module = rownames(module_trait_cor),
    stringsAsFactors = FALSE
  )

scripts/export_wgcna_results.R

# Export all WGCNA results

library(WGCNA)

#' Export all WGCNA results (including RDS objects for downstream skills)
#'
#' @param results Results object from run_wgcna_analysis()
#' @param output_dir Directory to save results (default: "wgcna_results")
#' @param output_prefix Prefix for output files (default: "wgcna")
export_all <- function(results, output_dir = "wgcna_results", output_prefix = "wgcna") {

  cat("\n=== Exporting WGCNA Results ===\n\n")

  # Create output directory if needed
  if (!dir.exists(output_dir)) {
    dir.create(output_dir, recursive = TRUE)
  }

  # Extract components from results
  gene_info <- results$hub_results$gene_info
  hub_genes <- results$hub_results$hub_genes
  trait_results <- results$trait_results

  # 1. Export gene-module assignments
  csv_path <- file.path(output_dir, paste0(output_prefix, "_gene_modules.csv"))
  write.csv(gene_info, csv_path, row.names = FALSE)
  cat("  Saved:", csv_path, "\n")

  # 2. Export hub genes
  hub_df <- do.call(rbind, lapply(names(hub_genes), function(mod) {
    df <- hub_genes[[mod]]
    df$module <- mod
    df
  }))
  csv_path <- file.path(output_dir, paste0(output_prefix, "_hub_genes.csv"))
  write.csv(hub_df, csv_path, row.names = FALSE)
  cat("  Saved:", csv_path, "\n")

  # 3. Export module-trait correlations
  csv_path <- file.path(output_dir, paste0(output_prefix, "_module_trait_cor.csv"))
  write.csv(trait_results$results, csv_path, row.names = FALSE)
  cat("  Saved:", csv_path, "\n")

  # 4. Export module eigengenes
  csv_path <- file.path(output_dir, paste0(output_prefix, "_eigengenes.csv"))
  write.csv(trait_results$MEs, csv_path)
  cat("  Saved:", csv_path, "\n")

  # 5. Save analysis objects as RDS for downstream skills (CRITICAL)
  cat("\n  Saving analysis objects for downstream use:\n")

  rds_path <- file.path(output_dir, paste0(output_prefix, "_network.rds"))
  saveRDS(results$network, rds_path)
  cat("    • wgcna_network.rds\n")
  cat("      (Load with: net <- readRDS('", basename(rds_path), "'))\n", sep = "")

  rds_path <- file.path(output_dir, paste0(output_prefix, "_module_colors.rds"))
  saveRDS(results$module_colors, rds_path)
  cat("    • wgcna_module_colors.rds\n")
  cat("      (Load with: colors <- readRDS('", basename(rds_path), "'))\n", sep = "")

  rds_path <- file.path(output_dir, paste0(output_prefix, "_expression_matrix.rds"))
  saveRDS(results$datExpr, rds_path)
  cat("    • wgcna_expression_matrix.rds\n")
  cat("      (Load with: expr <- readRDS('", basename(rds_path), "'))\n", sep = "")

  rds_path <- file.path(output_dir, paste0(output_prefix, "_full_results.rds"))
  saveRDS(results, rds_path)
  cat("    • wgcna_full_results.rds (complete results object)\n")
  cat("      (Load with: results <- readRDS('", basename(rds_path), "'))\n", sep = "")

  # 6. Create summary report
  cat("\n  Creating summary report...\n")
  summary_lines <- c(
    "# WGCNA Co-expression Network Analysis Summary\n",
    paste("**Total genes analyzed:**", nrow(gene_info)),
    paste("**Total samples:**", nrow(trait_results$MEs)),
    paste("**Number of modules:**", length(unique(gene_info$module)) - 1, "(excluding grey)\n"),
    "## Module Sizes"
  )

Companion files

TypePathBytes
Textreferences/2008_WGCNA an R package for weighted correlation network analysis.pdf.UNAVAILABLE.txt391
Textreferences/2021_WGCNA Gene Correlation Network Analysis - Bioinformatics Workbook.pdf.UNAVAILABLE.txt401
Markdownreferences/parameter-tuning-guide.md15,005
Markdownreferences/troubleshooting.md10,604
Markdownreferences/wgcna-best-practices.md15,201
Markdownreferences/wgcna-reference.md14,644
Rscripts/build_network.R1,269
Rscripts/correlate_modules_traits.R2,786
Rscripts/export_wgcna_results.R4,480
Rscripts/identify_hub_genes.R1,958
Rscripts/load_example_data.R7,444
Rscripts/module_enrichment.R5,033
Rscripts/pick_soft_power.R2,452
Rscripts/plot_all_wgcna.R4,065
Rscripts/plot_eigengene_heatmap.R1,801
Rscripts/plot_hub_genes.R2,325
Rscripts/plot_module_dendrogram.R1,408
Rscripts/plotting_helpers.R2,930
Rscripts/prepare_wgcna_data.R2,548
Rscripts/wgcna_workflow.R3,586
MarkdownSKILL.md13,975
JSONskill.meta.json4,083