Variant Annotation

Annotate VCF variants with function, clinical significance and pathogenicity.

Overview

Problem. A coordinate means nothing until annotated.

Use when: Have a VCF; need annotation or prioritisation
Avoid when: Mismatched build / annotation

Learning goals

Figures

Variant Annotation Overview
4-Step Annotation Workflow
When to Use
VEP vs SNPEff
Variant Impact Tiers
Pathogenicity & ACMG

Tutorial

Annotate genomic variants in VCF files with functional effects, clinical significance, and pathogenicity predictions.

When to Use This Skill

Use this skill when you have: - ✅ VCF files from variant calling (GATK, bcftools, FreeBayes, etc.) - ✅ Need functional annotation (gene impact, consequence types) - ✅ Need clinical interpretation (pathogenicity, ClinVar, ACMG classification) - ✅ Need variant filtering by consequence, frequency, or pathogenicity - ✅ Working with human, mouse, or 38,000+ other genomes

Use cases: - Clinical diagnostics (identify pathogenic variants in patient samples)

  • Population genetics (annotate with allele frequencies)
  • Cancer genomics (somatic variant annotation with COSMIC)
  • Research variant prioritization (rank by predicted impact)
  • Non-model organism analysis (38,000+ genomes supported)

Don't use for: - ❌ Variant calling (use GATK/bcftools first) → this skill starts with VCF files

  • ❌ Structural variant annotation (limited support, use specialized tools)

Quick Start (Example Data)

Test this skill with example variants:

from scripts.load_example_data import load_clinvar_pathogenic_sample
data = load_clinvar_pathogenic_sample()  # Creates test VCF (~10 variants)
print(f"Example VCF created: {data['vcf_path']}")

What you get:

  • Dataset: 10 ClinVar pathogenic variants (BRCA1/BRCA2)
  • Expected: ~6 HIGH impact, ~8 pathogenic classifications
  • Runtime: VEP ~5 min, SNPEff ~2 min

Next steps: Use this VCF in the Standard Workflow below.

Optional validation test (for developers): python assets/eval/simple_test.py to verify installation.

For your own data: See Inputs and Clarification Questions.

Installation

Choose one annotation tool:

Tool Best For Installation Cache/Database Size
VEP Human clinical, comprehensive annotations conda install -c bioconda ensembl-vep 15-20 GB (one-time)
SNPEff Non-model organisms, quick analysis conda install -c bioconda snpeff 2-5 GB per genome

⚠️ FOR AGENTS: If tools are missing, install them automatically:

# Check and install SNPEff if needed
from scripts.run_snpeff import check_snpeff_installation
is_installed, _, _ = check_snpeff_installation()
if not is_installed:
    from scripts.install_tools import install_snpeff
    install_snpeff()  # Takes ~2-5 min, installs via conda

For VEP:

from scripts.install_tools import install_vep
install_vep()  # Takes ~5-10 min, then install cache separately

Python dependencies:

pip install -r requirements.txt
# Or: pip install pandas numpy pysam cyvcf2 plotnine xlsxwriter

Minimum versions: - Python 3.9+

  • VEP 110+ or SNPEff 5.1+
  • conda (for tool installation)

VEP cache setup (if using VEP):

vep_install -a c -s homo_sapiens -y GRCh38  # Human, ~15-20 GB, ~30-60 min

SNPEff database setup (if using SNPEff):

snpEff download GRCh38.105  # Human, ~2-5 GB, ~10 min

For detailed installation: See references/installation_guide.md

License: VEP (Apache 2.0), SNPEff (LGPLv3) - both permit commercial use ✅

Inputs

Required:

  • VCF file: Variant call format file (bgzipped recommended)
  • Must be valid VCF 4.x format
  • Coordinate-sorted
  • Germline, somatic, or population variants
  • SNVs, indels, or mixed (structural variants have limited support)

Genome reference:

  • Human: GRCh38/hg38 (recommended) or GRCh37/hg19
  • Mouse: GRCm39 or GRCm38
  • Other: 38,000+ genomes available in SNPEff

Optional: - Reference FASTA (for validation)

  • BED file (for region-specific annotation)
  • Custom gene lists (for filtering)

VCF validation: First workflow step checks format, coordinates, and reference alleles.

Outputs

Analysis objects (Pickle):

  • analysis_object.pkl - Complete analysis object for downstream use
  • Load with: import pickle; obj = pickle.load(open('results/analysis_object.pkl', 'rb'))
  • Contains: annotated variants DataFrame, gene summaries, tool metadata
  • Required for: downstream pathway enrichment, protein structure mapping, literature mining skills
  • Size: ~1-10 MB depending on variant count

Annotated variants:

  • annotated.vep.vcf.gz or annotated.snpeff.vcf.gz - Full VCF with annotations in INFO field
  • all_variants.csv - All annotated variants in tabular format

Filtered results:

  • high_impact_variants.csv - HIGH impact variants only
  • moderate_impact_variants.csv - HIGH/MODERATE impact variants
  • rare_variants.csv - Rare variants (AF < 0.01)
  • filtered_high_impact.vcf.gz - Filtered VCF (maintains VCF format for downstream tools)

Gene summaries:

  • gene_summary.csv - Variants aggregated per gene (useful for pathway enrichment)

Visualizations (PNG + SVG, 300 DPI):

  • consequence_distribution.png/.svg - Variant types (missense, stop_gained, etc.)
  • impact_by_chromosome.png/.svg - Impact severity distribution across chromosomes
  • pathogenicity_scores.png/.svg - CADD, REVEL, SIFT score distributions
  • allele_frequency.png/.svg - Population frequency distribution
  • gene_burden.png/.svg - Top genes by variant burden

Reports:

  • summary_report.xlsx - Excel report with multiple sheets (summary statistics, top consequences, top genes, variants)

Clarification Questions

⚠️ CRITICAL: Always ask question #1 first to check if user has provided VCF files.

Before starting, gather:

  1. Input Files (ASK THIS FIRST):

    • Do you have specific VCF file(s) to annotate?
    • If uploaded: Is this the VCF you'd like to annotate?
    • Expected format: VCF/VCF.GZ (variant call format)
    • Or use example data for testing?
    • Use load_clinvar_pathogenic_sample() (~10 variants, known pathogenic)
  2. Organism and Genome Build:

    • What organism? (human, mouse, zebrafish, etc.)
    • Which genome assembly? (GRCh38/hg38 recommended for human)
    • ⚠️ Must match VCF reference - check VCF header: ##reference=
  3. Primary Use Case (determines tool selection):

    • Clinical/medical genetics → VEP (comprehensive clinical databases)
    • Population genetics → Either tool works
    • Non-model organism → SNPEff (38,000+ genomes)
    • Quick analysis → SNPEff (faster setup)
    • See references/tool_selection_guide.md for detailed comparison
  4. Annotation Priorities (select all that apply):

    • Clinical significance (ClinVar, OMIM)
    • Pathogenicity predictions (SIFT, PolyPhen, CADD, REVEL)
    • Population frequencies (gnomAD, 1000 Genomes)
    • Regulatory impacts (ENCODE)
    • Basic consequences only (faster annotation)
  5. Computational Resources:

    • High-performance (32+ GB RAM) → VEP with full cache
    • Standard (16-32 GB) → VEP or SNPEff
    • Limited (<16 GB) → SNPEff recommended
  6. Output Requirements:

    • Variant prioritization? → Apply filtering + ACMG classification
    • Gene-level analysis? → Generate gene summaries
    • Clinical reporting? → Excel export with multiple sheets
    • Statistical analysis? → CSV format for downstream tools

Standard Workflow

🚨 MANDATORY: USE SCRIPTS EXACTLY AS SHOWN - DO NOT WRITE INLINE CODE 🚨

CRITICAL: Use relative paths (scripts/). DO NOT construct absolute paths like /mnt/knowhow/ or /workspace/.

This is a 4-step workflow. Scripts handle all complexity automatically.

Step 1 - Load and validate data:

# Load example data (or use your own VCF file)
from scripts.load_example_data import load_clinvar_pathogenic_sample
data = load_clinvar_pathogenic_sample()
input_vcf = data['vcf_path']

# Validate VCF format
from scripts.validate_vcf import validate_vcf, vcf_summary_stats
results = validate_vcf(input_vcf)
if not results['is_valid']:
    print(f"Errors: {results['errors']}")

stats = vcf_summary_stats(input_vcf)
print(f"Total: {stats['total_variants']}, SNVs: {stats['snvs']}, Indels: {stats['indels']}")

✅ VERIFICATION: You MUST see: "✓ Data loaded successfully"

DO NOT write inline validation code. Use the scripts exactly as shown.


Step 2 - Run annotation:

# Select and run annotation tool
from scripts.select_tool import select_annotation_tool
tool = select_annotation_tool(organism='human', use_case='clinical', resources='standard')

# Check if tool is installed, install if needed
if tool == 'vep':
    from scripts.run_vep import check_vep_installation
    is_installed, _, _ = check_vep_installation()
    if not is_installed:
        print("VEP not found. Installing via conda...")
        from scripts.install_tools import install_vep
        install_vep()
        # Note: After VEP install, you must install cache separately
        print("\n⚠️ IMPORTANT: Now install VEP cache:")
        print("  vep_install -a c -s homo_sapiens -y GRCh38")

elif tool == 'snpeff':
    from scripts.run_snpeff import check_snpeff_installation
    is_installed, _, _ = check_snpeff_installation()
    if not is_installed:
        print("SNPEff not found. Installing via conda...")
        from scripts.install_tools import install_snpeff
        install_snpeff()

# VEP annotation
if tool == 'vep':
    from scripts.run_vep import run_vep
    annotated_vcf = run_vep(input_vcf, "annotated.vep.vcf.gz", genome="GRCh38", everything=True, fork=4)
    from scripts.parse_vep_output import parse_vep_vcf
    df = parse_vep_vcf("annotated.vep.vcf.gz")

# SNPEff annotation
elif tool == 'snpeff':
    from scripts.run_snpeff import run_snpeff
    annotated_vcf = run_snpeff(input_vcf, "annotated.snpeff.vcf.gz", genome="GRCh38.105", canon=True, threads=4)
    from scripts.parse_snpeff_output import parse_snpeff_vcf
    df = parse_snpeff_vcf("annotated.snpeff.vcf.gz")

✅ VERIFICATION: You MUST see: "✓ Annotation completed successfully!"

DO NOT write inline annotation code. Use run_vep() or run_snpeff() as shown.

⚠️ DO NOT fall back to VEP API mode - API mode is 10-100x slower and less full-featured. Always install the tool if missing.


Step 3 - Generate visualizations:

from scripts.filter_variants import filter_by_consequence, filter_by_frequency
from scripts.annotate_genes import annotate_gene_summary
from scripts.plot_variant_distribution import (
    plot_consequence_distribution,
    plot_impact_by_chromosome,
    plot_pathogenicity_scores,
    plot_allele_frequency,
    plot_gene_burden
)

# Filter variants for analysis
high_impact = filter_by_consequence(df, consequence_levels=['HIGH', 'MODERATE'])
rare = filter_by_frequency(df, max_af=0.01)

# Generate gene-level summaries
gene_df = annotate_gene_summary(df)

# Create visualizations (PNG + SVG)
output_dir = "results"
plot_consequence_distribution(df, f"{output_dir}/consequence_distribution.png")
plot_impact_by_chromosome(df, f"{output_dir}/impact_by_chromosome.png")
plot_pathogenicity_scores(df, f"{output_dir}/pathogenicity_scores.png")
plot_allele_frequency(df, f"{output_dir}/allele_frequency.png")
if gene_df is not None:
    plot_gene_burden(gene_df, output_file=f"{output_dir}/gene_burden.png")

✅ VERIFICATION: You MUST see: "Saved: [filename].webp" and "Saved: [filename].svg" for each plot

DO NOT write inline plotting code. Use the plotting functions as shown.


Step 4 - Export results:

from scripts.export_results import export_all

# Export all results in all formats
export_all(
    df=df,
    gene_df=gene_df,
    original_vcf=input_vcf,
    output_dir="results",
    tool_name=tool
)

✅ VERIFICATION: You MUST see: "=== Export Complete ==="

DO NOT write custom export code. Use export_all() as shown.


⚠️ CRITICAL - DO NOT:

  • Write inline validation/annotation/plotting/export codeSTOP: Use the scripts shown above
  • Fall back to VEP API mode when tools are missingSTOP: Use install_tools.py to install via conda
  • ❌ Use absolute paths like /mnt/knowhow/ or /workspace/ → use relative paths scripts/
  • ❌ Skip validation step → catches format errors early
  • ❌ Mix genome builds → VCF and annotation database must match (check VCF header)
  • ❌ Write custom export code → STOP: Use export_all()

If tools are missing: Use install_tools.py to install via conda (~2-10 min). DO NOT use VEP API mode (10-100x slower, less full-featured).

If scripts fail: Install missing dependencies, re-run. Only modify scripts if genuinely broken.

Common Issues

Issue Possible Cause Solution Details
VEP/SNPEff fails to start Tool not installed or not in PATH Use install_tools.py: from scripts.install_tools import install_snpeff; install_snpeff() Takes 2-5 min. DO NOT fall back to VEP API mode (10-100x slower). Verify with which vep or which snpEff
"Cannot find genome" Database not downloaded Run install_vep_cache() or download_snpeff_database() See installation_guide.md
All variants annotated as intergenic Genome build mismatch (hg19 vs hg38) Check VCF header ##reference=, use matching genome build Use bcftools annotate --rename-chrs to convert
No pathogenicity scores Plugins/databases not installed Install VEP plugins or SnpSift databases See vep_best_practices.md
Memory error during annotation Insufficient RAM or large VCF Reduce buffer size (--buffer_size 1000) or split VCF by chromosome Use bcftools view -r chr1 to split
"Invalid VCF format" VCF format errors (tabs, header, etc.) Run validation step first, check tabs and header completeness Use bcftools view -h to inspect header
Chromosome name mismatch VCF uses "chr1" but database uses "1" Convert with bcftools annotate --rename-chrs Create mapping file: chr1 1\nchr2 2\n...
Very slow annotation Not using cache/database (API mode) Ensure VEP cache or SNPEff database installed locally API mode is 10-100x slower

For complete troubleshooting: See references/troubleshooting_guide.md

QC checkpoints:

  • VCF validation: Ti/Tv ratio 2.0-2.1 (genome) or 2.8-3.0 (exome)
  • Annotation completeness: 100% variants annotated
  • Expected HIGH impact: 10-50 per exome (healthy individual)

For detailed QC thresholds: See references/qc_guidelines.md

Suggested Next Steps

After annotating variants:

  1. Variant prioritization → Filter and rank by pathogenicity

    • Use filtering: AF < 0.01, HIGH/MODERATE impact, CADD > 20
    • Apply ACMG classification for clinical reporting
  2. Gene-level analysis → Aggregate variants per gene

    • Identify genes with high variant burden
    • Prepare for pathway enrichment analysis
  3. Clinical reporting → Export prioritized variants

    • Excel format with multiple sheets (variants, genes, summary)
    • Include ACMG classifications and ClinVar annotations
  4. Statistical analysis → Burden testing, association studies

    • Compare variant frequencies between cases and controls
    • Gene-based collapsing analysis

Related analyses: Pathway enrichment from gene lists, protein structure mapping, literature mining

Related Skills

Upstream (run these first):

  • Variant calling workflows (GATK HaplotypeCaller, bcftools call, FreeBayes)
  • Variant quality score recalibration (VQSR)

Downstream (run these after):

  • Pathway enrichment from gene lists
  • Protein structure visualization and impact prediction
  • Clinical report generation

Complementary: - Copy number variant (CNV) annotation

  • Structural variant annotation (specialized tools)
  • Pharmacogenomics annotation (PharmGKB)

References

Primary Citations

Ensembl VEP:

SNPEff:

  • Cingolani P, et al. (2012) A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff. Fly 6(2):80-92. doi:10.4161/fly.19695

ACMG Guidelines:

  • Richards S, et al. (2015) Standards and guidelines for the interpretation of sequence variants. Genetics in Medicine 17(5):405-424. PubMed:25741868

Documentation

Tool Selection & Configuration:

Analysis & Interpretation:

Setup & Troubleshooting:

Code preview

scripts/annotate_genes.py

"""
Gene-Level Annotation Module

This module provides functions for generating gene-level summaries from variant data.
"""

import sys
from collections import defaultdict

try:
    import pandas as pd
    import numpy as np
except ImportError as e:
    print(f"Error: Missing required package: {e}")
    print("Install with: pip install pandas numpy")
    sys.exit(1)


def create_gene_summary(
    df,
    group_by='SYMBOL',
    count_variants=True,
    count_by_consequence=True,
    count_by_impact=True,
    include_most_severe=True,
    aggregate_scores=None,
    aggregation='max'
):
    """
    Create gene-level summary from variant data.

    Parameters
    ----------
    df : pd.DataFrame
        DataFrame with variant annotations
    group_by : str
        Column to group by (default: 'SYMBOL' for gene symbol)
    count_variants : bool
        Count total variants per gene
    count_by_consequence : bool
        Count variants by consequence type
    count_by_impact : bool
        Count variants by impact level
    include_most_severe : bool
        Include most severe variant per gene
    aggregate_scores : list, optional
        List of score columns to aggregate (e.g., ['CADD_PHRED', 'REVEL'])
    aggregation : str
        Aggregation method: 'max', 'mean', 'median' (default: 'max')

    Returns
    -------
    pd.DataFrame
        Gene-level summary DataFrame
    """
    if group_by not in df.columns:
        raise ValueError(f"Column '{group_by}' not found in DataFrame")

    # Remove missing gene names
    df_genes = df[df[group_by].notna()].copy()

    # Initialize summary dict
    summary_data = defaultdict(dict)

    # Group by gene
    for gene, gene_df in df_genes.groupby(group_by):
        gene_summary = {}

        # Total variant count
        if count_variants:
            gene_summary['N_Variants'] = len(gene_df)

        # Count by impact
        if count_by_impact and 'IMPACT' in gene_df.columns:
            for impact in ['HIGH', 'MODERATE', 'LOW', 'MODIFIER']:
                count = len(gene_df[gene_df['IMPACT'] == impact])
                gene_summary[f'N_{impact}_Impact'] = count

        # Count by consequence
        if count_by_consequence and 'Consequence' in gene_df.columns:

scripts/export_results.py

"""
Export Results Module

This module provides functions for exporting annotated variants in multiple formats.
"""

import sys
from pathlib import Path

try:
    import pandas as pd
    import xlsxwriter
    from openpyxl import load_workbook
    from openpyxl.styles import Font, PatternFill, Alignment
except ImportError as e:
    print(f"Error: Missing required package: {e}")
    print("Install with: pip install pandas xlsxwriter openpyxl")
    sys.exit(1)


def export_to_excel(output_file, sheets, format_columns=True, freeze_panes=(1, 2)):
    """
    Export data to Excel with multiple sheets and formatting.

    Parameters
    ----------
    output_file : str
        Output Excel file path
    sheets : dict
        Dictionary mapping sheet names to DataFrames
    format_columns : bool
        Apply conditional formatting (default: True)
    freeze_panes : tuple
        Row and column to freeze (default: (1, 2) for header + 2 columns)
    """
    # Create Excel writer
    with pd.ExcelWriter(output_file, engine='xlsxwriter') as writer:
        workbook = writer.book

        # Define formats
        header_format = workbook.add_format({
            'bold': True,
            'bg_color': '#D9E1F2',
            'border': 1,
            'align': 'center',
            'valign': 'vcenter'
        })

        high_impact_format = workbook.add_format({'bg_color': '#FFC7CE', 'font_color': '#9C0006'})
        moderate_impact_format = workbook.add_format({'bg_color': '#FFEB9C', 'font_color': '#9C6500'})
        pathogenic_format = workbook.add_format({'bg_color': '#FFC7CE', 'font_color': '#9C0006'})

        # Write each sheet
        for sheet_name, df in sheets.items():
            df.to_excel(writer, sheet_name=sheet_name, index=False, startrow=0)
            worksheet = writer.sheets[sheet_name]

            # Format header
            for col_num, column_name in enumerate(df.columns):
                worksheet.write(0, col_num, column_name, header_format)

            # Auto-fit columns
            for col_num, column_name in enumerate(df.columns):
                column_data = df[column_name].astype(str)
                max_length = max(
                    column_data.apply(len).max(),
                    len(column_name)
                ) + 2
                worksheet.set_column(col_num, col_num, min(max_length, 50))

            # Freeze panes
            if freeze_panes:
                worksheet.freeze_panes(freeze_panes[0], freeze_panes[1])

            # Conditional formatting
            if format_columns and 'IMPACT' in df.columns:
                impact_col = df.columns.get_loc('IMPACT')
                worksheet.conditional_format(
                    1, impact_col, len(df), impact_col,
                    {

scripts/filter_variants.py

"""
Variant Filtering Module

This module provides functions for filtering annotated variants by consequence,
frequency, quality, and pathogenicity.
"""

import sys

try:
    import pandas as pd
    import numpy as np
except ImportError as e:
    print(f"Error: Missing required package: {e}")
    print("Install with: pip install pandas numpy")
    sys.exit(1)


# Consequence severity definitions (Sequence Ontology)
HIGH_IMPACT_CONSEQUENCES = {
    'transcript_ablation',
    'splice_acceptor_variant',
    'splice_donor_variant',
    'stop_gained',
    'frameshift_variant',
    'stop_lost',
    'start_lost',
    'transcript_amplification'
}

MODERATE_IMPACT_CONSEQUENCES = {
    'inframe_insertion',
    'inframe_deletion',
    'missense_variant',
    'protein_altering_variant'
}

LOW_IMPACT_CONSEQUENCES = {
    'splice_region_variant',
    'incomplete_terminal_codon_variant',
    'start_retained_variant',
    'stop_retained_variant',
    'synonymous_variant'
}


def filter_by_consequence(df, impact=None, consequence_types=None):
    """
    Filter variants by consequence severity or specific consequence types.

    Parameters
    ----------
    df : pd.DataFrame
        DataFrame with variant annotations
    impact : list, optional
        List of impact levels to keep: ['HIGH', 'MODERATE', 'LOW', 'MODIFIER']
    consequence_types : list, optional
        List of specific consequence types to keep

    Returns
    -------
    pd.DataFrame
        Filtered DataFrame
    """
    filtered = df.copy()

    # Filter by IMPACT
    if impact:
        if 'IMPACT' not in df.columns:
            print("Warning: IMPACT column not found")
            return filtered

        filtered = filtered[filtered['IMPACT'].isin(impact)]

    # Filter by specific consequence types
    if consequence_types:
        if 'Consequence' not in df.columns:
            print("Warning: Consequence column not found")
            return filtered

Companion files

TypePathBytes
Markdownreferences/auto_installation_implementation.md7,458
Markdownreferences/consequence_terms.md9,672
Markdownreferences/filtering_strategies.md19,433
Markdownreferences/installation_guide.md10,441
Markdownreferences/pathogenicity_interpretation.md11,852
Markdownreferences/qc_guidelines.md15,425
Markdownreferences/snpeff_best_practices.md10,422
Markdownreferences/tool_selection_guide.md12,226
Markdownreferences/troubleshooting_guide.md11,097
Markdownreferences/vep_best_practices.md9,351
Filescripts/__pycache__/test_complete_workflow.cpython-311-pytest-7.3.1.pyc12,914
Filescripts/__pycache__/validate_vcf.cpython-311.pyc14,223
Pythonscripts/annotate_genes.py8,175
Pythonscripts/export_results.py15,166
Pythonscripts/filter_variants.py10,601
Pythonscripts/install_tools.py6,309
Pythonscripts/load_example_data.py5,290
Pythonscripts/parse_snpeff_output.py7,642
Pythonscripts/parse_vep_output.py7,013
Pythonscripts/plot_variant_distribution.py11,139
Pythonscripts/prioritize_variants.py9,209
Pythonscripts/run_snpeff.py10,493
Pythonscripts/run_vep.py9,409
Pythonscripts/select_tool.py6,115
Pythonscripts/test_complete_workflow.py10,985
Pythonscripts/test_pickle_load.py3,182
Pythonscripts/validate_vcf.py10,187
Pythonscripts/verify_changes.py6,981
MarkdownSKILL.md17,189
JSONskill.meta.json5,041