⌂ / Overview / Population Genetics & Genomics / Variant Annotation

View companion source

Variant Annotation

Annotate VCF variants with function, clinical significance and pathogenicity.

Overview

Problem. A coordinate means nothing until annotated.

Use when: Have a VCF; need annotation or prioritisation

Avoid when: Mismatched build / annotation

Learning goals

VEP for human clinical, SnpEff for non-model organisms
Germline vs somatic interpretation

Figures

Tutorial

Annotate genomic variants in VCF files with functional effects, clinical significance, and pathogenicity predictions.

When to Use This Skill

Use this skill when you have:

✅ VCF files from variant calling (GATK, bcftools, FreeBayes, etc.)
✅ Need functional annotation (gene impact, consequence types)
✅ Need clinical interpretation (pathogenicity, ClinVar, ACMG classification)
✅ Need variant filtering by consequence, frequency, or pathogenicity
✅ Working with human, mouse, or 38,000+ other genomes

Use cases:

Clinical diagnostics (identify pathogenic variants in patient samples)
Population genetics (annotate with allele frequencies)
Cancer genomics (somatic variant annotation with COSMIC)
Research variant prioritization (rank by predicted impact)
Non-model organism analysis (38,000+ genomes supported)

Don't use for:

❌ Variant calling (use GATK/bcftools first) → this skill starts with VCF files
❌ Structural variant annotation (limited support, use specialized tools)

Quick Start (Example Data)

Test this skill with example variants:

from scripts.load_example_data import load_clinvar_pathogenic_sample
data = load_clinvar_pathogenic_sample()  # Creates test VCF (~10 variants)
print(f"Example VCF created: {data['vcf_path']}")

What you get:

Dataset: 10 ClinVar pathogenic variants (BRCA1/BRCA2)
Expected: ~6 HIGH impact, ~8 pathogenic classifications
Runtime: VEP ~5 min, SNPEff ~2 min

Next steps: Use this VCF in the Standard Workflow below.

Optional validation test (for developers): python assets/eval/simple_test.py to verify installation.

For your own data: See Inputs and Clarification Questions.

Installation

Choose one annotation tool:

Tool	Best For	Installation	Cache/Database Size
VEP	Human clinical, comprehensive annotations	`conda install -c bioconda ensembl-vep`	15-20 GB (one-time)
SNPEff	Non-model organisms, quick analysis	`conda install -c bioconda snpeff`	2-5 GB per genome

⚠️ FOR AGENTS: If tools are missing, install them automatically:

# Check and install SNPEff if needed
from scripts.run_snpeff import check_snpeff_installation
is_installed, _, _ = check_snpeff_installation()
if not is_installed:
    from scripts.install_tools import install_snpeff
    install_snpeff()  # Takes ~2-5 min, installs via conda

For VEP:

from scripts.install_tools import install_vep
install_vep()  # Takes ~5-10 min, then install cache separately

Python dependencies:

pip install -r requirements.txt
# Or: pip install pandas numpy pysam cyvcf2 plotnine xlsxwriter

Minimum versions:

Python 3.9+
VEP 110+ or SNPEff 5.1+
conda (for tool installation)

VEP cache setup (if using VEP):

vep_install -a c -s homo_sapiens -y GRCh38  # Human, ~15-20 GB, ~30-60 min

SNPEff database setup (if using SNPEff):

snpEff download GRCh38.105  # Human, ~2-5 GB, ~10 min

For detailed installation: See references/installation_guide.md

License: VEP (Apache 2.0), SNPEff (LGPLv3) - both permit commercial use ✅

Inputs

Required:

VCF file: Variant call format file (bgzipped recommended)
Must be valid VCF 4.x format
Coordinate-sorted
Germline, somatic, or population variants
SNVs, indels, or mixed (structural variants have limited support)

Genome reference:

Human: GRCh38/hg38 (recommended) or GRCh37/hg19
Mouse: GRCm39 or GRCm38
Other: 38,000+ genomes available in SNPEff

Optional:

Reference FASTA (for validation)
BED file (for region-specific annotation)
Custom gene lists (for filtering)

VCF validation: First workflow step checks format, coordinates, and reference alleles.

Outputs

Analysis objects (Pickle):

analysis_object.pkl - Complete analysis object for downstream use
Load with: import pickle; obj = pickle.load(open('results/analysis_object.pkl', 'rb'))
Contains: annotated variants DataFrame, gene summaries, tool metadata
Required for: downstream pathway enrichment, protein structure mapping, literature mining skills
Size: ~1-10 MB depending on variant count

Annotated variants:

annotated.vep.vcf.gz or annotated.snpeff.vcf.gz - Full VCF with annotations in INFO field
all_variants.csv - All annotated variants in tabular format

Filtered results:

high_impact_variants.csv - HIGH impact variants only
moderate_impact_variants.csv - HIGH/MODERATE impact variants
rare_variants.csv - Rare variants (AF < 0.01)
filtered_high_impact.vcf.gz - Filtered VCF (maintains VCF format for downstream tools)

Gene summaries:

gene_summary.csv - Variants aggregated per gene (useful for pathway enrichment)

Visualizations (PNG + SVG, 300 DPI):

consequence_distribution.png/.svg - Variant types (missense, stop_gained, etc.)
impact_by_chromosome.png/.svg - Impact severity distribution across chromosomes
pathogenicity_scores.png/.svg - CADD, REVEL, SIFT score distributions
allele_frequency.png/.svg - Population frequency distribution
gene_burden.png/.svg - Top genes by variant burden

Reports:

summary_report.xlsx - Excel report with multiple sheets (summary statistics, top consequences, top genes, variants)

Clarification Questions

⚠️ CRITICAL: Always ask question #1 first to check if user has provided VCF files.

Before starting, gather:

Input Files (ASK THIS FIRST):
- Do you have specific VCF file(s) to annotate?
- If uploaded: Is this the VCF you'd like to annotate?
- Expected format: VCF/VCF.GZ (variant call format)
- Or use example data for testing?
- Use load_clinvar_pathogenic_sample() (~10 variants, known pathogenic)
Organism and Genome Build:
- What organism? (human, mouse, zebrafish, etc.)
- Which genome assembly? (GRCh38/hg38 recommended for human)
- ⚠️ Must match VCF reference - check VCF header: ##reference=
Primary Use Case (determines tool selection):
- Clinical/medical genetics → VEP (comprehensive clinical databases)
- Population genetics → Either tool works
- Non-model organism → SNPEff (38,000+ genomes)
- Quick analysis → SNPEff (faster setup)
- See references/tool_selection_guide.md for detailed comparison
Annotation Priorities (select all that apply):
- Clinical significance (ClinVar, OMIM)
- Pathogenicity predictions (SIFT, PolyPhen, CADD, REVEL)
- Population frequencies (gnomAD, 1000 Genomes)
- Regulatory impacts (ENCODE)
- Basic consequences only (faster annotation)
Computational Resources:
- High-performance (32+ GB RAM) → VEP with full cache
- Standard (16-32 GB) → VEP or SNPEff
- Limited (<16 GB) → SNPEff recommended
Output Requirements:
- Variant prioritization? → Apply filtering + ACMG classification
- Gene-level analysis? → Generate gene summaries
- Clinical reporting? → Excel export with multiple sheets
- Statistical analysis? → CSV format for downstream tools

Standard Workflow

🚨 MANDATORY: USE SCRIPTS EXACTLY AS SHOWN - DO NOT WRITE INLINE CODE 🚨

CRITICAL: Use relative paths (scripts/). DO NOT construct absolute paths like /mnt/knowhow/ or /workspace/.

This is a 4-step workflow. Scripts handle all complexity automatically.

Step 1 - Load and validate data:

# Load example data (or use your own VCF file)
from scripts.load_example_data import load_clinvar_pathogenic_sample
data = load_clinvar_pathogenic_sample()
input_vcf = data['vcf_path']

# Validate VCF format
from scripts.validate_vcf import validate_vcf, vcf_summary_stats
results = validate_vcf(input_vcf)
if not results['is_valid']:
    print(f"Errors: {results['errors']}")

stats = vcf_summary_stats(input_vcf)
print(f"Total: {stats['total_variants']}, SNVs: {stats['snvs']}, Indels: {stats['indels']}")

✅ VERIFICATION: You MUST see: "✓ Data loaded successfully"

DO NOT write inline validation code. Use the scripts exactly as shown.

Step 2 - Run annotation:

# Select and run annotation tool
from scripts.select_tool import select_annotation_tool
tool = select_annotation_tool(organism='human', use_case='clinical', resources='standard')

# Check if tool is installed, install if needed
if tool == 'vep':
    from scripts.run_vep import check_vep_installation
    is_installed, _, _ = check_vep_installation()
    if not is_installed:
        print("VEP not found. Installing via conda...")
        from scripts.install_tools import install_vep
        install_vep()
        # Note: After VEP install, you must install cache separately
        print("\n⚠️ IMPORTANT: Now install VEP cache:")
        print("  vep_install -a c -s homo_sapiens -y GRCh38")

elif tool == 'snpeff':
    from scripts.run_snpeff import check_snpeff_installation
    is_installed, _, _ = check_snpeff_installation()
    if not is_installed:
        print("SNPEff not found. Installing via conda...")
        from scripts.install_tools import install_snpeff
        install_snpeff()

# VEP annotation
if tool == 'vep':
    from scripts.run_vep import run_vep
    annotated_vcf = run_vep(input_vcf, "annotated.vep.vcf.gz", genome="GRCh38", everything=True, fork=4)
    from scripts.parse_vep_output import parse_vep_vcf
    df = parse_vep_vcf("annotated.vep.vcf.gz")

# SNPEff annotation
elif tool == 'snpeff':
    from scripts.run_snpeff import run_snpeff
    annotated_vcf = run_snpeff(input_vcf, "annotated.snpeff.vcf.gz", genome="GRCh38.105", canon=True, threads=4)
    from scripts.parse_snpeff_output import parse_snpeff_vcf
    df = parse_snpeff_vcf("annotated.snpeff.vcf.gz")

✅ VERIFICATION: You MUST see: "✓ Annotation completed successfully!"

DO NOT write inline annotation code. Use run_vep() or run_snpeff() as shown.

DO NOT fall back to VEP API mode

API mode is 10-100x slower and less full-featured. Always install the tool if missing.

Step 3 - Generate visualizations:

from scripts.filter_variants import filter_by_consequence, filter_by_frequency
from scripts.annotate_genes import annotate_gene_summary
from scripts.plot_variant_distribution import (
    plot_consequence_distribution,
    plot_impact_by_chromosome,
    plot_pathogenicity_scores,
    plot_allele_frequency,
    plot_gene_burden
)

# Filter variants for analysis
high_impact = filter_by_consequence(df, consequence_levels=['HIGH', 'MODERATE'])
rare = filter_by_frequency(df, max_af=0.01)

# Generate gene-level summaries
gene_df = annotate_gene_summary(df)

# Create visualizations (PNG + SVG)
output_dir = "results"
plot_consequence_distribution(df, f"{output_dir}/consequence_distribution.png")
plot_impact_by_chromosome(df, f"{output_dir}/impact_by_chromosome.png")
plot_pathogenicity_scores(df, f"{output_dir}/pathogenicity_scores.png")
plot_allele_frequency(df, f"{output_dir}/allele_frequency.png")
if gene_df is not None:
    plot_gene_burden(gene_df, output_file=f"{output_dir}/gene_burden.png")

✅ VERIFICATION: You MUST see: "Saved: [filename].webp" and "Saved: [filename].svg" for each plot

DO NOT write inline plotting code. Use the plotting functions as shown.

Step 4 - Export results:

from scripts.export_results import export_all

# Export all results in all formats
export_all(
    df=df,
    gene_df=gene_df,
    original_vcf=input_vcf,
    output_dir="results",
    tool_name=tool
)

✅ VERIFICATION: You MUST see: "=== Export Complete ==="

DO NOT write custom export code. Use export_all() as shown.

⚠️ CRITICAL - DO NOT:

❌ Write inline validation/annotation/plotting/export code → STOP: Use the scripts shown above
❌ Fall back to VEP API mode when tools are missing → STOP: Use install_tools.py to install via conda
❌ Use absolute paths like /mnt/knowhow/ or /workspace/ → use relative paths scripts/
❌ Skip validation step → catches format errors early
❌ Mix genome builds → VCF and annotation database must match (check VCF header)
❌ Write custom export code → STOP: Use export_all()

If tools are missing: Use install_tools.py to install via conda (~2-10 min). DO NOT use VEP API mode (10-100x slower, less full-featured).

If scripts fail: Install missing dependencies, re-run. Only modify scripts if genuinely broken.

Common Issues

Issue	Possible Cause	Solution	Details
VEP/SNPEff fails to start	Tool not installed or not in PATH	Use install_tools.py: `from scripts.install_tools import install_snpeff; install_snpeff()`	Takes 2-5 min. DO NOT fall back to VEP API mode (10-100x slower). Verify with `which vep` or `which snpEff`
"Cannot find genome"	Database not downloaded	Run `install_vep_cache()` or `download_snpeff_database()`	See installation_guide.md
All variants annotated as intergenic	Genome build mismatch (hg19 vs hg38)	Check VCF header `##reference=`, use matching genome build	Use `bcftools annotate --rename-chrs` to convert
No pathogenicity scores	Plugins/databases not installed	Install VEP plugins or SnpSift databases	See vep_best_practices.md
Memory error during annotation	Insufficient RAM or large VCF	Reduce buffer size (`--buffer_size 1000`) or split VCF by chromosome	Use `bcftools view -r chr1` to split
"Invalid VCF format"	VCF format errors (tabs, header, etc.)	Run validation step first, check tabs and header completeness	Use `bcftools view -h` to inspect header
Chromosome name mismatch	VCF uses "chr1" but database uses "1"	Convert with `bcftools annotate --rename-chrs`	Create mapping file: `chr1 1\nchr2 2\n...`
Very slow annotation	Not using cache/database (API mode)	Ensure VEP cache or SNPEff database installed locally	API mode is 10-100x slower

For complete troubleshooting: See references/troubleshooting_guide.md

QC checkpoints:

VCF validation: Ti/Tv ratio 2.0-2.1 (genome) or 2.8-3.0 (exome)
Annotation completeness: 100% variants annotated
Expected HIGH impact: 10-50 per exome (healthy individual)

For detailed QC thresholds: See references/qc_guidelines.md

Suggested Next Steps

After annotating variants:

Variant prioritization → Filter and rank by pathogenicity
- Use filtering: AF < 0.01, HIGH/MODERATE impact, CADD > 20
- Apply ACMG classification for clinical reporting
Gene-level analysis → Aggregate variants per gene
- Identify genes with high variant burden
- Prepare for pathway enrichment analysis
Clinical reporting → Export prioritized variants
- Excel format with multiple sheets (variants, genes, summary)
- Include ACMG classifications and ClinVar annotations
Statistical analysis → Burden testing, association studies
- Compare variant frequencies between cases and controls
- Gene-based collapsing analysis

Related analyses: Pathway enrichment from gene lists, protein structure mapping, literature mining

Related Skills

Upstream (run these first):

Variant calling workflows (GATK HaplotypeCaller, bcftools call, FreeBayes)
Variant quality score recalibration (VQSR)

Downstream (run these after):

Pathway enrichment from gene lists
Protein structure visualization and impact prediction
Clinical report generation

Complementary:

Copy number variant (CNV) annotation
Structural variant annotation (specialized tools)
Pharmacogenomics annotation (PharmGKB)

References

Primary Citations

Ensembl VEP:

McLaren W, et al. (2016) The Ensembl Variant Effect Predictor. Genome Biology 17:122. doi:10.1186/s13059-016-0974-4

SNPEff:

Cingolani P, et al. (2012) A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff. Fly 6(2):80-92. doi:10.4161/fly.19695

ACMG Guidelines:

Richards S, et al. (2015) Standards and guidelines for the interpretation of sequence variants. Genetics in Medicine 17(5):405-424. PubMed:25741868

Documentation

Tool Selection & Configuration:

Tool Selection Guide - VEP vs SNPEff decision matrix, resource requirements
VEP Best Practices - VEP configuration, plugins, optimization
SNPEff Best Practices - SNPEff parameters, databases, statistics

Analysis & Interpretation:

QC Guidelines - Quality control metrics, expected ranges, red flags
Consequence Terms - Sequence Ontology terms reference
Pathogenicity Interpretation - ACMG/AMP guidelines, classification rules
Filtering Strategies - Clinical vs research filtering approaches

Setup & Troubleshooting:

Installation Guide - Detailed VEP/SNPEff installation instructions
Troubleshooting Guide - Common issues, diagnosis steps, solutions

Code preview

scripts/annotate_genes.py

"""
Gene-Level Annotation Module

This module provides functions for generating gene-level summaries from variant data.
"""

import sys
from collections import defaultdict

try:
    import pandas as pd
    import numpy as np
except ImportError as e:
    print(f"Error: Missing required package: {e}")
    print("Install with: pip install pandas numpy")
    sys.exit(1)


def create_gene_summary(
    df,
    group_by='SYMBOL',
    count_variants=True,
    count_by_consequence=True,
    count_by_impact=True,
    include_most_severe=True,
    aggregate_scores=None,
    aggregation='max'
):
    """
    Create gene-level summary from variant data.

    Parameters
    ----------
    df : pd.DataFrame
        DataFrame with variant annotations
    group_by : str
        Column to group by (default: 'SYMBOL' for gene symbol)
    count_variants : bool
        Count total variants per gene
    count_by_consequence : bool
        Count variants by consequence type
    count_by_impact : bool
        Count variants by impact level
    include_most_severe : bool
        Include most severe variant per gene
    aggregate_scores : list, optional
        List of score columns to aggregate (e.g., ['CADD_PHRED', 'REVEL'])
    aggregation : str
        Aggregation method: 'max', 'mean', 'median' (default: 'max')

    Returns
    -------
    pd.DataFrame
        Gene-level summary DataFrame
    """
    if group_by not in df.columns:
        raise ValueError(f"Column '{group_by}' not found in DataFrame")

    # Remove missing gene names
    df_genes = df[df[group_by].notna()].copy()

    # Initialize summary dict
    summary_data = defaultdict(dict)

    # Group by gene
    for gene, gene_df in df_genes.groupby(group_by):
        gene_summary = {}

        # Total variant count
        if count_variants:
            gene_summary['N_Variants'] = len(gene_df)

        # Count by impact
        if count_by_impact and 'IMPACT' in gene_df.columns:
            for impact in ['HIGH', 'MODERATE', 'LOW', 'MODIFIER']:
                count = len(gene_df[gene_df['IMPACT'] == impact])
                gene_summary[f'N_{impact}_Impact'] = count

        # Count by consequence
        if count_by_consequence and 'Consequence' in gene_df.columns:

scripts/export_results.py

"""
Export Results Module

This module provides functions for exporting annotated variants in multiple formats.
"""

import sys
from pathlib import Path

try:
    import pandas as pd
    import xlsxwriter
    from openpyxl import load_workbook
    from openpyxl.styles import Font, PatternFill, Alignment
except ImportError as e:
    print(f"Error: Missing required package: {e}")
    print("Install with: pip install pandas xlsxwriter openpyxl")
    sys.exit(1)


def export_to_excel(output_file, sheets, format_columns=True, freeze_panes=(1, 2)):
    """
    Export data to Excel with multiple sheets and formatting.

    Parameters
    ----------
    output_file : str
        Output Excel file path
    sheets : dict
        Dictionary mapping sheet names to DataFrames
    format_columns : bool
        Apply conditional formatting (default: True)
    freeze_panes : tuple
        Row and column to freeze (default: (1, 2) for header + 2 columns)
    """
    # Create Excel writer
    with pd.ExcelWriter(output_file, engine='xlsxwriter') as writer:
        workbook = writer.book

        # Define formats
        header_format = workbook.add_format({
            'bold': True,
            'bg_color': '#D9E1F2',
            'border': 1,
            'align': 'center',
            'valign': 'vcenter'
        })

        high_impact_format = workbook.add_format({'bg_color': '#FFC7CE', 'font_color': '#9C0006'})
        moderate_impact_format = workbook.add_format({'bg_color': '#FFEB9C', 'font_color': '#9C6500'})
        pathogenic_format = workbook.add_format({'bg_color': '#FFC7CE', 'font_color': '#9C0006'})

        # Write each sheet
        for sheet_name, df in sheets.items():
            df.to_excel(writer, sheet_name=sheet_name, index=False, startrow=0)
            worksheet = writer.sheets[sheet_name]

            # Format header
            for col_num, column_name in enumerate(df.columns):
                worksheet.write(0, col_num, column_name, header_format)

            # Auto-fit columns
            for col_num, column_name in enumerate(df.columns):
                column_data = df[column_name].astype(str)
                max_length = max(
                    column_data.apply(len).max(),
                    len(column_name)
                ) + 2
                worksheet.set_column(col_num, col_num, min(max_length, 50))

            # Freeze panes
            if freeze_panes:
                worksheet.freeze_panes(freeze_panes[0], freeze_panes[1])

            # Conditional formatting
            if format_columns and 'IMPACT' in df.columns:
                impact_col = df.columns.get_loc('IMPACT')
                worksheet.conditional_format(
                    1, impact_col, len(df), impact_col,
                    {

scripts/filter_variants.py

"""
Variant Filtering Module

This module provides functions for filtering annotated variants by consequence,
frequency, quality, and pathogenicity.
"""

import sys

try:
    import pandas as pd
    import numpy as np
except ImportError as e:
    print(f"Error: Missing required package: {e}")
    print("Install with: pip install pandas numpy")
    sys.exit(1)


# Consequence severity definitions (Sequence Ontology)
HIGH_IMPACT_CONSEQUENCES = {
    'transcript_ablation',
    'splice_acceptor_variant',
    'splice_donor_variant',
    'stop_gained',
    'frameshift_variant',
    'stop_lost',
    'start_lost',
    'transcript_amplification'
}

MODERATE_IMPACT_CONSEQUENCES = {
    'inframe_insertion',
    'inframe_deletion',
    'missense_variant',
    'protein_altering_variant'
}

LOW_IMPACT_CONSEQUENCES = {
    'splice_region_variant',
    'incomplete_terminal_codon_variant',
    'start_retained_variant',
    'stop_retained_variant',
    'synonymous_variant'
}


def filter_by_consequence(df, impact=None, consequence_types=None):
    """
    Filter variants by consequence severity or specific consequence types.

    Parameters
    ----------
    df : pd.DataFrame
        DataFrame with variant annotations
    impact : list, optional
        List of impact levels to keep: ['HIGH', 'MODERATE', 'LOW', 'MODIFIER']
    consequence_types : list, optional
        List of specific consequence types to keep

    Returns
    -------
    pd.DataFrame
        Filtered DataFrame
    """
    filtered = df.copy()

    # Filter by IMPACT
    if impact:
        if 'IMPACT' not in df.columns:
            print("Warning: IMPACT column not found")
            return filtered

        filtered = filtered[filtered['IMPACT'].isin(impact)]

    # Filter by specific consequence types
    if consequence_types:
        if 'Consequence' not in df.columns:
            print("Warning: Consequence column not found")
            return filtered

Companion files

Type	Path	Bytes
Markdown	references/auto_installation_implementation.md	7,458
Markdown	references/consequence_terms.md	9,672
Markdown	references/filtering_strategies.md	19,433
Markdown	references/installation_guide.md	10,441
Markdown	references/pathogenicity_interpretation.md	11,852
Markdown	references/qc_guidelines.md	15,425
Markdown	references/snpeff_best_practices.md	10,422
Markdown	references/tool_selection_guide.md	12,226
Markdown	references/troubleshooting_guide.md	11,097
Markdown	references/vep_best_practices.md	9,351
Python	scripts/annotate_genes.py	8,175
Python	scripts/export_results.py	15,166
Python	scripts/filter_variants.py	10,601
Python	scripts/install_tools.py	6,309
Python	scripts/load_example_data.py	5,290
Python	scripts/parse_snpeff_output.py	7,642
Python	scripts/parse_vep_output.py	7,013
Python	scripts/plot_variant_distribution.py	11,139
Python	scripts/prioritize_variants.py	9,209
Python	scripts/run_snpeff.py	10,493
Python	scripts/run_vep.py	9,409
Python	scripts/select_tool.py	6,115
Python	scripts/test_complete_workflow.py	10,985
Python	scripts/test_pickle_load.py	3,182
Python	scripts/validate_vcf.py	10,187
Python	scripts/verify_changes.py	6,981
Markdown	SKILL.md	17,189
JSON	skill.meta.json	5,041