sgRNA Design

Find or design guides — validated-first, then de novo.

Overview

Problem. Need efficient, low-off-target guides to KO/activate/inhibit.

Use when: KO / CRISPRa / CRISPRi for a gene
Avoid when: Ignoring off-target risk

Learning goals

Figures

sgRNA Three-Tiered Approach
Tier Decision Flow
Cas Enzymes & PAM
Target Location by Goal
Design Rules & Caveats

Tutorial

Find or design sgRNAs by prioritizing validated sequences before computational predictions. Always start at Option 1 and only descend to the next tier when the current one yields nothing usable. Ported from the Biomni sgRNA_design_guide.md (snap-stanford/Biomni), with the data parsing corrected and the literature step wired to this environment's search tools.

When to use

Inputs

Outputs

Save both to the user's results directory.

Bundled resources (work offline; refresh via references/refresh_resources.md)

Scripts

Script Purpose
scripts/search_addgene.py Tier 1 / Method 1 — search the Addgene database (handles the HTML-wrapped IDs and messy species values).
scripts/find_crispick_dataset.py Tier 2 / Step 1 — resolve the correct CRISPick download URL for organism+enzyme+application.
scripts/select_crispick_sgrnas.py Tier 2 / Step 3 — filter a downloaded CRISPick file to your gene, rank, pick 3–4.
scripts/check_design_rules.py Tier 3 — sanity-check a candidate guide (length, GC, TTTT, PAM).
scripts/export_results.py Write the unified CSV + markdown summary.

Option 1 — Validated sequences (ALWAYS try first)

You MUST complete both Method 1 and Method 2 before considering Option 2. Do not skip Method 2 even if Method 1 finds nothing — many validated guides live only in the literature.

Method 1 — Bundled Addgene database

import sys; sys.path.insert(0, "scripts")
from search_addgene import search_addgene

hits = search_addgene("TP53", species="human", application="knockout")
print(len(hits))   # 0 for TP53 -> still do Method 2 before Option 2

Each hit carries a clean Target Sequence, pubmed_id/pubmed_url, plasmid_id/plasmid_url, and Depositor. Cite the PubMed ID of the original publication in your methods.

application accepts intent words and maps them to Addgene's vocabulary: knockout→cut, activation→activate, inhibition/CRISPRi→interfere/RNA targeting.

Method 2 — Literature & web search (REQUIRED)

This environment does not expose advanced_web_search_claude; use the available tools instead. Run both for coverage: - LiteratureSearch — peer-reviewed papers (validated guides, supplements). - WebSearch — vendor/database hits (GenScript, Horizon, lab protocols).

Query templates (substitute the gene):

"sgRNA" OR "guide RNA" "<GENE>" validated experimental
"CRISPR knockout" "<GENE>" sgRNA sequence validated
"<GENE>" sgRNA "cutting efficiency" OR "on-target"

Scan ≥10–15 results and check supplementary materials. For any validated guide, record the sequence, citation (PMID/DOI), and validation details (cell line, cutting efficiency).

Decision: if either method yields usable validated guides, select 3–4 and export. Only if both come up empty, go to Option 2.


Option 2 — CRISPick precomputed designs

Use when no validated guides exist, you need genome-wide coverage, or you want ranked options.

Step 1 — Resolve the dataset URL

from find_crispick_dataset import find_crispick_dataset
info = find_crispick_dataset("human", cas="SpCas9", application="knockout")
print(info["matches"])   # GRCh38 + GRCh37 dataset URLs
print(info["warning"])   # Cas12a variant warning, if applicable

AsCas12a vs enAsCas12a are different enzymes. Guides for one may not work with the other. The finder matches the exact enzyme token so datasets never cross-contaminate.

Step 2 — Download & extract (files are 50–700 MB; not bundled)

wget '<URL from Step 1>'
gunzip sgRNA_design_*.txt.gz

Step 3 — Filter, rank, select

from select_crispick_sgrnas import select_crispick_sgrnas
picks = select_crispick_sgrnas("sgRNA_design_..._CRISPRko_....txt", "TP53", n=4)

Ranks by Combined Rank (lower = better) by default; rank_by="on_target" or "off_target" to prioritize efficiency or specificity. Spreads picks across distinct exons for redundancy. Optional filters: exon=, cut_position_range=, max_target_cut_pct= (knockout). Column names are resolved defensively (handles both real CRISPick and abbreviated layouts); see references/crispick_file_format.md. If the gene is absent → Option 3.


Option 3 — De-novo design (last resort)

For genes/organisms not covered above. Follow references/design_rules.md: - Length: 20 bp (SpCas9/SaCas9), 23–25 bp (Cas12a). PAM: SpCas9 NGG, SaCas9 NNGRRT, Cas12a TTTV (5'). - GC 40–60%; avoid TTTT and homopolymer runs >4 nt. - KO → early exons (first ~50%); CRISPRa → −200 to +1 of TSS; CRISPRi → −50 to +300 of TSS.

from check_design_rules import check_design_rules, format_report
print(format_report(check_design_rules("GAGGTTGTGAGGCGCTGCCC", "SpCas9", pam="AGG")))

This checks rules only — for real off-target assessment use Cas-OFFinder/CRISPOR or CRISPick ranks.


Export (all tiers)

from export_results import from_addgene, from_crispick, export
unified = from_addgene(hits, application="knockout", enzyme="SpCas9")  # or from_crispick(...)
export(unified, gene="TP53", tier="Option 1 (validated Addgene)",
       outdir="/path/to/results", rationale="Validated guides found via Method 1.")

Universal best practice

Test 3–4 sgRNAs per gene experimentally regardless of predicted scores, and validate edits (Sanger sequencing; TIDE/T7E1 for indels). Prediction scores guide selection but do not replace empirical validation.

Citations & acknowledgments (preserve in user methods)

Scientific caveats

Code preview

scripts/__init__.py

"""sgRNA design skill: helper scripts for the three-tiered workflow."""

scripts/check_design_rules.py

"""
Option 3 helper: sanity-check a candidate sgRNA against the de-novo design rules.

This is a lightweight rule checker (length, GC 40-60%, TTTT terminator, >4 homopolymer runs,
and PAM verification for a given enzyme). It is NOT a genome-wide off-target search — use
Cas-OFFinder / CRISPOR or CRISPick's precomputed off-target ranks for real specificity.
"""

from __future__ import annotations

import re

# Enzyme -> (expected protospacer length range, PAM regex, PAM side).
# PAM regex uses IUPAC: N=ACGT, V=ACG, R=AG.
_IUPAC = {"N": "[ACGT]", "V": "[ACG]", "R": "[AG]", "Y": "[CT]", "W": "[AT]",
          "S": "[GC]", "K": "[GT]", "M": "[AC]"}

ENZYMES = {
    "SpCas9":     {"len": (20, 20), "pam": "NGG",    "side": "3'"},
    "SaCas9":     {"len": (20, 21), "pam": "NNGRRT", "side": "3'"},
    "AsCas12a":   {"len": (23, 25), "pam": "TTTV",   "side": "5'"},
    "enAsCas12a": {"len": (23, 25), "pam": "TTTV",   "side": "5'"},
}


def _pam_to_regex(pam: str) -> str:
    return "".join(_IUPAC.get(b, b) for b in pam.upper())


def gc_content(seq: str) -> float:
    seq = seq.upper()
    if not seq:
        return 0.0
    return 100.0 * (seq.count("G") + seq.count("C")) / len(seq)


def check_design_rules(protospacer: str, enzyme: str = "SpCas9", pam: str | None = None) -> dict:
    """
    Check one candidate protospacer against the de-novo rules.

    Parameters
    ----------
    protospacer : str
        The guide/protospacer sequence (without the PAM), 5'->3'.
    enzyme : str
        One of SpCas9, SaCas9, AsCas12a, enAsCas12a.
    pam : str, optional
        The observed PAM in the genome flanking the protospacer. If given, it is checked
        against the enzyme's PAM pattern.

    Returns
    -------
    dict: {'passes': bool, 'checks': {name: (ok, detail)}, 'enzyme': ..., 'gc': float}
    """
    seq = protospacer.upper().strip()
    spec = ENZYMES.get(enzyme)
    checks: dict[str, tuple[bool, str]] = {}

    if spec is None:
        return {"passes": False, "enzyme": enzyme, "gc": None,
                "checks": {"enzyme_known": (False, f"Unknown enzyme '{enzyme}'. "
                                            f"Known: {list(ENZYMES)}")}}

    lo, hi = spec["len"]
    checks["length"] = (lo <= len(seq) <= hi,
                        f"{len(seq)} bp (expected {lo}-{hi} for {enzyme})")

    gc = gc_content(seq)
    checks["gc_content"] = (40.0 <= gc <= 60.0, f"{gc:.0f}% (target 40-60%)")

    checks["no_TTTT"] = ("TTTT" not in seq, "TTTT terminator present" if "TTTT" in seq
                         else "no TTTT run")

    homo = re.search(r"(A{5,}|C{5,}|G{5,}|T{5,})", seq)
    checks["no_long_homopolymer"] = (homo is None,
                                     f"{homo.group(0)} run" if homo else "no run >4 nt")

    checks["valid_bases"] = (re.fullmatch(r"[ACGT]+", seq) is not None,
                             "non-ACGT characters present" if not re.fullmatch(r"[ACGT]+", seq)
                             else "all ACGT")

scripts/export_results.py

"""
Export selected sgRNAs to a unified CSV + a short markdown summary.

Produces the two deliverables the skill returns for a gene:
  <GENE>_selected_sgRNAs.csv   - unified table across whichever tier(s) were used
  <GENE>_sgRNA_summary.md      - which tier was used, why, the picks, and caveats

The unified schema lets validated (Addgene), CRISPick, and de-novo guides sit in one table.
"""

from __future__ import annotations

import os
import pandas as pd

UNIFIED_COLUMNS = [
    "gene", "sgRNA_sequence", "source", "application", "enzyme",
    "rank_or_score", "exon_or_position", "pam", "citation_or_dataset", "notes",
]

# source values: "validated_addgene" | "crispick" | "de_novo"


def from_addgene(df: pd.DataFrame, application: str = "", enzyme: str = "") -> pd.DataFrame:
    """Map a search_addgene() result into the unified schema."""
    rows = []
    for _, r in df.iterrows():
        cite = f"PMID {r['pubmed_id']}" if r.get("pubmed_id") else ""
        if r.get("plasmid_id"):
            cite += f"; Addgene #{r['plasmid_id']}"
        rows.append({
            "gene": r.get("Target Gene", ""),
            "sgRNA_sequence": r.get("Target Sequence", ""),
            "source": "validated_addgene",
            "application": application or r.get("Application", ""),
            "enzyme": enzyme or r.get("Cas9 Species", ""),
            "rank_or_score": "validated",
            "exon_or_position": "",
            "pam": "",
            "citation_or_dataset": cite,
            "notes": f"Depositor: {r.get('Depositor','')}; species: {r.get('Target Species','')}",
        })
    return pd.DataFrame(rows, columns=UNIFIED_COLUMNS)


def from_crispick(df: pd.DataFrame, gene: str, dataset_url: str = "",
                  application: str = "", enzyme: str = "") -> pd.DataFrame:
    """Map a select_crispick_sgrnas() result into the unified schema."""
    def col(*names):
        for n in names:
            if n in df.columns:
                return n
        return None

    seq_c = col("sgRNA Sequence", "sgRNA_sequence")
    exon_c = col("Exon Number", "Exon_ID")
    pos_c = col("sgRNA Cut Position (1-based)", "sgRNA 'Cut' Position")
    pam_c = col("PAM Sequence", "PAM")
    ds = dataset_url.rsplit("/", 1)[-1] if dataset_url else "CRISPick"

    rows = []
    for _, r in df.iterrows():
        exon_pos = ""
        if exon_c and pd.notna(r.get(exon_c)):
            exon_pos = f"exon {r[exon_c]}"
        if pos_c and pd.notna(r.get(pos_c)):
            exon_pos = (exon_pos + f" @ {int(r[pos_c])}").strip()
        rows.append({
            "gene": gene,
            "sgRNA_sequence": r.get(seq_c, "") if seq_c else "",
            "source": "crispick",
            "application": application,
            "enzyme": enzyme,
            "rank_or_score": r.get("rank_value", ""),
            "exon_or_position": exon_pos,
            "pam": r.get(pam_c, "") if pam_c else "",
            "citation_or_dataset": ds,
            "notes": "CRISPick precomputed; rank lower=better (or score higher=better)",
        })
    return pd.DataFrame(rows, columns=UNIFIED_COLUMNS)

Companion files

TypePathBytes
Markdownreferences/crispick_file_format.md4,037
Markdownreferences/design_rules.md2,167
Markdownreferences/refresh_resources.md1,399
CSVreferences/resource/addgene_grna_sequences.csv54,648
Textreferences/resource/CRISPick_download_links.txt38,053
Pythonscripts/__init__.py72
Pythonscripts/check_design_rules.py3,946
Pythonscripts/export_results.py6,212
Pythonscripts/find_crispick_dataset.py6,754
Pythonscripts/search_addgene.py7,031
Pythonscripts/select_crispick_sgrnas.py7,165
MarkdownSKILL.md7,586
JSONskill.meta.json2,457