⌂ / Overview / Experimental Design & Molecular Biology / sgRNA Design

View companion source

sgRNA Design

Find or design guides — validated-first, then de novo.

Overview

Problem. Need efficient, low-off-target guides to KO/activate/inhibit.

Use when: KO / CRISPRa / CRISPRi for a gene

Avoid when: Ignoring off-target risk

Learning goals

Reuse validated guides before designing de novo
Balance on-target activity and off-target risk

Figures

Tutorial

Find or design sgRNAs by prioritizing validated sequences before computational predictions. Always start at Option 1 and only descend to the next tier when the current one yields nothing usable. Ported from the Biomni sgRNA_design_guide.md (snap-stanford/Biomni), with the data parsing corrected and the literature step wired to this environment's search tools.

When to use

"Give me an sgRNA to knock out TP53 in human cells"
"Design CRISPR guides to activate OCT4" / "CRISPRi guides for MYC"
"What guide RNA should I use for with SpCas9 / SaCas9 / Cas12a?"
Selecting guides for an arrayed or pooled CRISPR screen

Inputs

Gene symbol (required), e.g. TP53, BRCA1, AAVS1.
Organism (default human), e.g. human/mouse/rat or NCBI TAXID.
Application (default knockout): knockout / activation (CRISPRa) / inhibition (CRISPRi).
Cas enzyme (default SpCas9): SpCas9, SaCas9, AsCas12a, enAsCas12a.

Outputs

<GENE>_selected_sgRNAs.csv — unified table of 3–4 recommended guides (sequence, source, rank/score, exon/position, PAM, citation/dataset, notes).
<GENE>_sgRNA_summary.md — which tier was used and why, the picks, and caveats.

Save both to the user's results directory.

Bundled resources (work offline; refresh via `references/refresh_resources.md`)

references/resource/addgene_grna_sequences.csv — 321 validated sgRNAs, 197 genes (Addgene).
references/resource/CRISPick_download_links.txt — 238 CRISPick dataset URLs, 13 organisms.

Scripts

Script	Purpose
`scripts/search_addgene.py`	Tier 1 / Method 1 — search the Addgene database (handles the HTML-wrapped IDs and messy species values).
`scripts/find_crispick_dataset.py`	Tier 2 / Step 1 — resolve the correct CRISPick download URL for organism+enzyme+application.
`scripts/select_crispick_sgrnas.py`	Tier 2 / Step 3 — filter a downloaded CRISPick file to your gene, rank, pick 3–4.
`scripts/check_design_rules.py`	Tier 3 — sanity-check a candidate guide (length, GC, TTTT, PAM).
`scripts/export_results.py`	Write the unified CSV + markdown summary.

Option 1 — Validated sequences (ALWAYS try first)

You MUST complete both Method 1 and Method 2 before considering Option 2. Do not skip Method 2 even if Method 1 finds nothing — many validated guides live only in the literature.

Method 1 — Bundled Addgene database

import sys; sys.path.insert(0, "scripts")
from search_addgene import search_addgene

hits = search_addgene("TP53", species="human", application="knockout")
print(len(hits))   # 0 for TP53 -> still do Method 2 before Option 2

Each hit carries a clean Target Sequence, pubmed_id/pubmed_url, plasmid_id/plasmid_url, and Depositor. Cite the PubMed ID of the original publication in your methods.

application accepts intent words and maps them to Addgene's vocabulary: knockout→cut, activation→activate, inhibition/CRISPRi→interfere/RNA targeting.

Method 2 — Literature & web search (REQUIRED)

This environment does not expose advanced_web_search_claude; use the available tools instead. Run both for coverage:

LiteratureSearch — peer-reviewed papers (validated guides, supplements).
WebSearch — vendor/database hits (GenScript, Horizon, lab protocols).

Query templates (substitute the gene):

"sgRNA" OR "guide RNA" "<GENE>" validated experimental
"CRISPR knockout" "<GENE>" sgRNA sequence validated
"<GENE>" sgRNA "cutting efficiency" OR "on-target"

Scan ≥10–15 results and check supplementary materials. For any validated guide, record the sequence, citation (PMID/DOI), and validation details (cell line, cutting efficiency).

Decision: if either method yields usable validated guides, select 3–4 and export. Only if both come up empty, go to Option 2.

Option 2 — CRISPick precomputed designs

Use when no validated guides exist, you need genome-wide coverage, or you want ranked options.

Step 1 — Resolve the dataset URL

from find_crispick_dataset import find_crispick_dataset
info = find_crispick_dataset("human", cas="SpCas9", application="knockout")
print(info["matches"])   # GRCh38 + GRCh37 dataset URLs
print(info["warning"])   # Cas12a variant warning, if applicable

AsCas12a vs enAsCas12a are different enzymes. Guides for one may not work with the other. The finder matches the exact enzyme token so datasets never cross-contaminate.

Step 2 — Download & extract (files are 50–700 MB; not bundled)

wget '<URL from Step 1>'
gunzip sgRNA_design_*.txt.gz

Step 3 — Filter, rank, select

from select_crispick_sgrnas import select_crispick_sgrnas
picks = select_crispick_sgrnas("sgRNA_design_..._CRISPRko_....txt", "TP53", n=4)

Ranks by Combined Rank (lower = better) by default; rank_by="on_target" or "off_target" to prioritize efficiency or specificity. Spreads picks across distinct exons for redundancy. Optional filters: exon=, cut_position_range=, max_target_cut_pct= (knockout). Column names are resolved defensively (handles both real CRISPick and abbreviated layouts); see references/crispick_file_format.md. If the gene is absent → Option 3.

Option 3 — De-novo design (last resort)

For genes/organisms not covered above. Follow references/design_rules.md:

Length: 20 bp (SpCas9/SaCas9), 23–25 bp (Cas12a). PAM: SpCas9 NGG, SaCas9 NNGRRT, Cas12a TTTV (5').
GC 40–60%; avoid TTTT and homopolymer runs >4 nt.
KO → early exons (first ~50%); CRISPRa → −200 to +1 of TSS; CRISPRi → −50 to +300 of TSS.

from check_design_rules import check_design_rules, format_report
print(format_report(check_design_rules("GAGGTTGTGAGGCGCTGCCC", "SpCas9", pam="AGG")))

This checks rules only — for real off-target assessment use Cas-OFFinder/CRISPOR or CRISPick ranks.

Export (all tiers)

from export_results import from_addgene, from_crispick, export
unified = from_addgene(hits, application="knockout", enzyme="SpCas9")  # or from_crispick(...)
export(unified, gene="TP53", tier="Option 1 (validated Addgene)",
       outdir="/path/to/results", rationale="Validated guides found via Method 1.")

Universal best practice

Test 3–4 sgRNAs per gene experimentally regardless of predicted scores, and validate edits (Sanger sequencing; TIDE/T7E1 for indels). Prediction scores guide selection but do not replace empirical validation.

Citations & acknowledgments (preserve in user methods)

Validated guides (Option 1): Addgene (https://www.addgene.org). Cite the PubMed ID of each guide's original publication. Acknowledge: "Validated sgRNA sequences obtained from Addgene."
CRISPick (Option 2): "Guide designs provided by the CRISPick web tool of the GPP at the Broad Institute."
Cas9 (SpCas9, SaCas9): Sanson KR, et al. Nat Commun. 2018;9(1):5416. PMID: 30575746.
Cas12a (AsCas12a, enAsCas12a): DeWeirdt PC, et al. Nat Biotechnol. 2021;39(1):94–104. PMID: 32661438. (Specify which Cas12a variant you used.)

Scientific caveats

Bundled Addgene/CRISPick files are a fixed snapshot (197 genes / 238 datasets); literature search (Method 2) is mandatory precisely because the snapshot is incomplete.
Genome build matters: human defaults GRCh38 (GRCh37 also available), mouse GRCm38 — match coordinates to your reference.
The skill does not perform genome-wide off-target alignment beyond CRISPick's precomputed ranks.

Code preview

scripts/init.py

"""sgRNA design skill: helper scripts for the three-tiered workflow."""

scripts/check_design_rules.py

"""
Option 3 helper: sanity-check a candidate sgRNA against the de-novo design rules.

This is a lightweight rule checker (length, GC 40-60%, TTTT terminator, >4 homopolymer runs,
and PAM verification for a given enzyme). It is NOT a genome-wide off-target search — use
Cas-OFFinder / CRISPOR or CRISPick's precomputed off-target ranks for real specificity.
"""

from __future__ import annotations

import re

# Enzyme -> (expected protospacer length range, PAM regex, PAM side).
# PAM regex uses IUPAC: N=ACGT, V=ACG, R=AG.
_IUPAC = {"N": "[ACGT]", "V": "[ACG]", "R": "[AG]", "Y": "[CT]", "W": "[AT]",
          "S": "[GC]", "K": "[GT]", "M": "[AC]"}

ENZYMES = {
    "SpCas9":     {"len": (20, 20), "pam": "NGG",    "side": "3'"},
    "SaCas9":     {"len": (20, 21), "pam": "NNGRRT", "side": "3'"},
    "AsCas12a":   {"len": (23, 25), "pam": "TTTV",   "side": "5'"},
    "enAsCas12a": {"len": (23, 25), "pam": "TTTV",   "side": "5'"},
}


def _pam_to_regex(pam: str) -> str:
    return "".join(_IUPAC.get(b, b) for b in pam.upper())


def gc_content(seq: str) -> float:
    seq = seq.upper()
    if not seq:
        return 0.0
    return 100.0 * (seq.count("G") + seq.count("C")) / len(seq)


def check_design_rules(protospacer: str, enzyme: str = "SpCas9", pam: str | None = None) -> dict:
    """
    Check one candidate protospacer against the de-novo rules.

    Parameters
    ----------
    protospacer : str
        The guide/protospacer sequence (without the PAM), 5'->3'.
    enzyme : str
        One of SpCas9, SaCas9, AsCas12a, enAsCas12a.
    pam : str, optional
        The observed PAM in the genome flanking the protospacer. If given, it is checked
        against the enzyme's PAM pattern.

    Returns
    -------
    dict: {'passes': bool, 'checks': {name: (ok, detail)}, 'enzyme': ..., 'gc': float}
    """
    seq = protospacer.upper().strip()
    spec = ENZYMES.get(enzyme)
    checks: dict[str, tuple[bool, str]] = {}

    if spec is None:
        return {"passes": False, "enzyme": enzyme, "gc": None,
                "checks": {"enzyme_known": (False, f"Unknown enzyme '{enzyme}'. "
                                            f"Known: {list(ENZYMES)}")}}

    lo, hi = spec["len"]
    checks["length"] = (lo <= len(seq) <= hi,
                        f"{len(seq)} bp (expected {lo}-{hi} for {enzyme})")

    gc = gc_content(seq)
    checks["gc_content"] = (40.0 <= gc <= 60.0, f"{gc:.0f}% (target 40-60%)")

    checks["no_TTTT"] = ("TTTT" not in seq, "TTTT terminator present" if "TTTT" in seq
                         else "no TTTT run")

    homo = re.search(r"(A{5,}|C{5,}|G{5,}|T{5,})", seq)
    checks["no_long_homopolymer"] = (homo is None,
                                     f"{homo.group(0)} run" if homo else "no run >4 nt")

    checks["valid_bases"] = (re.fullmatch(r"[ACGT]+", seq) is not None,
                             "non-ACGT characters present" if not re.fullmatch(r"[ACGT]+", seq)
                             else "all ACGT")

scripts/export_results.py

"""
Export selected sgRNAs to a unified CSV + a short markdown summary.

Produces the two deliverables the skill returns for a gene:
  <GENE>_selected_sgRNAs.csv   - unified table across whichever tier(s) were used
  <GENE>_sgRNA_summary.md      - which tier was used, why, the picks, and caveats

The unified schema lets validated (Addgene), CRISPick, and de-novo guides sit in one table.
"""

from __future__ import annotations

import os
import pandas as pd

UNIFIED_COLUMNS = [
    "gene", "sgRNA_sequence", "source", "application", "enzyme",
    "rank_or_score", "exon_or_position", "pam", "citation_or_dataset", "notes",
]

# source values: "validated_addgene" | "crispick" | "de_novo"


def from_addgene(df: pd.DataFrame, application: str = "", enzyme: str = "") -> pd.DataFrame:
    """Map a search_addgene() result into the unified schema."""
    rows = []
    for _, r in df.iterrows():
        cite = f"PMID {r['pubmed_id']}" if r.get("pubmed_id") else ""
        if r.get("plasmid_id"):
            cite += f"; Addgene #{r['plasmid_id']}"
        rows.append({
            "gene": r.get("Target Gene", ""),
            "sgRNA_sequence": r.get("Target Sequence", ""),
            "source": "validated_addgene",
            "application": application or r.get("Application", ""),
            "enzyme": enzyme or r.get("Cas9 Species", ""),
            "rank_or_score": "validated",
            "exon_or_position": "",
            "pam": "",
            "citation_or_dataset": cite,
            "notes": f"Depositor: {r.get('Depositor','')}; species: {r.get('Target Species','')}",
        })
    return pd.DataFrame(rows, columns=UNIFIED_COLUMNS)


def from_crispick(df: pd.DataFrame, gene: str, dataset_url: str = "",
                  application: str = "", enzyme: str = "") -> pd.DataFrame:
    """Map a select_crispick_sgrnas() result into the unified schema."""
    def col(*names):
        for n in names:
            if n in df.columns:
                return n
        return None

    seq_c = col("sgRNA Sequence", "sgRNA_sequence")
    exon_c = col("Exon Number", "Exon_ID")
    pos_c = col("sgRNA Cut Position (1-based)", "sgRNA 'Cut' Position")
    pam_c = col("PAM Sequence", "PAM")
    ds = dataset_url.rsplit("/", 1)[-1] if dataset_url else "CRISPick"

    rows = []
    for _, r in df.iterrows():
        exon_pos = ""
        if exon_c and pd.notna(r.get(exon_c)):
            exon_pos = f"exon {r[exon_c]}"
        if pos_c and pd.notna(r.get(pos_c)):
            exon_pos = (exon_pos + f" @ {int(r[pos_c])}").strip()
        rows.append({
            "gene": gene,
            "sgRNA_sequence": r.get(seq_c, "") if seq_c else "",
            "source": "crispick",
            "application": application,
            "enzyme": enzyme,
            "rank_or_score": r.get("rank_value", ""),
            "exon_or_position": exon_pos,
            "pam": r.get(pam_c, "") if pam_c else "",
            "citation_or_dataset": ds,
            "notes": "CRISPick precomputed; rank lower=better (or score higher=better)",
        })
    return pd.DataFrame(rows, columns=UNIFIED_COLUMNS)

Companion files

Type	Path	Bytes
Markdown	references/crispick_file_format.md	4,037
Markdown	references/design_rules.md	2,167
Markdown	references/refresh_resources.md	1,399
CSV	references/resource/addgene_grna_sequences.csv	54,648
Text	references/resource/CRISPick_download_links.txt	38,053
Python	scripts/__init__.py	72
Python	scripts/check_design_rules.py	3,946
Python	scripts/export_results.py	6,212
Python	scripts/find_crispick_dataset.py	6,754
Python	scripts/search_addgene.py	7,031
Python	scripts/select_crispick_sgrnas.py	7,165
Markdown	SKILL.md	7,586
JSON	skill.meta.json	2,457

sgRNA Design

Overview

Learning goals

Figures

Tutorial

When to use

Inputs

Outputs

Bundled resources (work offline; refresh via references/refresh_resources.md)

Scripts

Option 1 — Validated sequences (ALWAYS try first)

Method 1 — Bundled Addgene database

Method 2 — Literature & web search (REQUIRED)

Option 2 — CRISPick precomputed designs

Step 1 — Resolve the dataset URL

Step 2 — Download & extract (files are 50–700 MB; not bundled)

Step 3 — Filter, rank, select

Option 3 — De-novo design (last resort)

Export (all tiers)

Universal best practice

Citations & acknowledgments (preserve in user methods)

Scientific caveats

Code preview

scripts/__init__.py

scripts/check_design_rules.py

scripts/export_results.py

Companion files

Bundled resources (work offline; refresh via `references/refresh_resources.md`)

scripts/init.py