- Dataset
- 20M NYC Taxi rows
- SQL grammar
- 43 Lark rules
- Headline slice
- 0 / 8 CFG failures
- Generated
- June 11, 2026
- Dataset
- NYC Taxi (20M trips)
- Cases
- 21 labelled + 10 refusal
- Grammar rules
- 43
TL;DR
GPT-5's context-free grammar feature forces the decode to stay inside a 43-rule Lark grammar at every token — so the SQL it emits is provably valid SELECT-only ClickHouse, not just prompted to be. On the 8 adversarial prompts designed to break the unconstrained baseline, the baseline failed 7 out of 8. The constrained path failed 0. On clean prompts, both modes score ~100% and the grammar costs nothing — the separation only appears on the adversarial slice.
The eval runs against ClickHouse's NYC Taxi dataset — 20 million trips, July–September 2015. Correctness is graded by comparing result sets against live reference runs, not SQL strings. The grammar is the only manipulated variable; both modes share the same base system prompt.
The question
GPT-5's Responses API added a context-free grammar (CFG) feature — a custom tool whose format field is a Lark grammar. At every decoding step, LLGuidance masks the logit distribution — zeroing out every token that would violate the grammar — so the model can only emit tokens that stay inside the accepted language. The output is provably grammar-conformant.
The practical question: does grammar-constrained decoding buy anything over a strong unconstrained baseline that is already prompted to emit valid SQL? The answer turns out to be: not on easy prompts, yes on adversarial ones. The grammar's value isn't preventing a capable model from writing bad SQL on standard queries — it's closing the gap when the prompt is specifically designed to induce schema drift.
Pipeline
The route (app/api/query/route.ts) is thin on purpose:
- Zod body validation + in-memory LRU cache check (256-entry, 2015 dataset is static)
generateSQLConstrained— GPT-5 Responses API with a custom tool whoseformatis the Lark grammar, plus acannot_answerfunction tool;tool_choice: "required"- Extract SQL from the
custom_tool_calloutput item runQuery— ClickHouse Cloud HTTPS / 8443 with server-side guards:max_execution_time 30s,max_result_rows 100 000- Return
{ sql, columns, rows, generationMs, executionMs, usage }
The unconstrained path drops the grammar tool and cannot_answer, emitting plain-text SQL with any ```sql fences stripped. Both paths share the same base system prompt — the grammar is the only variable.
The grammar
The grammar lives in lib/grammar/taxi.ts as a Lark string. It is the load-bearing security boundary, not the system prompt. What it enforces by construction — not by instruction:
| Surface | What the grammar accepts |
|---|---|
| Statements | SELECT only. DDL, DML, TRUNCATE, GRANT — structurally unrepresentable. |
| Tables | FROM nyc_taxi or FROM default.nyc_taxi. No joins, subqueries, CTEs, or UNION. |
| Columns | 13 whitelisted. Lat/long excluded — a hallucinated column name cannot be emitted. |
| Functions | 7 aggregates (count, sum, avg, min, max, uniq, uniqExact) and 8 date functions (toDate, toHour, toStartOf*, toDayOfWeek, toMonth, toYear). |
| Keywords | UPPERCASE required — lowercase equivalents are outside the accepted language. |
| Shape | No semicolons, no comments, no multi-statement input. |
count() AS trips … ORDER BY trips). An off-whitelist alias parses but fails on ClickHouse — the projection itself stays whitelist-only, and evals 2–3 confirm this never fires in practice.Explicit whitespace threading. The GPT-5 CFG cookbook cautions against %ignore WS. Every keyword/identifier boundary uses an explicit WS terminal; optional whitespace uses WS? at the rule level rather than a zero-width terminal. This creates LALR-1 shift/reduce conflicts, which is why the local Python validator uses Earley — but LLGuidance is not LALR-bound and accepts the same grammar.
Eval design
NL→SQL systems can fail in at least five distinct ways. Each suite measures a different failure mode; the HEADLINE slice is the one that actually discriminates:
Execution correctness
Result set matches a live ClickHouse reference run for each of 21 labelled prompts. Compares result sets, not SQL strings. pass@N ≥ 0.5.
SQL validity
Every constrained output executes on ClickHouse without error. A grammar-accepted query that fails means the grammar is wrong, not the model.
Schema adherence
Generated SQL references only real columns, tables, and functions. Zero violations on the constrained path by construction; the adversarial slice is where the unconstrained baseline drifts.
Refusal handling
10 unanswerable prompts (out-of-domain + phantom columns). Constrained path declines via cannot_answer tool; unconstrained baseline emits degenerate answers.
CFG vs no-CFG head-to-head
Per-case + overall + adversarial-slice tables, each with a schemaClean column. Hard assertion: constrained execRate == 1.0.
HEADLINE slice
The 8 prompts built to break the unconstrained baseline. Baseline failed 7 / 8; constrained path 0 / 8. This is the one axis where the grammar measurably separates.
Cases are tagged easy / medium / hard (Spider/BIRD-style) so pass@N is reported per tier. Each case also has distractor queries — plausible-but-wrong SQL (sum vs avg, fare vs total, wrong date) — that are verified to actually fail on the live dataset. A coincidentally-correct query can't pass.
Results
The headline: on the 8-prompt adversarial slice built to induce schema drift in the unconstrained baseline, the constrained path produced zero failures.
| Mode | execRate | schemaClean | Headline fails | Avg latency |
|---|---|---|---|---|
| CFG constrained | 1.000 | 1.000 | 0 / 8 | 2.7 s |
| Unconstrained baseline | 1.000 | 0.875 | 7 / 8 | 3.3 s |
The unconstrained baseline's execRate stays at 1.0 on clean prompts — GPT-5 writes valid SQL when the question is straightforward. The separation appears in schemaClean on the adversarial slice: the baseline reaches for toUnixTimestamp, dateDiff, and phantom columns that don't exist in the 13-column whitelist. The grammar can't emit those tokens, so the constrained path either produces a valid query or calls cannot_answer.
The reasoning_effort: "low" optimization was eval-confirmed: 2.7 s vs 3.3 s average latency, no accuracy loss on the correctness evals. When the grammar guarantees syntactic validity, the model doesn't need to spend reasoning tokens on self-checking.
Design decisions
Honest gaps
What would need to change before this is a production claim rather than an experiment:
The repo is open: github.com/zacharyr0th/cfg-eval. The live demo runs at cfg-eval.vercel.app/query.