CFG EvalBenchmarkPublished evalJune 11, 20265 min

Dataset: 20M NYC Taxi rows
SQL grammar: 43 Lark rules
Headline slice: 0 / 8 CFG failures

Generated: June 11, 2026
Dataset: NYC Taxi (20M trips)
Cases: 21 labelled + 10 refusal
Grammar rules: 43

TL;DR

GPT-5's context-free grammar feature forces the decode to stay inside a 43-rule Lark grammar at every token — so the SQL it emits is provably valid SELECT-only ClickHouse, not just prompted to be. On the 8 adversarial prompts designed to break the unconstrained baseline, the baseline failed 7 out of 8. The constrained path failed 0. On clean prompts, both modes score ~100% and the grammar costs nothing — the separation only appears on the adversarial slice.

The eval runs against ClickHouse's NYC Taxi dataset — 20 million trips, July–September 2015. Correctness is graded by comparing result sets against live reference runs, not SQL strings. The grammar is the only manipulated variable; both modes share the same base system prompt.

The question

GPT-5's Responses API added a context-free grammar (CFG) feature — a custom tool whose format field is a Lark grammar. At every decoding step, LLGuidance masks the logit distribution — zeroing out every token that would violate the grammar — so the model can only emit tokens that stay inside the accepted language. The output is provably grammar-conformant.

Why "context-free"? Decades-old CS theory, not an OpenAI coinage. A context-free grammar rule applies regardless of what surrounds it — a condition is a column, operator, and value whether it appears in a WHERE or HAVING clause. That property makes CFGs checkable one token at a time: the engine only needs to track which rules are still open.

The practical question: does grammar-constrained decoding buy anything over a strong unconstrained baseline that is already prompted to emit valid SQL? The answer turns out to be: not on easy prompts, yes on adversarial ones. The grammar's value isn't preventing a capable model from writing bad SQL on standard queries — it's closing the gap when the prompt is specifically designed to induce schema drift.

Pipeline

The route (app/api/query/route.ts) is thin on purpose:

Zod body validation + in-memory LRU cache check (256-entry, 2015 dataset is static)
generateSQLConstrained — GPT-5 Responses API with a custom tool whose format is the Lark grammar, plus a cannot_answer function tool; tool_choice: "required"
Extract SQL from the custom_tool_call output item
runQuery — ClickHouse Cloud HTTPS / 8443 with server-side guards: max_execution_time 30s, max_result_rows 100 000
Return { sql, columns, rows, generationMs, executionMs, usage }

The unconstrained path drops the grammar tool and cannot_answer, emitting plain-text SQL with any ```sql fences stripped. Both paths share the same base system prompt — the grammar is the only variable.

The grammar

The grammar lives in lib/grammar/taxi.ts as a Lark string. It is the load-bearing security boundary, not the system prompt. What it enforces by construction — not by instruction:

Surface	What the grammar accepts
Statements	SELECT only. DDL, DML, TRUNCATE, GRANT — structurally unrepresentable.
Tables	FROM nyc_taxi or FROM default.nyc_taxi. No joins, subqueries, CTEs, or UNION.
Columns	13 whitelisted. Lat/long excluded — a hallucinated column name cannot be emitted.
Functions	7 aggregates (count, sum, avg, min, max, uniq, uniqExact) and 8 date functions (toDate, toHour, toStartOf*, toDayOfWeek, toMonth, toYear).
Keywords	UPPERCASE required — lowercase equivalents are outside the accepted language.
Shape	No semicolons, no comments, no multi-statement input.

One deliberate looseness: bare IDENTIFIERs in GROUP BY / ORDER BY / HAVING let projection aliases be referenced (e.g. count() AS trips … ORDER BY trips). An off-whitelist alias parses but fails on ClickHouse — the projection itself stays whitelist-only, and evals 2–3 confirm this never fires in practice.

Explicit whitespace threading. The GPT-5 CFG cookbook cautions against %ignore WS. Every keyword/identifier boundary uses an explicit WS terminal; optional whitespace uses WS? at the rule level rather than a zero-width terminal. This creates LALR-1 shift/reduce conflicts, which is why the local Python validator uses Earley — but LLGuidance is not LALR-bound and accepts the same grammar.

Eval design

NL→SQL systems can fail in at least five distinct ways. Each suite measures a different failure mode; the HEADLINE slice is the one that actually discriminates:

Execution correctness

Result set matches a live ClickHouse reference run for each of 21 labelled prompts. Compares result sets, not SQL strings. pass@N ≥ 0.5.

SQL validity

Every constrained output executes on ClickHouse without error. A grammar-accepted query that fails means the grammar is wrong, not the model.

Schema adherence

Generated SQL references only real columns, tables, and functions. Zero violations on the constrained path by construction; the adversarial slice is where the unconstrained baseline drifts.

Refusal handling

10 unanswerable prompts (out-of-domain + phantom columns). Constrained path declines via cannot_answer tool; unconstrained baseline emits degenerate answers.

CFG vs no-CFG head-to-head

Per-case + overall + adversarial-slice tables, each with a schemaClean column. Hard assertion: constrained execRate == 1.0.

★

HEADLINE slice

The 8 prompts built to break the unconstrained baseline. Baseline failed 7 / 8; constrained path 0 / 8. This is the one axis where the grammar measurably separates.

Cases are tagged easy / medium / hard (Spider/BIRD-style) so pass@N is reported per tier. Each case also has distractor queries — plausible-but-wrong SQL (sum vs avg, fare vs total, wrong date) — that are verified to actually fail on the live dataset. A coincidentally-correct query can't pass.

Results

The headline: on the 8-prompt adversarial slice built to induce schema drift in the unconstrained baseline, the constrained path produced zero failures.

Mode	execRate	schemaClean	Headline fails	Avg latency
CFG constrained	1.000	1.000	0 / 8	2.7 s
Unconstrained baseline	1.000	0.875	7 / 8	3.3 s

The unconstrained baseline's execRate stays at 1.0 on clean prompts — GPT-5 writes valid SQL when the question is straightforward. The separation appears in schemaClean on the adversarial slice: the baseline reaches for toUnixTimestamp, dateDiff, and phantom columns that don't exist in the 13-column whitelist. The grammar can't emit those tokens, so the constrained path either produces a valid query or calls cannot_answer.

The reasoning_effort: "low" optimization was eval-confirmed: 2.7 s vs 3.3 s average latency, no accuracy loss on the correctness evals. When the grammar guarantees syntactic validity, the model doesn't need to spend reasoning tokens on self-checking.

Design decisions

reasoning_effort: "low" on the constrained path

The grammar guarantees syntactic validity, so the model doesn't need to spend reasoning tokens self-checking. Eval-confirmed: constrained path averaged 2.7 s vs 3.3 s unconstrained, with no accuracy loss.

Earley parser, not LALR, for local validation

Explicit whitespace threading creates LALR-1 shift/reduce conflicts. Earley explores both paths. LLGuidance is not LALR-bound, so the local validator must match its acceptance behavior.

cannot_answer tool alongside the grammar

The grammar can accept but cannot refuse — an out-of-scope question would force the decoder into a degenerate query. tool_choice: "required" means the model calls exactly one: either the grammar tool or cannot_answer.

Length-bounded terminals

Every terminal is bounded by content and length — IDENTIFIER: /[A-Za-z_][A-Za-z0-9_]{0,63}/, STRING_LITERAL: /\'([^\'\n]|\'\' ){0,200}\'/, WS: /[ \t\n]{1,64}/. Unbounded payloads drift the decode out of distribution.

Raw openai SDK, not an AI framework

The custom-tool + grammar shape isn't first-class in @ai-sdk/openai yet. The Responses API format is: custom tool with format: { type: "grammar", syntax: "lark", definition: NYC_TAXI_LARK }.

LLM observability via Raindrop

Every /api/query and /api/execute request lands as a trace event via Next's after() — zero user-facing latency. The eval harness logs each trial as nl_to_sql_eval. build-eval-set.ts reads those signals back to fold real failures into the regression set.

Honest gaps

What would need to change before this is a production claim rather than an experiment:

Prompt-injection adversarial set

DROP TABLE, ;--, role-spoofing prompts. The grammar forecloses DDL structurally, but the exact boundary vs prompting alone is unmeasured.

Multi-table support

The grammar pins to one table. Joins and CTEs are a known LLGuidance failure mode flagged in the cookbook.

CI eval cron

A scheduled run per model tier would catch silent regressions automatically. The runbook persists per-run diffs but there's no automated trigger yet.

Historical traces in-app

The Trace sidebar shows the current request. Past traces live only in the external Raindrop dashboard.

The repo is open: github.com/zacharyr0th/cfg-eval. The live demo runs at cfg-eval.vercel.app/query.