Eval-Driven Development: reliable scoring when the judge has opinions
A methodology for reliable scoring on prompt-based AI features, when you can't write criteria upfront and the LLM judge keeps disagreeing with itself.
Eval-Driven Development: reliable scoring when the judge has opinions
A methodology for reliable scoring on prompt-based AI features, when you can’t write criteria upfront and the LLM judge keeps disagreeing with itself between Tuesday and Wednesday.
The way prompts get tested in practice: ship a prompt, read ten outputs, “looks fine,” ship a fix, read ten more, “still looks fine.” Months pass. The prompt has grown long, nobody remembers why half of it’s there, and the latest fix broke something nobody’s noticed yet.
That’s not testing. That’s hope, with a git history.
Eval-Driven Development is the way out. Eval first, code second. Like TDD, except the test runs on an LLM that can disagree with itself between Tuesday and Wednesday. This post is the methodology: when to start, where criteria come from, how to encode them, how to stop the judge from disagreeing with itself, and the order of validation gates that catches what your eyeballs miss.
The problem with vibe-testing
Your prompt is a function. Inputs are stochastic. Outputs are stochastic. The only thing standing between “works” and “silently broken” is whether you can mechanically tell the difference between a good output and a bad one, repeatedly, on the same input, without your judgment drifting between Monday and Friday.
That’s what an eval system is. Not a benchmark you ran once. Not a notebook you spun up after the first customer complaint. A persistent, mechanical, low-variance verdict pipeline that runs on every change to the prompt, the model, the temperature, the surrounding code, the upstream tool feeding the prompt, anything.
If you don’t have one, every prompt change is a gamble. Worse: every prompt change is a gamble you can’t tell whether you won, because the outputs look “mostly OK” on both sides of the change. You ship. You wait. Maybe the user notices something is off three weeks later. Maybe they don’t, and the regression sits in the long tail until the customer calls.
Teams that ship reliable AI features don’t have better intuition. They have an eval system that catches the change they would have missed.
Eval-Driven Development is what you call it when you treat the eval system as the foundation. Eval criteria authored before the prompt is locked. Criteria growing with every cycle. A validation gate before any merge. Same energy as TDD, with one wrinkle: the LLM is doing the grading, and the LLM is not deterministic. Most of this post is about that wrinkle.
Start day one. Schema-driven.
What teams try first
Build the prompt. Ship to a small user pool. Watch outputs. Write evals “once we know what good looks like.”
Why it doesn’t work
You will never know what good looks like in a way that scales. Your sense of “good” drifts over months. Engineers rotate off the project. The output schema changes. The next engineer has no way to recover the implicit rubric in your head. You ship for nine months, then a new hire is staring at a sprawl of instructions and asking what half of them are still doing, and nobody can answer.
What works
Author evals before the prompt is locked. Day one. Same way TDD writes the test before the function. The eval doesn’t have to be exhaustive on day one. It has to exist.
For prompt-based features specifically: lock the output schema first. Each field of the schema is a target. Each target gets exactly one criterion describing what good extraction looks like for that field. That’s your day-one eval. If the schema has eight fields, you have eight criteria. Done.
The tradeoff
Authoring criteria before you’ve seen any production outputs feels wrong. You’re guessing about what “good” means for fields you haven’t yet pulled real data through. You will write criteria that turn out to be too strict, or miss edge cases, or anchor on the wrong source location in the input. You’ll rewrite them in week three.
That’s fine. The point of day-one criteria isn’t to be permanent. It’s to be a starting point for the eval system to grow from. The alternative (no eval until you “know what good looks like”) guarantees the system never grows, because there’s nothing to grow from.
Scope discipline
On day one, the evals are field-level: does the structured output have the right value in each field for the given input. Other eval types (hallucination, safety, tone, factuality-vs-source) are real, they’re important, they will eventually live in the system. They are not day-one work.
Field-level evals are the cheapest to author against a schema. They use the schema you already wrote. They don’t require a separate ground-truth dataset. They give you a working eval system you can iterate from. Every other eval type can be layered in once the foundation is solid.
Three paths for where criteria come from
A field-level eval needs to know what “right value” means for each schema field. There are three ways to get there; real setups usually combine them.
Path A: Pre-development context from a domain expert
For software-style features (a contract clause classifier, a financial-figure extractor, a meeting-minute summarizer), the engineer can usually author baseline criteria alone. The schema fields are typed. The criteria are mechanical. “This field is the contract’s effective date, formatted as ISO 8601, sourced from the contract preamble, not from later amendments.”
For domain-specific work where the engineer is not the domain expert (legal-judgment classification, medical-coding extraction, regulatory-compliance flags), this falls over fast. The engineer doesn’t know what a domain expert would mark as correct. The criteria they write will be wrong in ways the engineer can’t detect.
Cheapest fix: have the domain expert label expected outputs on a small dataset before development starts. Fifty examples is often enough. Once you have the inputs paired with the expected outputs, you can fit the criteria to match. The criteria become deterministic: the output field matches the expected value when the expert provided one.
Tradeoff: this requires a domain expert who’ll sit with you for an hour. If you don’t have one and can’t get one, Path A is closed.
Path B: Post-launch HITL corrections
If a feature is already running in production, every user interaction is a potential labeling event. Build a human-in-the-loop correction surface into the product. Users can:
- Mark outputs as good or bad (thumbs up / thumbs down)
- Optionally provide a corrected version (“this field should have been X”)
- Optionally provide textual feedback (“you missed the second installment amount”)
The marking step is the one that matters. The thumbs up/down is the primary signal; textual feedback (when present) adds context but is not required to make Path B work. The engineer does not read individual thumbs-down outputs. Aggregating the marks gives you a labeled dataset: every up-vote is a positive example, every down-vote is a negative one, and every correction is a near-positive example plus a delta.
This becomes a path to grow the eval system over time. Every batch of corrections updates the rubric.
Tradeoff: requires product surface area for users to mark outputs. If your feature ships into a downstream automated pipeline where no human ever sees the output, Path B is closed.
Path C: LLM-derived criteria from marked traces
When you have a feature running, no domain expert on call, and a corpus of marked production traces (Path B’s output, or any feedback-labeled dataset), let an LLM derive the eval criteria for you. This is the third path, and the one most teams skip because it isn’t obvious.
The mechanism: feed both the marked-success and marked-failure subsets to an LLM in a single call, clearly labeled. The eval-generator sees what worked AND what didn’t work at the same time, so it doesn’t make a half-guess from one subset alone. It outputs a set of criteria, or a full G-eval rubric, or in some cases a custom LLM-as-judge prompt.
Today, this runs as a single shot. The ideal: iterate. Generate criteria, evaluate them on a holdout subset, keep what survives. Refine the criteria that disagree with the labels. That iterative loop is methodology-correct, not always practical at first. Single-shot works fine to bootstrap.
Path C is the one that compounds. Every new batch of marked traces feeds the next iteration of the eval-generator. The eval system grows itself.
Encoding. Structured output, one criterion per field, G-eval first.
You have a schema. You have criteria. You need to encode the criteria into something the eval pipeline can run.
One criterion per field
Strict invariant. Each Zod / Pydantic schema field maps to exactly one criterion. No criterion spans multiple fields. No field has multiple competing criteria.
The reason is purely arithmetic. The moment you have multiple criteria per field (a correctness criterion AND a hallucination criterion AND a safety criterion on the same field), the score aggregation gets ambiguous. Do you average them? Take the worst? Weight them? Which weight? You will spend more engineering time arguing about the weighting than building the feature. Keep it 1:1 until you can defend the cross-field math, which is later.
If you need to evaluate the same field on multiple dimensions, run them as separate eval passes producing separate scores. Don’t mix dimensions inside one criterion.
What a criterion actually looks like
The shape:
// Schema field
issue_category: z.enum(['billing', 'technical', 'account', 'feature_request']).nullable()
.describe('Primary issue category derived from the ticket body')
// Paired criterion
{
name: 'issue_category-accuracy',
criteria:
'The issue_category correctly classifies the primary complaint in the ticket body. '
+ 'Must be derived from the customer\'s own message, NOT from any quoted prior emails or signature blocks. '
+ 'When the ticket contains multiple complaints, classify by the one with the highest stated urgency. '
+ 'Use null only when no complaint can be identified (auto-response, empty body).',
evaluationParameter: 'issue_category',
parameterType: 'OUTPUT',
}
Five things are happening in that criterion text. Positive criterion (what good extraction looks like). Source location (where in the input the field comes from). Format constraint (the enum values). Exclusion rule (don’t pull from quoted emails or signature blocks). Edge-case handling (when to use null).
That’s an anchored rubric. We’ll come back to it.
G-eval as the default encoding
G-eval is the standard wrapper around LLM-as-judge: a generic prompt template that takes your criterion text, the input, and the structured output, and asks the LLM to grade it according to the criterion. Each criterion in your set runs as its own G-eval call.
Start here. Off-the-shelf libraries support it. The criterion text is the dynamic part you author. The template is fixed and well-trodden. You can iterate on criteria text without rebuilding the eval harness.
When to graduate to a custom LLM-as-judge prompt
There’s no specific trigger. It’s preference. The reasons to graduate:
- The generic G-eval template is adding variance to the judge’s reasoning that you can’t control
- You need the judge to apply structured analysis (chain-of-thought, intermediate reasoning) that the generic template doesn’t elicit
- The criterion has grown complex enough that “drop it into the G-eval template” loses fidelity
When you graduate, you write a bespoke judge prompt per criterion that bakes the rubric into the prompt structure instead of injecting it as a dynamic parameter. You give up the template’s generality. You gain control over the judge’s reasoning path.
Code-based checks: rare, valuable when they fit
If a criterion can be expressed as a deterministic check (schema validates, regex matches, field passes a typed assertion, date parses correctly), use code. Skip the LLM entirely. No variance. No judge cost. Runs in microseconds.
These are rare in correctness evals on rich structured outputs, because most fields require some semantic judgment. They show up most often as supporting validators: “the output schema validates” passes before the per-field criteria run. Use code where code suffices. Don’t shoehorn LLM-judge into a problem that has a regex answer.
Making the judge actually reliable
This is the section where most teams have not done the work. You have criteria, you have G-eval encoding, you run it. The scores come back. You compare runs. The scores are different on the same input from yesterday. Now what?
The variance problem, stated plainly
LLM-as-judge is non-deterministic. Same input, same rubric, same model, you get different scores between calls. A pure continuous score from an off-the-shelf judge gives you noise: you can’t tell if today’s 88% and yesterday’s 82% reflect a real change or just LLM drift.
Two failure modes feed the noise:
- Tier ambiguity in the rubric. If the rubric defines “good” with fuzzy edges, the judge slides between tiers between runs. Same output. Different verdict.
- Pure continuous scoring without any deterministic anchor. The judge produces a probability-weighted score. The probability distribution is noisy. The score reflects the distribution, not the underlying quality.
Both failure modes compound. Pure continuous score on a fuzzy rubric is the worst case. That’s where teams give up on LLM-judge and conclude “evals don’t work for us.”
Anchored rubrics
“Anchored” is the lever. A rubric is anchored when the criterion text leaves no room for the judge to drift. Four anchors matter:
- Anchored to the schema field. The criterion names the field it evaluates. One criterion. One field.
- Anchored in the input. The criterion text cites where in the input the field comes from. “Sourced from the contract preamble, not from later amendments.” If the criterion doesn’t say where to look, the judge guesses, and the guess varies between runs.
- Anchored in the output type. The criterion includes the format constraints. ISO 8601. The enum values. Decimal punctuation rules. The format is part of “good.”
- Anchored against the negative case. The criterion names what to NOT count. “Ignore the signature in the cc field.” “Do not count the bank name from the document header.” The exclusion rule is the part naive rubrics miss. It’s also where most of the variance lives, because without it the judge invents its own exclusions, differently each run.
A criterion missing any of the four anchors will have higher variance than one that includes all four. The more explicit, the more details, the better the judge.
Binary first, continuous supplementary
Binary scoring is deterministic per-criterion when the criterion is anchored. The judge says True or False. Same input, same anchored rubric, same verdict every time.
Continuous scoring is useful as a supplement. It tells you HOW good a passing output is, or HOW bad a failing output is. The gradient is the gauge. Continuous alone is where variance leaks in. Continuous on top of binary preserves the gauge without paying the variance cost.
The pattern: each criterion has a binary verdict and an optional continuous gauge. The binary drives pass/fail. The continuous gives you a sortable signal for which passing outputs are stronger and which failing outputs are worse. Aggregating the binaries gives you a checklist score that’s stable across runs. Aggregating the continuous gives you a quality gradient that lives alongside.
If you’re using an off-the-shelf LLM-judge that only emits a single continuous score, you’re getting variance with no anchor. Either wrap it in a binary verdict layer, or upgrade to a rubric that emits both.
Validation order. Variance first. Holdout second.
Once you have criteria, encoded with G-eval, with anchored rubrics, you still don’t trust the eval system. The eval can be wrong about itself. Two gates catch the failure modes, in a specific order.
Variance test, first
Run the same eval over the same dataset multiple times. Measure how much the verdict shifts between runs. If a criterion’s verdict flips on the same input between Tuesday and Wednesday, the criterion is unreliable. Fix it before doing anything else.
Variance is the silent killer. Skip this step and every downstream comparison is poisoned. The optimizer can’t separate signal from noise. A/B comparisons of prompt versions become coin flips. Regression detection mis-fires constantly. Iteration loops chase noise instead of real improvements. The team thinks it has an eval system. What it actually has is a dice roll.
When a criterion fails variance: rewrite the rubric (usually missing an anchor), force the criterion to be binary if it was continuous, or in extreme cases swap the judge model to one that’s more deterministic on that rubric shape.
Holdout validation, second. On the entire dataset.
After variance is acceptable, you need to know that the criterion actually grades correctly. Run the eval rubric on a labeled holdout. Measure agreement with the ground-truth labels. Accuracy, F1, percentage agreement, Cohen’s kappa, whatever you prefer.
The thing that breaks teams: running holdout on the subset you used to derive the criteria. That tests whether the criteria match the data you fit them to. Not whether they generalize.
Holdout must run on the entire dataset. Reason: changing one criterion can fix the failure mode you targeted while breaking a different failure mode on a different slice of data. The fix-A-break-B regression. If you only test on the subset where you wanted A fixed, you ship the break.
Variance gates each criterion. Holdout gates the criteria-set as a whole. Both must pass before the eval system drives a downstream decision (optimizer run, deploy, automatic regression gate).
The order matters
Run holdout on a noisy criterion and the agreement numbers will be incoherent. You’ll think you have a coverage problem when you actually have a variance problem. The same eval rubric will look like it has 78% agreement today and 84% agreement tomorrow because the noise floor is moving under you.
Minimize variance first. Then validate with holdout.
Variance test gates each criterion for stability; holdout validation gates the criteria-set for correctness. Order matters: a noisy criterion makes holdout incoherent.
Results: what changes when you actually run EDD
Numbers without a benchmark behind them are just storytelling. So instead, the methodology table. What changes when an engineer adopts EDD on a prompt-based feature.
| Aspect | Without EDD (vibe-testing) | With EDD |
|---|---|---|
| When evals exist | After first user complaint, sometimes never | Day one, alongside the schema |
| What gets graded | What the engineer remembers to spot-check | Every field of the structured output, every run |
| Iteration confidence | ”Looks the same to me” | Pass/fail verdict per criterion, per change |
| Cross-engineer handoff | New engineer can’t recover the implicit rubric | Rubric is the contract; new engineer reads the criteria |
| Regression detection | Caught by users, weeks later | Caught by the eval on the merge that introduced it |
| Variance handling | Implicit (the engineer’s eyeballs drift between Monday and Friday) | Explicit (variance test runs before holdout) |
| Coverage growth | None (rubric is whatever the engineer remembers) | New failures join the eval set automatically |
The compounding effect is the part most teams underestimate. The eval system that runs Path C (LLM-derived criteria from marked traces) gets sharper every cycle. Failures become criteria. Criteria become regression tests. Regression tests catch the next iteration’s mistakes. Six months in, the eval system is the most valuable asset on the feature, more durable than the prompt itself.
The team without EDD spends those six months re-discovering the same failure modes, because there’s no system that remembers what already broke. Six months of vibe-testing has produced exactly zero durable knowledge. Six months of EDD has produced an eval system the team can hand to a new hire.
Key learnings
Five principles that generalize beyond Mutagent.
-
Lock the schema first. The schema is the eval system’s spine. Every criterion targets a field. If the schema is loose, the eval is loose.
-
One criterion per field. Strict. Cross-field aggregation is a different problem with different math. Solve it later, with intent, not by accident.
-
The criterion text needs four anchors: schema field, input source, output format, negative rule. Missing the negative rule is the most common variance leak. The judge invents exclusions when you don’t specify them.
-
Binary as the verdict, continuous as the gauge. Continuous alone is noise. Binary alone is information-lossy. Both together is the only encoding that keeps both stability and signal.
-
Variance before holdout. Always. A noisy criterion makes holdout incoherent. Minimize variance first. Then validate.
Mutagent productizes this methodology. The library handles G-eval encoding, anchored rubric authoring, variance testing, and entire-dataset holdout validation out of the box.
Read the docs or npm install -g @mutagent/cli to try it.