The variance floor of LLM-as-judge: what it does to your optimizer
A controlled replication study of three prompt optimizers on FinanceQA 150. The 5.46pp LLM-judge variance floor, and how it shapes acceptance-gate behavior.
The variance floor of LLM-as-judge: what it does to your optimizer
Run an LLM-as-judge metric three times on the same prompt over the same dataset. You get three different scores. We measured 0.5767, 0.6007, and 0.6313 on identical input. That spread, 5.46 percentage points wide, is the variance floor that sits under every prompt-optimizer claim made on a free-text Q/A benchmark.
This post is a controlled replication study of three prompt optimizers on a 150-item FinanceQA benchmark. The three are Mutagent, Opik HRPO, and Opik GEPA. Inputs held identical: dataset, execution model, evaluation model, judge rubric, cold-start prompt, optimization budget. The single question we set out to answer: under those identical inputs, how much measurable answer-quality improvement does each optimizer deliver, and at what cost?
The post is organized in five chapters. We start with the variance problem in LLM-as-judge evaluation, which motivates everything that follows. We then describe how the three optimizers under comparison work, the experiment design, the results, and a structural reading of why two of the three optimizers produced 0.00% real uplift on this benchmark. We close with scope and planned follow-up work.
The variance problem in LLM-as-judge evaluation
Three runs, three scores
LLM-as-judge metrics are not deterministic, even at temperature zero. The same prompt evaluated against the same dataset produces different scores across runs. This is not theoretical. We measured it directly inside the benchmark.
During a GEPA run that, as it turned out, never actually mutated the prompt, the same unchanged seed prompt was evaluated against the same FinanceQA 150 dataset three times, by the same judge model at the same temperature.
| Run | Score |
|---|---|
| 1 | 0.5767 |
| 2 | 0.6007 |
| 3 | 0.6313 |
| Range | 5.46 pp on identical input |
| Mean | 0.603 |
| σ (population) | ≈ 2.2 pp |
Measuring the variance floor
The 5.46 percentage points of spread on identical input is the variance floor for this benchmark. Any reported optimizer uplift below this floor is statistically indistinguishable from re-evaluation noise on the same prompt.
xychart-beta
title "LLM-judge score on identical input, three runs"
x-axis [Run1, Run2, Run3]
y-axis "Score" 0.55 --> 0.65
bar [0.5767, 0.6007, 0.6313]
line [0.603, 0.603, 0.603]
The horizontal line marks the mean (0.603). The bars span 5.46 percentage points without any change to the underlying prompt. From internal observation across multiple cold-start tuning runs, real mutation improvements on free-text Q/A often fall in the 5 to 15 percentage point range absolute. The variance floor sits inside that range. Any optimizer that compares aggregated scores on free-text Q/A is making decisions across this floor every time the gate fires.
Why a temperature-zero judge is non-deterministic
Temperature controls per-token sampling. Inside a batched evaluation harness, three other sources of variation remain.
- Sampling pool order. The order in which items are submitted to the judge can affect the model’s internal state across a batch.
- Context-window packing. Many providers pack multiple requests into a single context window for throughput. The packing arrangement varies across batches.
- Provider-side routing. The same logical model name can be served by different replicas or different hardware between calls.
Each source contributes a small amount of variance per call. Over 150 items, the variances accumulate into the 5.46 pp spread.
Interpretation rule. Any reported optimizer uplift below 5.46 percentage points on this benchmark is statistically indistinguishable from re-evaluation noise on the same prompt.
Building a non-noisy metric
A controlled comparison requires a metric that is identical across all three optimizers and that gives the judge enough structure to grade reliably. Opik ships with two stock metrics. Both were considered and rejected, for reasons that are structural, not preference.
LevenshteinRatio. A character-level edit distance ratio between the model output and the reference answer. Two semantically equivalent answers to a FinanceQA question can share less than 30 percent character overlap and still both be correct. For example, "The FY2022 net income was $5.43B" and "$5.43 billion" both correctly answer the same question. LevenshteinRatio scores them at roughly 0.10 against each other. The metric is fit for short canonical-answer tasks where the gold is a single number or a single word. It is unfit for free-text Q/A.
Stock GEval. Opik’s stock GEval method has the following signature, verbatim from the public source:
def score(self, output: str, **ignored_kwargs) -> ScoreResult: ...
The signature is auditable in the linked source: there is no expected_output parameter on score(). The **ignored_kwargs swallows whatever a benchmark harness tries to pass, including expected_output, reference, and any other gold-answer parameter. The judge LLM sees only the model output and a free-form criterion description. It has no programmatic access to the labeled correct answer.
The consequence: with stock GEval, the optimizer’s improvement gate is asking “is this output plausible to an LLM” rather than “is this output correct”. On any dataset with labeled gold answers, those are categorically different questions.
We replaced both Opik defaults with a custom metric that calls LiteLLM directly with a 10-tier rubric. Same judge model, gemini-3-flash-preview. Same temperature, zero. Same JSON parsing logic across all three optimizers.
The rubric:
- 1.0: Exact match (numeric within 1%, same conclusion)
- 0.9: Near-perfect (within 1%, minor phrasing diff)
- 0.8: Strong with minor gap (within 2%, one metric missing)
- 0.7: Good but incomplete (within 5%, missing 1-2 metrics)
- 0.6: Adequate (within 10%, correct but shallow)
- 0.5: Partially correct (within 20%, directionally right)
- 0.3: Wrong answer (opposite conclusion or wrong number)
- 0.1: Fundamentally wrong (different metric entirely)
- 0.0: No valid answer
CRITICAL: Wrong Yes/No conclusion = max 0.3 regardless of detail.
Each item produces a continuous score in the [0, 1] range plus a reason string explaining the score. The reason field is required by HRPO for its root-cause analysis path; we kept it for parity even where the other optimizers do not consume it.
Why the percentage thresholds are reliable. A careful reader will ask: how can the judge distinguish “within 1%” from “within 2%” from “within 5%” reliably on free-text financial answers? Asking an LLM to make absolute unit-precise discriminations is a hard problem on its own.
The rubric does not rely on the numerical threshold alone. Each tier ships with anchored examples: a worked sample of a question, a gold answer, and a candidate output, labeled with the tier the candidate deserves and the reasoning that justifies the label. The judge sees the threshold and the worked anchors together, then maps the candidate under evaluation to the closest one. Tier boundaries become graded similarity comparisons (“is this candidate closer to the 0.9 anchor or the 0.8 anchor?”) rather than absolute discriminations (“is this answer literally within 1% or within 2%?”). LLM judges are reliable at the former. They are unreliable at the latter.
The 5.46 pp variance floor we measured in the previous chapter is the post-calibration floor. It is the floor with per-tier anchors, with the inverted-Yes/No penalty, with the temperature pinned at zero, with a stronger judge than the executor. Without these calibrations the variance widens substantially, which is generally accepted in LLM-as-judge practice and is why stock metrics that ship without per-tier anchored examples produce noisier improvement signals at every gate. The rubric is one half of why this benchmark is comparable across optimizers. The variance floor we report is the other half.
Why this is the only configuration where comparison is possible. Substituting either Opik default would worsen Opik’s measurable result on this benchmark, not improve it. Handing an optimizer a metric it cannot meaningfully optimize against is unfair in the opposite direction. The calibrated rubric, with anchored tier examples, is what makes the comparison actually controlled.
Optimizers under comparison
Opik HRPO: reflective single-mutation per trial
HRPO is described in Opik’s documentation as the Hierarchical Reflective Prompt Optimizer. Per trial, its loop is:
- Sample reflection examples from prior failures.
- Call the optimizer LLM to propose an improved system prompt.
- Evaluate the new prompt on the full dataset.
- Adopt the new prompt if the sum of scores improves over the baseline.
The mutation surface is restricted to the system role by default, configurable to all roles via the optimize_prompts parameter. The acceptance gate is a continuous sum-of-scores comparison.
Documented design intent, per Opik’s HRPO docs: HRPO is for prompts “that have already gone through a few rounds of manual prompt engineering”. We tested it on a cold-start prompt because Mutagent targets cold-start explicitly, and a general-purpose prompt optimizer is reasonably expected to produce some signal on a bare prompt with clear failure traces. The cold-start framing is a known caveat we revisit in scope.
sequenceDiagram
participant H as HRPO
participant L as Optimizer LLM
participant E as Eval harness (full 150)
participant J as Judge LLM (T=0)
H->>L: Reflect on prior failures
L-->>H: Proposed system prompt
H->>E: Score new prompt
E->>J: Per-item judging
J-->>E: Continuous scores
E-->>H: Sum of scores
Note over H: Gate compares sum-new vs sum-baseline
alt sum-new > sum-baseline
H-->>H: Adopt
else
H-->>H: Reject, return baseline
end
Opik GEPA: minibatch reflection plus Pareto front
GEPA wraps an external research algorithm (Gradient-Free Evolutionary Pareto optimizer) for Opik (docs), suited for “single-turn task…reflection-driven search”. Per trial, its loop is:
- Sample a small minibatch of items.
- Evaluate the current best prompt on the minibatch.
- Generate a mutated candidate via reflection on minibatch failures.
- Adopt the candidate if the sum of scores on the minibatch improves.
- Periodically re-evaluate surviving candidates on a validation set, maintaining a Pareto front of best-on-each-item.
- Return the candidate with the best validation score.
The acceptance gate at the per-trial step is a continuous sum-of-scores comparison on the minibatch. The Pareto-front maintenance is layered on top.
How the minibatch is sampled. GEPA’s reflection_minibatch_size parameter defaults to 3. The minibatch is drawn at random from the dataset. With a 1000-item benchmark this is 3 items, or 0.3% of the data, per trial. The sampler does not balance across passing items and failing items, nor across question categories, nor across difficulty bands. Whether the sampled three items reflect the failure-mode distribution of the full dataset is not enforced; on any given trial they may over-represent easy cases, hard cases, or any one failure mode by chance.
Why this allows regressions. The acceptance gate then compares sum-of-scores on this minibatch only. A mutation candidate that improves the three sampled items but regresses fifty unsampled items is accepted. The gate has no visibility into the unsampled portion of the dataset. GEPA’s Pareto-front maintenance partially compensates by re-evaluating surviving candidates on the validation set periodically, but the per-trial gate that controls each candidate’s acceptance into the front is the minibatch-only comparison. This is the structural mechanism by which a candidate that overfits a small sample can be adopted into the front, then later (correctly) found to be inferior on the full set, but only after it has been propagated.
Both Opik gates compare aggregated continuous LLM-judge scores against each other. The accept/reject decision is a single floating-point comparison.
sequenceDiagram
participant G as GEPA
participant M as Minibatch sampler
participant L as Optimizer LLM
participant E as Eval harness
participant J as Judge LLM (T=0)
G->>M: Sample minibatch (size 3)
M-->>G: 3 items
G->>E: Score current best on minibatch
E->>J: Judge minibatch outputs
J-->>E: Continuous scores
E-->>G: Sum of minibatch scores
G->>L: Reflect on minibatch failures
L-->>G: Mutated candidate
G->>E: Score candidate on minibatch
E-->>G: New sum
Note over G: Gate compares sum-candidate vs sum-current on minibatch
G->>E: Periodic full-valset Pareto-front maintenance
Mutagent: gate at primitive level
Mutagent operates on the same cold-start prompt under the same models with the same calibrated rubric. Its sampling and acceptance behavior differs from a continuous-sum-on-minibatch comparison in four named properties.
Mixed-class sampling, weighted to extremes. Each iteration draws its sample from both the passing class and the failing class, weighted to the extremes of the score distribution: roughly the bottom 20% (worst-performing items, the failure cluster) plus the top 20% (best-performing items, the success cluster). The middle 60% is held in reserve. This is different from minibatch optimizers that draw uniformly from the full dataset: those see a slice that may not contain the failure modes worth fixing, and they cannot tell what success patterns are worth preserving.
Per-criterion discrete verdicts. Each output field is scored against multiple criteria with both a continuous 0-to-1 score and a discrete pass/fail-class verdict per criterion. Continuous judge noise collapses to discrete verdict noise, which has a substantially lower floor.
Full-dataset evaluation per candidate. Every mutation candidate is scored on all items in the dataset, not on the bottom-and-top slice sampled at iteration start. Per-item noise averages out across N items, and the candidate is judged against the dataset it actually has to perform on, not the slice it was generated from.
Item-level regression check. For a mutation to be adopted, no item may drop from the passing class to the failing class. The gate operates on the discrete pass/fail boundary, which is robust to small continuous-score perturbations.
The combined effect is iteration-loop avoidance with an explicit termination condition. With the success class held in view at sampling time and the full-dataset regression check at gate time, the optimizer cannot oscillate between fixing one failure mode and breaking another. Each iteration’s adopted candidate must clear new failures without sacrificing items that were already working. The iteration goal is explicit: continue until the full-dataset regression check produces no failing items. In practice a small number of iterations is sufficient on a well-defined cold-start dataset.
sequenceDiagram
participant S as Sampler
participant M as Mutation generator
participant E as Eval harness, full dataset
participant J as Judge LLM, temp 0
participant G as Gate
Note over S: Iteration N
S->>S: Score full dataset, rank by score
S->>M: Bottom 20 percent failures and top 20 percent successes
M-->>S: Mutation candidate
S->>E: Score candidate on full dataset
E->>J: Per-item judging
J-->>E: Continuous score and per-criterion verdicts
E-->>G: Full-dataset scoring
Note over G: Item-level regression check
alt no item drops to failing class
G-->>S: Adopt, re-rank, continue
else any item regresses
G-->>S: Reject, sample again
end
Note over S: Loop until zero failing items remain
Experiment design
Models, and why
| Role | Model | Provider | Temp |
|---|---|---|---|
| Execution (target LLM whose prompt is optimized) | gemini-2.5-flash-lite | Google AI | 0 |
| Evaluation (LLM-as-judge) | gemini-3-flash-preview | Google AI | 0 |
| Optimization (mutation generation) | gemini-3-flash-preview | Google AI | 0 |
The execution model is the LLM whose prompt is being optimized. We chose gemini-2.5-flash-lite deliberately, and not the strongest available frontier model, because a stronger executor leaves less room for prompt optimization to matter. The weaker the executor, the more headroom for an optimizer to produce a measurable improvement.
The evaluation and optimization roles use gemini-3-flash-preview. Stronger than the executor, so the judge can grade nuanced free-text answers without undershoot, and so the optimizer can generate mutation candidates that read as credible to a careful reader. Temperature is pinned at zero across all three roles, for reproducibility.
The choice of dataset matters as much as the choice of model. Public benchmarks split into two rough classes for the question we are asking. Knowledge benchmarks like MMLU, GPQA, and GSM8K measure what a model has memorized or what reasoning it can do over common training data. Each release of a frontier model eats more of the available headroom. The model is the moving variable on these benchmarks. The prompt is increasingly not. In-context-learning benchmarks like FinanceQA, LegalBench, and RagBench give the model evidence in the prompt context, then ask it to reason over that evidence. The answer is not in the model’s weights. It is in the context. This is where the prompt is the load-bearing variable.
quadrantChart
title Public benchmarks by optimization headroom
x-axis Knowledge in weights --> Knowledge in context
y-axis Saturation high --> Headroom durable
quadrant-1 Prompt-optimizable
quadrant-2 Niche, prompt-optimizable
quadrant-3 Saturated, knowledge-bench
quadrant-4 Knowledge-bench
MMLU: [0.15, 0.2]
GPQA: [0.2, 0.25]
GSM8K: [0.1, 0.2]
FinanceQA: [0.8, 0.8]
LegalBench: [0.75, 0.7]
RagBench: [0.7, 0.7]
Where current optimizer benchmarks live, and why it matters. Opik publishes optimizer results on four benchmarks: Arc (multiple-choice science), GSM8K (grade-school math word problems), RagBench (retrieval-oriented Q/A), and MedHallu (medical hallucination checks). All four are discrete-label tasks where pass/fail is binary. Opik reports dramatic gains on these benchmarks: HRPO takes Arc from 1.69 percent baseline to 92.70 percent final; an evolutionary optimizer takes RagBench from 9.81 percent to 92.00 percent. These results demonstrate that the optimizers produce real signal on tasks of this shape.
The shape matters. On discrete-label tasks the optimizer’s primary job is output-format coercion: getting the model to emit a valid multiple-choice letter or a numeric answer in a parseable format. Once the output shape is established, the model’s pretrained knowledge handles the answers. The acceptance gates calibrated for these tasks operate on a clean win/lose signal where the answer is correct or it is not.
The upstream GEPA paper adds a second category: multi-hop reasoning on knowledge-heavy QA, including HotpotQA and IFEval. On a current frontier model evaluated at temperature zero, baseline scores on saturated knowledge benchmarks like HotpotQA already sit above 90 percent. The available optimization headroom shrinks to roughly 10 percentage points, and the LLM-judge variance floor measured in chapter 1 consumes more than half of that. Reported uplifts of 1 to 5 percentage points on saturated benchmarks are statistically indistinguishable from re-evaluation noise.
Free-text domain Q/A on a deliberately weak execution model is a third regime, distinct from both. The cold-start baseline on FinanceQA sits in the 0.62 to 0.77 range on gemini-2.5-flash-lite (the spread is per-optimizer harness sampling variance). This leaves more than 20 percentage points of headroom for an optimizer to clear the variance floor with margin. The optimizer’s job here is not output-format coercion. It is domain-knowledge specialization: getting the model to reason correctly across SEC filing structures, unit conversions, and conditional financial logic. Acceptance gates calibrated for discrete-label tasks operate on a different signal than this benchmark requires. Whether they transfer is an empirical question, and this benchmark is designed to surface that question.
The four model and dataset choices together define what this benchmark is measuring: prompt-mediated improvement of a fixed model on a domain where the prompt has work to do, in a regime that existing optimizer benchmarks do not directly cover.
Dataset: FinanceQA
FinanceQA (also published as FinanceBench by Patronus AI) is a public benchmark of corporate financial filings. Each item is a question grounded in 10K to 50K tokens of evidence drawn from SEC 10-K and 10-Q reports. Answers span numerical extraction, qualitative analysis, and multi-step reasoning across financial statements. Items cover nine major US companies and five question categories.
| Property | Why it matters |
|---|---|
| Free-text answers | No binary pass/fail shortcut. The optimizer must produce structurally improved prompts to move the score. |
| Long evidence context | Stress-tests prompt efficiency, not just semantic understanding. The prompt has to direct the model through the evidence. |
| Diverse question types | Reduces single-skill overfitting. An optimizer cannot win by tuning to one question shape. |
| Public, citable | Anyone can re-run the benchmark and reproduce the inputs. |
For cross-domain validation we additionally ran a 356-item slice of LegalBench (three contract-NLI tasks, binary entailment classification). That result is in the discussion chapter.
Cold-start prompt: starting from nothing
Each optimizer is given the same starting prompt: structurally empty, with no domain knowledge, no task framing, no examples. The intent is to test whether each system can specialize a useful prompt out of nothing.
For Mutagent and Opik HRPO, the starting prompt is fully bare. In standard API messages format:
messages = [
{"role": "system", "content": "."},
{"role": "user", "content": "{input}"},
]
The system role contains a single period because Opik rejects literally empty system content; the period is the minimum string the framework will accept. The user role contains only the input placeholder. There is no instruction, no persona, no domain context, no constraint on output format. The model is given a question and nothing else.
For Opik GEPA, the starting prompt requires a slightly seeded variant:
messages = [
{"role": "system", "content": "."},
{"role": "user", "content": "Answer the question.\n\n{input}"},
]
The "Answer the question." prefix is a disclosed asymmetry. GEPA’s mutation engine cannot operate on a literal {input} placeholder alone; it requires at least one sentence of seed text that it can mutate. We added the minimum perturbation that makes GEPA functional. This works in GEPA’s favor compared to Mutagent and HRPO, which start from a fully bare prompt. We still measure 0.00% real uplift on GEPA; the seed-text caveat does not save the result.
Why a bare cold-start. A useful prompt-optimization benchmark separates the optimizer’s contribution from any prompt-engineering already baked into the seed. Tests that start from a hand-tuned prompt are measuring mutation refinement, not specialization. Tests that start from nothing are measuring whether the optimizer can build domain knowledge into a prompt where there was none. The bare cold-start is the harder, more diagnostic setting, and it is the setting most relevant to teams who want a system that can produce a useful prompt without a human prompt engineer in the loop.
Quirks encountered on Opik’s cold-start path. Both Opik optimizers ship with safety gates that interact unexpectedly with cold-start framing.
GEPA’s skip_perfect_score parameter (default True with perfect_score = 1.0) skips reflection entirely when all minibatch scores meet or exceed the perfect-score threshold. On a cold-start dataset where the bare prompt happens to score perfectly on a small random minibatch (3 items by default), this path triggers and zero mutations are proposed for that trial. We bypassed it by setting skip_perfect_score=False. This is undocumented in the public Opik API; we found it by reading the upstream gepa package source.
GEPA additionally fails silently if reflection_minibatch_size exceeds max_trials. The reflection engine never fires, no error is raised, and the run consumes wall-clock without producing any candidate. We hit this on a first-pass configuration before identifying the warning.
HRPO’s documented design intent is for prompts that have already gone through manual prompt engineering. The cold-start framing tests it outside its design envelope. We document this as a deliberate choice, not a defect of HRPO’s: a general-purpose prompt optimizer should still produce some signal on a bare prompt with clear failure traces, and Mutagent targets cold-start explicitly. The result speaks to whether a controlled comparison on cold-start is informative for teams choosing an optimizer.
Budget, controls, hypothesis
This benchmark holds every controllable input identical across the three optimizers and varies only the optimizer architecture itself. The intent is a measurement we can defend against the questions a careful reader will ask first: was the comparison fair, were the metrics the same, was the budget the same.
| Variable | Held identical across all three? |
|---|---|
| Dataset | ✓ |
| Execution model | ✓ |
| Evaluation model | ✓ |
| Evaluation rubric | ✓ (10-tier, calibrated) |
| Starting prompt | ✓ (modulo a disclosed GEPA seed-text caveat) |
| Optimization budget | ✓ (1 iteration / 1 trial) |
Hypothesis. An optimizer whose acceptance gate compares aggregated continuous LLM-judge scores will be unable to reliably distinguish real mutation signal from evaluation noise on free-text Q/A, because the variance floor measured in chapter 1 sits inside the typical mutation-signal range. An optimizer whose gate operates on a different signal will not have this exposure.
What would falsify the hypothesis: an optimizer with a continuous sum-of-scores gate producing measurable uplift above the variance floor on this benchmark. What would support it: continuous-sum gates returning the seed prompt unchanged, and a different-gate optimizer producing measurable uplift well above the floor, on the same inputs.
Results
Side-by-side
All three optimizers were run on the same FinanceQA 150 benchmark, with the same models, the same calibrated 10-tier rubric, and a single-iteration / single-trial budget. LLM call counts and costs were measured directly via per-call instrumentation. No estimates.
| Dimension | Mutagent | Opik HRPO | Opik GEPA |
|---|---|---|---|
| Cold-start G-Eval | 0.7733 | 0.6187 | 0.6320 |
| Final G-Eval | 0.8633 | 0.6187 | 0.6320 |
| Real uplift (relative) | +11.6% | 0.00% | 0.00% |
| Real uplift (absolute) | +9.0 pp | 0 pp | 0 pp |
| Above 5.46 pp variance floor? | Yes | At floor by definition | At floor by definition |
| Mutations generated | 5 | 2 | 3 |
| Mutations adopted as final | 5 | 0 | 0 |
| Final prompt = cold-start? | No | Yes | Yes |
| LLM calls (1 iter / trial) | ≈ 700 | 908 | 2,626 |
| Cost per run | $2.86 | $0.6553 | $1.9226 |
xychart-beta
title "Real uplift (absolute pp) with 5.46pp variance floor"
x-axis [Mutagent, "Opik HRPO", "Opik GEPA"]
y-axis "Absolute uplift (pp)" -1 --> 11
bar [9.0, 0, 0]
line [5.46, 5.46, 5.46]
The bars show absolute percentage-point uplift per optimizer. The horizontal line at 5.46 pp marks the LLM-judge variance floor. Mutagent’s bar at 9.0 pp clears the floor with margin. The two Opik optimizers sit at 0 pp, at the floor by definition.
Cold-start divergence note. Cold-start scores diverge across the three optimizers despite the identical prompt and execution model because each optimizer’s evaluation harness samples and orchestrates execution calls differently. Each per-optimizer cold-start is treated as that optimizer’s own baseline for relative-uplift comparison. The LLM-call counts and costs are directly comparable because they measure the same 150-item workload.
Per-optimizer detail
Mutagent. Five mutations were adopted into the final prompt. Zero items dropped to a worse score across the run. Fourteen items crossed the 0.95 success threshold that had previously been below it. The +11.6 percentage point relative uplift sits well above the 5.46 pp variance floor.
Opik HRPO. Two mutation candidates were generated during the trial. Both were rejected by HRPO’s own acceptance gate, on its own evaluation of the held-out dataset. The best candidate scored 0.5567 against the baseline 0.6187. HRPO returned the cold-start prompt unchanged. Stop reason in the HRPO log: “No improvement”. 908 LLM calls, $0.6553.
Opik GEPA. Three mutation candidates were generated across three trials. The Pareto-front aggregate score climbed during the run, from 0.6320 to 0.7220 to 0.7380. This is not a deployable result. The Pareto aggregate is the upper-bound score across multiple candidates’ best per-item performance. It is not the score of any single prompt that an operator could ship. GEPA’s selected single best candidate, the prompt that gets returned as “OPTIMIZED PROMPT”, never beat the cold-start. Final prompt: identical to the cold-start. 2,626 LLM calls, $1.9226.
Cost-per-percentage-point of measured uplift
Cost-per-percentage-point of measured uplift is the most operationally meaningful efficiency metric for prompt optimization. It folds the LLM-call budget and the result together.
| Optimizer | Cost / pp real uplift |
|---|---|
| Mutagent | $0.25 / pp |
| Opik HRPO | undefined (no measurable uplift) |
| Opik GEPA | undefined (no measurable uplift) |
A defined cost-per-pp implies the optimizer produced a deployable deliverable that justifies the spend. An undefined cost-per-pp means the spend purchased a re-evaluation of the seed prompt, not an optimization.
GEPA’s cost is approximately 2.9 times HRPO’s at the same null result. The differential is structural: GEPA’s per-trial loop performs minibatch evaluation, mutation evaluation, and periodic Pareto-front re-evaluation against the validation set. HRPO’s per-trial loop performs one baseline pass and one mutation evaluation. Same null outcome, different per-trial overhead.
Discussion
Why continuous-sum gates produce 0.00%
The results, read against the variance floor, point to a structural property rather than a parameter-tuning issue.
Both HRPO and GEPA generated mutation candidates. Both rejected those candidates on their own acceptance gates. The gates are continuous sum-of-scores comparisons. The scores being compared are LLM-judge outputs with a 5.46 pp variance floor on identical input.
A real mutation that improves a subset of items and slightly regresses another can produce no aggregate improvement signal above the floor. The gate cannot tell that mutation apart from evaluation noise on the unchanged seed. It is correctly conservative: it preserves the seed when it cannot resolve the difference. The cost is that real signal gets rejected alongside noise.
A second pattern is visible in the rejected candidates themselves. The mutations HRPO produced on this benchmark were generic instruction-style additions: a “financial data extraction expert” system-role prefix and a related variant. These framings did not surface or address the specific structural failure modes present in the dataset, including unit conversion across millions and billions, balance-sheet versus cash-flow disambiguation, and conditional-prompt handling for items where requested values are unavailable. Even setting aside the variance floor, a generic prefix is unlikely to produce per-item improvement on a benchmark whose failures are structural rather than tonal. GEPA’s mutations had a similar character. The optimizer can only propose what its reflection step surfaces from the failures it can see; with a small minibatch and continuous-aggregate signal, the structural failure modes are obscured before reflection runs. The two pressures compound: mutation candidates are generic at generation, then the gate cannot resolve them above noise at evaluation. Either pressure alone would impair the run; together they produce 0.00% by construction.
Per-optimizer drill-down: how each gate fails this benchmark. The two pressures interact differently with each optimizer’s specific gate.
HRPO, evidence-based. HRPO’s documented acceptance gate is a greedy improvement comparison: the new prompt is adopted if its sum-of-scores on the held-out evaluation exceeds the baseline. Full-dataset evaluation is a strength of HRPO’s gate: per-item noise averages out across 150 items rather than across a small sample. The structural limit is the candidate cadence. HRPO produces one or two candidates per trial, and the gate fires once per candidate. With a 5.46 pp variance floor, each candidate has to clear the baseline by roughly that margin to be distinguishable from re-evaluation noise on the unchanged seed. In our run HRPO’s two candidates scored 0.5567 and a related variant against the 0.6187 baseline. Both fell below the baseline, well outside the floor in the wrong direction, and were correctly rejected. To clear the floor in the right direction in a single shot, a generic instruction-style prefix would need to produce a roughly +5.46 pp aggregate jump on first generation. Without per-failure analysis surfacing the specific structural failure modes, this magnitude of jump is empirically rare on a cold-start free-text benchmark.
GEPA, evidence-based. GEPA’s per-trial acceptance gate compares sum-of-scores on the same minibatch the mutation was derived from. Default reflection_minibatch_size = 3. Per-item judge variance scales as 1/√N, which means the per-item noise on a 3-item minibatch is significantly higher than the 5.46 pp full-dataset floor we measured. The gate becomes dominated by minibatch sampling variance: on any given minibatch, the same mutation can win or lose depending on which 3 items were sampled. Pareto-front maintenance partially compensates by re-evaluating surviving candidates on the validation set, but the per-trial gate that controls each candidate’s adoption into the front runs on minibatch-only signal. In our run GEPA’s Pareto-front aggregate climbed from 0.6320 to 0.7220 to 0.7380 across three trials. The climb is real as a per-item upper bound across multiple candidates, but no single candidate beat the seed on the full dataset. The per-trial gate’s selection of which candidates to maintain is operating on a noisier signal than the floor we measured at full-dataset scale, which is why the climb does not translate to a deployable winner.
The gates are working as designed in both cases. The design is calibrated for a different evaluation regime: discrete-label tasks where pass/fail is binary and the variance floor on small samples is much lower. Opik’s own published optimizer benchmarks (Arc, GSM8K, RagBench, MedHallu) are exactly this regime. On free-text Q/A scored by an LLM judge with a continuous rubric, both gates collapse under the variance floor we measured.
stateDiagram-v2
[*] --> Evaluate
Evaluate --> AggregateScore : sum continuous scores
AggregateScore --> Compare
Compare --> InsideFloor : delta below 5.46 pp
Compare --> AboveFloor : delta at or above 5.46 pp
InsideFloor --> Reject : cannot resolve signal vs noise
AboveFloor --> Decide
Decide --> Adopt
Decide --> Reject
Adopt --> [*]
Reject --> [*]
| Gate type | Rejects regressions reliably? | Accepts wins-without-regression reliably? |
|---|---|---|
| Continuous sum-of-scores (HRPO, GEPA) | Sometimes | Sometimes |
| Per-criterion discrete + item-level regression (Mutagent) | Yes, above the floor | Yes, above the floor |
This is a structural property of continuous-aggregate gates on free-text Q/A scored by an LLM judge. It is independent of parameter tuning. We ran HRPO and GEPA under their best session-discovered configurations: parallel concurrency tuned, configurable acceptance gates relaxed, all available mutation surfaces unlocked. The structural constraint produces the same 0.00% result the default configuration would.
Mutagent’s gate combines per-criterion discrete verdicts, full-dataset evaluation per candidate, and an item-level regression check on the discrete pass/fail boundary. These three properties together move the gate signal above the LLM-judge variance floor, where continuous-sum gates operate at it. The pipeline that produces mutation candidates is out of scope for this report.
We walk through the diagnose-before-you-mutate primitive separately, where the same constraint shapes how a benchmark should classify failures before scoring optimizer uplift. The replication study here is the receipt; the methodology behind it is the longer story.
Cross-domain check: LegalBench
The pattern shown on FinanceQA generalizes when we change the dataset. We ran the same Mutagent configuration on a 356-item slice of LegalBench: three contract-NLI tasks requiring binary entailment classification of NDA clauses. Different domain (law vs finance), different output format (Yes/No vs free-text), different difficulty profile (short clauses vs 10K-token filings).
| Metric | Value |
|---|---|
| Cold-start G-Eval | 0.8778 |
| After 1 iteration | 0.9213 |
| Real uplift (relative) | +4.96% |
| Items crossing ≥ 0.95 success threshold | +24 / 356 |
| Items regressing | 0 |
The lower absolute uplift compared to FinanceQA’s +11.6% is consistent with the higher cold-start baseline. Less headroom for improvement when the cold-start is already at 0.878. Same optimizer, same models, same rubric. The pattern transfers across dataset structure and across domain.
Scope
Several methodological choices in this study deserve explicit framing for any reader planning their own replication.
No held-out test set. All three optimizers were evaluated on the same 150 items they “optimized” against. All three had equal opportunity to overfit. The relative comparison is fair. The absolute uplift numbers may be optimistic.
Single provider, single model family. All execution, evaluation, and optimization calls use Google Gemini models. The LLM-judge variance floor is provider-dependent. Results may differ on Anthropic, OpenAI, or AWS Bedrock model variants.
Single iteration, single trial. Multi-iteration regimes are out of scope for this report. HRPO and GEPA may recover some uplift over multiple trials. Mutagent’s multi-iteration uplift saturates within 2 to 3 iterations on internal runs.
Cold-start framing tests HRPO outside its documented use case. Opik’s HRPO docs describe it as a tool for prompts that have already gone through manual prompt engineering. The 0.00% measured here is in part a scoping artefact for HRPO.
Any optimizer claim on a free-text Q/A benchmark sits on top of an LLM-as-judge variance distribution. The reader’s question is whether the reported uplift sits above that distribution, or is the distribution.