02 · Evaluate & testComing soon

Evaluator Agent

Prove every change beats baseline before it ships

Evaluator Agent is the gate between the Mutation Agent's candidates and your repository. It re-runs the full eval suite, scores against held-out production traces, and runs adversarial probes for safety regressions.

A candidate that passes the Evaluator comes with a signed evidence pack — eval deltas, latency, cost, and a side-by-side diff — so the human reviewer sees the receipts, not just the prompt.

What it does

  • Re-runs the full eval suite against held-out traces
  • Adversarial probing for safety and prompt-injection regressions
  • Side-by-side diff and score-delta report
  • Signs the evidence pack so the PR is reviewable in minutes
  • Auto-rejects any candidate that regresses on a blocked criterion

Inputs

  • Mutation Agent candidates
  • Held-out eval suite
  • Safety probes

Outputs

  • Signed evidence pack
  • Pass/fail verdict per candidate
  • Pull request body

Works with

BraintrustBraintrust
PhoenixPhoenix
MLflowMLflow
GitHubGitHub

Get early access to Evaluator Agent

Join the early-access list and we will reach out the moment this agent ships.

Join the early-access list