02 · Evaluate & test

Experiment Agent

Compare candidates side by side

Experiment Agent answers the question every team asks before they ship: which prompt, which model, on which dataset, against which evals? It runs the full matrix in parallel and surfaces a single comparison view ranked by your criteria.

Use it to pick a baseline before optimization, validate a prompt mutation against a held-out set, or pressure-test a model swap.

What it does

  • Run any prompt × any dataset × any model matrix
  • Score side by side against your eval criteria
  • Surface cost, latency, and accuracy in one view
  • Export comparison reports for stakeholders
  • Plug-in support for Claude, GPT, Gemini, Llama, and local models

Inputs

  • Prompt candidates
  • Eval datasets
  • Model provider credentials

Outputs

  • Comparison report
  • Cost and latency breakdown
  • Per-eval score deltas

Works with

BraintrustBraintrust
PhoenixPhoenix
W&BW&B
MLflowMLflow

Try Experiment Agent today

Install the CLI and run this agent against your own evals in under five minutes.

See an experiment run