02 · Evaluate & test

Experiment Agent

Compare candidates side by side

Experiment Agent answers the question every team asks before they ship: which prompt, which model, on which dataset, against which evals? It runs the full matrix in parallel and surfaces a single comparison view ranked by your criteria.

Use it to pick a baseline before optimization, validate a prompt mutation against a held-out set, or pressure-test a model swap.

What it does

Run any prompt × any dataset × any model matrix
Score side by side against your eval criteria
Surface cost, latency, and accuracy in one view
Export comparison reports for stakeholders
Plug-in support for Claude, GPT, Gemini, Llama, and local models

Inputs

Prompt candidates
Eval datasets
Model provider credentials

Outputs

Comparison report
Cost and latency breakdown
Per-eval score deltas

Works with

Braintrust

Phoenix

W&B

MLflow

Try Experiment Agent today

Install the CLI and run this agent against your own evals in under five minutes.

See an experiment run

← Previous agentEvaluator Agent Next agent →Deploy Agent