02 · Evaluate & test
Experiment Agent
Compare candidates side by side
Experiment Agent answers the question every team asks before they ship: which prompt, which model, on which dataset, against which evals? It runs the full matrix in parallel and surfaces a single comparison view ranked by your criteria.
Use it to pick a baseline before optimization, validate a prompt mutation against a held-out set, or pressure-test a model swap.
What it does
- Run any prompt × any dataset × any model matrix
- Score side by side against your eval criteria
- Surface cost, latency, and accuracy in one view
- Export comparison reports for stakeholders
- Plug-in support for Claude, GPT, Gemini, Llama, and local models
Inputs
- Prompt candidates
- Eval datasets
- Model provider credentials
Outputs
- Comparison report
- Cost and latency breakdown
- Per-eval score deltas
Works with
Try Experiment Agent today
Install the CLI and run this agent against your own evals in under five minutes.
See an experiment run