02 · Evaluate & test
Dataset Agent
Build the dataset your evals run against
Dataset Agent owns the data that everything else in the Mutagent loop scores against. It mines production traces for the hard cases, clusters them by failure mode, fills the long tail with synthetic edge cases, and routes whatever still needs a human to the right reviewer.
Because datasets drift the moment your traffic does, Dataset Agent keeps the set alive: new failure modes get pulled in continuously, stale items get retired, and every change is versioned alongside the prompt it was scored against. The Evaluator never runs against yesterday's reality.
What it does
- Mines production traces for hard cases and rare failure modes
- Clusters traces by semantic similarity to find coverage gaps
- Generates synthetic edge cases to fill the long tail
- Routes ambiguous items to human reviewers with the right context
- Versions every dataset change next to the prompts it scored against
Inputs
- Production traces
- Existing eval cases
- Reviewer feedback
- Synthetic-generation policy
Outputs
- Curated eval dataset
- Coverage report
- Per-cluster labelled samples
Works with
Try Dataset Agent today
Install the CLI and run this agent against your own evals in under five minutes.
Try Dataset Agent