02 · Evaluate & test

Dataset Agent

Build the dataset your evals run against

Dataset Agent owns the data that everything else in the Mutagent loop scores against. It mines production traces for the hard cases, clusters them by failure mode, fills the long tail with synthetic edge cases, and routes whatever still needs a human to the right reviewer.

Because datasets drift the moment your traffic does, Dataset Agent keeps the set alive: new failure modes get pulled in continuously, stale items get retired, and every change is versioned alongside the prompt it was scored against. The Evaluator never runs against yesterday's reality.

What it does

  • Mines production traces for hard cases and rare failure modes
  • Clusters traces by semantic similarity to find coverage gaps
  • Generates synthetic edge cases to fill the long tail
  • Routes ambiguous items to human reviewers with the right context
  • Versions every dataset change next to the prompts it scored against

Inputs

  • Production traces
  • Existing eval cases
  • Reviewer feedback
  • Synthetic-generation policy

Outputs

  • Curated eval dataset
  • Coverage report
  • Per-cluster labelled samples

Works with

LangfuseLangfuse
LangSmithLangSmith
BraintrustBraintrust
PhoenixPhoenix

Try Dataset Agent today

Install the CLI and run this agent against your own evals in under five minutes.

Try Dataset Agent