Agent catalog

A team of agents for your AI agents.

Six specialized Mutagent agents across four lifecycle phases. Run them à la carte, or let the full loop run end to end. Click any agent to see exactly what it does, what it consumes, and what it produces.

Build

Turn an idea into a running, instrumented agent.

01 · Build

Spec Agent

From idea to working architecture

Describe what you need and get the right questions, a recommended architecture, a node graph, and a framework-agnostic spec with eval criteria baked in.

See details

01 · Build

Build Agent

Scaffold and connect in minutes

Install the CLI, log in, and map your existing prompts and traces automatically. Connect to Langfuse, LangSmith, or raw OpenTelemetry. No YAML, no infrastructure changes.

See details

Evaluate & test

Build the eval set, score every candidate, ship the winners.

02 · Evaluate & test

Dataset Agent

Build the dataset your evals run against

Curate dataset items from production traces, expert annotations, and synthetic generation. Get the hard cases labelled before they ship. Maintain the set as the agent and your traffic evolve. Stale datasets are why "tests pass but quality drops".

See details

02 · Evaluate & test

Evaluator Agent

Coming soon

Prove every change beats baseline before it ships

Score every candidate prompt against your evals and held-out production traces. Only candidates that beat baseline without regressing reach the merge button.

See details

02 · Evaluate & test

Experiment Agent

Compare candidates side by side

Compare any prompt against any dataset with any model. Score Claude vs GPT vs Gemini side by side against your eval criteria. Know exactly where you stand before optimization changes anything.

See details

02 · Evaluate & test

Deploy Agent

Coming soon

Ship the winning prompt safely

Stage rollouts, canary on a slice of traffic, watch the key evals live, and roll back automatically if quality slips.

See details

Improve

Watch production, diagnose drift, mutate, validate, ship.

03 · Improve

Incident Agent

Coming soon

Maintain reliability in production

Define acceptance thresholds once and monitor production automatically. Trigger optimization when drift is detected, validate candidates, and deploy improvements. You get a notification, not a ticket.

See details

03 · Improve

Diagnose Agent

Find exactly what is failing

Run your entire eval set, group failures by semantic similarity, and trace each cluster back to a specific phrase or structural gap in the prompt. Ranked hypotheses, not a wall of logs.

See details

03 · Improve

Mutation Agent

Implement the change, validated, before it ships

Given a strategy from Optimize Agent, instruct the AI-agent coding agent to mutate the prompt, tool, code, or config. Validate against the Evaluator. Ship the candidates that beat baseline, abandon the rest.

See details

03 · Improve

Auto Engineer Agent

Coming soon

Close the loop without humans in the loop

When Incident Agent flags drift, Auto Engineer kicks off Diagnose, Mutate, Evaluate, and Deploy automatically. You get notified, not paged.

See details