03 · Improve

Diagnose Agent

Find exactly what is failing

Diagnose Agent turns a sea of red eval results into a short, ranked list of root causes. It clusters failures by semantic similarity, traces each cluster back to a specific phrase or structural gap in the prompt, and ranks them by impact so you know what to fix first.

No more eyeballing logs at 2am. Every diagnosis comes with sample traces, the failing eval cases, and a confidence score.

What it does

  • Runs your full eval suite against any prompt version
  • Clusters failures by semantic similarity
  • Traces each cluster back to a specific prompt phrase or missing tool description
  • Ranks root causes by impact and confidence
  • Surfaces representative sample traces for each cluster

Inputs

  • Eval dataset
  • Production traces
  • Current prompts and tool definitions

Outputs

  • Ranked root-cause list
  • Failure clusters with sample traces
  • Impact estimates

Works with

LangfuseLangfuse
LangSmithLangSmith
DatadogDatadog
BraintrustBraintrust

Try Diagnose Agent today

Install the CLI and run this agent against your own evals in under five minutes.

See a sample diagnosis