AI Agent Observability Tools: What to Trace, Score, and Fix

The Problem With Every "Best Tools" List You've Read

Search for AI agent observability tools and you'll find a strange pattern: Braintrust's "5 best tools" article ranks Braintrust first. Arize's comparison puts Arize on top. MLflow's "Top 5" crowns MLflow. None of them disclose the conflict. Meanwhile, Reddit threads asking for an "honest tier list" of agent observability tools climb the same search results, which tells you exactly how much readers trust the vendor listicles.

This article is also written by a vendor. We build Catalyst, an observability and improvement platform for AI agents. So instead of a ranked list, here's a different deal: we'll give you the evaluation criteria first, score eight tools against them (including ours), date every price, and name the situations where our product is the wrong choice. You can apply the criteria to any tool, including ones we don't mention.

First, the definition. AI agent observability is the practice of capturing the full decision path of an agent in production — every model call, tool call, and session — scoring those runs against quality criteria, and using what you find to fix recurring failures. Capture, score, fix. Most tools handle the first step. Fewer handle the second. Almost none handle the third.

That three-part framing is the spine of this comparison, because the gap between "I can see my traces" and "I know what to fix this week" is where most teams actually struggle.

Read time: 14 minutes

Why AI Agent Observability Is Not LLM Call Logging

LLM observability grew up around a simple shape: one request, one response, one log record. Latency, token counts, cost, maybe the prompt and completion. For a single-shot completion app — summarize this document, classify this ticket — that record is the whole story.

Agents break the shape. An agent run is a loop: the model reasons, calls a tool, reads the result, reasons again, calls another tool, and eventually produces an answer. The failure you care about almost never lives in one model call. It lives in the path: the third step of a loop where the model passed a malformed ID to a lookup tool, got an empty result, and confidently invented the rest.

Four structural differences matter when you evaluate agent observability tools:

Traces are trees, not rows. An agent run needs a hierarchy: an AGENT span at the root, with LLM, TOOL, CHAIN, and RETRIEVER spans nested beneath it. A flat list of LLM calls strips out the structure that explains why the run went sideways.

Tool calls need full argument capture. When an agent fails, the bug is usually in the JSON arguments the model passed to a tool, or in how it handled the tool's result. A capture layer that stores tool names and latency but drops the arguments and results can't show you the failure. Look for tools that record the complete payload: names, IDs, arguments, and results.

Sessions group what users experienced. A multi-turn conversation spans many traces. Without grouping on a session ID, you reconstruct a user's bad experience by hand, trace by trace. Span attributes like session.id exist precisely so the tooling can group runs into the conversation a human actually had.

Agent identity must survive deploys. You shipped a prompt change on Tuesday. Is the agent better or worse? Answering that requires a stable agent ID that persists across versions and deploys, so dashboards and analysis can compare the same agent over time.

Figure 1: Flat Call Log vs Agent Trace Tree

One more thing worth knowing before you evaluate vendors: the standards underneath all this are in motion. OpenTelemetry's GenAI semantic conventions — both the client (LLM call) span conventions and the agent and framework span conventions (invoke_agent, execute_tool, gen_ai.agent.*) — are still marked as in development, not stable. OpenInference, the span convention maintained by Arize, remains the de facto standard for rich LLM spans, and spans that follow it can be read by any OpenInference-aware viewer. Tools that speak these open formats keep your exit door open. Tools that don't, don't.

The Evaluation Criteria That Actually Matter

Here is the checklist we'd use to evaluate any AI agent observability tool — ours included. It groups into the three layers from the introduction, plus the operational concerns that show up after you've committed.

Criterion	What to verify	Failing looks like
Trace hierarchy	Nested AGENT/LLM/TOOL spans	Flat call list
Tool-call capture	Full JSON args + results	Name + latency only
Sessions + identity	`session.id`, stable `agent.id`	Per-trace silos
Open formats	OTel / OpenInference output	Proprietary export
Framework coverage	Your stack auto-instrumented	Hand-written spans
Eval support	Offline + online scoring	Logs without scores
Cost attribution	Per agent and session	Per request only
Cross-run analysis	Ranked recurring failures	Manual trace reading

A few of these deserve elaboration.

Trace hierarchy depth and tool-call capture are table stakes, but verify them with a real agent, not a demo. Run a multi-step loop through the tool and open the worst trace. Can you see the exact arguments the model passed on step three? If the answer is "we show the tool name and duration," you'll be exporting data to a notebook every time something breaks.

Session grouping and stable agent identity sound like bookkeeping until the first incident. The teams that debug fastest are the ones who can pull up "everything the support agent did in session X" in one view, and compare this deploy's behavior against last week's under a stable agent identity.

Eval support splits into offline and online evaluation. Offline evaluation runs a dataset (typically built from captured production traffic) through your models and scores the outputs. Online evaluation scores live production outputs continuously, usually with sample-rate controls. Tools differ widely here: some treat evals as the product, some bolt on a scoring API, some leave it to you. LLM-as-judge scoring against plain-language rubrics has become the common pattern; Catalyst's version requires the rubric to reference the model response and scores on a 1–10 range by default. If a vendor claims both offline and online evaluation, ask which is shipping today — for Catalyst, offline evals are available now and online evaluation is still listed as coming soon.

Cost and token attribution needs to resolve to the level you budget at. Per-request token counts are universal; what you actually want is cost per agent and per session over time, so you can see that the planning agent got 40% more expensive after last week's prompt change. Per-agent dashboards that track run counts, error rate, latency, token usage, and cost over time are the bar here.

Framework coverage is a compatibility question, not a feature count. The integration list that matters is yours: if you're on LangGraph, the OpenAI Agents SDK, or the Vercel AI SDK, the tool should capture those runs without you hand-writing spans for every step. Check the vendor's documented integrations against your actual stack, and treat manual instrumentation as the escape hatch, not the plan.

Automated cross-run analysis is the criterion that separates inspection tools from improvement tools, and it's the one no vendor listicle currently covers. The question to ask: when an agent fails the same way forty times this week, does the tool tell you, or do you find out by reading forty traces?

The Five Categories of Agent Observability Tools

Brands churn; categories are stable. Map a tool to its category and you know its trade-offs before the sales call.

Tracing-Only Tools

What they do: capture spans, render trace trees, give you filters. This is the developer-experience layer, and during development it's genuinely what you need. The trap is treating it as an end state. Trace viewers scale to the volume a human can read, which production traffic exceeds in the first hour. If a tool's story stops at "inspect your traces," it's a debugging tool, not an observability practice.

Eval-First Platforms

These platforms organize around datasets, experiments, and scores rather than traces. Braintrust is the clearest example: a free Starter tier (1 GB of processed data, 10,000 scores, 14-day retention) and a Pro plan at $249/month, with the strongest CI/CD-style eval workflow in the group. LangSmith straddles this category and the next: its eval tooling covers offline and online scoring, with dashboards for tokens, latency percentiles, error rates, and cost, plus alerting through webhooks or PagerDuty.

Choose eval-first when your bottleneck is quality discipline: you can't tell whether prompt v12 beat v11. The trade-off is that capture depth tends to be shallower or tied to the vendor's SDK, and the eval-first workflow assumes someone curates datasets and reviews experiments.

APM Extensions

Datadog, New Relic, and Dynatrace extended their application monitoring into LLM territory. The pitch is consolidation: your infra alerts, dashboards, and on-call workflow already live there. Datadog's LLM Observability bills only on LLM spans (tool, workflow, and retrieval spans are free), with 40,000 LLM spans included per month before on-demand billing.

Choose an APM extension when your organization already runs on the APM and the AI workload is one service among many. The weakness is agent-native depth: session-centric debugging, eval workflows, and improvement loops are not where APM vendors live, and span-based billing on top of APM pricing gets hard to predict for chatty agents.

Open-Source Stacks

The self-hosted path, and in 2026 it requires more honesty than the listicles offer:

Langfuse remains the strongest general-purpose open-source option: MIT-licensed, free to self-host, with cloud tiers from a free Hobby plan (50k units/month) through Core at $29/month, Pro at $199/month, and Enterprise at $2,499/month. ClickHouse acquired Langfuse on January 16, 2026; the project stayed MIT-licensed, the cloud kept operating, and ClickHouse publicly committed to expanding the roadmap rather than disrupting it.
Arize Phoenix is open source, built on OpenTelemetry, powered by OpenInference, and runs locally or self-hosted — excellent for notebook workflows, tracing, and LLM-as-judge evals without sending data to anyone.
MLflow is Apache-2.0 under Linux Foundation governance and positions itself as a complete platform spanning tracing, evaluation, and experiment tracking.
Helicone deserves a candid flag no ranking page leads with: after Mintlify acquired it (announced March 3, 2026), the cloud product entered maintenance mode: security updates and bug fixes continue, but no new SaaS feature development. The Apache-2.0 code, proxy-based capture, and self-hosting remain, but recommending it for new production adoption is hard.

Choose open source for data control and budget. Price the ops honestly: you're running an ingestion pipeline, a database, and a UI, and you're on call for them.

Improvement-Loop Platforms

The newest category extends capture and scoring with a layer that analyzes traces across runs and tells you what to fix. This is where Catalyst sits, paired with Halo, an open-source, MIT-licensed analysis engine that reads OpenTelemetry-compatible spans, decomposes them to find systemic failure modes across many runs, and returns ranked findings that cite the specific trace IDs behind each one.

Choose this category when your agents generate more traces than your team will ever read, and "we look at traces when something breaks" has stopped working. Skip it if you're still pre-production — a local trace viewer is faster feedback while you're iterating on a prototype.

Comparison: Eight Tools Scored Against the Criteria

No tool wins every column, which is exactly why the listicle format fails. Pick your category first, then check the criteria that are binding for you.

Tool	Category	Strongest at	Watch out for
Catalyst	Improvement-loop	Capture + cross-run analysis	Online evals pending
LangSmith	Eval-first / tracing	LangChain-native evals	Trace overage costs
Langfuse	Open-source stack	Self-hosting, maturity	New owner (ClickHouse)
Braintrust	Eval-first	CI-style eval workflow	Capture depth, price
Arize Phoenix	Open-source stack	Local OTel tracing + evals	You run the ops
Helicone	Open-source / proxy	One-line proxy capture	Maintenance mode
Datadog LLM Obs.	APM extension	One pane with infra	Agent-native depth
OpenTelemetry-only	DIY standard	Neutrality, $0 license	Build everything

Assessments as of June 2026; categories and trade-offs are detailed in the section above.

Pricing deserves its own table, because the units are a trap. The same agent workload is billed as traces by LangSmith, units by Langfuse, LLM spans by Datadog, requests by Helicone, and gigabytes by Braintrust. A single agent run with tools and retrieval can emit dozens of spans, so two tools with similar sticker prices can produce bills an order of magnitude apart.

Tool	Billing unit	Free tier	Entry paid tier
Catalyst	OTel spans	1M spans/mo	Starter: 10M spans/mo
LangSmith	Traces	5K traces/mo	$39/seat/mo
Langfuse	Units	50K units/mo	$29/mo
Braintrust	GB + scores	1 GB, 10K scores	$249/mo
Arize Phoenix	None (OSS)	Self-host free	n/a
Helicone	Requests	10K requests/mo	$79/mo
Datadog LLM Obs.	LLM spans	40K LLM spans/mo	On-demand
OpenTelemetry-only	None (DIY)	Free	Your infra cost

Prices and allowances as of June 2026; Helicone cloud is in maintenance mode following the Mintlify acquisition.

Figure 2: Entry Paid Plan Pricing - Hosted agent observability tools, USD per month, as of June 2026

LangSmith's numbers illustrate how fast included volumes evaporate: the free Developer tier includes 5,000 traces a month, and the $39-per-seat Plus tier includes 10,000 base traces with overage at $2.50 per 1,000 (14-day retention; 400-day retention traces cost $5.00 per 1,000). A few hundred iterations on an agent during development is enough to burn through an allowance like that.

Figure 3: Free Tier Included Volume, in Each Tool's Native Billing Unit - Deliberately apples-to-oranges: a trace, a unit, a request, and an LLM span are different things (June 2026)

The chart above is deliberately apples-to-oranges — that's the point. A LangSmith trace, a Langfuse unit, a Helicone request, and a Datadog LLM span are different things, so compare allowances only after estimating how many of each unit your agent emits per run.

Three more candid notes the table can't hold:

OpenTelemetry-only (instrument with OTel, ship spans to your own backend) is free and standards-aligned, but the GenAI agent-span conventions are still experimental, and you're building storage, visualization, evals, and analysis yourself. It's a floor, not a product.
Migration is cheaper than it looks. If you're on LangSmith, its OpenTelemetry spans can be bridged into another backend. Catalyst accepts them directly, so you can bridge LangSmith spans into another backend without re-instrumenting. Langfuse SDK traffic redirects with a three-variable env change.
Catalyst's weak spots, since we promised: online evaluation is still listed as coming soon, and there's no self-hosted version of the full platform. If your mandate is "everything on our own hardware today," Langfuse or Phoenix fits better (Halo itself is open source and self-hostable; the hosted dashboard is not).

The Capture Layer in Practice: Catalyst Tracing

This section shows our own product, because criteria like "tool-call argument capture" are abstract until you see what instrumentation actually looks like. Judge the ergonomics for yourself.

Catalyst Tracing emits OpenInference-shaped OpenTelemetry spans over OTLP/HTTP. Captured spans carry the full message history (user, system, assistant, tool, and tool-result messages) plus tool calls with names, IDs, JSON arguments and results, model metadata, token usage including prompt-cache counts, agent identity, and error status on failed spans.

Setup is a single call before you construct your clients. In TypeScript you pass the SDK modules to patch; in Python, setup() auto-detects installed packages, with per-integration extras like pip install 'inference-catalyst-tracing[openai]'.

import { setup } from "@inference/tracing";
import OpenAI from "openai";

const tracing = await setup({
  modules: { openai: OpenAI },
});

const client = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

const response = await client.chat.completions.create({
  model: "gpt-4o-mini",
  messages: [{ role: "user", content: "Reply with just the word hello." }],
  max_tokens: 16,
});

console.log(response.choices[0]?.message.content);
await tracing.shutdown();

That gets you automatic LLM spans with messages, model name, invocation parameters, and token counts. The agent-level criteria (session grouping, stable identity) come from wrapping the run in an agent span:

import { agentSpan, setup } from "@inference/tracing";
import OpenAI from "openai";

const tracing = await setup({ modules: { openai: OpenAI } });
const client = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

await agentSpan(
  {
    agentId: "hello-agent",
    agentName: "Hello Agent",
    sessionId: "session-001",
  },
  async (span) => {
    span.setInput("Reply with just the word hello.");
    const response = await client.chat.completions.create({
      model: "gpt-4o-mini",
      messages: [{ role: "user", content: "Reply with just the word hello." }],
      max_tokens: 16,
    });
    span.setOutput(response.choices[0]?.message.content ?? "");
  },
);

await tracing.shutdown();

The agentId, agentName, and sessionId attributes are what the Agents dashboard groups on, and every span underneath inherits them.

On the framework-coverage criterion, there are 16 documented integration guides: OpenAI, Anthropic, LangChain, LangGraph, LangSmith, OpenAI Agents, LiveKit Agents, ElevenLabs Agents, PI AI, Cursor SDK, Vercel AI SDK, Claude Agent SDK, Claude Code SDK, Pydantic AI, plus manual spans and agent identity. If wiring this by hand isn't appealing, inf instrument --mode tracing launches a coding agent that scans your codebase, installs the SDK with the right extras, and wires setup() into your entrypoint.

One operational honesty note that applies to every SDK-based tool in this article: spans are batched and exported in the background, so process shape matters. Short-lived scripts need a shutdown() call before exit, long-lived services should call it on SIGTERM, and serverless runtimes need a force-flush per invocation or spans get dropped when the process freezes.

If you want to kick the tires, the Catalyst tracing quickstart is the install path.

Trace your first agent run in minutes

Install the Catalyst tracing SDK and call setup() before your clients. You get the full trace tree: agent, LLM, and tool spans with cost, latency, and token usage.

Start tracing

From Inspecting Traces to Fixing Failures

Here's the uncomfortable math of production agents: a modestly busy agent produces thousands of traces a week. Nobody reads them. Dashboards tell you that the error rate moved; they don't tell you why, and they certainly don't tell you about the failure pattern that never throws an error — the agent that calls the same search tool four times with slightly different phrasing because the first result was never used.

This is the gap cross-run analysis closes. Halo reads a window of OpenTelemetry-compatible spans, decomposes them across many runs, and returns a ranked list of systemic failure modes, each citing the specific trace IDs it came from, so a finding is a claim you can verify by clicking into the evidence, not a vibe. The default analysis prompt looks for recurring failures, notable tool-call patterns, wasted or redundant work, slow or high-cost paths, and concrete recommended fixes, and you can ask it pointed questions instead ("which tool calls return empty results most often?").

The production pattern is scheduling: recurring runs at hourly, daily, weekly, or monthly cadence, with a lookback window that pre-fills to match (a daily schedule reviews 24 hours; any single window caps at 30 days), each producing a report in the analysis history. That turns "did the Tuesday prompt change regress anything?" from an investigation into a standing report. You can run Halo on your traces on demand or on a schedule from the dashboard.

Two honest qualifiers. First, the findings feed a human or a coding agent; nothing auto-merges fixes. Second, you don't have to take the hosted version on faith: the engine is MIT-licensed, and pip install halo-engine pointed at a JSONL trace file runs the same analysis self-hosted.

Security and Compliance for Trace Data

A trace store contains whatever your agent saw: customer messages, retrieved documents, and every JSON argument passed to every tool. It's a production data store with production data in it, and the listicles barely mention it.

Questions to ask any vendor — including us:

Secret handling. Are credentials and sensitive values stripped at capture? Catalyst's documented posture: secrets and similar sensitive values are stripped where possible, and request data is not used for model training by default.
Encryption and retention. Data should be encrypted in transit and at rest, with retention matched to operational need rather than "forever by default"; custom and no-retention policies should be available for stricter environments — Catalyst's documented data-retention posture covers each of these.
Compliance attestations by tier. Read the fine print on which plan actually carries the certification. Langfuse, for example, attaches SOC 2 and ISO 27001 reports and HIPAA availability to its $199/month Pro tier, not its cheaper plans.
Self-hosting's hidden clause. Running the stack yourself moves the compliance burden to you. That's sometimes the goal — just price it.

Which Tool Should You Choose?

When choosing between AI agent observability tools, match the scenario, not the brand:

Prototyping solo or pre-production: run Arize Phoenix locally or use any generous free tier. You need a trace viewer and fast iteration, not a platform.
Committed to LangChain/LangGraph with an eval-curious team: LangSmith is the path of least resistance; budget for trace overages once real iteration starts.
Hard data-control mandate: self-host Langfuse or MLflow and staff the ops; both are genuinely open source.
Organization already lives in Datadog: extend it with LLM Observability and accept thinner agent-native workflows; model the per-LLM-span billing against your agents' chattiness first.
Quality discipline is the bottleneck: Braintrust's eval-first workflow is the strongest, at eval-platform prices.
Production agents at volume, and trace-reading has stopped scaling: this is the improvement-loop case. Catalyst for capture and scoring, Halo for ranked failure modes with trace-ID evidence. If you need online evals or full self-hosting today, see the wrong-fit notes above.

Whatever you choose, insist on OpenTelemetry/OpenInference-compatible output. The standards are still settling, vendors are getting acquired — two of the eight tools in this article changed hands in the first quarter of 2026 — and the only durable insurance is owning spans in an open format.

The Bar Is Higher Than "View Your Traces"

The honest summary of this market: capture is commoditizing, scoring is maturing, and fixing is the frontier. Most AI agent observability tools you'll find ranked on the SERP will show you a trace tree. The criteria that will actually separate them for your team are tool-call argument capture, session and identity grouping, eval support that matches your workflow, billing units that survive contact with a chatty agent, and whether anything in the stack tells you what to fix without a human reading traces all day.

If you're evaluating this category for a production agent stack, we'll happily be one of the vendors you grill. Bring the criteria table from this article.

Talk through your stack with an engineer

Walk through your routing, observability, and training setup with someone who builds this for a living, not a sales script.

Meet with us

Agent Observability: A Practical Guide — implementation depth on the trace trees, conventions, and capture patterns this comparison scores
LLM Tracing: What to Capture in Production Agents — a field-by-field guide to what belongs in a trace
LLM Evaluation Tools Comparison — a deeper dive on the eval-first category