Agent Observability: A Practical Guide for Production AI Teams

What Agent Observability Actually Means (and Why It's Harder Than LLM Observability)

A production agent rarely fails loudly. It returns HTTP 200, the latency graph looks normal, and somewhere inside a 14-step loop it passed a malformed argument to a tool, got back an empty result, and confidently summarized data it never retrieved. Your logs show a successful request. Your user got a wrong answer. This is the problem agent observability exists to solve, and it's why AI observability has become a separate discipline from the application monitoring you already run.

Agent observability is the practice of capturing the full execution of an AI agent: every LLM call with its complete message history, every tool call with its arguments and results, every framework step, and the session and agent identity that tie runs together. The point is to be able to explain any single run and find patterns across thousands of them.

It builds on a narrower practice. LLM observability is the monitoring of individual model calls: the prompt that went in, the completion that came out, token counts, latency, cost, and errors for each request. If your product is a single prompt-and-response flow, LLM observability is most of what you need, and we cover that layer in depth in our guide to LLM observability in production.

Agents break the per-call model in four ways:

Multi-step loops. One user request fans out into dozens of model and tool calls. A flat list of per-call logs with no hierarchy is unreadable; you need the tree.
Tool calls. Most agent failures live in tool arguments and tool results, not in the model output itself. A tool that times out, returns an empty array, or receives hallucinated arguments usually still produces a "successful" request.
Non-determinism. The same input takes a different path tomorrow. You can't reproduce your way to a diagnosis; you have to compare many recorded runs.
Stateful sessions. Conversations span multiple runs. Quality problems often emerge across turns: context decays, instructions get forgotten. Per-call metrics never see any of it.

Dimension	LLM observability	Agent observability
Unit of analysis	One model call	One run (trace tree)
Failure surface	Prompt, completion, errors	Tools, loops, sessions
Key question	"What did this call do?"	"Why did this run fail?"
Data required	Prompt, output, tokens	Full tree + identity
Debugging mode	Inspect a request	Compare many runs

The rest of this guide is the implementation playbook: the data model behind ai agent observability, exactly what to capture, how to instrument your stack without rewriting it, the production practices that keep trace data safe and useful, and how to turn a pile of traces into a repeatable loop that finds and fixes recurring failure modes.

Read time: 13 minutes

The Data Model: Spans, Trace Trees, Sessions, and Agent Identity

Four concepts carry the entire data model. Get these right and every tool in the ecosystem speaks your language, whether that's an open standard, a vendor dashboard, or an analysis engine.

A span is one unit of work: a single LLM call, one tool execution, a retrieval. It has a start time, an end time, attributes, and a status. Under the hood it's an ordinary OpenTelemetry span; everything AI-specific is carried in attribute conventions layered on top.

A trace is the tree of spans for one run, connected by parent-child relationships. For an agent, that means an outer agent span with model calls, tool calls, and retrieval steps nested beneath it.

A session groups multiple runs that belong to the same conversation or multi-turn workflow, carried by a session.id attribute on the spans.

Agent identity is the stable agent.id (plus a human-readable agent.name) that lets a dashboard or an analysis engine group every execution of the same agent across deploys, renames, and environments.

OpenTelemetry GenAI vs OpenInference: How the Standards Fit Together

Two convention sets define what goes on those spans, and the confusion between them is mostly artificial: both are attribute vocabularies applied to standard OpenTelemetry spans, transported over OTLP. You are not choosing a wire protocol; you're choosing which dictionary your spans use.

The OpenTelemetry GenAI semantic conventions (gen_ai.* attributes) are the official OTel track. The whole convention set — including the client spans that describe individual LLM calls, and the agent and framework spans — still carries Development (experimental) status, with v1.36 as the compatibility baseline and an OTEL_SEMCONV_STABILITY_OPT_IN=gen_ai_latest_experimental opt-in for the newest format. The agent conventions define operations like create_agent and invoke_agent, with span names such as invoke_agent {gen_ai.agent.name}.

OpenInference is the complementary, more application-shaped convention set maintained by Arize. It requires an openinference.span.kind attribute on every span and defines ten kinds — LLM, EMBEDDING, CHAIN, RETRIEVER, RERANKER, TOOL, AGENT, GUARDRAIL, EVALUATOR, and PROMPT — along with flattened attributes for full message lists and tool calls. In practice, OpenInference is richer for LLM-app payloads today, while the GenAI conventions are the long-term OTel standard whose agent coverage is still settling.

Catalyst Tracing, the instrumentation layer we'll use for the examples in this guide, emits OpenInference-shaped OpenTelemetry spans over OTLP/HTTP, with wire-format attribute values byte-identical to the upstream OpenInference spec — so any OpenInference-aware viewer can render the same spans without configuration.

What a Real Agent Trace Tree Looks Like

Here's the shape of a single support-agent run, annotated with the attributes that matter:

Figure 1: Anatomy of an agent trace tree

The outer AGENT span carries identity and session attributes; every child span inherits them. The LLM spans hold the full message history, model name, and token counts. The TOOL spans hold the tool name, the exact JSON arguments the model produced, and the result the tool returned. That last part is the difference between knowing that a run failed and knowing why.

Six span kinds cover essentially everything you'll author or read; the attribute and span-kind reference documents the full constant list.

Span kind	Use it for	Key attributes
`AGENT`	Outer agent run	`agent.id`, `session.id`
`LLM`	One model call	`llm.model_name`, messages, tokens
`TOOL`	One tool invocation	`tool.name`, `tool_call.id`
`CHAIN`	Non-LLM step (router, planner)	input/output values
`RETRIEVER`	Vector search / doc lookup	query, results
`EMBEDDING`	Direct embedding call	`llm.model_name`, tokens

Patched provider SDKs emit LLM spans automatically with required attributes filled in; you author the rest only for work the SDK can't see.

What to Capture at Each Layer

A trace is only as useful as the data on its spans. Teams that capture timing but skip payloads end up with a beautiful tree of empty boxes. Think in four layers: the LLM call, the tool call, the framework step, and the run or session that wraps them all.

Layer	Capture	Why it matters
LLM call	Full messages, model, params, finish reason, tokens	Reconstruct what the model saw
Tool call	Name, ID, JSON args, result	Most failures live here
Framework step	Chain/graph/agent spans	Attribute failures to a step
Run / session	`agent.id`, `session.id`, `user.id`, errors	Group, filter, and compare runs

Instrumented provider SDKs handle the first layer for you: each patched call emits an LLM span with input messages, output messages, model name, invocation parameters, finish reason, and token counts, with no changes at the call site.

Two of these items get skipped most often, and both are expensive mistakes.

Full message history. You cannot debug a prompt you didn't record. When an agent goes off the rails at step nine, the question is always "what did the model actually see?": the accumulated system prompt, prior turns, tool results, and injected context. Truncated or sampled message capture turns that question into guesswork.

Tool results, not just tool calls. Many "the model is dumb" bugs are actually "the tool returned garbage" bugs. An empty search result, a stale record, an error string the model dutifully treated as data — without the tool's output on the span, you'll blame the wrong component every time. Captured spans should carry tool names, IDs, JSON arguments, and the results that came back, plus exception status and error details when a step fails.

Token counts deserve a special mention. Capturing prompt, completion, total, and prompt-cache token counts per span is what makes cost attribution possible later: per agent, per session, per step.

Instrumenting Your Stack

Implementation is less work than most teams expect. The whole rollout reduces to five steps:

Install a tracing SDK in your app.
Configure the export endpoint and token via environment variables.
Initialize tracing before your LLM clients are constructed.
Wrap each logical agent run in an agent span with stable identity.
Verify the trace tree in your dashboard or CLI.

Drop-In Setup with Catalyst Tracing

Catalyst ships two tracing SDKs with the same shape: @inference/tracing for TypeScript and inference-catalyst-tracing for Python 3.10+, sharing one setup() entry point, one env-var contract, and the same OpenInference attribute conventions. Here's the Python path end to end — the tracing quickstart covers TypeScript and every other variation:

# Install with the extras you use:
#   pip install 'inference-catalyst-tracing[openai]'
#
# Configure export before your app starts:
#   export CATALYST_OTLP_ENDPOINT="https://telemetry.inference.net"
#   export CATALYST_OTLP_TOKEN="<your-token>"   # from https://inference.net/dashboard/api-keys/
#   export CATALYST_SERVICE_NAME="checkout-agent"

import os

from inference_catalyst_tracing import setup
from openai import OpenAI

tracing = setup()
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Reply with just the word hello."}],
    max_tokens=16,
)

print(response.choices[0].message.content)
tracing.shutdown()

Three details matter here. The install string uses per-integration extras (openai, anthropic, langchain, langgraph, and so on, or all), so you only pull in what you use. The three env vars — CATALYST_OTLP_ENDPOINT pointed at https://telemetry.inference.net, CATALYST_OTLP_TOKEN, and a stable CATALYST_SERVICE_NAME per deployed service — are the entire export configuration. And setup() must run before your SDK clients are constructed, because instrumentation patches the client libraries themselves.

If you'd rather not wire it by hand, inf instrument --mode tracing launches a coding agent (Claude Code, OpenCode, or Codex) that scans your codebase, installs the right packages, wires setup() into your entrypoint, and adds stable service and agent identity for you. Either way, verification is one command: inf trace list --range 1h, then inf trace get <trace-id> --view tree to walk the tree you just captured. From there, inf span list --trace-id <trace-id> --kind LLM narrows the tree to the model calls, and inf span get <trace-id> <span-id> --view io prints the exact input and output captured for one span — a quick way to confirm payload capture without opening the dashboard.

Auto-Instrumentation for Providers and Frameworks

In Python, setup() auto-detects every supported package you have installed. In TypeScript you pass modules explicitly — setup({ modules: { openai: OpenAI } }) — and can set autoInstrument: false when a production service should patch only the SDKs you list.

Coverage spans the stack most teams actually run: OpenAI, Anthropic, LangChain, LangGraph, a LangSmith bridge, Langfuse, Pydantic AI, OpenAI Agents, LiveKit Agents, ElevenLabs Agents, PI AI, Cursor SDK, Claude Agent SDK, and Claude Code SDK. The Vercel AI SDK is the one special case: it has its own experimental_telemetry option, so instead of patching, you pass createAISdkTelemetrySettings() into each generateText or streamText call.

Auto-instrumentation gets you LLM spans. The step that makes them agent observability is wrapping each logical run in an agent span, so the model and tool calls nest under one row with identity and session attached:

import { agentSpan, setup } from "@inference/tracing";
import OpenAI from "openai";

const tracing = await setup({ modules: { openai: OpenAI } });
const client = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

await agentSpan(
  {
    agentId: "refund-review-agent",
    agentName: "Refund Review Agent",
    spanName: "refund-review.run",
    sessionId: "conversation-refund-1842",
    userId: "user_8675309",
    role: "refunds",
    system: "openai",
  },
  async (span) => {
    const input = "Review refund request #1842.";
    span.setInput(input);
    const response = await client.responses.create({
      model: "gpt-4o-mini",
      input,
    });
    span.setOutput(response.output_text);
  },
);

await tracing.shutdown();

Set agentId, agentName, and sessionId once on the outer span; child spans created inside it inherit the identity.

Manual Spans for Everything Else

Real agents do work no SDK ever sees: tools you execute yourself, retrieval against your own store, routing logic, validation passes, CLI subprocesses. Manual spans put that work into the same tree, using the span kinds from the data-model section — TOOL, CHAIN, RETRIEVER, EMBEDDING.

from inference_catalyst_tracing import (
    SpanKindValues,
    agent_span,
    manual_span,
    setup,
)

tracing = setup()

with agent_span(
    tracing.tracer,
    agent_id="refund-review-agent",
    span_name="refund-review.run",
) as span:
    span.set_input("Review refund request #1842")
    decision = run_refund_review()
    span.set_output(decision.summary)

# manual_span authors TOOL / CHAIN / RETRIEVER / EMBEDDING spans.
with manual_span(
    tracing.tracer,
    name="rag.retrieve",
    span_kind=SpanKindValues.RETRIEVER,
    input={"query": query, "k": 8},
) as span:
    docs = retrieve(query)
    span.set_output(docs)

tracing.shutdown()

With the agent span as the parent, auto-captured LLM spans and manual tool spans land side by side, and the trace tree finally reflects everything the run did — not just the model calls.

AI Agent Observability Best Practices for Production

Instrumentation gets traces flowing. These five practices are what separate a demo from ai agent observability best practices you can run for a year without regretting it.

Group runs into sessions. Pass a conversation-stable session.id on the outer agent span — a chat thread ID, a ticket ID, whatever your calling code already has. Set it once; everything underneath inherits it, and your dashboard gains a "show me this whole conversation" view instead of disconnected runs.

Keep agent identity stable across deploys. Group-by-agent only works if agent.id survives renames, redeploys, and environment changes. Use readable slugs like support-triage-agent or billing-agent-prod; avoid display names with version suffixes, random UUIDs per process, or user-specific values. Catalyst reads agent.id first and falls back to agent.name only when no stable ID exists — relying on the fallback is how you end up with one agent shown as five. The agent identity guide covers multi-agent setups with agent.role.

Strip secrets before export. Full message capture is the point, and it means prompts, tool arguments, and tool results can carry API keys, tokens, and customer PII. Treat payload capture as a deliberate decision: redact known secret patterns at the boundary of your tools, keep credentials out of prompts in the first place, and review what a typical trace contains before you scale up retention.

Respect the flush lifecycle. Spans are batched and exported in the background, so the process shape dictates how you flush. Short-lived scripts call shutdown() before exit. Long-lived services call setup() once per process, memoize it, and call shutdown() only on SIGTERM. Serverless and edge runtimes freeze the process between invocations, so flush each invocation with forceFlush() and never tear down the provider a warm start will reuse. Most "we're missing traces" reports trace back to this paragraph.

Attribute cost where it's incurred. Because each span carries the model name and token counts, cost rolls up naturally: per step, per run, per session, per agent. The Agents overview turns that into run counts, error rate, latency, token usage, and cost over time without extra work on your side. When finance asks which agent doubled the bill last week, the answer is a filter, not a forensic project.

From Traces to Evals: Scoring Quality with LLM-as-a-Judge

Traces and metrics tell you what happened. They cannot tell you whether the output was any good — an agent can be fast, cheap, error-free, and wrong. Evals close that gap, and the trace data you're now collecting is what feeds them.

The mechanism is rubric-based scoring with an LLM as the judge. A rubric is a plain-English description of a quality dimension — accuracy, helpfulness, format compliance — scored on a numeric scale; you can write one by hand, start from a template, or have one generated from your existing data. At eval time, the judge model receives the rubric, the conversation context, and the output to score, and returns a judgment with a numerical score in the rubric's range. Two practical rules: use the smartest model available as the judge, since a weak judge can't reliably distinguish quality, and remember judge calls are full LLM inferences with real cost. The mechanics are documented in how LLM-as-a-judge works.

Where does eval data come from? Your traffic. Catalyst lets you filter captured production inferences — by task, model, status — and save the slice as an eval dataset, with a zero-overlap rule keeping eval and training data separate. Offline evals, available today, run that dataset through one or more models and compare scores: the standard way to test a prompt change or a model swap against real inputs before shipping it.

One honest caveat: online evaluation — scoring live production traffic continuously with sample-rate controls — is coming to Catalyst but isn't shipped yet. Today, continuous quality coverage on production traffic comes from a different direction: scheduled analysis of the traces themselves. That's the last piece of the loop.

Closing the Loop: Finding Recurring Failure Modes with Halo

Everything to this point — trace trees, captured payloads, sessions, evals — still leaves a gap. Inspecting traces one at a time finds the bug in that trace. It does not tell you that the same retrieval step returns empty results in 4% of runs, or that one tool's schema quietly drifted last Tuesday. Recurring failure modes live across runs, and humans don't read five hundred trace trees.

This is what Halo (Hierarchical Agent Loop Optimization) is for. It's an open-source, MIT-licensed engine that reads OpenTelemetry-compatible spans, decomposes traces across many runs to find systemic failure modes, and writes up concrete fixes. Each finding cites the specific trace IDs it came from, so you can click from a claim straight into the evidence. You can run it self-hosted (pip install halo-engine, point it at a JSONL trace file) or hosted inside the Catalyst Agents tab, where the Analysis sub-tab runs the same engine against traces you've already collected.

The working loop looks like this: pick an agent and a time window (a focused window beats a firehose), give Halo a prompt — the default asks it to identify anomalies, errors, inefficiencies, and opportunities to improve reliability, latency, cost, and tool usage, citing trace IDs for every key finding — then read the ranked report, apply the fixes, and re-run on a fresh window to confirm the issue is gone. You can also ask it pointed questions: "Which tool calls return empty results most often?" The reports are written to be pasted directly into a coding agent as a work brief, which is exactly how we use them.

Run analysis on a schedule and it stops being an investigation tool and becomes a regression net. Trace-to-fix is a workflow you can stand up this week, not a platform migration.

Find out why your agents fail

Halo reads your production traces and returns a ranked list of failure modes, with the trace IDs to prove it. It works across all your traffic instead of one trace at a time.

Explore Observe

Schedules run hourly, daily, weekly, or monthly, with the lookback window pre-filled to match the cadence — a daily schedule reviews the last 24 hours — and any single window capped at 30 days. Each run lands a report in the analysis history; the analyze your traces guide walks through both the on-demand and scheduled flows. If you live in the terminal instead, the same engine is reachable as inf halo: kick off a one-off run, create a schedule, and read the report markdown without leaving your shell.

Figure 2: Halo Schedule Cadences and Default Lookback Windows - Lookback pre-fills to match the cadence; any single window caps at 30 days (720h)

A 30-Day Rollout Plan

You don't need a quarter-long project. Resist the urge to instrument everything at once, too: one production agent, fully traced end to end, teaches you more in a week than partial coverage of ten agents does in a month. Here's the four-week version, one focus per week.

Week	Focus	Actions	Done when
1	Instrument	Install SDK, `setup()`, agent spans, sessions	Trace trees visible per agent
2	Baseline	Review error rate, latency, tokens, cost	Dashboard answers cost questions
3	Evaluate	Write rubrics, build dataset from traffic, run offline evals	Quality scores on real inputs
4	Automate	Schedule recurring trace analysis, fix and re-run	Ranked findings arrive on schedule

By day 30 you have a defensible baseline: every production agent emitting full trace trees with stable identity, dashboards answering cost and error questions in seconds, a rubric-based eval suite built from your own traffic gating prompt changes, and a scheduled analysis run filing ranked failure-mode reports before your users file them. The loop — trace, analyze, fix, re-run — keeps compounding from there.

Conclusion

Agent observability isn't a dashboard you buy; it's a loop you operate. The data model is four ideas (spans, trace trees, sessions, agent identity) on open standards. Capture is a checklist (full messages, tool arguments and results, tokens, errors). Instrumentation is a few lines at your entrypoint plus an agent span around each run. And the payoff comes from what you do with the traces: evals that score quality against rubrics, and cross-run analysis that surfaces recurring failure modes with the trace IDs to prove them.

The fastest way to start is to get one agent's traces flowing today and let the rest of the rollout follow.

Trace your first agent run in minutes

Install the Catalyst tracing SDK and call setup() before your clients. You get the full trace tree: agent, LLM, and tool spans with cost, latency, and token usage.

Start tracing

LLM Observability: Monitoring Production Deployments — the per-call layer of the same practice, covered in depth
LLM Evaluation Tools Comparison — if you're choosing an eval stack to pair with your tracing
Introducing Catalyst — the platform context behind the tracing, evals, and Halo loop

References

OpenTelemetry GenAI semantic conventions — https://opentelemetry.io/docs/specs/semconv/gen-ai/
OpenTelemetry GenAI agent span conventions — https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-agent-spans/
OpenInference specification (Arize) — https://github.com/Arize-ai/openinference
Halo (Hierarchical Agent Loop Optimization), open source — https://github.com/context-labs/halo
Catalyst Tracing documentation — https://docs.inference.net/integrations/traces/overview