LLM Tracing: What to Capture in Production Agents

What LLM Tracing Is (and Why Logs Aren't Enough)

Your agent failed in production last night. The request log shows a 200, a timestamp, and a latency number. What it doesn't show: the model called the same tool six times in a row, got an empty result every time, and burned 40,000 tokens before giving up. LLM tracing exists to make that visible, but only if you capture the right fields. Plenty of teams turn on tracing, ship it, and still can't answer basic questions because their traces are missing tool arguments, token counts, or the message history that drove each decision.

This guide is the capture checklist: a field-by-field walk through what a production agent trace must contain, grounded in OpenTelemetry and OpenInference conventions so it stays useful no matter which backend you use. It's one layer of a broader LLM observability practice, but tracing is the layer everything else builds on.

Here's the definition worth keeping: an LLM trace is a tree of spans that records every step an LLM application takes to serve one request — each model call, tool invocation, retrieval, and framework step — along with its timing, inputs, outputs, and metadata. The trace tells you not just that the app responded, but why it did what it did along the way.

Request logs can't do this because they're flat: one record per HTTP call, with no relationship between the six model calls that made up a single agent run. Traditional APM spans get you the tree shape but not the substance: they capture timing and status codes, and agent failures almost never live in status codes. The model that hallucinated a tool argument returned a 200. The loop that burned your token budget was six successful calls. Agent failures are content failures, so the trace has to preserve content: messages, arguments, results, and usage.

One warning before the checklist: in vanilla OpenTelemetry GenAI instrumentation, message content capture is opt-in and off by default, because messages can carry user data. Teams flip on tracing, never flip on content capture, and end up with beautiful timing trees that can't answer a single "why" question. Decide deliberately what content you capture and how you protect it. More on that below.

Read time: 11 minutes

Anatomy of an Agent Trace

Agent tracing makes more sense walked through than defined, so take one concrete run. A user asks a support agent: "Where is my order?" Here's everything that happens, and the span each step should emit:

The request enters your agent. A root AGENT span opens, carrying the run-level input, the agent's identity, and the session it belongs to.
The agent calls the model to decide what to do. That's LLM span #1, with the full message history, model name, and token counts. The model responds with a tool call: lookup_order(order_id="ABC-123").
Your code executes the lookup. That's a TOOL span carrying the tool name, the JSON arguments, and the result it returned.
The agent fetches shipping policy docs to ground its answer. That's a RETRIEVER span: query in, documents out.
The agent calls the model again to compose the reply. LLM span #2, same capture as the first.
The root span closes with the run's final output.

Figure 1: Anatomy of an Agent Trace

One request, one trace, one tree. The tree shape is what makes the loop case debuggable: when the model calls lookup_order six times, you see six TOOL spans in sequence under one parent, each with its own arguments and result, instead of six interleaved log lines.

Each node in the tree declares a span kind, a fixed vocabulary that tells any viewer how to render it. OpenInference, the convention this article leans on, defines ten kinds in total; the six you'll touch daily are AGENT, LLM, TOOL, CHAIN, RETRIEVER, and EMBEDDING. A TOOL span gets an arguments-and-result panel, an LLM span gets a message view with token counts, a RETRIEVER span gets a query and document list. Here's what each kind must carry:

Span kind	Wraps	Must capture	Key wire attributes
AGENT	The whole run	Identity, run IO	`agent.id`, `session.id`, `input.value`
LLM	One model call	Messages, model, usage	`llm.model_name`, `llm.token_count.*`
TOOL	One tool execution	Name, args, result	`tool.name`, `tool_call.id`
CHAIN	Router/validator step	Step IO	`input.value`, `output.value`
RETRIEVER	Vector/doc lookup	Query, documents	`input.value`, `output.value`
EMBEDDING	Embedding call	Model, tokens	`llm.model_name`, token counts

Every span also carries openinference.span.kind plus start time, end time, and status.

Parenting comes free if you instrument in the right order: spans created inside an active agent span automatically nest under it and inherit its identity. The most common first-instrumentation mistake is the opposite: auto-captured LLM spans with no agent wrapper, which land as orphan rows with no run, agent, or conversation to group them by.

What to Capture on Every LLM Call

The LLM span is where most of the debugging value lives, and most of the capture mistakes happen. Field by field, with the question each one answers:

Full message history. Every message the model saw — system, user, assistant, and tool messages — recorded as indexed attributes like llm.input_messages.0.message.role and llm.input_messages.0.message.content, with output messages mirrored the same way. This answers the most fundamental question: what did the model actually see? Bad outputs are usually explained by the input: a stale system prompt, a tool result that came back malformed, a context window that silently dropped the one message that mattered.

Model and parameter metadata. The model name as the API echoed it back (llm.model_name), the provider, and the invocation parameters as JSON (llm.invocation_parameters: temperature, max_tokens, and friends). This answers: did the deploy change the model or its settings? When quality drops on Tuesday, the first thing to rule out is that Tuesday's release didn't quietly swap models or bump temperature.

Token counts, per span. Prompt, completion, and total counts, plus the detail fields: cache reads and writes (llm.token_count.prompt_details.cache_read / cache_write) and reasoning tokens (llm.token_count.completion_details.reasoning). This answers: why did costs spike? An aggregate bill tells you costs doubled; per-span counts tell you it was the planning step in one agent whose prompt grew 30 messages long. Cache details tell you whether your prompt caching is actually hitting.

Latency and streaming. Span duration comes free with any tracing. For streamed calls, capture the streaming flag and time to first chunk — OpenTelemetry's conventions define gen_ai.response.time_to_first_chunk for exactly this. Total duration answers throughput questions; time-to-first-chunk answers the "feels slow" complaints.

Finish reason and errors. The provider's finish reason (llm.finish_reason) distinguishes a natural stop from a length cutoff from a tool-call handoff, and failed spans should carry exception status and error details. This answers: did the model finish, or did we truncate it? Truncation bugs hide here. Outputs that look mysteriously incomplete are usually length finish reasons nobody looked at.

The tool calls the model requested. When the model emits a tool call, record the function name, arguments, and call ID on the LLM span itself (...tool_calls.0.tool_call.function.name, .arguments, .id). This is the model's decision, distinct from your code's execution of it, and you need both sides to debug a disagreement between them.

The full checklist in one place, with the wire-format keys from the attribute reference:

Capture	Wire attribute(s)	Question it answers
Message history	`llm.input_messages.{i}.`, `llm.output_messages.{i}.`	What did the model see?
Model name	`llm.model_name`	Did the deploy swap models?
Parameters	`llm.invocation_parameters`	Did settings change?
Token counts	`llm.token_count.prompt` / `.completion` / `.total`	Why did costs spike?
Cache usage	`llm.token_count.prompt_details.cache_read` / `.cache_write`	Is prompt caching hitting?
Reasoning tokens	`llm.token_count.completion_details.reasoning`	Where do thinking tokens go?
Finish reason	`llm.finish_reason`	Did we truncate the output?
Streaming + TTFC	`llm.streaming`, `gen_ai.response.time_to_first_chunk`	Why does it feel slow?
Requested tool calls	`...tool_calls.{j}.tool_call.function.name` / `.arguments` / `.id`	What did the model decide?
Errors	exception status, error details	What failed, exactly?

Tool Calls and Custom Spans

If you capture only one thing beyond the LLM calls, capture tool inputs and outputs. The classic undebuggable failure is the agent loop: the model calls the same tool repeatedly and won't move on. With LLM spans alone you see six near-identical model calls and no cause. With TOOL spans you see the cause immediately — the tool kept returning an empty result, and the model kept retrying because nothing in its context told it the lookup would never succeed.

A TOOL span needs the tool name, the call ID, the arguments as structured input, and the result as output. The call ID is the thread that ties decision to execution: the LLM span records that the model requested lookup_order with call ID abc123, and the TOOL span with the matching tool_call.id records what actually happened when your code ran it. When the model passes garbage arguments, the LLM span shows it. When the tool returns garbage, the TOOL span shows it.

Auto-instrumentation can only see SDK calls, though. Your retrieval step, your router, your validation pass, your subprocess — the tracing SDK never sees them unless you wrap them. That's what manual spans are for: RETRIEVER spans for vector search (query in, documents out), CHAIN spans for routing and post-processing logic, EMBEDDING spans for direct embedding calls. Here's a run with an agent wrapper, an auto-captured LLM call, and manually wrapped tool and retrieval steps:

import {
  SpanKindValues,
  agentSpan,
  manualSpan,
  setup,
} from "@inference/tracing";
import OpenAI from "openai";

const tracing = await setup({ modules: { openai: OpenAI } });
const client = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

await agentSpan(
  {
    agentId: "support-agent",
    agentName: "Support Agent",
    sessionId: conversationId, // stable per conversation, not per process
  },
  async (span) => {
    span.setInput("Where is my order? Order ABC-123.");

    // Auto-captured LLM span: the patched OpenAI client emits it on its own,
    // with messages, model, token counts, and any requested tool calls.
    const decision = await client.chat.completions.create({
      model: "gpt-4o-mini",
      messages: [
        { role: "user", content: "Where is my order? Order ABC-123." },
      ],
    });

    // TOOL span: your code executes the lookup the model asked for.
    // tool_call.id ties this execution back to the model's request.
    const order = await manualSpan(
      {
        spanName: "lookup_order.tool",
        spanKind: SpanKindValues.TOOL,
        toolName: "lookup_order",
        toolCallId: decision.choices[0]?.message.tool_calls?.[0]?.id,
        input: { order_id: "ABC-123" },
      },
      async (toolSpan) => {
        const result = await lookupOrder("ABC-123");
        toolSpan.setOutput(result);
        return result;
      },
    );

    // RETRIEVER span: query in, documents out.
    await manualSpan(
      {
        spanName: "vector_store.search",
        spanKind: SpanKindValues.RETRIEVER,
        input: { query: "shipping policy", k: 8 },
      },
      async (ragSpan) => {
        const docs = await retrieve("shipping policy");
        ragSpan.setOutput(docs);
      },
    );

    span.setOutput(`Order ${order.id} is in transit.`);
  },
);

await tracing.shutdown();

The goal is a tree that reflects everything the run did, not just the model calls. Any step that could explain a failure deserves a span.

Sessions, Users, and Agent Identity

Everything so far makes a single trace debuggable. The grouping metadata is what makes ten thousand traces analyzable, and it's the part most instrumentation skips.

Session ID. A multi-turn conversation produces one trace per turn. Setting session.id to a stable conversation identifier groups those traces into one session, so you can replay the whole exchange in order. Pass it per conversation from your calling code. A session ID set once at process startup merges every user's conversation into one meaningless blob.

User ID. Setting user.id lets you pull up every trace for one user. When a customer writes in with "the assistant gave me wrong information yesterday," this field is the difference between finding that conversation in seconds and grepping timestamps.

Stable agent identity. Dashboards group agent runs by agent.id, falling back to agent.name when no stable ID exists. The ID needs to survive deploys, renames, and environment changes: support-triage-agent is a good ID; a random UUID per process or a name like SupportAgent v2 is not, because every release shatters your trend lines and the regression you're hunting becomes invisible across the rename. Keep the display name human-readable and changeable; keep the ID boring and permanent. In multi-agent systems, give each agent its own ID and a role like triage or refunds.

Set all of this once on the outer agent span; children inherit it. The agent identity guide covers the conventions in detail.

Standards That Keep Traces Portable

Instrumentation is an investment, and the way to protect it is to emit spans in a format more than one backend understands. Two convention sets matter, and the OpenTelemetry semantic conventions for LLM workloads deserve the most careful reading, because their status is widely misunderstood.

OpenTelemetry GenAI semantic conventions

OpenTelemetry's GenAI conventions define the official attribute vocabulary for model calls: gen_ai.operation.name and gen_ai.provider.name are required, gen_ai.request.model is conditionally required when available, and gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, gen_ai.request.temperature, and gen_ai.response.finish_reasons sit in the recommended set, with span names following the pattern {operation} {model}. A separate convention covers agent and framework spans.

Two things to know before betting on them. First, the conventions are still in development status — experimental, not stable — as of mid-2026, with migrations gated behind the OTEL_SEMCONV_STABILITY_OPT_IN environment variable. Attribute names can still change. Second, as noted earlier, content capture (gen_ai.input.messages, gen_ai.output.messages, gen_ai.system_instructions) is opt-in by design, and instrumentations are told not to capture it by default. If you adopt OTel GenAI instrumentation and skip the opt-in, your traces will have token counts but no messages.

OpenInference

OpenInference is an OpenTelemetry-compatible convention maintained by Arize, purpose-built for LLM applications: it defines the span kinds from earlier, requires openinference.span.kind on every span, and flattens structured data like message lists into the indexed attributes this article has been quoting. It's the convention most LLM tracing tools render richly today, which makes it the practical choice while OTel GenAI stabilizes. The two aren't enemies: OpenInference rides on ordinary OpenTelemetry spans and exports over OTLP like everything else.

The practical guidance: emit OTLP, pick the convention your backend actually renders, and prefer instrumentation that doesn't lock you in. Catalyst's tracing SDKs, for example, emit OpenInference attributes byte-identical to the upstream spec, so any OpenInference-aware viewer can read the spans without configuration.

Bridging from LangSmith or Langfuse

Already instrumented? You probably don't need to start over. LangSmith supports OpenTelemetry export, so apps instrumented with traceable can route spans through an OTel tracer provider by setting LANGSMITH_TRACING_MODE=otel; Catalyst's LangSmith bridge picks those up directly. Langfuse's SDKs are OpenTelemetry-native, and a Langfuse-instrumented app can point at a different OTLP-speaking backend with environment variables alone. The Langfuse bridge is exactly that — three env vars, no code changes:

# Keep your Langfuse tracing code. Point the SDK at Catalyst instead:
export LANGFUSE_HOST="https://telemetry.inference.net"
export LANGFUSE_PUBLIC_KEY="pk-catalyst"           # compatibility only; value unused
export LANGFUSE_SECRET_KEY="<your-catalyst-api-key>"

Capturing All of This in Practice

The checklist above sounds like a lot of instrumentation work. It mostly isn't, because the LLM-span fields — messages, model, parameters, tokens, finish reasons, requested tool calls — are exactly what auto-instrumentation captures by patching the SDKs you already use. With Catalyst Tracing, the setup is three environment variables and one call:

# Install:
#   pip install 'inference-catalyst-tracing[openai]'
#
# Configure export before the app starts:
#   export CATALYST_OTLP_ENDPOINT="https://telemetry.inference.net"
#   export CATALYST_OTLP_TOKEN="<your-token>"   # from inference.net/dashboard/api-keys
#   export CATALYST_SERVICE_NAME="support-agent"

import os

from inference_catalyst_tracing import agent_span, setup
from openai import OpenAI

tracing = setup()  # auto-detects installed SDKs and patches them
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

with agent_span(
    tracing.tracer,
    agent_id="support-agent",
    agent_name="Support Agent",
    session_id="conversation-4812",
) as span:
    span.set_input("Where is my order? Order ABC-123.")
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "user", "content": "Where is my order? Order ABC-123."}
        ],
    )
    span.set_output(response.choices[0].message.content or "")

tracing.shutdown()  # flush batched spans before the process exits

setup() has to run before your clients are constructed; in Python it auto-detects installed packages, while TypeScript takes the modules to patch explicitly. Auto-instrumentation covers OpenAI, Anthropic, the Vercel AI SDK, LangChain, LangGraph, Pydantic AI, OpenAI Agents, Claude Agent SDK, and more; the supported integrations list is the current inventory. The agent span and any custom tool spans are the part you add by hand, and the earlier sections covered why they're worth it.

Two production concerns to settle before rollout:

Flushing. Spans are batched and exported in the background, so a process that exits or freezes before the batch flushes silently drops them. Scripts should call shutdown() before exit; long-lived services call setup() once per process and shutdown() on SIGTERM; serverless functions flush per invocation with forceFlush() because the runtime freezes the process between invocations. This is the number-one cause of "I instrumented everything and see nothing."

Secrets and retention. Traces capture content, and content can include things that shouldn't live in a dashboard. Keep credentials and API keys out of span attributes and tool arguments where you control them, and know your backend's posture: Catalyst doesn't use request data for model training by default, strips secrets where possible, encrypts data in transit and at rest, and supports custom retention policies down to no-retention handling. Whatever backend you choose, retention should match operational need. Traces are production data, not exhaust.

If you'd rather not wire any of this by hand, the quickstart includes an agent-assisted install that scans your codebase and wires up setup() for you.

Trace your first agent run in minutes

Install the Catalyst tracing SDK and call setup() before your clients. You get the full trace tree: agent, LLM, and tool spans with cost, latency, and token usage.

Start tracing

From Inspecting Traces to Fixing Failures

Everything above makes individual runs inspectable. But production agents rarely fail once — they fail in patterns, and patterns hide across hundreds of traces no human is going to read one by one.

That's the gap cross-run analysis fills. Halo, an open-source analysis engine, reads OpenTelemetry-compatible spans across many runs, decomposes them, and surfaces recurring failure modes, with concrete recommended fixes and the trace IDs behind every finding, so you can click from "the lookup tool returns empty results for international orders" straight into the runs that prove it. It runs self-hosted via pip install halo-engine, or hosted inside the Catalyst Agents dashboard, on demand or on a schedule against the traces you're already collecting. The workflow is a loop: analyze your traces, apply the fixes, capture a new window, run it again.

And this is where capture quality stops being an abstraction. An analysis engine can't reason about tool loops it can't see, cost spikes without token counts, or cross-run patterns without stable agent IDs. Every field in this article's checklist is something analysis — human or automated — will eventually ask for.

Conclusion

LLM tracing comes down to capturing, for every production run, the things you'll want at 2 a.m.: the full message history (what did the model see?), model and parameter metadata (did the deploy change something?), per-span token counts (why did costs spike?), tool arguments and results (why did it loop?), session and user IDs (what happened in this conversation?), and a stable agent identity (is this week worse than last week?). Emit it all as OpenTelemetry spans with conventions your tools understand, flush correctly for your process shape, and treat the captured content with the care production data deserves. Instrument once, and every debugging session afterward starts from evidence instead of guesswork.

LLM Observability: Monitoring Production Deployments — the monitoring layer that surrounds tracing
LLM Evaluation Tools Comparison — what to run on your traces once they're captured
LLM Cost Optimization — acting on the token counts your traces surface