Claude Agent SDK Tracing and Evaluation in Production (2026)

The Claude Agent SDK (formerly Claude Code SDK, renamed September 2025) has excellent documentation for building agents, and very little for running them in production. Once an agent handles real traffic, you need tracing: which tools each run called, what every step cost, why run #4,812 burned 40 turns, and which prompt change broke delegation. This guide covers that missing layer for the claude-agent-sdk Python package (v0.2.95) and @anthropic-ai/claude-agent-sdk for TypeScript (v0.3.170), as of June 2026.

Tracing a Claude Agent SDK run means recording one span per query() call, opened before the init message and closed on the ResultMessage, with child spans for each LLM step and tool call, correlated by session_id, tool-use IDs, and parent_tool_use_id. It is agent tracing applied to the SDK's specific message surface.

By the end you'll be able to trace every run, debug failures straight from ResultMessage subtypes, and turn real traffic into eval datasets and regression gates.

Read time: 16 minutes

What the Claude Agent SDK Emits: The Message Stream

A query() call yields a typed message stream: an init SystemMessage, alternating AssistantMessage/UserMessage pairs for each model turn and tool call, optional stream, rate-limit, and task events, and a terminal ResultMessage carrying cost, usage, and outcome. Everything you can trace lives in that stream, so the field names below are the reference backbone for the rest of this article.

One architecture fact first. The SDK spawns the Claude Code CLI as a child process and talks to it over a local pipe. The agent loop, the tools, and all built-in telemetry live in the CLI; the SDK itself produces no telemetry. That's why naively auto-instrumenting the Anthropic client in your process captures nothing. The model calls happen in the subprocess. We'll deal with the consequences in the next section.

The init `SystemMessage`: where every trace gets its `session_id`

The first message of every run is a SystemMessage with subtype == "init". Its data dict carries session_id, model, tools, mcp_servers, permissionMode, and claude_code_version. Capture all of these as span attributes immediately. session_id is the correlation key for everything that follows, and claude_code_version lets you bisect regressions after an SDK upgrade.

`AssistantMessage`, tool calls, and `message_id` deduplication

Each model step arrives as an AssistantMessage whose content is a list of blocks: TextBlock, ThinkingBlock, and ToolUseBlock {id, name, input}. The matching ToolResultBlock {tool_use_id, content, is_error} comes back inside a UserMessage, which also exposes the result as tool_use_result. Server-side tools like web_search use ServerToolUseBlock/ServerToolResultBlock instead. The API executes those, so no client-side result comes back.

The fields that matter for tracing sit on the message itself: model, usage, message_id, stop_reason, parent_tool_use_id, and error. One trap to flag now: parallel tool calls produce multiple assistant messages that share a single message_id with identical usage. Sum usage without deduplicating and you double-count tokens (full treatment in the cost section).

`ResultMessage`: subtypes, cost, and `duration_ms` vs `duration_api_ms`

Every query() call terminates with a ResultMessage. Its key fields: subtype, is_error, num_turns, session_id, total_cost_usd, usage, model_usage, result, structured_output, permission_denials, errors, and api_error_status (HTTP status of a failing API call, emitted by CLI v2.1.110+).

The pair worth watching is duration_ms versus duration_api_ms: wall clock versus time actually spent in API calls. The gap is tool execution plus permission waits, which makes it a first-class latency signal. And one gotcha to internalize: is_error can be True while subtype is "success" when the failure happened at the API level. Check both, plus api_error_status.

Subtype	Meaning	What to check
`success`	Run completed — but `is_error` can still be `True` on API-level failures	`api_error_status` (429/500/529) whenever `is_error` is `True`
`error_max_turns`	Hit the `max_turns` cap — usually a loop or an underpowered model	`num_turns` vs peer runs, and the trajectory (which tools repeated)
`error_during_execution`	The run failed mid-execution	`errors`, tool failures (`ToolResultBlock.is_error`), `AssistantMessage.error`
`error_max_budget_usd`	The cost breaker tripped	`total_cost_usd` vs the `max_budget_usd` you set
`error_max_structured_output_retries`	Output never matched the `output_format` schema	The schema itself — too strict, or the model needs guidance

The long tail of the stream: StreamEvent (only with include_partial_messages=True), RateLimitEvent, the TaskStarted/TaskProgress/TaskNotification trio for background tasks and subagents, HookEventMessage (with include_hook_events=True), and the TypeScript-only SDKCompactBoundaryMessage for context compaction. The table below maps the full surface.

Message type	When emitted	Tracing-relevant fields
`SystemMessage` (init)	First message of every run	`session_id`, `model`, `tools`, `permissionMode`, `claude_code_version`
`AssistantMessage`	Each model step	`content` blocks, `model`, `usage`, `message_id`, `stop_reason`, `parent_tool_use_id`, `error`
`UserMessage`	Tool results returned	`tool_use_result`, `parent_tool_use_id`
`ResultMessage`	Terminal — once per `query()`	`subtype`, `is_error`, `num_turns`, `duration_ms` / `duration_api_ms`, `total_cost_usd`, `usage`, `model_usage`, `api_error_status`
`StreamEvent`	Only with `include_partial_messages=True`	Raw API stream `event`, `parent_tool_use_id`
`RateLimitEvent`	On rate-limit status changes	`status`, `resets_at`, `utilization`
`TaskStarted` / `TaskProgress` / `TaskNotification`	Background tasks and subagents	`task_id`, `tool_use_id`, `usage`, `status`
`HookEventMessage`	Only with `include_hook_events=True`	`hook_event_name`, `data`

Where Claude Agent SDK Traces Should Start and End

One span per `query()` call, sessions as trace groups

The rule: start the AGENT span when you invoke query(), before the init message arrives, and end it when the ResultMessage lands. Not at process exit. The result message carries everything the span needs to close cleanly: outcome subtype, turn count, durations, and cost.

from claude_agent_sdk import query, ClaudeAgentOptions, ResultMessage, SystemMessage
from opentelemetry import trace  # any OTel-compatible tracer works here

tracer = trace.get_tracer("claude-agent")


async def traced_query(prompt: str, options: ClaudeAgentOptions) -> None:
    # Open the AGENT span BEFORE iterating — the init message arrives first,
    # and the span should cover the whole run, not start mid-stream.
    with tracer.start_as_current_span("agent.query") as span:
        async for message in query(prompt=prompt, options=options):
            if isinstance(message, SystemMessage) and message.subtype == "init":
                # Run metadata lives in the init message's data dict.
                span.set_attribute("session_id", message.data["session_id"])
                span.set_attribute("model", message.data["model"])
                span.set_attribute("tools", ",".join(message.data["tools"]))

            elif isinstance(message, ResultMessage):
                # Terminal message: outcome, cost, and durations close the span.
                span.set_attribute("subtype", message.subtype)
                span.set_attribute("is_error", message.is_error)
                span.set_attribute("num_turns", message.num_turns)
                span.set_attribute("duration_ms", message.duration_ms)
                span.set_attribute("duration_api_ms", message.duration_api_ms)
                if message.total_cost_usd is not None:
                    span.set_attribute("total_cost_usd", message.total_cost_usd)

Sessions are the layer above. A session is a trace group keyed by session_id across multiple query() traces: resume="<session-id>" continues an existing session in a new call, fork_session=True resumes into a fresh session ID for branching, and continue_conversation=True picks up the most recent session. For a multi-turn ClaudeSDKClient, cut one trace per receive_response() cycle and link them by session_id. Cost follows the same boundary: it's scoped per query() call and never accumulates across a resumed session. We'll handle that properly later.

If local JSONL transcripts aren't durable enough for you, the session_store option mirrors them to an external backend (batched or eager flushing); mirror failures arrive as non-fatal MirrorErrorMessage events worth recording on the trace.

Why in-process auto-instrumentation misses the subprocess

Because the agent loop runs in the CLI subprocess, wrapping the Anthropic Python client in your process traces exactly nothing. You have three real options: consume the message stream yourself, register hooks that fire in your process, or have the CLI export OTLP directly. The next sections cover all three. My short version: the OTel route is the least code, but if you ever want evals you'll end up consuming the stream anyway.

One nice property if you already run OpenTelemetry: the SDK propagates W3C trace context. If an OTel span is active when you call query(), the SDK injects TRACEPARENT into the subprocess, so the CLI's own spans parent under your application span. TRACEPARENT is even forwarded into Bash commands the agent runs.

Tracing Tool Calls and Subagents

Tool spans from `ToolUseBlock.id` to `ToolResultBlock`

Tool-call spans are a matching exercise. Open a TOOL child span for every ToolUseBlock.id you see in assistant content; close it when the ToolResultBlock (arriving inside a UserMessage) with the matching tool_use_id shows up. Record the tool name, a size-capped copy of input, and is_error from the result.

Tool names follow a taxonomy worth encoding in your span attributes: built-ins (Bash, Read, Edit, Glob, Grep, WebSearch, and friends), MCP tools named mcp__<server>__<tool>, and the Agent tool (a.k.a. Task) that spawns subagents. That last one gets special handling below.

from claude_agent_sdk import AssistantMessage, ToolResultBlock, ToolUseBlock, UserMessage

# Open spans keyed by ToolUseBlock.id — a tool span stays open until the
# ToolResultBlock with the matching tool_use_id arrives.
open_tool_spans: dict[str, "Span"] = {}


def on_message(message, tracer, agent_span) -> None:
    # Subagent nesting: messages emitted from inside a subagent's context
    # carry parent_tool_use_id == the Agent/Task ToolUseBlock.id that
    # spawned them. Parent their spans under that tool span.
    parent = agent_span
    if getattr(message, "parent_tool_use_id", None) in open_tool_spans:
        parent = open_tool_spans[message.parent_tool_use_id]

    if isinstance(message, AssistantMessage):
        for block in message.content:
            if isinstance(block, ToolUseBlock):
                span = tracer.start_span(f"tool.{block.name}", parent=parent)
                span.set_attribute("tool.input", str(block.input)[:2000])
                # Special case: name == "Agent" is the Task tool — the
                # subagent's whole run will nest under this span.
                open_tool_spans[block.id] = span

    elif isinstance(message, UserMessage) and isinstance(message.content, list):
        # Tool results come back inside UserMessages.
        for block in message.content:
            if isinstance(block, ToolResultBlock):
                span = open_tool_spans.pop(block.tool_use_id, None)
                if span is not None:
                    span.set_attribute("is_error", bool(block.is_error))
                    span.end()

Don't forget that permission waits are real latency. can_use_tool callbacks and permission prompts sit between tool-use and tool-result, and the CLI's own trace model gives them a dedicated claude_code.tool.blocked_on_user child span. Denials accumulate in ResultMessage.permission_denials, which becomes a debugging signal later.

Nesting subagents with `parent_tool_use_id` and `agent_id`

Subagents are where most homegrown tracers fall apart. The first correlation key is parent_tool_use_id: as the official overview puts it, messages from within a subagent's context include a parent_tool_use_id field, equal to the ToolUseBlock.id of the Agent call that spawned it. Nest a child AGENT span under that tool span and the hierarchy assembles itself.

The second key is agent_id (with agent_type) on tool-lifecycle hooks. The SDK's own docstring is blunt about why: when multiple subagents run in parallel, their tool-lifecycle hooks interleave over the same control channel, and agent_id is the only reliable way to attribute each tool call to the correct subagent. The GitHub issues asking how to untangle parallel subagent output suggest plenty of teams learned this the hard way.

Two more facts shape the span model. Subagents have isolated context windows and only their final result returns to the parent, so their token usage rolls up into the parent's usage and model_usage (a cost wrinkle we'll pick up next section). And background subagents surface as TaskStarted/TaskProgress/TaskNotification messages keyed by task_id plus tool_use_id. Attach those as span events or child spans.

Hooks as the in-process tracing surface

Hooks run in your process, not the CLI's, which makes them the cleanest place to write spans directly. A PreToolUse → PostToolUse bracket gives you precise tool timing, free of the transport overhead baked into message-stream timestamps.

The verified 2026 hook list (from types.py; ignore stale blog posts with invented names): PreToolUse, PostToolUse, PostToolUseFailure, UserPromptSubmit, Stop, SubagentStop, PreCompact, Notification, SubagentStart, and PermissionRequest. Every hook input carries session_id and transcript_path. Hold onto that transcript path; it's the raw material for eval datasets later.

import time

from claude_agent_sdk import ClaudeAgentOptions, HookMatcher

tool_starts: dict[str, float] = {}  # keyed by tool_use_id


def record_event(name: str, **attrs) -> None:
    """Replace with your span/event writer (OTel, DB, collector)."""
    print(name, attrs)


# Hook callback signature: async fn(input, tool_use_id, context).
# Hooks run in YOUR process — write spans to OTel/DB/collector directly.
async def on_pre_tool(input, tool_use_id, context):
    tool_starts[tool_use_id] = time.monotonic()
    # agent_id/agent_type are present only when the hook fires inside a
    # Task-spawned subagent — the only reliable attribution when parallel
    # subagents interleave hooks over the same control channel.
    record_event("tool_start", tool=input["tool_name"],
                 agent_id=input.get("agent_id"), agent_type=input.get("agent_type"))
    return {}


async def on_post_tool(input, tool_use_id, context):
    duration = time.monotonic() - tool_starts.pop(tool_use_id, time.monotonic())
    record_event("tool_end", tool=input["tool_name"], duration_s=duration,
                 response=str(input.get("tool_response"))[:2000])
    return {}


async def on_tool_failure(input, tool_use_id, context):
    record_event("tool_error", tool=input["tool_name"], error=input.get("error"))
    return {}


options = ClaudeAgentOptions(
    hooks={
        "PreToolUse": [HookMatcher(hooks=[on_pre_tool])],
        "PostToolUse": [HookMatcher(hooks=[on_post_tool])],
        "PostToolUseFailure": [HookMatcher(hooks=[on_tool_failure])],
    }
)

Tracking Cost and Token Usage Across Runs

Why `total_cost_usd` is an estimate (and how to dedupe parallel tool calls)

total_cost_usd is a client-side estimate computed from a price table bundled into the SDK at build time. The official cost docs warn it drifts when pricing changes or the model is unrecognized. Never bill users from it. The authoritative source is the Usage & Cost API in the Console.

The dedup rule from earlier now matters: parallel tool calls emit multiple assistant messages sharing one message_id with identical usage, so you must track seen IDs before summing. Rarely, same-ID messages differ in output_tokens; take the highest, or prefer the usage on the result message. Watch the casing too: usage keys are snake_case (input_tokens, cache_read_input_tokens) while model_usage keys are camelCase (inputTokens, costUSD). The per-model model_usage map is what saves you when subagents run on a different model than the main agent.

from collections import defaultdict

from claude_agent_sdk import AssistantMessage, ResultMessage

seen_message_ids: set[str] = set()
token_totals: dict[str, int] = defaultdict(int)
session_costs: dict[str, float] = defaultdict(float)


def on_message(message) -> None:
    if isinstance(message, AssistantMessage) and message.usage:
        # Parallel tool calls emit multiple assistant messages that SHARE one
        # message_id with identical usage — sum without deduplicating and you
        # double-count tokens.
        if message.message_id in seen_message_ids:
            return
        seen_message_ids.add(message.message_id)
        for key in ("input_tokens", "output_tokens",
                    "cache_creation_input_tokens", "cache_read_input_tokens"):
            token_totals[key] += message.usage.get(key, 0)

    elif isinstance(message, ResultMessage):
        # Cost scope is per query() call. Resumed sessions get no cumulative
        # total — accumulate it yourself, keyed by session_id.
        if message.total_cost_usd is not None:
            session_costs[message.session_id] += message.total_cost_usd

        # model_usage is per-model and uses camelCase keys — what you need
        # when subagents run on a different model than the main agent.
        for model, usage in (message.model_usage or {}).items():
            print(f"{model}: ${usage['costUSD']:.4f} "
                  f"({usage['inputTokens']} in / {usage['outputTokens']} out)")

Per-call vs per-session cost scoping

Cost is scoped per query() call. Resumed sessions get no cumulative total; you accumulate total_cost_usd across calls yourself, keyed by session_id. Error results still carry usage and total_cost_usd for the tokens burned before failure; count them, because failed runs are often your most expensive ones.

For a hard ceiling, set max_budget_usd in ClaudeAgentOptions; it kills the run with subtype error_max_budget_usd. One billing note worth a calendar entry: starting June 15, 2026, Agent SDK usage on subscription plans draws from a separate monthly Agent SDK credit (API-key usage is unaffected).

Why does token-class accuracy matter so much? Because Anthropic prices input-token classes an order of magnitude apart: cache reads cost roughly a tenth of base input. Any pipeline that prices all input tokens at the base rate will wildly overestimate cache-heavy agent loops, which is exactly the failure mode of the built-in OTel exporter, covered next.

Anthropic input-token pricing by class — Claude Sonnet 4.6, USD per million input tokens. Cache reads cost 10x less than base input.

The Built-In OpenTelemetry Export — and Its Limits

Enabling beta traces (`CLAUDE_CODE_ENABLE_TELEMETRY`, span names)

The built-in observability path: the CLI can export OTLP itself. Metrics and logs are gated by one flag; trace spans need the enhanced-telemetry beta flag too:

CLAUDE_CODE_ENABLE_TELEMETRY=1
CLAUDE_CODE_ENHANCED_TELEMETRY_BETA=1   # gates trace spans
OTEL_TRACES_EXPORTER=otlp
OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf
OTEL_EXPORTER_OTLP_ENDPOINT=https://your-collector:4318
OTEL_EXPORTER_OTLP_HEADERS="Authorization=Bearer <token>"
OTEL_TRACES_EXPORT_INTERVAL=1000        # short-lived processes: flush fast

Pass these via process env or options.env (TypeScript gotcha: env replaces the process environment in TS, so spread ...process.env; Python merges). The span names you'll see: claude_code.interaction for one agent-loop turn, claude_code.llm_request per API call, and claude_code.tool per tool invocation with claude_code.tool.blocked_on_user and claude_code.tool.execution children; subagent spans nest under the parent's tool span, and session.id rides along as an attribute.

Two operational warnings. Never configure the console exporter through the SDK. Stdout is the SDK's message channel and you will corrupt the stream. And short-lived processes must lower the OTEL_*_EXPORT_INTERVAL values (traces/logs default to 5s flushes) or they'll drop spans on exit.

What the OTel exporter can't give you (content opt-ins, cache-token gap #673)

Default telemetry is structural only: durations, models, tool names, token counts. Prompt and tool content require explicit opt-ins: OTEL_LOG_USER_PROMPTS=1, OTEL_LOG_TOOL_DETAILS=1, OTEL_LOG_TOOL_CONTENT=1 (60KB cap), OTEL_LOG_RAW_API_BODIES=1. The consequence that matters for this article: without those flags, the OTel export cannot reconstruct inputs and outputs, so it can't feed an eval dataset.

Then there's issue #673: the exporter omits the prompt-caching token breakdown. Any OTLP backend that prices input tokens at the full base rate will badly overestimate cost for cache-heavy agents. The price chart above shows the magnitude. Stream-level tracing reads usage directly off messages and doesn't have this problem.

Also remember traces are beta; the docs state span names and attributes may change between releases, and claude_code.* names are not OTel GenAI semantic-convention names (invoke_agent, execute_tool), which matters if your backend expects semconv. (Managed Agents, the hosted alternative where Anthropic runs the loop, ships its own Console session and tracing views. That's a different product from self-hosted SDK tracing.)

Approach	Setup	Content capture	Subagent attribution	Cost accuracy	Eval path
Built-in OTel export (beta)	Env flags (`CLAUDE_CODE_ENABLE_TELEMETRY=1` + enhanced-telemetry beta)	Structural-only by default; prompts/tool I/O are opt-in flags	Subagent spans nest under the parent's tool span	Cache-token gap (#673) inflates downstream cost estimates	None — spans land in your backend, no dataset flow
Message stream + hooks (DIY)	Code in your process	Full — you choose what to record	`parent_tool_use_id` + `agent_id`, but you implement it	Accurate — usage read off messages, deduped by `message_id`	Manual export from traces to datasets
Vendor wrappers (Langfuse, MLflow, Catalyst)	One-line instrumentor / wrapper	Per vendor	Per vendor (Catalyst: AGENT/LLM/TOOL span hierarchy)	Stream-level readers avoid #673	Catalyst: datasets-from-traffic → rubric evals

Debugging Failed Claude Agent SDK Runs

When a run fails, work this checklist in order:

Branch on ResultMessage.subtype: error_max_turns, error_during_execution, error_max_budget_usd, error_max_structured_output_retries.
Check is_error plus api_error_status even when the subtype is "success".
Inspect AssistantMessage.error: rate_limit, billing_error, invalid_request, authentication_failed, server_error.
Scan tool failures: ToolResultBlock.is_error and PostToolUseFailure hooks.
Review permission_denials; blocked agents silently degrade output.
Check RateLimitEvent status (rejected or allowed_warning).
Look for compaction boundaries shortly before the failure (context loss).
Compare duration_ms vs duration_api_ms for stalls, and num_turns against peer runs for loops.
Watch for a missing ResultMessage entirely; run a timeout watchdog.

API errors, rate limits, and `api_error_status`

The subtype table earlier gives each branch its meaning; the practical readings are these. error_max_turns usually means a loop or an underpowered model. Inspect the trajectory, not just the count. A "success" subtype with is_error=True and api_error_status of 429, 500, or 529 is an API failure dressed as success; alert on the combination, not the subtype alone. Mid-stream, AssistantMessage.error names the failure class before the run even terminates. And RateLimitEvent carries resets_at and utilization, which is enough to implement real backoff instead of blind retries.

Silent failures: permission denials, compaction, missing `ResultMessage`

The failures that don't throw are worse. Permission denials accumulate quietly while the agent routes around blocked tools and produces a degraded answer. permission_denials is both a debugging signal and an eval signal ("the agent tried something it shouldn't"). Compaction is the other classic: a PreCompact hook firing (or a TS compact_boundary message) right before things went strange usually means the context the agent needed got summarized away. Record both as span events.

Then the known SDK failure modes worth instrumenting for: #30333, where ResultMessage is never emitted after long runs with background subagents, so a watchdog timeout on every trace is non-negotiable; #425, where the internal queue fills if you stop consuming the stream after the parent result, which blocks SDK MCP tool bridging; and #627, subagents stopping with no stated reason.

When the stream itself isn't enough, drop down a level: the stderr callback in ClaudeAgentOptions surfaces CLI-level logs, and the TypeScript SDK adds a debugFile option for the same purpose.

If you traced runs as described in the earlier sections, every signal on this list is already a span attribute or event. Debugging becomes a trace query, not a log grep.

Turning Traces into Eval Datasets

Building eval datasets from production failures

Production traces are the highest-value eval source you have. They capture real inputs and real tool behavior, plus the failure modes synthetic test cases never anticipate. Start with around 50 production failure cases and grow the set every time you hit a new bug.

A trace-derived eval case needs three things: the input (prompt plus options), the expected behavior (not just the expected final text), and the observed trajectory. You already have the raw material twice over: your captured spans, and the session transcripts the SDK writes as JSONL on disk, reachable via transcript_path on every hook input or the get_session_messages() helper. Anthropic's own engineering guidance says to build "a representative test set for programmatic evaluations", but ships no tooling for it. This is that tooling.

import json

from claude_agent_sdk import get_session_messages


def trace_to_eval_case(result, tool_names: list[str], path: str = "evals.jsonl") -> None:
    """Append a failed run to the eval dataset. Failures and edge cases
    first — grow the set on every production bug."""
    # The full transcript is replayable: get_session_messages() reads the
    # session JSONL the SDK already wrote to disk.
    messages = get_session_messages(result.session_id)
    first_user_prompt = next(
        m.content for m in messages
        if getattr(m, "content", None) and type(m).__name__ == "UserMessage"
    )

    case = {
        "input": first_user_prompt,
        "session_id": result.session_id,
        "failure": result.subtype,                # e.g. "error_max_turns"
        "expected_behaviors": [                   # written by a human at triage
            "completes within 15 turns",
            "uses Grep before Read on large repos",
        ],
        "trajectory": {                           # what actually happened
            "num_turns": result.num_turns,
            "tools_used": tool_names,             # from your TOOL spans
            "total_cost_usd": result.total_cost_usd,
            "permission_denials": len(result.permission_denials or []),
        },
    }
    with open(path, "a") as f:
        f.write(json.dumps(case) + "\n")

Agent evaluation metrics: tool selection, turns, cost per success

Final-answer grading misses most of what goes wrong in agents, so score the trajectory. The dimensions that matter, each mapping to data you already captured: correct tool selection (TOOL span names per run), tool-argument validity (recorded input payloads), turn efficiency (num_turns vs peers), cost per success (total_cost_usd joined with outcome), permission-denial rate (permission_denials), subagent delegation correctness (the parent_tool_use_id hierarchy), and recovery after tool error (is_error tool spans followed by success). LLM agent evaluation built only on final outputs would miss every one of these.

Layered scoring and LLM-as-judge rubrics

Score in layers, cheapest first: deterministic checks (output format, tool-call validity, budget caps) catch the unambiguous failures; heuristics handle the mid-tier; LLM-as-judge handles nuance; and a human calibration pass on a 1–2% sample keeps the judge honest. For judge rubrics, two framings cover most needs: "direct" rubrics grade output against written standards, while "adherence" rubrics compare output to a reference response. Direct rubrics suit open-ended agent work; adherence rubrics suit replay-based regression checks, which brings us to the last operational piece.

Regression Checks Before SDK, Prompt, or Tool Changes

The Claude Agent SDK ships multiple releases per week (Python v0.2.95 is one of dozens in the first half of 2026), and the docs themselves warn that beta trace span names may change between releases. Your prompt and tool definitions churn even faster. Every one of those changes can regress agent behavior without throwing a single error.

The gate is four steps:

Pin a dataset of real traces: inputs plus expected behaviors, built per the previous section.
Replay it through the new SDK version, prompt, or tool configuration.
Compare judge scores plus structural metrics (turns, cost, latency, tool-failure rate) against the baseline.
Gate the deploy on no-regression.

Two options make replays safe: max_budget_usd and max_turns cap the blast radius of a misbehaving candidate config, and fork_session=True lets you branch a real session into a replay without disturbing the original. The structural metrics in step 3 come free if you traced runs as described earlier. The regression gate is just a query over two sets of spans.

Tracing Claude Agent SDK Runs in Catalyst

Everything in the first three sections — the span-per-query() model, tool-call spans, subagent nesting — is what Catalyst's Claude Agent SDK integration implements for you. In Python, install inference-catalyst-tracing[claude-agent-sdk] and call setup() before importing the SDK; it auto-patches with no code changes, emitting an AGENT span per query loop with nested LLM and tool spans. TypeScript uses an explicit wrapClaudeAgentSdkQuery(query) wrapper (ESM namespace bindings can't be patched in place).

# pip install 'inference-catalyst-tracing[claude-agent-sdk]'
#
# Env vars:
#   CATALYST_OTLP_ENDPOINT=https://telemetry.inference.net
#   CATALYST_OTLP_TOKEN=<your api key>
#   CATALYST_SERVICE_NAME=<stable service id>

from inference_catalyst_tracing import setup, tracing

# Call setup() BEFORE importing claude_agent_sdk — it auto-patches the SDK
# at import time, so importing first means nothing gets instrumented.
setup(service_name="claude-agent")

from claude_agent_sdk import query  # noqa: E402  (import after setup is the point)

# ... run your agent: every query() now emits an AGENT span with nested
# LLM and TOOL spans, no further code changes ...

tracing.shutdown()  # flush spans on exit (short-lived processes)

# TypeScript variant — ESM namespace bindings can't be patched in place,
# so wrap query explicitly:
#
#   import { wrapClaudeAgentSdkQuery } from "@inference/tracing";
#   const tracedQuery = wrapClaudeAgentSdkQuery(query);

Because Catalyst reads usage off the message stream, it isn't subject to the OTel exporter's cache-token gap (#673). And the trace-to-eval path is productized: build datasets from captured traffic, run rubric-based LLM-as-judge evals (direct or adherence), and compare configurations side by side. Optional agent_span() identity groups runs across deploys, and Halo reads your traces to propose fixes with citations to specific trace IDs.

Trace Claude Agent SDK runs in Catalyst

FAQ

Does the Claude Agent SDK have built-in tracing?

Yes, in beta. Set CLAUDE_CODE_ENABLE_TELEMETRY=1 plus CLAUDE_CODE_ENHANCED_TELEMETRY_BETA=1 and configure OTLP exporters, and the CLI exports claude_code.* spans. Default output is structural only (content capture requires opt-in flags), and the docs warn span names may change between releases.

How do I get the session ID from a Claude Agent SDK run?

The first message of every run is a SystemMessage with subtype == "init", and message.data["session_id"] carries the session ID. It's then repeated on every subsequent message and on the terminal ResultMessage, so any message you hold can correlate back to its session.

Is `total_cost_usd` accurate for billing?

No. It's a client-side estimate from a price table bundled at SDK build time, and it drifts when pricing changes or a model isn't recognized. Use the Usage & Cost API for billing, and dedupe parallel tool calls by message_id whenever you sum token usage yourself.

How do I trace parallel subagents?

Messages from a subagent's context carry parent_tool_use_id, matching the ToolUseBlock.id of the Agent tool call that spawned it; nest spans under that. When parallel subagents interleave, the agent_id field on tool-lifecycle hooks is the only reliable per-subagent attribution.