Claude Agent SDK: The Production Guide to Tracing, Subagents, and Evaluation

What Is the Claude Agent SDK?

The Claude Agent SDK is Anthropic's library for building autonomous AI agents in Python and TypeScript. It packages the same agent loop, built-in tools, and context management that power Claude Code, so your agent can read files, run commands, search the web, and edit code without you implementing tool execution yourself. If you've seen references to the Claude Code SDK, that's the same project: Anthropic renamed it to the Claude Agent SDK.

The quickstart takes about ten minutes. You install the package, set an API key, and query() starts fixing bugs in your repo. The distance between that demo and an agent you can trust in production is where most teams stall, and it's the part Anthropic's reference docs don't cover: what happens inside a forty-turn session, why a subagent came back empty-handed, whether last week's prompt change made the agent worse. This guide assumes the quickstart already works for you and covers everything after it.

A quick orientation before the deep material. The Claude Platform now offers three distinct ways to build agents, and most of the confusion around "what is the Claude Agent SDK" is really confusion about which surface does what.

Dimension	Agent SDK	Messages API	Managed Agents
Runs in	Your process	Your process	Anthropic infrastructure
Interface	Python / TypeScript library	HTTP API + client SDKs	REST API
Tool loop	SDK handles it	You implement it	Anthropic runs it
Works on	Your filesystem	Whatever you wire up	Managed sandbox
Session state	JSONL on your disk	None (stateless)	Hosted event log
Custom tools	In-process functions	Your tool executor	You execute, return results
Best for	Production automation in your infra	Full control of the loop	Hosted, long-running sessions

The Claude Code CLI sits alongside these as the interactive interface to the same harness: use the CLI for daily development and one-off tasks, and the Agent SDK for CI/CD pipelines, custom applications, and production automation. Choose the Messages API when you want to own the tool loop yourself, Managed Agents when you want Anthropic to host both the loop and a sandbox, and the Agent SDK when the agent should run inside your own process, against your own filesystem and services.

One billing note worth knowing: starting June 15, 2026, Agent SDK and claude -p usage on Claude subscription plans draws from a monthly Agent SDK credit that's separate from interactive usage limits. Production deployments on API keys bill per token as before.

Read time: 15 minutes

Core Architecture: The Agent Loop, Sessions, Permissions, and Hooks

Four mechanisms define how the SDK behaves in production: the agent loop, session persistence, the permission system, and hooks. Get these right and the later sections (tracing, evals) have something solid to observe.

The agent loop and query()

Everything starts with query(), an async generator. You pass a prompt and options; the SDK runs the full agentic loop (model turn, tool calls, tool results, repeat) and streams every intermediate message back to your code, ending with a result message. Install is one line per language: pip install claude-agent-sdk for Python 3.10+, or npm install @anthropic-ai/claude-agent-sdk for TypeScript, which bundles a native Claude Code binary so there's nothing else to install.

import asyncio

from claude_agent_sdk import ClaudeAgentOptions, query


async def main():
    # query() is an async generator: it streams every message in the
    # agent loop (reasoning, tool calls, tool results, final result).
    async for message in query(
        prompt="Find all TODO comments and create a summary",
        options=ClaudeAgentOptions(allowed_tools=["Read", "Glob", "Grep"]),
    ):
        # The final ResultMessage carries the run's outcome.
        if hasattr(message, "result"):
            print(message.result)


asyncio.run(main())

import { query } from "@anthropic-ai/claude-agent-sdk";

// Same loop in TypeScript. Note the camelCase options: allowedTools
// here, allowed_tools in Python.
for await (const message of query({
  prompt: "Find all TODO comments and create a summary",
  options: { allowedTools: ["Read", "Glob", "Grep"] },
})) {
  if ("result" in message) console.log(message.result);
}

The two SDKs expose the same surface with one wire-format difference that bites people: Python options are snake_case (allowed_tools, permission_mode) while TypeScript uses camelCase (allowedTools, permissionMode). Built-in tools cover most agent work out of the box: Read, Write, Edit, Bash, Glob, Grep, WebSearch, WebFetch, Monitor for watching background scripts, and AskUserQuestion for clarifying questions. Authentication runs through ANTHROPIC_API_KEY by default, with first-class paths for Amazon Bedrock, Google Vertex AI, Claude Platform on AWS, and Azure.

Figure 1: One query() call through the Claude Agent SDK

Sessions: continue, resume, and fork

Every query() call accumulates a session: your prompt, every tool call and result, every response. The SDK writes it to disk automatically as JSONL under ~/.claude/projects/<encoded-cwd>/, where the directory name is your absolute working directory with non-alphanumeric characters replaced by hyphens. That encoding matters operationally. If a resume call returns a fresh session instead of your history, the most common cause is a mismatched working directory: the SDK is looking in the wrong folder.

You have three ways back into a session. Continue picks up the most recent session in the current directory, with no ID tracking. Resume takes a specific session ID, which you read from the session_id field on the result message. Fork (fork_session=True in Python, forkSession: true in TypeScript, combined with resume) creates a new session seeded with a copy of the original's history, so you can explore an alternative approach without disturbing the original.

For multi-turn conversations in one process, Python provides ClaudeSDKClient, which tracks the session internally across client.query() calls. TypeScript instead passes continue: true on each subsequent call. If you maintain older TypeScript code, note that the experimental V2 session API (createSession() with send/stream) was removed in version 0.3.142; query() plus session options is the supported path.

Two production caveats. Sessions persist the conversation, not the filesystem, so reverting an agent's file edits is a separate concern (file checkpointing). And session files are local to the machine that created them: to resume on another host you either ship the JSONL file to the same encoded path or use a SessionStore adapter, though capturing results as application state and starting fresh is usually more reliable.

Permissions: the evaluation order that actually runs

Permission behavior is where production agents most often surprise their authors. When Claude requests a tool, the SDK evaluates in a fixed order:

Hooks run first and can deny the call outright.
Deny rules are checked next. A bare name like Bash removes the tool from Claude's context entirely; a scoped pattern like Bash(rm *) blocks matching calls in every mode, including bypassPermissions.
Ask rules route matching calls to your canUseTool callback for confirmation.
The permission mode applies: bypassPermissions approves everything that reaches this step, acceptEdits approves file operations, plan never auto-approves writes.
Allow rules (allowed_tools) approve listed tools.
The canUseTool callback decides anything left, unless the mode is dontAsk, which denies instead of prompting.

Two footguns fall straight out of that order. First, allowed_tools does not constrain bypassPermissions: unlisted tools simply fall through to the mode, which approves them, so allowed_tools=["Read"] with bypassPermissions still approves Bash and Write. Use disallowed_tools for hard blocks. Second, subagents inherit bypassPermissions, acceptEdits, and auto from the parent and cannot override them per subagent, which means a loosely-permissioned parent silently grants its subagents the same access.

The locked-down pattern for headless production agents is an explicit tool list plus a hard-deny default: allowedTools with permissionMode: "dontAsk". You can also change mode mid-session with set_permission_mode() (Python) or setPermissionMode() (TypeScript), useful for starting restrictive and loosening once the agent's plan looks sane.

Hooks: lifecycle control points

Hooks are callback functions the SDK runs at lifecycle points: PreToolUse, PostToolUse, Stop, SessionStart, SessionEnd, UserPromptSubmit, and others. You register them with a matcher, for example HookMatcher(matcher="Edit|Write", hooks=[log_file_change]) to audit every file modification. Because hooks run before everything else in the permission chain, they're the right place for guarantees that must hold in every mode: blocking reads of credential files, stamping audit logs, or emitting custom telemetry.

Subagents in Practice

Subagents are separate agent instances your main agent spawns for focused subtasks. They earn their complexity in four situations: you need context isolation (a research subagent can read fifty files without bloating the parent's context), parallelism (independent subagents run concurrently, so the work finishes in the time of the slowest one), specialized instructions (a migration expert prompt that would be noise in the main agent), or tool restrictions (a reviewer that can read but never write).

You define subagents three ways: programmatically through the agents option (recommended for SDK applications), as markdown files in .claude/agents/, or not at all, since Claude can always invoke the built-in general-purpose subagent. Programmatic definitions use AgentDefinition:

import asyncio

from claude_agent_sdk import (
    AgentDefinition,
    ClaudeAgentOptions,
    ToolUseBlock,
    query,
)


async def main():
    async for message in query(
        prompt="Review the authentication module for security issues",
        options=ClaudeAgentOptions(
            # Include "Agent" so subagent invocations are auto-approved.
            allowed_tools=["Read", "Grep", "Glob", "Agent"],
            agents={
                "code-reviewer": AgentDefinition(
                    # description tells Claude WHEN to delegate
                    description="Expert code review specialist. Use for "
                    "quality, security, and maintainability reviews.",
                    # prompt is the subagent's own system prompt
                    prompt="You are a code review specialist. Identify "
                    "security vulnerabilities and suggest improvements.",
                    # read-only tool set: this subagent can never write
                    tools=["Read", "Grep", "Glob"],
                    model="sonnet",
                ),
                "test-runner": AgentDefinition(
                    description="Runs and analyzes test suites.",
                    prompt="You run tests and analyze the results.",
                    # Bash access lets this one execute test commands
                    tools=["Bash", "Read", "Grep"],
                ),
            },
        ),
    ):
        # Detect subagent invocations. Match both names: older SDK
        # versions emitted "Task", current versions emit "Agent".
        if hasattr(message, "content") and message.content:
            for block in message.content:
                if isinstance(block, ToolUseBlock) and block.name in (
                    "Task",
                    "Agent",
                ):
                    print(f"Subagent invoked: {block.input.get('subagent_type')}")

        # Messages from inside a subagent carry parent_tool_use_id.
        if getattr(message, "parent_tool_use_id", None):
            print("  (running inside subagent)")

        if hasattr(message, "result"):
            print(message.result)


asyncio.run(main())

The description field does the routing: Claude reads it to decide when to delegate, so write it like an instruction ("Use for quality and security reviews"), and mention the subagent by name in your prompt when you need guaranteed invocation. The full field set gives you per-subagent control over model, effort, and limits:

Field	Required	What it does
`description`	Yes	When Claude should delegate to it
`prompt`	Yes	The subagent's system prompt
`tools`	No	Allowed tools (omit = inherit all)
`disallowedTools`	No	Tools removed from its set
`model`	No	`'fable'`, `'opus'`, `'sonnet'`, `'haiku'`, `'inherit'`, or full ID
`skills`	No	Skills preloaded at startup
`maxTurns`	No	Turn cap before it stops
`background`	No	Run as non-blocking background task
`effort`	No	Reasoning effort level
`permissionMode`	No	Mode inside this subagent

Python uses camelCase for these field names to match the wire format.

Claude invokes subagents through the Agent tool, so include "Agent" in allowedTools or the invocation falls through to your permission callback. One compatibility wrinkle: the tool was renamed from Task to Agent in Claude Code v2.1.63, and current SDK releases still emit Task in the system:init tools list and in permission-denial records even though tool_use blocks say Agent. Detection code should match both names.

What a subagent actually inherits

The single most important fact about subagents: the only channel from parent to subagent is the prompt string in the Agent tool call. A subagent's context starts fresh.

The subagent receives	The subagent does not receive
Its own system prompt + the Agent tool's prompt	The parent's conversation history
Project CLAUDE.md (via setting sources)	The parent's tool results
Tool definitions (inherited or the `tools` subset)	The parent's system prompt
Skills listed in `AgentDefinition.skills`	Preloaded skill content not listed there

The design consequence is that anything the subagent needs (file paths, error messages, decisions already made) must be written into that prompt explicitly. On the way back, only the subagent's final message returns to the parent, verbatim as the Agent tool result, though the parent may then summarize it in its own response.

Common handoff failure modes

These are the failures you'll actually debug, and most of them are visible in traces once you instrument (next section):

Starved prompts. The parent delegates "review the changes" without saying which files. The subagent, which can't see the parent's conversation, re-explores from scratch or reviews the wrong thing. Fix: treat the Agent prompt like an API contract and pass paths and constraints explicitly.
Summarized-away results. The subagent produces a detailed report; the parent compresses it to two sentences. If you need subagent output verbatim in the final response, say so in the main query's system prompt.
Silent non-delegation. Claude does the work itself instead of delegating. The usual causes are Agent missing from allowedTools or a vague description; explicit naming in the prompt bypasses matching entirely.
Rename drift. Monitoring code that checks block.name == "Task" stops seeing subagent invocations on current SDKs. Match both Task and Agent.
Permission inheritance surprises. A parent in bypassPermissions hands every subagent full autonomous access, regardless of what you intended for that subagent.
Lost subagent state. Subagent transcripts persist independently of the main conversation and survive its compaction. The Agent tool result contains an agentId: trailer; resume the same session and reference that ID to continue a subagent where it stopped, and note transcripts are cleaned up after cleanupPeriodDays (30 days by default).

Subagents also can't spawn their own subagents, so don't put Agent in a subagent's tool list. When a job needs dozens or hundreds of coordinated agents rather than a few delegations per turn, the TypeScript SDK's Workflow tool (v0.3.149+) moves orchestration into a script that runs outside the conversation context.

Skills: Packaging Reusable Capabilities

Skills extend your agent with capabilities Claude invokes on its own when the task matches. Each skill is a directory containing a SKILL.md file with YAML frontmatter and instructions, and unlike subagents there is no programmatic registration: skills are filesystem artifacts, full stop.

Discovery runs through setting sources. With default options the SDK loads skills from ~/.claude/skills/ (user) and .claude/skills/ in your working directory and its parents (project). If you set setting_sources explicitly, include "user" or "project" or skills silently won't load, which is the most common "skills not found" cause. The skills option then filters what's available: "all" enables everything discovered, a list of names enables only those, and [] disables skills entirely.

from claude_agent_sdk import ClaudeAgentOptions

# Skills load from the filesystem via setting sources:
#   "user"    -> ~/.claude/skills/
#   "project" -> <cwd>/.claude/skills/ (and parents up to the repo root)
options = ClaudeAgentOptions(
    cwd="/path/to/project",  # project containing .claude/skills/
    setting_sources=["user", "project"],  # required for skill discovery
    skills="all",  # enable every discovered skill
    allowed_tools=["Read", "Write", "Bash"],
)

# Or enable only specific skills by name (SKILL.md name field or
# directory name; "plugin:skill" for plugin-provided skills):
options_filtered = ClaudeAgentOptions(
    setting_sources=["user", "project"],
    skills=["pdf", "docx"],
)

The loading model is progressive: skill descriptions sit in context at startup, and Claude pulls in a skill's full content only when the task calls for it. That makes skills cheap to carry and a good home for instructions too bulky for a system prompt. Two SDK-specific gotchas: the allowed-tools frontmatter field only works in the Claude Code CLI and is ignored by the SDK, so control tool access through allowedTools; and the skills option is a context filter, not a sandbox, meaning unlisted skill files are still readable from disk by Read or Bash.

Skills vs tools vs subagents

Teams new to the SDK regularly reach for the wrong primitive. The decision comes down to what you're packaging: knowledge, an action, or a delegated worker.

Dimension	Skill	Tool	Subagent
What it is	Instructions + resources	A callable action	Delegated agent instance
Defined as	`SKILL.md` on disk	Built-in / MCP server	`AgentDefinition` or file
Context cost	Description only, until used	Schema in context	Fresh, isolated context
Invoked by	Model, when relevant	Model, per call	Model, via Agent tool
Best for	Reusable know-how	Discrete operations	Isolated or parallel work

A useful heuristic: if you're writing instructions, it's a skill. If you're writing a function the model should call, it's a tool (or an MCP server). If the work needs its own context window, its own system prompt, or its own tool restrictions, it's a subagent.

Tracing Claude Agent SDK Apps in Production

A single query() call can expand into dozens of model turns, tool calls, and subagent runs. When something goes wrong (the agent loops, picks the wrong tool, or burns tens of thousands of tokens re-reading files), application logs give you fragments. What you need is the full trace tree: every LLM turn with its messages, every tool call with its arguments and result, token counts per step, and runs grouped by session. That's standard OpenTelemetry-based tracing, applied to agents; if the span vocabulary is new, the field-by-field guide to LLM tracing and this agent observability walkthrough cover the concepts.

The worked example here uses Catalyst's Claude Agent SDK traces integration, which is purpose-built for the SDK's query loop. It emits an AGENT span for each query, with nested LLM turns, tool-use and tool-result data, and the final assistant output captured on the span.

Instrumenting in Python and TypeScript

Install the tracing package alongside the SDK: pip install 'inference-catalyst-tracing[claude-agent-sdk]' for Python, or bun add @inference/tracing @anthropic-ai/claude-agent-sdk for TypeScript. Then point it at Catalyst with three environment variables: CATALYST_OTLP_ENDPOINT="https://telemetry.inference.net", CATALYST_OTLP_TOKEN (your API key), and a stable CATALYST_SERVICE_NAME.

The two languages instrument differently, and the difference is structural. In Python, calling setup() before you import claude_agent_sdk patches query() in place, so the rest of your code doesn't change:

import asyncio

from inference_catalyst_tracing import setup

tracing = setup(service_name="claude-agent")

from claude_agent_sdk import ClaudeAgentOptions, query

async def main() -> None:
    options = ClaudeAgentOptions(
        max_turns=4,
        allowed_tools=["Bash"],
        permission_mode="bypassPermissions",
    )
    async for message in query(
        prompt=(
            "Count files matching *.md under the current directory tree. "
            "Use the Bash tool. Reply with just the integer."
        ),
        options=options,
    ):
        print(message)

asyncio.run(main())
tracing.shutdown()

TypeScript ESM namespace bindings can't be safely patched in place, so the integration uses an explicit wrapper instead:

import { query } from "@anthropic-ai/claude-agent-sdk";
import { setup, wrapClaudeAgentSdkQuery } from "@inference/tracing";

const tracing = await setup({ serviceName: "claude-agent" });
const tracedQuery = wrapClaudeAgentSdkQuery(query);

const stream = tracedQuery({
  prompt:
    "Count the number of files matching *.md under the current directory tree. " +
    "Use the Bash tool. Reply with just the integer.",
  options: {
    maxTurns: 4,
    allowedTools: ["Bash"],
    permissionMode: "bypassPermissions",
  },
});

for await (const message of stream) {
  console.log(message);
}

await tracing.shutdown();

Stable identity and session grouping

Auto-instrumentation gets you spans; identity makes them navigable. Wrapping the product-level query in agent_span() / agentSpan() adds an outer AGENT span carrying agent.id, agent.name, and session.id, which is what the Agents dashboard groups on. Pass your application's conversation ID as the session ID and every run of "the PR review agent for conversation 101" lands in one place:

from inference_catalyst_tracing import agent_span, setup

tracing = setup(service_name="claude-agent")

from claude_agent_sdk import ClaudeAgentOptions, query

prompt = "Review the current diff and list risky changes."

async def traced_review() -> None:
    options = ClaudeAgentOptions(max_turns=4)

    with agent_span(
        tracing.tracer,
        agent_id="claude-review-agent",
        agent_name="Claude Review Agent",
        span_name="claude-review.run",
        session_id="conversation-pr-review-101",
        agent_role="code-review",
        system="anthropic",
    ) as span:
        span.set_input(prompt)
        output = []
        async for message in query(prompt=prompt, options=options):
            output.append(str(message))
        span.set_output("\n".join(output))

Figure 2: What a traced Claude Agent SDK run looks like

One lifecycle rule keeps spans from silently disappearing: spans export in batches, so a process that exits before flushing drops them. Short-lived scripts call shutdown() before exit; long-lived services call setup() once per process and shutdown() on SIGTERM; serverless handlers memoize setup() and call forceFlush() per invocation, since the provider must survive for the next warm invocation. The tracing quickstart documents each pattern with full server and handler examples.

With instrumentation in place, the subagent failure modes from earlier stop being guesswork: a starved handoff shows up as an Agent tool call whose prompt argument is visibly missing context, and a runaway loop shows up as a span tree with far more LLM turns than the task should need.

Evaluating Agent Quality

Tracing tells you what happened. Evaluation tells you whether it was any good, and whether your next change makes it better or worse.

The most useful eval data comes from production. Once your traffic is captured, you can build a dataset from captured traffic by filtering inferences by model, task, status, or date range and saving the slice as an eval dataset. Curate deliberately: eval datasets should be small, stable, and challenging, weighted toward the hard cases where you're not sure the agent gets it right. Keep them strictly separate from any training data; the zero-overlap rule exists because a model that trained on your eval examples makes the eval meaningless.

Scoring runs through rubric-based LLM-as-a-judge evaluation. You describe what "good" looks like in plain English, scored numerically; a judge model reads the rubric, the conversation context, and the output, and returns a scored judgment. Rubrics come in two forms: direct rubrics grade the output against your criteria alone, while adherence rubrics grade how closely it matches a reference response. Two practices keep judge scores meaningful. Use the smartest available model as the judge, since a weaker judge can't reliably distinguish quality differences in a stronger model's output. And validate the rubric before trusting it: the judge faithfully scores whatever criteria you wrote, including bad ones. The mechanics are documented in how LLM-as-a-judge scoring works.

Today this runs as offline evaluation: you select a dataset and candidate models, outputs are re-generated, and the judge scores them, which answers "how would model X perform on my data?" Online evaluation, which scores live production outputs continuously with sample-rate controls, is coming to the platform. For an Agent SDK app, the practical loop looks like this:

Figure 3: The production improvement loop

Trace production runs, promote interesting failures into the eval dataset, score candidate fixes (a prompt change, a model swap, a new subagent split) against the rubric before shipping, and let the next round of traces tell you whether production agrees.

From Traces to Fixes: Scheduled Failure-Mode Analysis with Halo

Inspecting traces one at a time finds individual bugs. It doesn't find the pattern where your agent quietly misuses a tool in a small fraction of runs, or the regression that only shows up across a day of traffic. That cross-run layer is what Halo automates.

Halo reads an agent's traces over a time window and produces a markdown report flagging anomalies, errors, and inefficiencies, with opportunities to improve reliability, latency, cost, and tool usage. The report is written so you can paste it straight into a coding agent to apply the fixes. Each run analyzes one agent over one window; schedules fire recurring runs hourly, daily, weekly, or monthly, which is how regressions get caught without anyone remembering to look.

From the Halo CLI reference, a one-off analysis is four commands:

# 1. Find the agent you want to analyze
inf halo agent list

# 2. Start a run for that agent over the last 24h
inf halo run create --agent-uuid <agent-uuid>

# 3. Wait for it to finish (prints the run id from step 2)
inf halo run poll <run-id>

# 4. Read the report (the assistant message in the conversation)
inf halo conversation get <conversation-id>

By default a run looks back 24 hours and caps analysis at 10,000 spans, with subagent recursion analyzed up to three levels deep. The output closes the loop from the previous section: failure modes Halo surfaces become eval-dataset entries, and the trace IDs in the report are the evidence trail for each one.

If your agents are already traced, this layer is the difference between noticing failures and being told about them.

Find out why your agents fail

Halo reads your production traces and returns a ranked list of failure modes, with the trace IDs to prove it. It works across all your traffic instead of one trace at a time.

Explore Observe

Production Checklist

Before a Claude Agent SDK app takes real traffic, work through this list.

Track cost per run. Every result message carries total_cost_usd; log it per session and alert on outliers. Set max_turns so a confused agent can't loop indefinitely.
Handle result subtypes. Branch on the result's subtype: success, error_max_turns, or error_max_budget_usd. The recovery path for limit errors is resuming the same session with a higher limit, not starting over.
Harden permissions. Headless agents get allowedTools plus permissionMode: "dontAsk". Use scoped deny rules like Bash(rm *) for operations that must never run; they hold in every mode.
Strip secrets at the boundary. Add a PreToolUse hook that blocks reads of .env and credential paths; hooks run before every other permission check, so the guarantee survives even bypassPermissions.
Decide session retention. Transcripts are JSONL on disk and contain everything the agent saw. Treat them like logs with sensitive data: scope filesystem access, and remember subagent transcripts are cleaned up after 30 days by default.
Match tracing lifecycle to deployment shape. Memoize setup() once per process; flush on SIGTERM for services and per invocation on serverless.
Gate releases on evals. Don't ship prompt or model changes on vibes. Score them against your eval dataset first, and keep a scheduled Halo run watching for the regressions evals didn't predict.

FAQ

Is the Claude Agent SDK the same as the Claude Code SDK? Yes. Anthropic renamed the Claude Code SDK to the Claude Agent SDK; the packages are claude-agent-sdk on PyPI and @anthropic-ai/claude-agent-sdk on npm.

Does the Claude Agent SDK support both Python and TypeScript? Yes. Python requires 3.10 or later; the TypeScript package bundles a native Claude Code binary, so no separate Claude Code install is needed.

Can I run it on Bedrock or Vertex AI? Yes. Set CLAUDE_CODE_USE_BEDROCK=1 with AWS credentials, or CLAUDE_CODE_USE_VERTEX=1 with Google Cloud credentials; Claude Platform on AWS and Azure are also supported.

How do I see what my agent actually did in production? Instrument it with an OpenTelemetry-based tracing SDK. With the Catalyst integration, each query loop becomes an AGENT span with nested LLM turns, tool calls and results, and token counts, grouped by agent and session in the dashboard.

Do subagents share the parent's conversation? No. A subagent starts with a fresh context, and the Agent tool's prompt string is the only thing the parent passes in. Include any files, errors, or decisions the subagent needs directly in that prompt.

Wrapping Up

The Claude Agent SDK hands you a production-grade harness: the loop, the tools, sessions, permissions, and delegation are solved problems. Reliability is the part you still own, and it comes from the loop this guide laid out: instrument every run, turn traces into eval datasets, score changes before they ship, and let scheduled analysis catch what slips through.

The fastest way to start is to make your agent visible. Instrumentation is a setup() call and three environment variables, and your next query() shows up as a full trace tree.

Trace your first agent run in minutes

Install the Catalyst tracing SDK and call setup() before your clients. You get the full trace tree: agent, LLM, and tool spans with cost, latency, and token usage.

Start tracing

LLM Tracing — a field-by-field reference for the spans, messages, and token counts behind the tracing section
LLM Observability: Monitoring Production Deployments — metrics, tracing, evals, and cost monitoring for production LLM apps
LLM Evaluation Tools Comparison — a side-by-side look at nine popular LLM evaluation tools