Jun 19, 2026
MCP Server Observability: Logging and Tracing Tool Calls
Inference Research
What MCP Observability Actually Means
Your agent calls a tool through MCP, the call fails, and a user gets a wrong answer. You open the MCP server's logs and find a single line: a tools/call request arrived at 14:32 and returned an error. That's it. Why did the agent pick that tool? What arguments did the model actually generate? Did the agent retry, give up, or hallucinate around the failure? The server has no idea, and by design it never will.
MCP observability is the practice of collecting and correlating logs, metrics, and traces from both sides of the Model Context Protocol: the servers that execute tool calls and the agents that make them. Done right, it lets you answer four questions about any tool call: what was invoked, with what arguments, what it returned, and why the agent called it in the first place.
That last question is what makes MCP server observability different from ordinary API monitoring. The caller is a non-deterministic language model, the conversation that explains its choice lives in a different process than the execution, and many failures never look like failures at the transport layer. The MCP specification directs servers to report tool execution errors inside a successful JSON-RPC response, with isError: true set on the result, so that the model can read the error and self-correct. Monitor HTTP status codes alone and most of your real tool failures are invisible.
This guide covers both halves of the problem: the logging machinery the MCP spec gives servers, the tracing conventions that capture tool calls on the agent side, a debugging workflow that connects them, and how to turn accumulated traces into systematic fixes rather than one-off patches.
Read time: 12 minutes
Where Observability Data Lives in the MCP Architecture
A quick refresher on the architecture, because it dictates where your data can come from. MCP follows a client-host-server design built on JSON-RPC. The host application — Claude Desktop, an IDE, or your own agent service — creates and manages one or more clients, and each client maintains a 1:1 stateful session with exactly one server. Servers expose tools, resources, and prompts. If you're new to the protocol itself, our Model Context Protocol explainer covers the fundamentals.
One of MCP's core design principles is isolation: servers cannot read the conversation history and cannot see into other servers. The full conversation stays with the host. That principle is good for security and terrible for server-side debugging. It means a server log is structurally incomplete: the "why" behind every request it receives lives in a process it cannot observe.
Transport choice decides where server logs physically go. With stdio transport, the server runs as a subprocess of the client: it may write UTF-8 log text to stderr, which the client is free to capture, forward, or ignore, and it must never write anything to stdout that isn't a valid MCP message. A stray print() or console.log() to stdout doesn't just pollute logs; it corrupts the protocol stream. Streamable HTTP servers, by contrast, are normal network services. They log like any other app, and they can issue an Mcp-Session-Id header at initialization that clients must echo on subsequent requests, which gives you a session key for correlating activity.
Figure 1: Where MCP Observability Data Lives
So each side sees a different slice of reality. The server sees its own execution in full detail. The host sees the model's reasoning, the arguments, the result handling, and the whole session. Here's how the two slices compare on the questions you actually ask during an incident:
| Question | Server logs | Agent traces |
|---|---|---|
| Was the tool called? | ✅ | ✅ |
| Arguments as generated | ✅ | ✅ |
| Why the agent called it | ❌ | ✅ |
| Inside tool execution | ✅ | ❌* |
| What agent did next | ❌ | ✅ |
| Retries across session | Partial | ✅ |
| Token/cost of run | ❌ | ✅ |
| Cross-server picture | ❌ | ✅ |
Assumptions: "❌*" becomes ✅ when the server participates in trace-context propagation (params._meta); "Partial" — an HTTP server can count repeat calls in its own session but cannot see why they happened.
The rest of this guide works through each column: first what the spec gives you server-side, then how to capture the agent side, then how to join them.
MCP Logging: What the Spec Gives You Server-Side
MCP ships a real logging utility in the specification. Not a convention: an actual protocol feature with capability negotiation, severity levels, and structured payloads. Most guides to MCP server logging skip it entirely. Here's how it works and where it stops.
Log Levels and logging/setLevel
A server that wants to emit log notifications must declare the logging capability during initialization. Severity follows RFC 5424, the syslog standard, with eight levels.
| Level | Meaning | Example use |
|---|---|---|
debug | Detailed debug info | Function entry/exit |
info | General information | Operation progress |
notice | Normal but significant | Config change |
warning | Warning condition | Deprecated feature used |
error | Error condition | Tool call failed |
critical | Critical condition | Component unavailable |
alert | Act immediately | Data corruption detected |
emergency | System unusable | Complete failure |
Verbosity is client-controlled: the client sends a logging/setLevel request naming the minimum level it wants, and the server suppresses everything below it. If a client sends an invalid level, the spec prescribes a standard JSON-RPC -32602 Invalid params error.
Sending Structured Log Notifications
Servers emit log entries as notifications/message notifications. Each carries a level, an optional logger name, and a data field that accepts any JSON-serializable payload, so structured logging is the default, not an add-on. Here are all three protocol pieces together:
// 1. Server declares the logging capability during initialization
{
"capabilities": {
"logging": {}
}
}
// 2. Client sets the minimum level it wants to receive
{
"jsonrpc": "2.0",
"id": 1,
"method": "logging/setLevel",
"params": {
"level": "info"
}
}
// 3. Server emits a structured log entry as a notification
{
"jsonrpc": "2.0",
"method": "notifications/message",
"params": {
"level": "error",
"logger": "database-tool",
"data": {
"error": "Query failed schema validation",
"tool": "query_database",
"field": "filter.created_at",
"reason": "expected ISO 8601 timestamp"
}
}
}Note what this machinery is and isn't. These notifications travel over the MCP session to the connected client. They do not go to a file, a collector, or a log store. If the host ignores them (and many do), they evaporate. Treat notifications/message as a debugging and transparency channel for the client, and keep conventional logging for durability: stderr for stdio servers, your normal logging stack for HTTP servers. The current spec version, 2025-11-25, explicitly clarifies that stdio servers may use stderr for all types of logging, not just errors.
Audit Logging and Compliance
If agents take real actions through your MCP server — querying customer data, sending messages, moving money — you need an audit trail independent of whatever the client chooses to retain. MCP audit logging and monitoring means recording, for every tools/call: which tool, which session, when, with what arguments, how long execution took, and the outcome.
The spec has hard requirements here. Log messages must not contain credentials or secrets, personally identifying information, or internal system details that could aid an attacker. Servers should also rate-limit log emissions and strip sensitive fields from payloads before they leave the process. The arguments field is the dangerous one: tool arguments routinely carry API keys, customer identifiers, and file contents. Decide per-tool what gets recorded verbatim, what gets hashed, and what gets dropped.
What Server Logs Cannot Tell You
Even a perfectly instrumented server hits a wall. It can tell you that query_database was called with a malformed filter at 14:32 and failed schema validation. It cannot tell you that the model had already failed the same call twice with different malformed filters, that the conversation context was about an entirely different table, or that after the third failure the agent confidently answered the user from stale memory. All of that is host-side knowledge, and the architecture guarantees the server never sees it.
Server logs answer "what did my server do." For "what did my agent do, and why," you need the other half.
MCP Tracing: Capturing Tool Calls in the Agent's Trace
The agent side is where MCP tracing earns its keep: every tool call captured as a span inside the full execution trace, nested under the LLM calls that triggered it and the agent run that contains them.
The OpenTelemetry Semantic Conventions for MCP
OpenTelemetry now publishes official semantic conventions for MCP. Spans are named {mcp.method.name} {target}, so a tool call to a weather tool becomes tools/call get_weather. The core attributes: mcp.method.name is required; gen_ai.tool.name identifies the tool; gen_ai.operation.name is set to execute_tool for tool calls; mcp.session.id and network.transport (pipe for stdio, tcp for HTTP) round out the context.
Two attributes are deliberately opt-in: gen_ai.tool.call.arguments and gen_ai.tool.call.result. The conventions treat payloads as sensitive by default, for the same reason the logging spec bans secrets in log messages. One honesty note: everything MCP-specific in these conventions is marked Development status, so attribute names may still shift.
That caveat shouldn't stop you. For debugging agents, the payloads are usually the point — a span that says tools/call query_database failed, without the arguments, tells you almost nothing about why. The practical posture is to capture arguments and results into your spans deliberately, with secret-stripping and retention controls, rather than skip them and debug blind.
Tool Calls as Spans with Arguments and Results
What should a genuinely useful tool-call span carry? In OpenInference shape (the convention Catalyst Tracing emits), a TOOL span records the tool name, the tool call ID, the JSON arguments, and the result, nested under the agent span alongside LLM spans that carry full message history and token counts. The span kind vocabulary is small and fixed: AGENT for the run, LLM for model calls, TOOL for tool executions, plus CHAIN, RETRIEVER, and EMBEDDING for everything else an agent does.
This is the data that answers the questions server logs can't: the arguments exactly as the model produced them, the error message the model saw, and what it generated next.
Propagating Trace Context Across the Client-Server Boundary
To stitch both halves into one distributed trace, the OTel conventions recommend injecting W3C trace context into the MCP request's params._meta field: the client adds traceparent (and optionally tracestate), and an instrumented server extracts it and uses it as the remote parent for its own spans.
// Client injects W3C trace context into params._meta; an instrumented
// server extracts it and uses it as the remote parent span context.
// Note: this propagation format is Development status and may change
// once the MCP spec publishes official guidance.
{
"jsonrpc": "2.0",
"id": 42,
"method": "tools/call",
"params": {
"name": "query_database",
"arguments": {
"table": "orders",
"filter": { "status": "refund_requested" }
},
"_meta": {
"traceparent": "00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01",
"tracestate": "vendor=catalyst"
}
}
}With propagation in place, the server's internal spans (database queries, downstream API calls) appear as children of the agent's tool-call span. The conventions note this propagation format is likely to change as MCP standardizes official guidance, so treat it as useful today with a migration expected. If you control both client and server, it's worth wiring now; the debugging payoff is immediate.
Debugging MCP Servers: From Failed Tool Call to Root Cause
Observability data is only as good as the workflow it feeds. For MCP implementations, the effective pattern pairs one local tool with production traces.
MCP Inspector for Local Debugging
MCP Inspector is the official interactive testing tool for MCP servers: a React web UI plus a Node.js proxy that connects to any server over stdio, SSE, or streamable HTTP. It runs via npx with no install, serving its UI at localhost:6274. The Tools panel reads each tool's JSON schema, generates an input form, and shows you the exact JSON response — including notifications your server emits along the way.
Inspector is the right tool for "does this work at all" questions: capability negotiation, schema validity, a tool's happy path, log notification behavior. It is one request at a time, driven by hand, on your machine.
Production Traces for Real Traffic
What Inspector can't answer is "why did 3% of tools/call requests fail yesterday between 2 and 4 PM." That's a traces question, and the loop looks like this:
Figure 2: From Failed Tool Call to Root Cause
Start from the anomaly: an alert on error rate or latency. Find an affected trace and walk the tree: which tool, called by which agent, in which session, with what arguments. Classify the failure. Protocol errors (unknown tool, invalid request shape) point at client or configuration bugs. Tool execution errors with isError: true point at the server or at arguments the server rejected. Repeated malformed arguments across different sessions usually mean the model is misreading your tool's schema. That's a tool-description problem, not a code problem. Then reproduce locally: paste the captured arguments into Inspector, watch the server's behavior up close, fix, and verify the fix in the next window of traces.
The two tools cover each other's blind spots. Inspector gives you control and visibility into one call; traces give you the statistical picture and the real arguments models actually generate, which are reliably weirder than anything you'd type into a form.
Instrumenting an MCP-Using Agent with Catalyst Tracing
Here's what the agent half looks like in practice with Catalyst Tracing. Install the SDK (@inference/tracing for TypeScript, inference-catalyst-tracing for Python), point three environment variables at Catalyst (CATALYST_OTLP_ENDPOINT, CATALYST_OTLP_TOKEN, CATALYST_SERVICE_NAME), and call setup() once before your clients are constructed. The tracing SDK quickstart walks through the full setup.
If your agent runs on a supported framework, that's most of the job. Catalyst ships trace integrations for the Claude Agent SDK, OpenAI Agents SDK, LangGraph, LangChain, Pydantic AI, the Vercel AI SDK, and other documented runtimes; the OpenAI Agents integration, for instance, captures agent runs, tools, and handoffs automatically. For the Claude Agent SDK (an MCP host runtime), each query loop lands as an AGENT span; in TypeScript you wrap query() explicitly:
import { query } from "@anthropic-ai/claude-agent-sdk";
import { setup, wrapClaudeAgentSdkQuery } from "@inference/tracing";
const tracing = await setup({ serviceName: "mcp-host-agent" });
const tracedQuery = wrapClaudeAgentSdkQuery(query);
const stream = tracedQuery({
prompt: "Count the number of files matching *.md under the current directory tree. Reply with just the integer.",
options: {
maxTurns: 4,
allowedTools: ["Bash"],
permissionMode: "bypassPermissions",
},
});
for await (const message of stream) {
console.log(message);
}
await tracing.shutdown();If you've built a custom agent loop that calls MCP servers directly, MCP isn't a patchable SDK, so you author the tool spans yourself with manual spans. Wrap the run in an agent_span carrying your agent and session identity, and wrap each MCP tool execution in a TOOL span with the tool name, call ID, arguments, and result:
from inference_catalyst_tracing import SpanKindValues, agent_span, manual_span, setup
tracing = setup()
def execute_mcp_tool(name: str, args: dict, call_id: str) -> dict:
"""Wrap one MCP tools/call execution in a TOOL span.
`name`, `args`, and `call_id` come straight from the model's tool call;
`call_mcp_tool` is your MCP client's tools/call round trip.
"""
with manual_span(
tracing.tracer,
name=f"{name}.tool",
span_kind=SpanKindValues.TOOL,
tool_name=name,
tool_call_id=call_id,
input=args,
) as span:
result = call_mcp_tool(name, args) # your MCP client session
span.set_output(result)
return result
# Wrap the whole run so TOOL spans nest under one AGENT span
with agent_span(
tracing.tracer,
agent_id="support-agent",
agent_name="Support Agent",
span_name="support-agent.run",
session_id="conversation-ticket-123",
user_id="user_8675309",
agent_role="support",
) as span:
span.set_input({"ticket": "ticket_123"})
result = execute_mcp_tool(
"query_database",
{"table": "orders", "filter": {"status": "refund_requested"}},
"call_001",
)
span.set_output(result)Either way, the dashboard shows the same thing: a trace tree with the agent run at the root, LLM spans with full message history and token usage, and TOOL spans with arguments and results, grouped into sessions by the agent identity you set, with run counts, error rate, latency, and cost rolled up per agent. When you'd rather not leave the terminal, inf trace get <trace-id> --view tree renders the same trace from the CLI.
Monitoring MCP Servers in Production
With both halves instrumented, MCP server monitoring becomes a question of which signals to watch. The answer for MCP monitoring is narrower and more specific than generic service dashboards.
Metrics That Matter
| Signal | What it catches | Where it lives |
|---|---|---|
| Protocol error rate | Client/config bugs | Server logs, traces |
Tool error rate (isError) | Failing executions | Server logs, traces |
| Per-tool latency | Slow tools | Server logs, traces |
| Call volume anomalies | Loops, abuse | Server metrics |
| Malformed-argument rate | Schema confusion | Server logs, traces |
| Session churn (HTTP) | Reconnect storms | Server logs |
| Token/cost per session | Runaway agents | Agent traces |
The split between the first two rows is the part teams miss. Protocol-level errors and tool execution errors are different populations with different causes, and the spec's guidance to surface validation failures as isError: true results means your most common failures look like successes to transport-level monitoring. Count them separately, per tool.
Malformed-argument rate deserves its own line because it's the signal that catches agents misusing tools. A server that rejects 15 differently-shaped filters for the same tool isn't broken — it has a schema or description the model can't follow. For broader context on monitoring LLM systems beyond MCP, see our guide to LLM observability.
Handling Sensitive Tool Payloads
Production monitoring means retaining payloads, and payloads are where the risk lives. The ground rules come from both specs: MCP prohibits credentials, PII, and internal system details in log messages, and OpenTelemetry makes tool arguments and results opt-in attributes precisely because they're sensitive.
In practice: strip or hash known secret fields before they leave the process, set retention windows that match your compliance posture rather than defaulting to forever, and prefer capturing full payloads on the agent side, where access control already governs who can see traces, over scattering them across server logs with looser access.
From Traces to Fixes: Detecting Recurring Tool Failures
Everything so far helps you debug one failure. The bigger payoff is finding the failures you don't know about: the patterns that only exist across runs. A server that's flaky for one in fifty calls, a tool whose schema quietly drifted from its description, an agent that calls a search tool four times with near-identical queries in every session: no single trace looks alarming, and the pattern is invisible until you analyze traffic in aggregate.
This is what Halo does. It's an open-source engine that reads OpenTelemetry-compatible spans, decomposes traces across many runs to find systemic failure modes, and returns a ranked list of findings, each citing the specific trace IDs that prove it, so you can click from finding to evidence. You can run it self-hosted (pip install halo-engine, pointed at a JSONL trace file) or hosted inside Catalyst, where the Agents tab's Analysis view runs the same engine on traces you've already collected. For production agents, the higher-leverage move is scheduling it: hourly, daily, weekly, or monthly reviews of recent traffic, with any single window capped at 30 days.
The prompts that work best for MCP traffic are specific: "Which tool calls return empty results most often?" "Find repeated malformed-argument failures and group them by tool." "Where does the agent retry a tool with the same arguments that just failed?"
There's a satisfying loop available here, too. Inference ships its own hosted MCP server at mcp.inference.net/mcp, which exposes traces, spans, and Halo reports as MCP tools — so your coding assistant can call get_halo_conversation, pull the latest findings, and apply the suggested fixes without leaving the editor. Your MCP observability data, delivered back through MCP. The Inference MCP server docs cover the two-minute client setup.
Finding the same failure in one trace is debugging. Finding it across two hundred traces, ranked by impact, with the evidence attached — that's how tool reliability actually improves.
Find out why your agents fail
Halo reads your production traces and returns a ranked list of failure modes, with the trace IDs to prove it. It works across all your traffic instead of one trace at a time.
Minimum Viable MCP Observability: A Reference Checklist
If you're standing this up from scratch, here's the floor: the setup that lets you answer "what happened" for any tool call in production.
| Layer | Item | Why |
|---|---|---|
| Server | Declare logging capability | Spec-compliant client logging |
| Server | Structured data payloads | Machine-parseable entries |
| Server | Strip secrets from logs | Spec requirement |
| Server | stderr / app-log durability | Notifications evaporate |
| Server | Audit every tools/call | Compliance trail |
| Agent | Tracing SDK setup() | Every run traced |
| Agent | TOOL spans with payloads | The missing "why" |
| Agent | Session + agent identity | Grouping, rollups |
| Ops | Split error populations | isError hides in 200s |
| Ops | Retention + access controls | Sensitive payloads |
| Ops | Scheduled cross-run analysis | Recurring failures surface |
Everything on this list traces back to a section above. The server rows keep you spec-compliant and auditable; the agent rows give you the "why" no server can see; the operations rows turn the data into fixes.
Conclusion
MCP observability splits cleanly in two. Server-side, the spec gives you a real logging utility (capability negotiation, eight RFC 5424 levels, structured notifications/message payloads), plus stderr and your normal logging stack for durability, and hard rules about what never goes in a log. Agent-side, tracing captures what servers structurally cannot: the arguments as the model produced them, the reasoning around the call, and everything the agent did next. Trace-context propagation through params._meta joins the halves, and cross-run analysis turns the accumulated record into ranked, evidence-backed fixes.
The agent side is the half most teams are missing, and it's also the fastest to add: install the tracing SDK, call setup() before your clients, and your next agent run arrives as a full trace tree with every tool call, argument, and token count attached.
Trace your first agent run in minutes
Install the Catalyst tracing SDK and call setup() before your clients. You get the full trace tree: agent, LLM, and tool spans with cost, latency, and token usage.
Related Reading
- What is Model Context Protocol? — the protocol fundamentals beneath this guide
- LLM Observability: Monitoring Production Deployments — the wider observability stack around MCP
- LLM Evaluation Tools Comparison — choosing the evaluation and monitoring tooling layer
Meet with our research team
Schedule a call with our research team to learn more about how Specialized Language Models can cut costs and improve performance.