News

    Introducing Catalyst: Train self-improving AI models

    Learn more

    Jun 14, 2026

    AI Agent Monitoring: Metrics, Traces & Failure Modes

    Inference Research

    What AI Agent Monitoring Means in Production

    Picture a support agent that takes a refund request, loops through forty LLM calls, burns several dollars of tokens, and finally tells the customer something wrong. Every one of those requests returned HTTP 200. Your APM dashboard is green. Nothing paged.

    That gap is why AI agent monitoring exists as its own discipline. AI agent monitoring is the continuous measurement of a production agent's latency, cost, tool-call health, loop behavior, and output quality, using the agent's own execution traces as the primary data source. The trace is the raw material: full message history, tool arguments, token counts, session grouping. Every metric worth having is derived from it.

    This guide is the operational version of that idea. You'll get the core metric set and the exact trace signals behind each number, a deep look at tool-call monitoring, a catalog of agent failure modes grounded in published research, a way to catch quality regressions that metrics miss, and a day-one alerting baseline you can actually set this week.

    Read time: 11 minutes


    Why Traditional APM Falls Short for AI Agents

    Traditional APM assumes a request is a deterministic call tree: the same input takes the same path, and failure shows up as a non-200 status, an exception, or a timeout. Agents break all three assumptions.

    First, agents are non-deterministic. The same input produces a different execution tree on every run: more steps or fewer, different tool choices, sometimes a different answer. There is no single "golden path" to baseline against, which is why request-centric dashboards struggle to say anything useful about an agent.

    Second, the unit of analysis is wrong. An agent run is a session spanning many LLM calls and tool calls, often across minutes. Per-request latency and error rate tell you almost nothing about whether the session accomplished anything. LLM monitoring at the individual-call level is necessary but not sufficient; you need session-level rollups.

    Third, and most important: agent failures return HTTP 200. A hallucinated tool argument, a runaway loop, a confidently wrong answer. All of these look like successes to status-code monitoring. The model responded, the HTTP layer was healthy, and the task still failed.

    Finally, cost is a first-class production signal for agents. Token spend per session can vary by an order of magnitude between a clean run and a retry-amplified mess, and classic APM has no native concept of it.

    What carries over from APM is the discipline: percentiles instead of means, error budgets, alert hygiene. What changes is the data source. For agents, everything worth measuring comes out of the execution trace.

    The Core Metric Set for AI Agent Monitoring

    Strip away vendor framing and there are six families of agent metrics that matter. Each one maps to specific fields in the trace, which is what makes them measurable rather than aspirational. If you want the deeper anatomy of those fields, see our field-by-field guide to LLM tracing.

    MetricWhat it tells youTrace signalWatch for
    Per-step latencyWhich step regressedLLM/TOOL span durationp95 by step
    End-to-end latencyWhat users feelAGENT span durationp95 by agent
    Cost per sessionSpend driftToken counts × pricing, by session.idOutlier sessions
    Tool-call error rateTool healthTOOL span status, by tool.nameStep-changes
    Loop countersRunaway runsagent.llm_call_count, agent.tool_call_countOutlier runs
    Task completion rateDid it workOutcome labels on spansDownward trend
    Finish-reason mixTruncationllm.finish_reasonRising length share

    Token-count and cache attributes, tool attributes, and finish reason are emitted automatically by the instrumented SDKs; outcome labels come from LLM-judge classification (section 6).

    A few notes on how to read these in practice.

    Latency needs two altitudes. Per-step latency (the duration of individual LLM and tool spans) tells you which step regressed; end-to-end latency (the duration of the outer agent span) tells you what the user felt. Track p50 and p95 for both, segmented by model and by step. An aggregate p95 across all steps will hide a slow retriever behind fast LLM calls.

    Cost only makes sense per session. Token counts live on each LLM span: prompt, completion, and total, plus cache read/write detail when the provider reports it. Grouped by session ID and multiplied by model pricing, they become cost per session, which is the number that actually drifts when something goes wrong. A rising cache-write share with flat cache reads is an early sign your prompts stopped reusing context.

    Loop counters are your runaway detector. Per-run counters for LLM calls and tool calls (agent.llm_call_count, agent.tool_call_count) turn "the agent seems stuck" into a thresholdable number.

    Completion rate requires labels, not status codes. Whether a task was completed, partial, failed, or abandoned is a judgment about the output, not the transport. Section six covers how to produce those labels on live traffic.

    Finish reasons are an underrated canary. LLM spans record the provider's finish reason. A rising share of length finishes means outputs are being truncated, usually a context problem upstream of any visible quality complaint.

    Traces Are the Foundation of LLM Monitoring and Observability

    Every metric above is a query over trace data. That ordering matters: teams that start with dashboards and bolt on tracing later end up with charts they can't drill into. Instrument first, and the metrics fall out.

    What a Complete Agent Trace Contains

    A useful agent trace has three layers. LLM spans capture each model call with full input and output message history, the model name, invocation parameters, finish reason, and token counts. Tool spans capture each tool invocation with its name, call ID, and arguments. And an outer agent span groups the whole run under a stable agent ID and session ID, which is what lets you compute anything per-session.

    Figure 1: Anatomy of an Agent Trace — the outer AGENT span carries identity and end-to-end duration; LLM spans carry message history, token counts, and finish reasons; TOOL spans carry the tool name and arguments, linked to the requesting LLM span by tool_call.id.

    On the standards side, OpenTelemetry's GenAI semantic conventions define agent invocation spans, per-call chat spans, and tool execution spans, along with latency and token-usage metrics. They're still under active development, though, so pin your instrumentation versions. The OpenInference conventions, maintained by Arize, are the more established attribute schema for LLM and agent spans. Catalyst's tracing SDKs emit OpenInference attributes byte-identically, so any OpenInference-aware viewer renders the spans without configuration.

    Instrumenting an Agent in a Few Lines

    Here's the full setup for a Python agent using OpenAI, from install to grouped spans. The same pattern covers Anthropic, LangChain, LangGraph, OpenAI Agents, Pydantic AI, and the other supported integrations via install extras.

    # Install:  pip install 'inference-catalyst-tracing[openai]'
    #
    # Configure export before the app starts:
    #   export CATALYST_OTLP_ENDPOINT="https://telemetry.inference.net"
    #   export CATALYST_OTLP_TOKEN="<your-token>"      # from the dashboard API keys page
    #   export CATALYST_SERVICE_NAME="support-agent"
    
    import os
    
    from inference_catalyst_tracing import agent_span, setup
    from openai import OpenAI
    
    tracing = setup()  # call before constructing clients; auto-detects installed SDKs
    client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
    
    with agent_span(
        tracing.tracer,
        agent_id="support-agent",
        agent_name="Support Agent",
        session_id="session-4271",
    ) as span:
        span.set_input("Customer asks: where is my refund for order #1842?")
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": "Where is my refund for order #1842?"}],
            max_tokens=16,
        )
        span.set_output(response.choices[0].message.content or "")
    
    tracing.shutdown()  # short-lived script: flush batched spans before exit

    Three things to notice. setup() runs before the OpenAI client is constructed and auto-detects installed packages, so the SDK patches the client and LLM spans appear with no further code. The export configuration is three environment variables pointing at the telemetry endpoint. And the agent_span wrapper is what attaches agent.id, agent.name, and session.id to everything inside it. Without it, you get orphan LLM spans and no session grouping. The tracing quickstart walks through the same flow for TypeScript, and the agent identity guide covers choosing stable IDs so runs group correctly.

    One operational gotcha: spans are batched and exported in the background, so short-lived processes must flush before exit, and serverless runtimes need a per-invocation flush rather than a shutdown. Silent span loss looks exactly like "the agent never ran," and it's the first thing to rule out when traces go missing.

    Monitoring Tool Calls

    Tool calls are where agents touch the real world, and they're the most under-monitored part of the loop. Most guides give them a bullet point. They deserve their own alert rules.

    Malformed Arguments and Schema Drift

    There are two distinct failure shapes here, and they need different responses.

    Malformed arguments are a model problem: the LLM emits tool arguments that fail JSON parsing or schema validation. The evidence is in the trace, since LLM spans record each requested tool call, including the function name and the raw argument string. Track validation failures as a rate per tool, because a single tool with a complex schema usually dominates.

    Schema drift is a deployment problem: the tool's schema changed but the prompt still describes the old shape, or vice versa. The signature is a step-change in one tool's validation failure rate immediately after a deploy, while every other tool stays flat. The fix is in your prompt or schema, not the model. The span attribute reference lists the exact wire keys if you're building these queries yourself.

    Timeouts get misread, too. A slow tool shows up as a long tool-span duration on one step. A stuck loop shows up as many tool spans of normal duration. Conflating them sends you to the wrong fix: one is a dependency problem, the other is a prompt or termination problem.

    Attributing Failures to the Right Step

    The question that matters during an incident is: did the model produce bad arguments, or did the tool itself fail? The trace answers it. The tool call ID links the LLM span that requested the call to the tool span that executed it, so you can walk from the model's request to the execution result and see which side broke.

    For that link to exist, in-process tool execution has to be wrapped in a span. Auto-instrumentation captures the model's side automatically; the execution side of custom tools is yours to wrap with manual spans.

    from inference_catalyst_tracing import SpanKindValues, manual_span, setup
    
    tracing = setup()
    
    def execute_tool(name: str, args: dict, call_id: str) -> dict:
        # Wrapping the execution side means failures, arguments, and duration
        # attach to this step. tool_call_id links this TOOL span back to the
        # LLM span that requested the call.
        with manual_span(
            tracing.tracer,
            name=f"{name}.tool",
            span_kind=SpanKindValues.TOOL,
            tool_name=name,
            tool_call_id=call_id,
            input=args,
        ) as span:
            result = TOOLS[name](**args)
            span.set_output(result)
            return result

    With both sides instrumented, triage becomes a couple of queries instead of log spelunking:

    # Browse recent traces
    inf trace list --range 1h
    
    # Find all TOOL spans for one tool name
    inf span list --kind TOOL --metadata "tool.name=lookup_order"
    
    # Find expensive LLM spans, sorted by cost
    inf span list --kind LLM --filter "cost_total>0.05" --sort cost_total --order desc
    
    # Find traces from one conversation
    inf trace list --metadata "session.id=conversation-ticket-123"
    
    # Open a trace tree or timeline
    inf trace get <trace-id> --view tree
    inf trace get <trace-id> --view timeline

    A Catalog of AI Agent Failure Modes

    Naming failure modes is half of monitoring them; an alert is just a named failure mode with a threshold. The best public grounding here is MAST, the Multi-Agent System Failure Taxonomy, built from analysis of 150 execution traces across 7 agent frameworks and since expanded to a dataset of over 1,600 annotated traces. It identifies 14 distinct failure modes in 3 categories: system design issues, inter-agent misalignment, and task verification.

    The frequencies are instructive. Step repetition, where the agent re-tries the same action with minor variations, is the single most common failure mode at 15.7% of observed failures, followed by reasoning-action mismatch at 13.2%.

    Most Frequent Agent Failure Modes
    Figure 2: Most Frequent Agent Failure Modes — share of failures across 1,642 traces (MAST, arXiv 2503.13657).

    In production monitoring terms, five failure modes account for most pages. Each has a distinct trace signature and a distinct first response:

    Failure modeSymptomTrace evidenceFirst response
    Step-repetition loopRun never endsHigh agent.llm_call_count; near-identical tool callsStep/turn budget
    Context overflowTruncated, drifting outputPrompt tokens climb per step; length finishesTrim or summarize history
    Hallucinated tool inputsTool errors spikeArguments fail validation; nonexistent IDsTighten schemas
    Cascading retriesCost and latency spikeNested retries of one failing stepSingle retry budget
    Prompt regressionQuality drops post-deployOutcome labels shift at a deploy boundaryRoll back; add eval gate

    Failure-mode framing grounded in the MAST taxonomy (14 modes, 3 categories, from analysis across 7 agent frameworks).

    Two of these deserve a closer look.

    Context overflow is sneaky because it degrades before it breaks. Prompt token counts climb step over step as history accumulates, then outputs start truncating (length finish reasons), and then, if your stack compacts context, instructions silently fall out of the window and the agent starts violating constraints it was given ten steps ago. Watch prompt-token growth per step, not just totals.

    Cascading retries are an interaction failure: a flaky tool times out, the step retries, the orchestrator retries the step, and a 2-second hiccup becomes a 90-second, triple-cost run. The signature is cost and latency spiking together while task completion stays flat. The fix is retry budgets at one layer, not every layer.

    From Dashboards to Detection: Scoring Production Traffic

    Everything so far measures the machinery. None of it catches the failure mode that hurts most: the agent gets faster, cheaper, and worse. A prompt tweak that makes answers subtly less accurate moves no infrastructure metric at all.

    Catching that requires judging outputs, not transport. The mechanism is continuous classification: define a plain-language classifier once ("was the task completed?", "did the user get frustrated?") and have an LLM judge run it against a sample of production spans. Labels get written back onto the trace, where they're queryable next to latency and cost. Catalyst calls these signals, and the pattern matters more than the branding: quality becomes a time series, not a vibe.

    Practical configuration notes from the documented feature set: classifiers come in binary and enumerated-label forms (2–10 labels), sampling is deterministic with presets from 10% to 100% so you can control judge cost on high-volume agents, and a task-outcome template with completed/partial/failed/abandoned labels gives you the task completion rate from section two. Test a classifier against recent spans before activating it, and backfill historical windows once you trust the labels.

    Two habits make these labels useful rather than decorative. First, start with one classifier (task outcome) and resist adding five more until the first one has survived contact with real traffic; a mislabeled quality metric is worse than none. Second, treat the label distribution as the alerting surface: a failed-outcome share that doubles week over week is a regression, even when every latency and cost chart looks healthy.

    Online classification pairs with offline evals rather than replacing them: labels catch regressions on live traffic, and a curated eval set gates changes before deploy. Our guide to regression testing for LLM apps covers the offline half.

    Closing the Loop: From Monitoring Signals to Systematic Fixes

    An alert tells you something broke at 2 p.m. It doesn't tell you that the same root cause has been quietly degrading a slice of your runs for three weeks. Getting from incident response to systematic improvement means analyzing failures across runs, and that's the step almost no monitoring guide covers.

    This is what Halo was built for. Halo (Hierarchical Agent Loop Optimization) is an open-source engine that reads OpenTelemetry-compatible spans, decomposes them across many runs to find systemic failure modes, and writes up concrete fixes with citations back to the specific trace IDs that exhibit each problem. You can self-host it (pip install halo-engine and point it at a JSONL trace file) or run the hosted version inside the Catalyst Agents dashboard against traces you've already collected.

    The high-leverage move for production agents is scheduling it. Hosted Halo runs hourly, daily, weekly, or monthly, with a lookback window that defaults to the cadence and caps at 30 days. A daily run over the last 24 hours of traces, reviewed like you'd review error-budget burn, turns failure analysis from an incident activity into a habit.

    Figure 3: The Agent Monitoring Loop — traces feed metrics and quality labels, alerts catch acute breakage, scheduled cross-run analysis ranks recurring failure modes with trace-ID citations, and each fix is verified against the next window of traffic.

    The loop closes when findings change something: a prompt edit, a tool schema fix, a retry budget. Ship the change, capture a fresh window, and check whether the failure mode disappears from the next report. Findings that cite trace IDs make this verifiable, because you can open the exact runs that motivated the fix.

    If you're already collecting traces, this is the shortest path from "we have dashboards" to "we know our top three failure modes and they're shrinking."

    Find out why your agents fail

    Halo reads your production traces and returns a ranked list of failure modes, with the trace IDs to prove it. It works across all your traffic instead of one trace at a time.

    A Practical Alerting Baseline

    Nobody publishes starting thresholds, so here are ours. To be clear about what these are: recommended starting points based on the failure modes above, not industry standards. Your traffic will recalibrate them. The point is to have tripwires on day one instead of waiting two weeks for "enough data."

    SignalDay-one alertTune after two weeks
    Tool-call error rate>5% over 15 minPer-tool baselines
    End-to-end latencyp95 > 2× expectedPercentile from your p95
    Cost per session>5× median estimateOutlier detection on actuals
    LLM calls per run>3× designed stepsPer-agent run budget
    length finish share>5% of LLM callsTrend alert on share
    Failed-outcome share— (no labels yet)Alert once labels trend

    These are recommended starting points, not industry standards — replace the static values with thresholds computed from your own two-week distributions.

    Day-one rules are deliberately coarse and absolute, because you have no baseline yet. After two weeks of traffic, switch the latency and cost alerts to percentile-based thresholds computed from your own distributions, set per-tool error-rate baselines instead of a global one, and add quality alerts on outcome-label shares once your classifiers have accumulated enough labels to trend.

    Just as important is what not to alert on. Single-run anomalies (one expensive session is an investigation, not a page), mean latency (it hides every tail problem agents actually have), and raw token counts without session grouping (volume changes look like incidents). Alert fatigue kills agent monitoring faster than missing signals do, because every false page erodes trust in the genuinely weird failures this guide is about.

    Start With the Trace

    AI agent monitoring has an order of operations: instrument traces first, derive the metric set from them, put dedicated eyes on tool calls, name your failure modes, classify quality on live traffic, and schedule cross-run analysis so findings turn into fixes. Each step builds on the trace data the first one captures, which is why instrumentation is the only genuinely blocking step.

    It's also the fastest one. The tracing SDK is a package install and a setup() call before your clients, and the full trace tree of agent, LLM, and tool spans starts streaming immediately, token counts and cost included.

    Trace your first agent run in minutes

    Install the Catalyst tracing SDK and call setup() before your clients. You get the full trace tree: agent, LLM, and tool spans with cost, latency, and token usage.

    CONTACT

    Meet with our research team

    Schedule a call with our research team to learn more about how Specialized Language Models can cut costs and improve performance.