News

    Introducing Catalyst: Train self-improving AI models

    Learn more

    Jun 17, 2026

    AI Agent Evaluation: How to Score Real Agent Runs

    Inference Research

    The Evaluation That Matters Happens on Your Own Traffic

    Your agent clears 70% on a public benchmark, the demo goes well, and the rollout still turns into a support-ticket generator. Nothing about that sequence is surprising. Benchmark scores measure capability on someone else's tasks, with someone else's tools and someone else's users. They say very little about what your agent did for the customer who asked for a refund at 2 a.m.

    AI agent evaluation is the practice of scoring agent runs against explicit criteria: the final output, the tools the agent selected, the arguments it passed, and the path it took to get there. It runs in two regimes: offline, against curated datasets, and online, against live production traffic. The goal is a number you can trust attached to behavior you can inspect.

    This guide skips the "what is an agent" recap and walks the hands-on path: what you can actually score in a run, the traces you need to capture first, how to write rubrics an LLM judge can score reliably, when to use offline versus online evals, and how to turn scores into fixes instead of dashboards. The frame for all of it is a loop: trace, score, find the recurring failures, fix them, re-score.

    Read time: 13 minutes


    Why Benchmark Scores Don't Predict Production Behavior

    Start with the most cited number in agent reliability. On τ-bench, Sierra's tool-agent-user benchmark, the best-performing GPT-4o function-calling agent scores above 60% average task success on a single attempt. Ask it to succeed at the same task eight times in a row (the pass^8 metric) and the rate collapses to under 25%. Same model, same tasks, same harness. The only thing that changed was asking for consistency.

    The arithmetic behind that collapse is unforgiving. If your agent succeeds at a task 90% of the time, the probability it succeeds eight consecutive times is 0.9^8, about 43%. A score that looks like an A on one run is a coin flip across a week of retries. Production is the retry regime: the same class of request arrives hundreds of times, and users remember the failures, not the average.

    Reliability decays exponentially across consecutive runs
    Reliability Decays Exponentially Across Consecutive Runs — pass^k for an agent with 90% single-run success (pass^k = 0.9^k)

    Benchmarks have a second problem: some of them leak. The Agentic Benchmark Checklist study audited major agent benchmarks for validity and found that a do-nothing agent, one that returns immediately without taking a single action, scores 38% on the airline partition of τ-bench, because tasks that don't change database state and require no specific output pass by default. A benchmark where doing nothing earns partial credit cannot tell you whether your agent works.

    And even a perfectly constructed benchmark measures the wrong distribution. Your agent runs against your tool schemas, your data, your users' phrasing, and your cost of failure. None of that appears in anyone's published test set.

    The conclusion isn't that benchmarks are useless. They're a reasonable first filter for picking a base model. But the evaluation that determines whether your agent ships and stays shipped runs on your own traces. Anthropic's engineering team makes the same point from the other direction: their guidance is to start agent evals from 20 to 50 real failure cases pulled from actual usage, not from a synthetic task suite.

    What You Can Score in an Agent Run

    "Did the agent do a good job?" decomposes into layers, and each layer needs different data from the run and a different kind of scorer. Most teams only score the first layer. The interesting failures live in the other four, which is also where AI agent evaluation metrics get specific enough to act on.

    LayerWhat it measuresTrace must containBest scorer
    Final outcomeTask completed, groundedFull message historyLLM judge
    Tool selectionRight tool chosenAvailable tool specsLLM judge
    Tool argumentsValid IDs, types, enumsCalls with argumentsCode checks
    TrajectoryNo loops, wasted stepsSpans with tokens, latencyJudge + code
    Errors & recoveryFailures surface, retries workErrors, retries, statusesJudge + code

    Final outcome

    The top layer: did the run accomplish the task? For a support agent, was the question answered and grounded in the retrieved policy? For an extraction agent, is the JSON complete and correctly typed? Outcome scoring is where LLM judges earn their keep, because "task completed" is usually a judgment call over the whole conversation rather than a string match.

    Tool selection

    Given the tools available, did the agent pick the right one? Scoring this requires the full tool specification in the trace — every tool the agent could have called, not just the ones it did. Without the available-tool list, you're grading "did the agent call any tool," which is a much weaker question.

    Tool-call arguments

    The agent picked the right tool and passed it garbage: an order ID that doesn't exist, an enum value it invented, a date in the wrong format. Argument correctness is the layer most amenable to deterministic checks (schema validation, ID lookups, type assertions), which run on every span at essentially no cost. Save the judge for the layers that need judgment.

    Trajectory and step efficiency

    Two runs can both succeed while one takes four steps and the other takes nineteen. Trajectory scoring looks at the path: redundant tool calls repeating identical work, loops where the agent re-plans without new information, steps that burn tokens and latency without advancing the task. These failures never show up in outcome scores because the run eventually succeeds. Your invoice notices anyway.

    Errors and recovery

    The layer that separates resilient agents from lucky ones. When a tool call failed, did the error surface, and did the agent retry sensibly or plow ahead on a fabricated result? The canonical trap is the silent failure: a query breaks underneath a span that still reports ok, and the agent confidently builds on nothing. Real trace analysis on a production agent surfaced exactly this — queries failing without bubbling up while span statuses stayed green.

    Across all five layers, the scorer toolkit is the same trio Anthropic describes: code-based graders for anything deterministic, model-based graders for judgment calls, and humans for calibration. The skill is matching the layer to the cheapest scorer that can actually measure it.

    First, Capture Traces Worth Scoring

    Everything above assumes one thing: a complete record of the run. You cannot score tool selection without the tool specs, can't score arguments you didn't log, and can't score recovery if errors vanish into a retry wrapper. Scoring quality is bounded by trace quality.

    A scoreable trace contains the full message history, every tool call with its arguments and results, errors and retries, token counts and latency per step, and session grouping that ties multi-turn conversations together. In practice that means OpenTelemetry-style span trees, where each model call, tool call, and custom operation is a span with its inputs and outputs attached, viewable as a tree, a timeline, or a readable conversation thread.

    This is where Catalyst Tracing fits. Install the SDK (@inference/tracing for TypeScript, inference-catalyst-tracing for Python), call setup() before your clients are constructed, and instrumented SDK calls are captured automatically. Integrations cover OpenAI, Anthropic, the Vercel AI SDK, LangChain, LangGraph, Pydantic AI, OpenAI Agents, the Claude Agent SDK, and several other agent frameworks, with manual spans as the escape hatch for custom work.

    The piece teams skip, and regret skipping, is agent identity. Wrapping each run in an agent span with a stable agent_id and a per-conversation session_id is what turns a pile of disconnected traces into scoreable units: runs group into agents, traces group into sessions, and your evals get a denominator.

    Here's what that looks like around an OpenAI Agents SDK run:

    import json
    
    from inference_catalyst_tracing import agent_span, setup
    from agents import Agent, Runner, function_tool
    
    tracing = setup()
    
    
    @function_tool
    def lookup_order(order_id: str) -> str:
        """Look up an order by ID."""
        return json.dumps({"order_id": order_id, "status": "shipped", "total": 42.5})
    
    
    support_agent = Agent(
        name="SupportAgent",
        instructions="Use tools to help customers with order questions.",
        tools=[lookup_order],
        model="gpt-4o-mini",
    )
    
    user_message = "Where is order ABC-123?"
    with agent_span(
        tracing.tracer,
        agent_id="support-agent",            # stable ID everything groups on
        agent_name="Support Agent",
        span_name="support-agent.run",
        session_id="conversation-order-abc-123",  # your conversation key
        user_id="user_8675309",
        agent_role="support",
        system="openai",
    ) as span:
        span.set_input(user_message)
        # Any custom attribute becomes filterable from the dashboard and CLI.
        span.set_attribute("organization.id", "org_42")
        result = Runner.run_sync(support_agent, user_message, max_turns=4)
        span.set_output(str(result.final_output))
    
    tracing.shutdown()

    The agent_id stays constant across deploys and renames; the session_id is your conversation key, like a chat or thread ID. Set them once at the top-level span and every layer of scoring downstream (per-agent metrics, session-level judgments, cross-run analysis) inherits clean grouping.

    If you'd rather see your first trace before reading further, the quickstart takes a few minutes.

    Trace your first agent run in minutes

    Install the Catalyst tracing SDK and call setup() before your clients. You get the full trace tree: agent, LLM, and tool spans with cost, latency, and token usage.

    Writing Rubrics an LLM Judge Can Actually Score

    With traces flowing, you need criteria. A rubric is a plain-English description of one quality dimension, scored numerically. It's what "good" means for your use case, written down precisely enough that a model can apply it the same way every time.

    There are three ways to get one: generate it with AI from an existing dataset (the generator reads your inputs and outputs and proposes relevant dimensions), start from a template for common dimensions like accuracy or format compliance, or write your own from scratch. Generated rubrics and templates are starting points. Review them before trusting them.

    The mechanics that separate working rubrics from noise:

    • Be task-specific. "Is this response accurate?" produces mush. "Does the extracted JSON contain all required fields with correct types?" produces a score you can act on.
    • Describe every score level. A 1–10 scale (the default range) needs the qualities of a 3 and a 7 spelled out, or scores drift between runs. An alternative is composite criteria that sum to the max: 3 points for clarity, 3 for correctness, and so on.
    • Match the range to the question. Pass/fail dimensions do better on a 1–3 scale than a 1–10 one.
    • Pick the right comparison mode. A direct rubric judges the output on its own; an adherence rubric judges it against a reference response from your dataset.

    The scoring mechanism itself is LLM-as-a-judge: the judge model receives the rubric, the conversation context, and the output to score, and returns a number in the rubric's range. Use the smartest model you can get as the judge; one weaker than the models it evaluates can't reliably distinguish quality. And budget for it: judge calls are full LLM inferences with real cost, which is why sampling matters once you score live traffic.

    Where judge models fail

    Judges inherit the failure modes of the models they're built on, and the research record is specific about three:

    • Position bias. In pairwise comparisons, judges favor responses by slot rather than content.
    • Verbosity bias. Longer outputs score higher across judge families, independent of quality.
    • Self-preference bias. Judges rate outputs that resemble their own writing more favorably, measured at roughly a 10% higher win rate for GPT-4 judging its own outputs.

    There's a fourth failure mode that isn't the judge's fault: a vague rubric. The judge doesn't decide what "good" means; it faithfully scores against whatever criteria you wrote, including bad ones.

    The mitigations follow directly. Score single outputs against explicit rubrics instead of running pairwise comparisons where position and verbosity bias bite hardest. Write score levels concrete enough that a different judge model would land in the same place. Spot-check a sample of judge scores against human review before you gate anything on them, and re-validate the rubric itself when scores stop matching your intuition.

    Offline Evals vs Online Evals

    The same rubric serves two different questions, and conflating them is the most common AI agent evaluation mistake.

    Offline evals run a curated dataset through one or more candidate models and score the regenerated outputs. The question they answer is "how would this model or prompt perform on these inputs?", which is the right question before a change ships. This is the regime where AI agent testing lives: regression-gating a prompt edit, comparing a model swap, checking a new tool description against the cases that broke last month.

    Online evals score what production actually did: the real outputs users saw, sampled continuously from live traffic. The question is "how is my agent behaving right now?" Drift, model-provider updates, and the failure modes you didn't anticipate only show up here.

    DimensionOffline evalsOnline evals
    What's scoredRe-generated outputsReal production outputs
    Data sourceCurated eval datasetLive traffic, sampled
    When it runsBefore changes shipContinuously
    Question answered"How would X perform?""How is it behaving now?"
    Cost controlBounded dataset sizeSample rate
    Fits bestGating changesCatching drift

    You need both, in sequence: offline to gate changes before they ship, online to catch what the gate missed.

    Building eval datasets from real traffic

    The strongest argument for capturing traffic through a gateway or tracing layer is that your eval dataset builds itself. Tag your LLM calls with a task header so calls group by objective (document-summary, ticket-classifier) rather than by model or prompt. Then filter captured traffic by task, model, status code, and date range, and save the slice as an eval dataset.

    Curate it like a benchmark, because that's what it is: small, stable, and challenging. Pick the hard cases — the ones where you're not sure the agent gets it right — and don't churn them, or you lose the ability to compare across time. Keep eval data strictly separate from training data; if a model trains on your eval examples, the eval is meaningless.

    For continuous scoring of live traffic, Signals run plain-language classifiers against an agent's spans as they arrive: binary checks ("did the user get frustrated?") or enumerated labels ("task outcome: completed / partial / failed / abandoned"), applied by an LLM judge to a deterministic sample of traffic: 10%, 25%, 50%, or everything. You can test a signal against 1–100 recent spans before activating it, and backfill historical spans once you trust it. One honest caveat on the rubric side: offline rubric evals are available in Catalyst today, while running rubrics directly against captured production outputs is coming soon. Today's eval runs regenerate outputs with the models you select.

    Comparing Models With the Same Rubric

    Once a rubric and dataset exist, multi-model comparison stops being a research project. The run is a cross-product: every sample goes through every candidate, and every output gets scored by the judge. Ten samples across three models means 30 generated outputs and 30 judge calls, producing per-sample scores for each model. Candidates can be different base models, different providers, or your own fine-tuned checkpoints. The identical setup compares prompt versions instead of models.

    Read the results in two passes. The aggregate view (side-by-side plots and a full scores table) tells you which candidate wins on average and whether models trade off across rubric dimensions, like one being more accurate while another nails tone. The per-sample view tells you why: individual cases where one model collapses reveal quirks that averages bury.

    The output is a decision, not a leaderboard: which model serves production, whether no off-the-shelf option scores well enough and fine-tuning is warranted, or whether the rubric itself needs work because the scores contradict what your eyes tell you. Rubrics support versioning for exactly that last case: iterate on the criteria, re-run, and compare how rubric versions score the same data.

    From Scores to Fixes: Finding Recurring Failure Modes

    Scores tell you something is wrong. They are conspicuously silent about what to change. A 6.2 average on tool-call correctness doesn't tell you that the search tool's date parameter gets formatted wrong every time a user says "last quarter."

    The missing step is cross-run analysis: reading many traces together and asking what fails repeatedly. Done by hand, this is afternoons of scrolling span trees. The structural problem is that traces are enormous. Feed a week of them to a general-purpose model and it either blows past its context window or overfits to the error in a single trace instead of generalizing to the systemic problem behind it.

    Halo (Hierarchical Agent Loop Optimization) is an open-source engine built for exactly this gap. It's an RLM, a recursive language model: instead of stuffing spans into one prompt, it holds your traces as a variable in a code environment and programmatically decomposes them, recursively calling sub-models over slices. It reads OpenTelemetry-compatible spans, identifies systemic failure modes across many runs, and returns ranked findings with concrete fixes, each citing the specific trace IDs it came from so you can click through and verify before acting.

    On inference.net's own GTM agent, a Halo run surfaced tools called with invalid inputs, queries failing silently while their spans reported ok, and duplicate tool calls repeating identical work, each with a recommended fix. Those are precisely the layer-2-through-5 failures from earlier in this article, found without anyone writing a rubric for them first.

    You can run Halo on demand against a time window, or schedule it hourly, daily, weekly, or monthly so it reviews recent traces automatically. Separate schedules can hunt different failure classes, one watching cost regressions while another watches reliability. Self-hosters can pip install halo-engine and point it at an exported JSONL trace file; the hosted version runs the same MIT-licensed engine against traces you've already collected.

    This closes the loop the article opened with:

    Figure 2: The agent evaluation loop — traces feed rubric evals and signals, Halo turns scored runs into ranked trace-cited findings, and fixes feed the next window of traces

    Evals flag the regression, Halo locates the mechanism, you ship the fix, and the next scored window confirms it. Trace, score, analyze, fix, repeat.

    A One-Week Starter Workflow

    Here's how to evaluate AI agents starting from zero instrumentation, in seven steps:

    1. Instrument your agent. Run inf instrument --mode tracing from your project root (the CLI scans your codebase, installs the tracing SDK, and wires in setup() and agent identity), or do the manual SDK setup if you prefer to see every line.
    2. Verify traces are flowing. Check inf trace list --range 1h or the dashboard's Traces tab after exercising your agent.
    3. Write three rubrics or signals. One for final outcome, one for tool-call correctness, one for efficiency. Plain English, explicit score levels.
    4. Turn on signals at a low sample rate. Let them label live traffic for a full week, long enough to catch the input variety a single day misses.
    5. Build an eval dataset from the week's hard cases. Filter to the failures and near-misses your signals flagged, and save them as a small, stable eval set.
    6. Run an offline comparison. Your current model versus one candidate, same rubric, same dataset. Read the per-sample breakdown, not just the averages.
    7. Schedule Halo daily and triage. Read the ranked findings, ship the top fix, and watch next week's scores confirm or refute it.

    The terminal side of that workflow fits in a few commands:

    # Install the Inference CLI and authenticate (your browser opens to log in)
    npm install -g @inference/cli && inf auth login
    
    # From your project root: scan the codebase, install the tracing SDK,
    # and wire setup() plus agent identity into your entrypoint
    cd /path/to/your/project && inf instrument --mode tracing
    
    # Run your app normally, then confirm traces are arriving
    inf trace list --range 1h
    
    # Open a trace tree or timeline
    inf trace get <trace-id> --view tree
    inf trace get <trace-id> --view timeline
    
    # Search spans and inspect captured inputs/outputs
    inf span list --trace-id <trace-id> --kind LLM
    inf span get <trace-id> <span-id> --view io

    By the end of the week you have something no benchmark provides: scores attached to your agent's real behavior, an eval dataset that gets harder as your traffic gets weirder, and a ranked list of what to fix next.

    Score Real Runs, Not Hypotheticals

    Benchmarks answer one question, which base model to start from, and then their usefulness ends. Every AI agent evaluation question that matters afterward runs through your own traces: what to score at each layer of a run, criteria a judge can apply consistently, offline gates before changes ship, online signals after, and cross-run analysis that turns low scores into named, fixable failure modes.

    The loop compounds. Every scored week sharpens the eval dataset, every Halo report shortens the fix list, and the agent that emerges is one you can change without holding your breath.

    You can run this entire loop on traffic you're already serving.

    Run evals on real production traffic

    Score traced runs with rubric-based evals instead of hand-built test sets, and gate prompt or model changes on the results.

    CONTACT

    Meet with our research team

    Schedule a call with our research team to learn more about how Specialized Language Models can cut costs and improve performance.