News

    Introducing Catalyst: Train self-improving AI models

    Learn more

    Jun 11, 2026

    AgentOps: What It Does, Pricing, and Alternatives (2026)

    Inference Research

    What Is AgentOps? The Product vs. the Practice

    Search for "agentops" and you get two different things wearing the same name. The first result is a product: AgentOps, the monitoring platform at agentops.ai, with a Python SDK on GitHub and a dashboard full of session replays. A few results down, IBM, AWS, and Red Hat are writing about "AgentOps" as a discipline: the operational practice of running AI agents in production, the way DevOps is the practice of running software. Neither cohort acknowledges the other, which leaves you piecing together the difference from a GitHub README and an enterprise think piece.

    This article clears that up first, then does what nothing on the current results page does: it evaluates the AgentOps product hands-on (what it does well, what it costs, where its limits show up) and compares the credible alternatives for teams monitoring agents in production.

    AgentOps is two things: a monitoring platform for AI agents (agentops.ai) that provides session replay, debugging, and cost tracking through a Python SDK, and an emerging operational discipline, short for "agent operations," covering how teams deploy, manage, and continuously improve autonomous AI agents in production.

    AgentOps the product

    The product is a developer platform for building and monitoring AI agents. Its core features are session replay (visual tracking of LLM calls, tool calls, and multi-agent interactions), time-travel debugging that lets you rewind and replay agent runs, an audit trail of logs, errors, and prompt-injection attempts, and cost tracking that follows token spend across 400+ LLMs. The SDK is MIT-licensed, has around 5.6k GitHub stars, and instruments your code with two lines: pip install agentops, then agentops.init() with your API key.

    AgentOps the practice

    The discipline borrows its shape from DevOps and MLOps: agents act autonomously, chain tasks, and behave non-deterministically, so they need their own operational layer for deployment, observability, and continuous improvement. IBM, AWS, and Red Hat all publish definitional guides on this version of AgentOps, and IBM Research notably built its own AgentOps tooling on OpenTelemetry standards.

    The practical consequence: when a vendor says "agentops tooling," check which meaning they're selling. The rest of this article uses "AgentOps" for the product and "agent operations" for the practice.

    Read time: 11 minutes


    What the AgentOps Platform Does Well

    An honest evaluation starts with credit where it's due, because the product earns real adoption for real reasons.

    Session replay is a genuinely useful debugging primitive. The dashboard organizes everything around sessions: a list of recorded runs with execution times, and a session waterfall that lays out LLM calls, action events, tool calls, and errors on a timeline. LLM calls render as chat history, so you can read the exact prompts and completions that produced a behavior. For a developer staring at a misbehaving agent, "show me exactly what it did, step by step" is the first question, and AgentOps answers it well.

    The SDK covers a lot of frameworks with almost no setup. Two lines of code get you instrumented, and the integration list spans OpenAI Agents SDK, CrewAI, AG2 (AutoGen), LangChain, Camel AI, Cohere, Anthropic, Mistral, LiteLLM, LlamaIndex, Llama Stack, and SwarmZero AI. If your team is on CrewAI or AutoGen, where observability options used to be thin, AgentOps was often the first tool that gave you any visibility at all.

    Cost tracking comes built in. Token counts and spend tracking across 400+ LLMs means you can see which agents are burning budget without wiring up a separate metering layer.

    As agentops observability for AI agents goes, this is a solid first layer: fast to install, broad framework support, and a session-centric view that makes individual runs legible. The question is what happens when you move from debugging a demo to operating a production fleet.

    AgentOps Pricing and Where the Limits Start to Bite

    AgentOps publishes three tiers on its homepage. (The standalone pricing page returned a 404 when we checked in June 2026, so the numbers below come from the homepage itself.)

    PlanPriceIncludedNotable gates
    BasicFree5,000 events/moCore features only
    ProFrom $40/moHigher volumes, retentionFull details via sales
    EnterpriseCustomSLA, SSO, self-hostingSOC-2, HIPAA, NIST AI RMF

    Source: agentops.ai homepage, June 2026. The standalone pricing page returned a 404 at time of writing.

    Two limits are worth doing the math on before you commit.

    The free tier's 5,000 events per month is smaller than it sounds. An "event" is each tracked thing: an LLM call, a tool call, an action. A single agent run that makes a handful of model calls and a few tool calls can emit a dozen or more events, so 5,000 events can translate to a few hundred agent runs. A production agent handling real traffic exhausts that in days, which is the point: the free tier is for evaluation, not operation.

    Compliance is gated to Enterprise. SOC-2, HIPAA, and NIST AI RMF support, along with self-hosting on AWS, GCP, or Azure, custom SSO, and SLAs, all live on the custom-priced Enterprise tier. If your security team requires SOC 2 from vendors that hold your prompt and completion data (and prompts routinely contain customer data), the effective price of AgentOps is "talk to sales," not $40 a month.

    Where Teams Outgrow AgentOps

    None of the limits above are bugs. They follow from what the product is: a session-inspection tool. The ceiling shows up in three places.

    Single-session inspection vs. cross-run analysis

    The AgentOps mental model is one session at a time. The waterfall, the chat-history view, the time-travel replay: each answers "what happened in this run?" That's the right question during development. In production, the question changes to "what keeps happening across the last five thousand runs?" Answering that by replaying sessions one at a time is archaeology, not analysis.

    Production failure modes are statistical. An agent that occasionally calls the wrong tool, or retries itself into a loop only when a specific upstream API times out, doesn't look broken in any single replay. You need tooling that reads across runs and surfaces the pattern.

    Figure 1: Single-session inspection vs. cross-run analysis — session replay makes one run legible; cross-run analysis reads thousands of traces and returns ranked, recurring failure modes with trace IDs as evidence.

    Eval depth

    The AgentOps homepage markets replay, debugging, audit trails, and cost tracking. What it doesn't center is evaluation: scoring outputs against quality rubrics, building eval sets from captured production traffic, and gating prompt or model changes on the results. Teams that ship agent changes weekly end up needing those workflows, and they end up bolting on a second tool to get them.

    Open standards and data ownership

    AgentOps documentation centers on its own SDK and event model rather than open span conventions. Meanwhile the industry is converging on OpenTelemetry: the GenAI semantic conventions for LLM client spans exited experimental status in early 2026, though agent and framework span conventions are still maturing. The pragmatic standard today is OpenInference, an open span-attribute convention for AI workloads maintained by Arize, layered on OTel. Any OpenInference-aware backend can read traces in that shape. Capture your telemetry in an open shape and you can switch vendors, run multiple backends, or feed the same spans into your own analysis. Capture it in a proprietary event model and your history is stuck where it was written.

    How to Evaluate an AgentOps Alternative

    Rather than ranking seventeen tools, here are the six criteria production teams actually use. Apply them to any candidate, including the ones below.

    1. Open trace format. Spans should be OpenTelemetry, exported over OTLP, following an open convention like OpenInference, so your data stays portable by default.

    2. Capture fidelity. A trace is only as useful as what's in it: full message history (system, user, assistant, and tool messages), tool calls with their JSON arguments and results, model metadata, and token usage per span. If you can't see the exact tool arguments that preceded a failure, you can't diagnose it.

    3. Session and agent grouping. Multi-agent systems need stable agent identity: a consistent agent.id and session.id that survive deploys, so runs group into meaningful histories instead of anonymous trace lists.

    4. Eval workflows. Look for rubric-based, LLM-as-judge scoring with offline and online evaluation paths: offline evals on datasets built from captured traffic today, continuous scoring of live outputs as it matures.

    5. Cost tracking at the span level. Token usage and cost should attach to the same spans as your traces, not live in a separate dashboard.

    6. Security posture. Check which tier SOC 2 lands on, whether the vendor strips secrets from captured data, and what data retention controls you get. AgentOps gates compliance to Enterprise; some alternatives ship it lower in the pricing ladder.

    Notice what isn't on the list: dashboard polish, framework logos, GitHub stars. Those are how tools get adopted. The six above are how they survive a year of production use.

    AgentOps Alternatives Compared

    Four alternatives cover the realistic decision space: Catalyst (inference.net), LangSmith, Langfuse, and Arize Phoenix.

    ToolTrace formatEvalsCross-run analysisEntry pricing
    AgentOpsOwn event modelNot core focusNoFree, 5k events/mo
    Catalyst (inference.net)OpenInference/OTelRubric LLM-judge*Yes (Halo)Free, 1M spans/mo
    LangSmithProprietaryOffline + onlineNo$39/seat/mo
    LangfuseOpen source (MIT)YesNoFree, 50k units/mo
    Arize PhoenixOpenInference/OTelYesNoFree (self-host)

    Assumptions: entry pricing is the lowest published tier as of June 2026. *Catalyst offline evals are available today; online evaluation is listed as coming soon.

    Entry Paid-Tier Price by Tool
    Figure 2: Entry Paid-Tier Price by Tool — lowest published monthly price, June 2026. Catalyst and Phoenix show $0 because a production-usable free path exists (Catalyst's free tier includes 1M spans/month; Phoenix is self-hosted open source).

    Catalyst (inference.net)

    Catalyst Tracing is a first-party capture layer built on open standards: TypeScript and Python SDKs that emit OpenInference-shaped OpenTelemetry spans over OTLP/HTTP. Auto-instrumentation covers 16+ documented integrations (OpenAI, Anthropic, the Vercel AI SDK, LangChain, LangGraph, Pydantic AI, OpenAI Agents, Claude Agent SDK, Claude Code SDK, Cursor SDK, LiveKit, ElevenLabs, and more), plus manual spans for custom steps. Captured spans carry full message history, tool calls with JSON arguments and results, model metadata, and token usage.

    The agents dashboard groups runs per agent with overview metrics (run counts, error rate, latency, token usage, cost over time), per-session conversation views, and pre-filtered traces. Evals are rubric-based with an LLM judge: offline evaluation on datasets built from captured production traffic is available today, and online evaluation is listed as coming soon. The free tier includes 1M OTEL spans per month, which is enough to instrument a real production agent before paying anything. And it's the only option here with a built-in cross-run analysis engine (Halo, covered below). Choose it if you want open-format capture plus an improvement loop, not just a viewer.

    LangSmith

    LangSmith is the obvious choice if you're committed to LangChain or LangGraph. It's built by the same team, and the integration is the deepest available. It's proprietary, with tracing, offline and online LLM-as-judge evaluation, custom dashboards, and alerting. Pricing is per-seat: the Plus plan runs $39 per user per month with 10,000 base traces included, then $2.50 per 1,000 traces of overage at 14-day retention (or $5.00 per 1,000 for 400-day retention). The per-seat model means cost scales with your engineering headcount, not just your traffic, which is worth modeling before a large team adopts it.

    Langfuse

    Langfuse is the open-source default: MIT-licensed with self-hosting as a first-class path, and a cloud offering priced on units (a unit is one trace, observation, or score). Cloud tiers run from a free Hobby plan at 50k units per month through Core at $29/month (100k units) and Pro at $199/month, with overage at $8 per 100k units. ClickHouse acquired Langfuse in January 2026; the core remains MIT-licensed and the cloud continues as a standalone service. Choose it if self-hosting your observability stack is a requirement and you're prepared to operate it.

    Arize Phoenix

    Phoenix is the standards-bearer: open source (ELv2), built directly on OpenTelemetry, and powered by OpenInference instrumentation, the same convention Catalyst emits. It runs anywhere: locally, in a notebook, in Docker or Kubernetes, with tracing, LLM-as-judge evals, and experiments. The tradeoff is operational: self-hosted Phoenix means you run the infrastructure, the retention, and the upgrades, and there's no managed layer analyzing your traces for you. Choose it for eval-heavy research workflows or when data can't leave your environment.

    Beyond Monitoring: Cross-Run Analysis With Halo

    Every tool above will show you a trace. The differentiating question is what happens after you've collected a hundred thousand of them.

    Halo (Hierarchical Agent Loop Optimization) is an open-source, MIT-licensed engine built for exactly this step. It reads OpenTelemetry-compatible spans, decomposes them to find systemic failure modes across many runs, and returns a ranked list of findings. Each finding cites the specific trace IDs it came from, so you can click straight from a finding into the evidence.

    It runs two ways. Self-hosted: pip install halo-engine and point it at a JSONL trace file. Hosted: it's built into the Catalyst dashboard, where you can analyze your traces on demand or on a schedule (hourly, daily, weekly, or monthly, with the lookback window matched to the cadence and capped at 30 days per run). Each run produces a report in an analysis history, so you can track whether last month's top failure mode actually went away.

    This is the structural difference from session replay. Replay shows you one failure, and finding it depends on knowing which session to open. Cross-run analysis tells you "this tool-selection error recurs across N runs, here are the trace IDs, and here's the recommended fix." It turns "inspect sessions" into "fix systemic failures."

    If your agents are already in production, this is the loop that compounds: capture everything, let the analysis rank what's broken, fix the top item, repeat.

    Find out why your agents fail

    Halo reads your production traces and returns a ranked list of failure modes, with the trace IDs to prove it. It works across all your traffic instead of one trace at a time.

    Migrating From AgentOps in Under an Hour

    Catalyst's instrumentation is a drop-in at the same place agentops.init() lives, and there are two ways to do the swap.

    Path 1: let a coding agent do it. The inference CLI ships an AI-assisted installer. inf instrument --mode tracing launches your choice of Claude Code, OpenCode, or Codex, scans your codebase for LLM clients and agent frameworks, installs the tracing SDK with the right per-integration extras, wires setup() into your entrypoint, and adds stable service and agent identity. You review the diff before it applies anything.

    # 1. Install the Inference CLI and authenticate (opens your browser)
    npm install -g @inference/cli && inf auth login
    
    # 2. From your project root, run instrumentation in tracing mode.
    #    Launches a coding agent (Claude Code, OpenCode, or Codex) that scans
    #    for LLM clients, installs the SDK with the right extras, and wires
    #    setup() into your entrypoint. You review the diff before it applies.
    cd /path/to/your/project && inf instrument --mode tracing
    
    # 3. Run your app as usual, then verify spans are flowing
    inf trace list --range 1h
    
    # 4. Drill into a trace
    inf trace get <trace-id> --view tree

    Path 2: wire it manually. Install the Python package with the extras matching your stack (openai, anthropic, langchain, langgraph, openai-agents, and others are available), set three environment variables, and call setup() before your clients are constructed.

    # Install: pip install 'inference-catalyst-tracing[openai]'
    #
    # Configure export before your app starts:
    #   export CATALYST_OTLP_ENDPOINT="https://telemetry.inference.net"
    #   export CATALYST_OTLP_TOKEN="<your-token>"   # API key from the dashboard
    #   export CATALYST_SERVICE_NAME="checkout-agent"
    
    import os
    
    from inference_catalyst_tracing import setup
    from openai import OpenAI
    
    tracing = setup()  # auto-detects installed packages; call before constructing clients
    client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
    
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": "Reply with just the word hello."}],
        max_tokens=16,
    )
    
    print(response.choices[0].message.content)
    tracing.shutdown()  # short-lived process: flush batched spans before exit

    One lifecycle note: spans batch and export in the background, so short-lived scripts should call shutdown() before exit, and serverless handlers should flush per invocation instead. Once spans are flowing, verify from the terminal with inf trace list --range 1h, then drill into any trace with inf trace get <trace-id> --view tree.

    If part of your team is also running Langfuse, you don't have to migrate everything at once. Catalyst accepts Langfuse SDK traffic directly: point Langfuse traces into Catalyst by changing three environment variables.

    FAQ

    Is AgentOps free? There's a free Basic tier that includes up to 5,000 events per month. Paid plans start at $40/month for Pro, and Enterprise (SOC-2, HIPAA, self-hosting, SSO) is custom-priced.

    Is AgentOps open source? The SDK is. It's MIT-licensed on GitHub with around 5.6k stars. The hosted platform and dashboard are a commercial product built on top of it.

    What's the difference between AgentOps and MLOps? MLOps operationalizes machine learning models: training, deployment, drift monitoring. Agent operations (the discipline sense of "AgentOps") operationalizes autonomous agents, systems that use models to take actions, chain tasks, and behave non-deterministically. That demands trace-level observability and continuous improvement loops that MLOps tooling wasn't built for.

    What's the best AgentOps alternative? It depends on which criterion binds first: Catalyst for open-format capture with built-in cross-run analysis and evals, LangSmith if you're all-in on LangChain, Langfuse if you need to self-host, Phoenix for eval-heavy research workflows. The comparison table above maps each tool to the six criteria.

    Choosing Your Next Agent Monitoring Layer

    AgentOps is a reasonable first monitoring layer: fast setup, broad framework support, and session replay that makes individual runs easy to read. The structural limits arrive with production scale. You get single-session inspection where you need cross-run analysis, thin eval workflows, compliance gated to Enterprise pricing, and a proprietary event model while the industry consolidates on OpenTelemetry and OpenInference.

    Whichever direction you choose, instrument in an open format so the choice stays reversible. If you want to see what your agents are actually doing in production, and what keeps going wrong across runs, the fastest path is to start capturing OpenInference-shaped traces today.

    Trace your first agent run in minutes

    Install the Catalyst tracing SDK and call setup() before your clients. You get the full trace tree: agent, LLM, and tool spans with cost, latency, and token usage.

    CONTACT

    Meet with our research team

    Schedule a call with our research team to learn more about how Specialized Language Models can cut costs and improve performance.