News

    Introducing Catalyst: Train self-improving AI models

    Learn more

    Jun 16, 2026

    Arize Phoenix Alternatives for Agent Observability (2026)

    Inference Research

    The Real Decision Isn't Phoenix vs. Another Trace Format

    Most "Arize Phoenix alternatives" pages start by tearing Phoenix down. That's a strange way to open, because Phoenix has earned its position. It's the most widely adopted open-source AI observability platform. It ships tracing, evals, datasets, experiments, and a prompt playground in one self-hostable package, and it's still moving: version 17.4.0 shipped in June 2026. Arize also created and maintains OpenInference, the OpenTelemetry semantic conventions that much of the LLM observability ecosystem now speaks.

    That last fact changes how you should think about alternatives. Because OpenInference spans are portable, leaving Phoenix doesn't mean throwing away your instrumentation. The real question isn't "Phoenix versus a different trace format." It's "what layers do you want on top of OpenInference traces": a self-hosted viewer and eval library you operate yourself, or a managed platform that adds rubric-based evals, gateway-level cost tracking, and automated cross-run failure analysis.

    This guide gives Phoenix honest credit, lays out the specific reasons teams outgrow it, compares six alternatives as they stand in June 2026 (including two that were acquired this year), and walks through a migration path that reuses most of what you've already built.

    Read time: 13 minutes


    What Arize Phoenix Does Well — and Where Teams Hit the Ceiling

    Arize Phoenix is an open-source AI observability platform for tracing, evaluating, and experimenting on LLM applications. It accepts traces over OpenTelemetry, ships auto-instrumentation for popular frameworks and providers, and bundles evals, versioned datasets, experiments, and prompt management under the Elastic License 2.0.

    Phoenix's strengths are real and worth naming:

    • The easiest self-host in the category. Phoenix runs as a single Docker container, with SQLite as the default store and PostgreSQL recommended for production. There's a Helm chart for Kubernetes, and it runs anywhere from a laptop to a Jupyter notebook to a cloud deployment. Arize's own comparison FAQ points out that Langfuse, by contrast, requires you to separately operate ClickHouse, Redis, and S3-compatible storage. That's vendor positioning, sure, but the single-container point checks out.
    • It created the standard. OpenInference defines ten span kinds (LLM, EMBEDDING, RETRIEVER, RERANKER, TOOL, CHAIN, AGENT, GUARDRAIL, EVALUATOR, and PROMPT), with openinference.span.kind required on every span. If you've instrumented with Phoenix, you've been writing portable telemetry all along.
    • Broad instrumentation. Auto-instrumentation covers LangChain, LangGraph, LlamaIndex, DSPy, the OpenAI Agents SDK, the Vercel AI SDK, CrewAI, and the major model providers, across Python, TypeScript, and Java.
    • Evals are first-class and free. The arize-phoenix-evals library shipped its latest major release in May 2026, and response evals, retrieval evals, datasets, experiments, and the prompt playground are all open source.

    The Specific Reasons Teams Outgrow Phoenix

    So why are you reading an alternatives page? In practice, four ceilings show up repeatedly.

    1. The ops burden grows with your span volume. Self-hosting Phoenix means owning the container, the Postgres instance, the backups, and the upgrades. Open-source Phoenix has no documented support for sharding or distributed Postgres. Arize's guidance for deployments past 200 million spans is aggressive retention policies, regular pruning of low-value spans, scaling up database resources, and horizontally scaling Phoenix containers, all of it still backed by a single Postgres instance. Agent workloads get there faster than you'd think: one user action can fan out into dozens of spans.

    2. Retention and access controls are your problem. Phoenix's default retention policy is infinite (PHOENIX_DEFAULT_RETENTION_POLICY_DAYS=0), which means disk growth is unmanaged until you configure time-based or count-based cleanup. Meanwhile, the compliance features enterprises ask for (SOC2 reports, HIPAA, uptime SLAs, configurable retention with support behind it) live in Arize AX's Enterprise tier, not in open-source Phoenix.

    3. The single-trace inspection ceiling. Phoenix is genuinely good at "open this trace and walk the tree." What it doesn't offer is the production workflow that follows: finding the failure mode that recurs across five hundred runs, ranked by impact, with evidence. Even competing vendors that credit Phoenix's notebook workflows and LLM-as-judge evals point at the span-tree-first UX straining on very long agent runs and the absence of SQL-style querying across traces. Phoenix's new PXI Agent helps you debug a trace interactively, but there's no scheduled, cross-run failure analysis.

    4. Eval scale. Phoenix evals run beautifully from notebooks and scripts against your instance. Scoring production traffic continuously, with managed judge infrastructure and results wired into dashboards, is where teams end up shopping for a platform.

    One more nuance: Phoenix's Elastic License 2.0 is free for internal self-hosting but restricts offering Phoenix as a hosted service — mostly relevant if you're a platform team reselling observability.

    None of this means Phoenix is fading. It shipped a release this month. It means Phoenix occupies a specific point in the design space: self-operated, single-trace-centric, notebook-eval-native. Production agent teams often need a different point.

    Retention is a useful lens on the trade: self-hosted Phoenix gives you unlimited retention if you manage the disk, while the managed platforms meter it. Here's what the entry paid tiers actually keep:

    Default Retention on Entry Paid Tiers
    Figure 1: Default Retention on Entry Paid Tiers - Days of trace data access (June 2026)

    What to Look For in a Phoenix Replacement

    Before comparing tools, fix the criteria. Six dimensions separate the contenders.

    OpenInference and OTel compatibility. Your instrumentation is an investment. A replacement that speaks OpenInference natively, or at minimum ingests OTLP and maps the attributes, protects it. Catalyst emits OpenInference wire keys byte-identical to upstream; LangSmith's OTel endpoint maps both gen_ai.* and OpenInference attributes; Langfuse accepts OTLP.

    Full agent execution capture. An agent trace is more than LLM calls. You want full message history including tool and tool-result messages, tool calls with names, IDs, and JSON arguments, model metadata and invocation parameters, token usage, and errors, plus stable agent identity and session grouping so runs roll up by agent rather than landing as orphan spans. For a deeper treatment of what production traces should contain, see our guide to what production agent traces should capture.

    Eval depth. Can you define what "good" means in plain language and score outputs against it? Offline on curated datasets and, for mature platforms, online against production traffic.

    Cost attribution. Token counts on spans are a start; per-request cost capture at the gateway level, aggregated by model and task, is what finance asks for.

    Security posture. SOC 2, secret handling, encryption, and retention policies you configure rather than build.

    Analysis automation. The dimension most comparisons skip entirely: does the platform read your traces for you and tell you what keeps going wrong, or does it only render them?

    Arize Phoenix Alternatives Compared

    Here's how the field looks in June 2026, with Phoenix as the baseline.

    ToolLicense / modelSelf-hostEvalsStandout / caveat
    Arize PhoenixELv2 OSSSingle containerLibrary, offlineBaseline; you run ops
    CatalystManaged platformNo (managed)Rubric judge*Halo cross-run analysis
    LangfuseMIT OSS + cloudMulti-service stackOnline judgeClickHouse-owned (2026)
    LangSmithProprietaryEnterprise-onlyOnline + offlineDeepest LangChain fit
    BraintrustProprietaryEnterpriseEval-first, CI gatesObservability secondary
    HeliconeOSS + cloud proxyYesBasicMaintenance mode (2026)
    Arize AXProprietary SaaSEnterprise optionOnline evals25k–50k span tiers

    Assumptions: Catalyst evals run offline on datasets today; online evals are coming soon. Pricing details for each tool are covered in the sections below. "Self-host" reflects the production-grade documented path as of June 2026.

    Free tiers are where most evaluations start, and the included volumes vary by two orders of magnitude. Each vendor meters a different unit (spans, traces, units, or requests), so read the bars as directional rather than equivalent:

    Free-Tier Monthly Volume Included
    Figure 2: Free-Tier Monthly Volume Included - Spans, traces, units, or requests included at $0 (June 2026)

    Catalyst (inference.net)

    Catalyst Tracing is built on the same standard Phoenix created: spans are emitted as OpenInference-shaped OpenTelemetry over OTLP/HTTP, with wire keys byte-identical to upstream OpenInference. Two SDKs, @inference/tracing for TypeScript and inference-catalyst-tracing for Python, share one setup() entry point, and the docs cover sixteen integrations including LangChain, LangGraph, the OpenAI Agents SDK, Anthropic, the Vercel AI SDK, and Pydantic AI. On top of the traces sit the layers Phoenix leaves to you: rubric-based evals scored by an LLM judge (offline today, online coming soon), gateway-level cost capture per request, and Halo, an open-source engine that analyzes traces across runs and ranks systemic failure modes with cited trace IDs. The platform is SOC 2 Type II compliant and strips API keys and credentials from traces automatically, and the free tier includes 1M spans per month, forty times Arize AX's free allotment.

    Langfuse vs Phoenix

    The Langfuse vs Phoenix question comes up constantly because both are open source, and the honest answer is that they trade places depending on what you value. Langfuse is MIT-licensed where Phoenix is ELv2, and Langfuse Cloud's entry tiers are cheap: a free Hobby tier with 50k units per month, Core at $29, Pro at $199. Its eval story is strong: LLM-as-a-judge on live production traces, code evaluators, and human annotation queues. But self-hosting Langfuse means operating a web container, a worker, PostgreSQL, ClickHouse, Redis, and S3-compatible storage — a heavier footprint than Phoenix's single container, not a lighter one. Langfuse was acquired by ClickHouse in January 2026; the core project remains MIT and self-hostable, and Langfuse Cloud continues as a standalone service. Both vendors publish FAQ pages arguing against each other (Arize leads with self-host simplicity, Langfuse with license purity), and both read as positioning.

    LangSmith

    LangSmith is the proprietary, deepest-integration choice for LangChain and LangGraph shops: tracing, offline and online LLM-as-judge evaluation, dashboards, and alerting, with zero-config capture for LangChain applications. It ingests OpenTelemetry at its OTel endpoint and maps OpenInference attributes, so Phoenix-style instrumentation isn't stranded. The economics are the watch-item: Developer is free with 5,000 base traces a month, Plus is $39 per seat with 10,000 included, and beyond that you pay $2.50 per 1,000 base traces with 14-day retention, or $5.00 per 1,000 for 400-day retention. Self-hosting is Enterprise-only. Per-trace billing punishes chatty agents.

    Braintrust

    Braintrust approaches the problem from the eval side: experiments against datasets, side-by-side prompt and model comparison, CI/CD gates that block deploys on regression, plus Loop, an AI agent that generates prompts, scorers, and datasets. The Starter tier is free with 1 GB of processed data ($4/GB after), and Pro runs $249 per month. If your bottleneck is regression testing rather than production observability, it's a serious option. Just know that observability is the second act here, not the headline.

    Helicone

    Helicone earns a caveat rather than a recommendation. Its proxy-based model (change your API base URL, get logging) is the lowest-friction integration in the category, with a free tier of 10,000 requests a month. But Helicone was acquired by Mintlify in March 2026 and is in maintenance mode: security patches, bug fixes, and new model support continue, while active feature development has ended. For a team migrating off a platform, migrating onto one in maintenance mode is hard to justify.

    Arize AX: Phoenix's Managed Sibling

    The path Arize would prefer you take is upgrading to Arize AX, and for some teams it's right: same vendor, managed infrastructure, online evals, and the Alyx agent. The catch is the meter. AX Free includes 25,000 spans a month with 15-day retention; AX Pro at $50 a month includes 50,000 spans with 30-day retention; everything bigger is an Enterprise conversation. An agent product doing real traffic can blow through 50,000 spans in a day. Langfuse's competing FAQ also points out, fairly, that self-hosted Phoenix has no feature parity with AX Cloud — the OSS and managed products are genuinely different tiers.

    Same Standard, Different Layers: OpenInference Beyond Phoenix

    The most useful mental model for this decision: the instrumentation layer and the platform layer are separate, and Phoenix happens to bundle them.

    Figure 3: Same Standard, Different Layers - OpenInference instrumentation is shared on both paths; the platform layers above it differ.

    OpenInference is a specification, not a product. Ten span kinds, defined attribute keys, OTLP as transport. Catalyst's SDKs implement that specification directly: the wire keys (openinference.span.kind, session.id, agent.id, llm.model_name, and the rest) are byte-identical to upstream OpenInference. The knowledge you built instrumenting for Phoenix (what a CHAIN span is, how sessions group, where tool arguments live) transfers without translation.

    Mechanically, the swap is small. Both Catalyst SDKs initialize with a single setup() call before your clients are constructed; the Python package auto-detects installed SDKs, and export is three environment variables. The Catalyst tracing quickstart shows the same three-layer model Phoenix users already know: automatic SDK spans, an agent span for identity, manual spans for custom steps.

    import os
    
    from inference_catalyst_tracing import agent_span, setup
    from openai import OpenAI
    
    tracing = setup()
    client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
    
    with agent_span(
        tracing.tracer,
        agent_id="hello-agent",
        agent_name="Hello Agent",
        session_id="session-001",
    ) as span:
        span.set_input("Reply with just the word hello.")
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": "Reply with just the word hello."}],
            max_tokens=16,
        )
        span.set_output(response.choices[0].message.content or "")
    
    tracing.shutdown()

    The agent_span wrapper carries agent.id, agent.name, and session.id so the Agents dashboard groups runs by agent and conversation. These are the same attributes OpenInference defines, doing the same job they did in Phoenix.

    Evals Beyond Notebook Scripts

    Phoenix's eval workflow reflects its origins: a Python library you drive from notebooks and scripts, with strong templates for response and retrieval evaluation. That's a good workflow for research. The production question is different: how do you keep scoring quality after you ship, without a data scientist in the loop for every run?

    Catalyst's answer is rubric-based evaluation. You describe a quality dimension in plain English and an LLM judge applies it to every output. Every rubric includes the {{ eval_model_response }} variable and scores on a 1–10 scale by default. Rubrics come in two forms: direct rubrics grade the output against the rubric alone, while adherence rubrics grade against a reference response. Eval datasets come from captured production traffic or uploaded JSONL, which closes the loop between what users actually send and what you test against. The mechanics are covered in the docs on writing a rubric.

    One honest caveat: Catalyst's offline evals are available today, and online evals (scoring live traffic as it flows) are coming soon. If continuous online judging is your hard requirement this quarter, Langfuse and LangSmith both offer it now.

    From Inspecting Traces to Fixing Recurring Failures

    Every tool in this comparison renders trace trees. The differentiating question is what happens after the four-hundredth trace lands, because nobody reads four hundred trees.

    This is the gap Halo fills. Halo (Hierarchical Agent Loop Optimization) is an open-source, RLM-based engine (pip install halo-engine, MIT licensed) that reads OpenTelemetry-compatible spans, decomposes them to find systemic failure modes across many runs, and writes up concrete fixes with citations back to specific trace IDs. You can run it self-hosted against a JSONL trace export, or hosted inside the Catalyst Agents dashboard, where the Analysis tab runs the same engine against traces you've already collected.

    The hosted version also runs on a schedule: hourly, daily, weekly, or monthly, with the lookback window pre-filled to match the cadence and capped at 30 days per run. Each run produces a ranked report in the analysis history, which turns regression detection from "someone noticed" into a standing process. The docs on running Halo on your traces walk through both modes. The rest of the dashboard supports the workflow: a per-agent workspace with overview metrics (runs, error rate, latency, tokens, cost), sessions, and filtered traces.

    This is the layer Phoenix's single-trace view can't offer, and it's the strongest reason teams that like Phoenix still move.

    Find out why your agents fail

    Halo reads your production traces and returns a ranked list of failure modes, with the trace IDs to prove it. It works across all your traffic instead of one trace at a time.

    Migrating a Phoenix-Instrumented Agent to Catalyst

    Because both ends speak OpenInference, migration is closer to a re-point than a rewrite. The whole path:

    1. Install the SDK with your integration extras. pip install 'inference-catalyst-tracing[langgraph]' or whichever extras match your stack; available extras include openai, anthropic, langchain, langgraph, openai-agents, pydantic-ai, and more.
    2. Set three environment variables. CATALYST_OTLP_ENDPOINT=https://telemetry.inference.net, CATALYST_OTLP_TOKEN (an API key from the dashboard), and a stable CATALYST_SERVICE_NAME.
    3. Replace your Phoenix registration with setup(). Where Phoenix had you register a tracer provider and attach OpenInference instrumentors, Catalyst's setup() does both in one call, and Python auto-detects installed packages.
    4. Wrap top-level runs in agent_span. Stable agent.id and session.id make runs group in the Agents dashboard and become analyzable by Halo.
    5. Verify. Open the dashboard's Traces tab, or stay in the terminal: inf trace list --range 1h, then inf trace get <trace-id> --view tree.

    One operational note: spans are batched, so match your flush pattern to your process shape: shutdown() before exit for scripts, shutdown() on SIGTERM for services, and a per-invocation force_flush() for serverless. If you'd rather not do any of this by hand, inf instrument --mode tracing launches a coding agent (Claude Code, OpenCode, or Codex) that scans your codebase and wires it up for review.

    LangChain and LangGraph

    Most Phoenix deployments instrument LangChain or LangGraph through OpenInference instrumentors. Catalyst uses the same callback path: a compiled graph emits a graph-level CHAIN span with node-level CHAIN spans parented underneath, so your trace shape survives the move. The LangGraph tracing guide has the full pattern; here's the core of it:

    from inference_catalyst_tracing import agent_span, setup
    from langgraph.graph import END, START, StateGraph
    from typing_extensions import TypedDict
    
    class GraphState(TypedDict):
        count: int
    
    tracing = setup(service_name="counter-graph")
    
    def increment(state: GraphState) -> GraphState:
        return {"count": state["count"] + 1}
    
    def double(state: GraphState) -> GraphState:
        return {"count": state["count"] * 2}
    
    builder = StateGraph(GraphState)
    builder.add_node("increment", increment)
    builder.add_node("double", double)
    builder.add_edge(START, "increment")
    builder.add_edge("increment", "double")
    builder.add_edge("double", END)
    
    graph = builder.compile()
    with agent_span(
        tracing.tracer,
        agent_id="counter-graph-agent",
        agent_name="Counter Graph Agent",
        span_name="counter-graph.run",
        session_id="conversation-counter-1",
        agent_role="workflow",
        system="langgraph",
    ) as span:
        input_state = {"count": 1}
        span.set_input(input_state)
        result = graph.invoke(input_state)
        span.set_output(result)
        print(result)
    
    tracing.shutdown()

    OpenAI Agents SDK

    For the OpenAI Agents SDK, the post-migration code is shorter than what it replaces. After setup(), the agent run, its tool calls, handoffs, and nested model calls are all captured automatically. No per-run wrapper is required, though you can add agent_span for stable identity. The OpenAI Agents tracing guide covers handoffs and sessions; the minimal version:

    from inference_catalyst_tracing import setup
    from agents import Agent, Runner
    
    tracing = setup()
    
    support_agent = Agent(
        name="SupportAgent",
        instructions="Help customers with order questions.",
        model="gpt-4o-mini",
    )
    
    # No agent_span needed: the run and its nested model calls are captured
    # automatically.
    result = Runner.run_sync(support_agent, "Where is order ABC-123?")
    print(result.final_output)
    
    tracing.shutdown()

    Running a mixed estate? Existing LangSmith instrumentation bridges over with LANGSMITH_TRACING=true and LANGSMITH_TRACING_MODE=otel, and existing Langfuse SDK code forwards by changing three LANGFUSE_* environment variables, with no code changes in either case.

    Decision Framework: Stay, Switch, or Layer

    There's no universal winner; there's a right answer per situation.

    Your situationBest fitWhy
    Data can't leave networkStay on PhoenixSingle-container self-host
    MIT license is a mustLangfuseFull self-host parity
    All-in on LangGraphLangSmithNative framework capture
    Eval-gated CI bottleneckBraintrustExperiments + deploy gates
    Already proxying HeliconeRe-evaluateMaintenance mode
    Want Phoenix's vendor, managedArize AXSame family, span caps
    Agent volume + failure analysisCatalystHalo, evals, 1M free spans

    Stay on self-hosted Phoenix when your traces can't leave your network, you have the ops capacity to own Postgres and retention, and your volumes sit comfortably inside a single database. The notebook-driven eval workflow is a feature, not a bug, for research-heavy teams.

    Pick Langfuse if MIT licensing and full-parity self-hosting are your top criteria and you're prepared to operate the ClickHouse-backed stack. Pick LangSmith if your application is LangChain/LangGraph end to end and per-trace billing fits your volumes. Pick Braintrust if eval-driven CI is the bottleneck and observability is secondary. Skip Helicone for new builds while it's in maintenance mode. Pick Arize AX if you want Phoenix's vendor with managed infrastructure and your span volumes genuinely fit 50k-span tiers.

    Pick managed-plus-analysis (Catalyst) when you're running agents at production volume and the jobs to be done are bigger than viewing traces: cost attribution per request, rubric evals on data sampled from real traffic, and a scheduled engine that tells you what keeps failing. The free tier (1M spans a month) absorbs a real workload while you decide.

    The Standard Is Shared. Choose Your Layers.

    Phoenix earned the ecosystem it created, and OpenInference is the best thing about it, precisely because it makes Phoenix optional. Every serious entry on this list of Arize Phoenix alternatives reads the telemetry you already emit. Your spans, your span kinds, and your instrumentation habits are portable. The decision in front of you is which layers should sit on top of them: a stack you operate, or a platform that also does the analysis.

    If you want to test the portability claim, it's an afternoon of work: the same setup() pattern, the same attributes, and your first traces visible before the coffee goes cold.

    Trace your first agent run in minutes

    Install the Catalyst tracing SDK and call setup() before your clients. You get the full trace tree: agent, LLM, and tool spans with cost, latency, and token usage.

    CONTACT

    Meet with our research team

    Schedule a call with our research team to learn more about how Specialized Language Models can cut costs and improve performance.