Best Langfuse Alternatives for Agent Observability, Tracing, and Evals

The Short Version

Langfuse is one of the most popular tools for LLM observability, and for good reason: it logs LLM calls, manages prompts, and runs evaluations, all from an open-source codebase you can self-host. But teams shipping multi-step agents keep running into the same wall. Logging individual LLM calls is not the same as understanding what an agent did, and searching for Langfuse alternatives usually starts the day an agent fails in production and nobody can explain why.

The best Langfuse alternatives fall into four categories: agent-native tracing platforms (Catalyst), eval-first tools (Braintrust, LangSmith), APM extensions (Datadog LLM Observability), and DIY OpenTelemetry pipelines. Arize Phoenix sits between the first category and open-source self-hosting. The right pick depends on whether you need agent-level observability, evaluation workflows, ops consolidation, or full control.

This guide compares all seven options with June 2026 pricing, explains what each billing model actually costs at agent volumes, covers when staying on Langfuse is the right call, and walks through a migration path that forwards your existing Langfuse traces to a new backend by changing three environment variables. No SDK rewrite required to evaluate the switch.

Read time: 13 minutes

Why Teams Look for Langfuse Alternatives

Nobody leaves a tool that's working. The search for an alternative starts when the shape of your workload changes, and for most teams that change is the move from single LLM calls to agents.

Agent workloads vs LLM-call-level observability

Langfuse's data model is observation-centric. A trace contains observations (spans, events, and generations) plus scores attached to them. That model is a clean fit for the apps Langfuse grew up with: a RAG pipeline that makes one or two model calls, a chat feature with a prompt you iterate on weekly. You log the call, inspect the prompt and completion, attach a score, ship.

Agents stress every part of that model. A single agent run might span thirty steps: LLM calls, tool calls with JSON arguments and results, framework steps, retries, and sub-agent handoffs. Debugging it requires the full execution tree (every message, every tool argument, every error), not a list of model calls. Runs need to group into sessions, and the agent itself needs a stable identity so that "support-agent v2 deployed Tuesday" and "support-agent v3 deployed Friday" show up as the same agent with comparable history, not two unrelated streams.

There's a deeper limitation that applies to most of this category, Langfuse included: trace viewers show you one run at a time. When an agent fails in 4% of runs for three different reasons, no amount of scrolling individual traces tells you what the three reasons are or which one matters most. That cross-run question is the one production agent teams actually need answered, and we'll come back to it.

Pricing pressure at agent volumes

Langfuse Cloud bills per unit, and a unit is any tracing data point: traces, observations, and scores all count. An agent run with thirty observations consumes roughly thirty times the units of a single logged LLM call. The dashboards don't change, but the bill does. We'll break down the actual numbers, including what self-hosting really costs, in the pricing section below.

The ClickHouse acquisition

In January 2026, ClickHouse acquired Langfuse. The announcement was unambiguous: nothing changes for users, Langfuse remains 100% open source under its MIT license, and cloud customers keep their existing SLAs. Langfuse has kept shipping since the deal: a dedicated Japan cloud region, experiments runnable from CI, and a full launch week in May 2026.

Still, an acquisition is a structural change in who owns your observability vendor's roadmap, and re-evaluating dependencies after one is ordinary engineering diligence, not panic. If you were already feeling the agent-shaped limits above, the acquisition is simply the natural moment to look around.

What to Look For in a Langfuse Alternative

Before comparing tools, fix the criteria. These eight separate the contenders from the brochures:

Open span conventions. Traces should be OpenTelemetry-based, ideally following OpenInference semantic conventions, so your instrumentation outlives any single vendor and the next migration is an exporter change, not a rewrite.
Full execution capture. Message history (user, system, assistant, tool), tool calls with their JSON arguments and results, framework steps, model metadata, and errors, not just prompt and completion.
Session grouping and agent identity. Runs must group by session and by a stable agent identity across deploys, or your history fragments every time you rename or redeploy an agent.
Eval support, offline and online. You need both: scored runs against fixed datasets before release, and rubric-based scoring of production traffic after.
Cost tracking. Token usage and per-call cost should roll up by agent, session, and model without spreadsheet work.
Self-hosting terms. "Open source" spans a wide range: MIT lets you do nearly anything, Elastic License 2.0 restricts offering the software as a managed service, and closed-source tools self-host only on enterprise contracts.
Security posture. SOC 2 / ISO 27001 reports, retention controls, and data residency options if you handle regulated data.
Cross-run analysis. Does the platform find recurring failure modes across hundreds of runs, or does it only store traces for you to inspect one by one? This is the rarest capability on the list.

Weight them for your situation, but be suspicious of any tool that can't answer the first three — those are table stakes for agent observability.

The Best Langfuse Alternatives, by Category

Vendor listicles rank brands. Categories are more useful, because the categories disagree about what observability is for.

Figure 1: Langfuse Alternatives by Category — keyed to the need that drives the switch. Phoenix straddles agent-native tracing and open-source self-hosting; Helicone's gateway logging sits closest to the monitoring use case.

Catalyst (inference.net): agent-native tracing plus cross-run analysis

Catalyst treats the agent, not the LLM call, as the unit of observability. Its first-party TypeScript and Python SDKs initialize with a single setup() call and emit OpenInference-shaped OpenTelemetry spans over OTLP, so the data stays portable. Auto-instrumentation covers OpenAI, Anthropic, the Vercel AI SDK, LangChain, LangGraph, OpenAI Agents, Claude Agent SDK, Claude Code, Pydantic AI, LiveKit Agents, ElevenLabs, Cursor SDK, and PI AI, and it ingests traces from Langfuse and LangSmith SDKs directly.

Captured spans include full message history, tool calls with JSON arguments and results, and token usage down to prompt-cache counts. Wrapping a run in an agentSpan gives it a stable agent ID and session, which is what makes runs comparable across deploys. Two things distinguish it from everything else in this list: Halo, an analysis engine that reads traces across many runs and ranks recurring failure modes (covered in its own section below), and a Langfuse drop-in mode that accepts traffic from your existing Langfuse SDK without code changes.

Honest limits: it's a younger platform than Langfuse, prompt management is not its center of gravity, and the Langfuse forwarding integration is v1: scores and evaluations don't forward yet.

Catalyst publishes a side-by-side comparison if you want the feature-level detail.

See how Catalyst compares to Langfuse

Side-by-side on tracing, evals, and pricing. Both speak OTLP, so you can bring your existing Langfuse traces over with an env-var change.

Compare Catalyst vs Langfuse

LangSmith: if you live in LangChain

LangSmith is LangChain's own observability and eval platform, and inside that ecosystem nothing else matches its integration depth: LangGraph runs land as properly shaped traces with zero ceremony. Its eval tooling, datasets, and annotation queues are mature. The trade-offs are structural: it's closed source, self-hosting is enterprise-only, and pricing combines $39 per seat per month with per-trace charges, so costs scale with both team size and traffic. The standing debate between the two (open-source self-hosting versus ecosystem depth) is covered in our Langfuse vs LangSmith in depth comparison; the short version is that LangSmith vs Langfuse is really a question of how committed you are to LangChain.

Braintrust: the eval-first loop

Braintrust starts from evaluation rather than observability: experiments, scorers, playgrounds, and CI gates that block a merge when quality drops. Teams that treat prompt changes like code changes love it. Production observability is the secondary feature, and retention shows it: 14 days on the free tier, 30 days on the $249/month Pro plan. That makes long-horizon debugging hard. Pick it when your bottleneck is pre-release quality, not production diagnosis.

Arize Phoenix: the open-source, OTel-native swap

Phoenix is the most direct answer to "I want open-source LLM observability with agent-shaped tracing." It self-hosts via Docker or Kubernetes, bundles tracing, evals, datasets, and experiments, and is the home of the OpenInference conventions themselves. Two caveats: the license is Elastic License 2.0 rather than MIT, which restricts reselling it as a service, and like Langfuse it's an inspect-one-trace-at-a-time tool. You operate it, and you do the cross-run analysis yourself.

Helicone: gateway-first request logging

Helicone takes the proxy route: change your SDK's base URL and every request gets logged with cost and latency. Setup really is one line, and for cost monitoring across providers it's excellent. But a proxy sees requests, not your agent's internal structure. There's no tool-call tree and no framework steps, so it can't replace trace-level debugging for multi-step agents. Plans run from a free 10,000 requests/month tier through Pro at $79/month to Team at $799/month with SOC 2 and HIPAA.

Datadog LLM Observability: the APM extension

If your company already runs on Datadog, its LLM Observability product (now marketed under Agent Observability) puts agent traces next to your infrastructure metrics, with billing at $160/month (annual) for the first 100k LLM spans and $3.50 per additional 10k. Consolidation is the whole pitch, and for ops teams it's a real one. What you give up are the LLM-native workflows (evals, datasets, prompt iteration), and span-based billing gets expensive when chatty agents emit thousands of spans per session.

Raw OpenTelemetry: the DIY route

Everything in this article ultimately speaks OpenTelemetry, so you can skip vendors entirely: instrument with OTel, export to Jaeger, Tempo, or ClickHouse, and pay nothing in license fees. The catch is that generic trace viewers don't render conversations, tool-call payloads, or token costs, and you'll build eval plumbing, cost rollups, and session views yourself on conventions that are still maturing. It's the right call for platform teams with strong opinions and spare capacity, and a tar pit for everyone else.

Comparison Table

The table below condenses the criteria from earlier. Billing model, hosting terms, and eval support are where the seven options genuinely differ.

Tool	Category	Billing unit	Self-host / license	Evals
Catalyst	Agent-native tracing	Spans	Managed	✅ Offline + online
LangSmith	Eval + tracing	Seats + traces	Enterprise only	✅ Mature
Braintrust	Eval-first	GB + scores	Enterprise only	✅ Core focus
Arize Phoenix	Open-source tracing	Free (self-host)	✅ ELv2	✅ Built in
Helicone	Gateway logging	Requests	Enterprise only	⚠️ Limited
Datadog LLM Obs.	APM extension	LLM spans	❌ SaaS	⚠️ Limited
Raw OpenTelemetry	DIY pipeline	Infra cost	✅ Apache 2.0	❌ Build yourself
Langfuse (baseline)	LLM observability	Units*	✅ MIT	✅ Built in

Assumptions: a Langfuse unit is any tracing data point — traces, observations, and scores all count. Catalyst includes Halo cross-run failure analysis; no other row has an equivalent. "Enterprise only" = self-hosting available only on enterprise contracts.

Langfuse Pricing and Hosting, Honestly

Since Langfuse pricing is half the reason people land on articles like this one, here are the actual numbers as of June 2026.

Langfuse Cloud has four tiers: Hobby is free with 50k units/month, 30-day retention, and 2 users; Core is $29/month with 100k units and 90-day retention; Pro is $199/month and adds 3-year retention, SOC 2 and ISO 27001 reports, and a HIPAA BAA, with a $300/month Teams add-on for enterprise SSO and fine-grained RBAC; Enterprise is $2,499/month with audit logs, SCIM, and an uptime SLA. Paid tiers include unlimited users with no per-seat charge. Overage starts at $8 per 100k units and steps down to $7, $6.50, and $6 at higher volumes.

Figure 2: Langfuse Cloud Overage Pricing - USD per 100k units beyond plan allowance, by monthly volume band

The number that matters is the unit definition: every trace, every observation, and every score is a unit. A simple LLM app logging one generation per request consumes about two units per request. An agent whose runs contain thirty observations consumes about thirty-one units per run. Same traffic, more than an order of magnitude more units. Unit-based billing is fair in the sense that you pay for what you send; it just means agent adoption quietly multiplies your observability bill.

Self-hosting is Langfuse's trump card: the MIT-licensed open-source version includes all core platform features (observability, evals, prompt management, datasets) with no usage limits. Your cost is infrastructure and operations (it's a ClickHouse-backed deployment you run and upgrade). A self-hosted enterprise edition with RBAC, retention policies, and audit logs is custom-priced, additive to a ClickHouse commercial plan.

How the alternatives bill is at least as important as what they charge, because the units aren't comparable: LangSmith bills per seat plus per trace, with retention as the price lever: $2.50 per 1,000 base traces kept 14 days, $5.00 per 1,000 kept 400 days (see our LangSmith's trace billing breakdown for worked examples). Braintrust bills on processed gigabytes plus scores. Helicone bills per request. Datadog bills per LLM span. Catalyst's free tier includes 1M spans/month with no per-seat charges, with paid plans at $25/month (10M spans) and $250/month (50M spans). Phoenix and raw OTel are free software plus your infrastructure.

Tool	Free tier	Entry paid plan	Billing unit	Retention (entry)
Langfuse	50k units/mo	$29/mo	Units	90 days
Catalyst	1M spans/mo	$25/mo	Spans	—
LangSmith	5k traces/mo	$39/seat/mo	Seats + traces	14 days
Braintrust	1 GB + 10k scores	$249/mo	GB + scores	30 days
Helicone	10k requests/mo	$79/mo	Requests	1 month
Datadog LLM Obs.	—	$160/mo*	LLM spans	Extended costs extra

Assumptions: prices as of June 2026. Datadog: $160/mo on annual billing ($200 monthly) covers the first 100k LLM spans, then $3.50 per 10k additional (annual). LangSmith base-trace retention is 14 days; 400-day extended traces cost $5.00/1k. Langfuse overage starts at $8/100k units. Catalyst Starter ($25/mo) includes 10M spans; retention not published on the pricing page. Phoenix and raw OTel are free software plus your own infrastructure.

Figure 3: Entry Paid Plan, Monthly Price - Cheapest paid tier per platform, June 2026 (billing bases differ)

Run the projection against your own traffic shape before deciding anything. A seat-billed tool is cheap for two engineers and expensive for twenty; a span-billed tool is the reverse.

When Langfuse Is Still the Right Choice

An honest comparison admits the incumbent wins sometimes. Stay on Langfuse when:

MIT self-hosting is a hard requirement. No alternative on this list self-hosts a full-featured platform under a license this permissive for free.
Prompt management is the center of your workflow. Versioned prompts with deployment workflows remain a Langfuse strength the agent-native tools don't prioritize.
Your app is genuinely LLM-call-shaped. A RAG Q&A feature or a classification endpoint doesn't need agent-level capture; Langfuse's model fits it precisely.
Data residency drives the decision. Langfuse Cloud's regions now include a dedicated Japan deployment alongside the EU.

And when each category wins instead: agent-native platforms when multi-step agents are in production and debugging them is your bottleneck; eval-first tools when pre-release quality gates matter more than production diagnosis; the APM extension when consolidating on existing ops tooling outweighs LLM-native workflows; DIY OpenTelemetry when you have platform engineers and conventions of your own.

Migrating Without Losing History

The biggest hidden cost of switching observability vendors is usually re-instrumentation. This is the part where evaluating a Langfuse alternative normally costs a sprint, and the part Catalyst reduces to an environment change. Here's the full path:

Step 1: Flip three environment variables. Catalyst accepts the Langfuse SDK's ingestion and OTel endpoints directly, so it forwards existing Langfuse traces into Catalyst without an SDK swap:

export LANGFUSE_HOST="https://telemetry.inference.net"
export LANGFUSE_PUBLIC_KEY="pk-catalyst"
export LANGFUSE_SECRET_KEY="<your-catalyst-api-key>"

Your traces, generations, spans, usage, costs, inputs, outputs, metadata, users, sessions, and tags all convert to Catalyst spans automatically. Two caveats: the integration is v1, so Langfuse scores and evaluations are not ingested yet, and forwarding applies to traffic from the switch onward, so historical data stays in your Langfuse instance.

Step 2: Mark your agents. Catalyst doesn't assume every trace is an agent. Set agent.name (and optionally a stable agent.id) on the top-level span of each agent so its runs group in the Agents dashboard and become available for analysis:

from langfuse import observe
from opentelemetry import trace

@observe(as_type="agent", name="deep-research")
def run_agent(question: str) -> str:
    # Mark this top-level span as an agent. Set ONLY here, not on children.
    span = trace.get_current_span()
    span.set_attribute("agent.name", "deep-research")
    span.set_attribute("agent.id", "deep-research-v1")  # stable id; optional
    # ... your agent loop, tool calls, and nested generations ...
    return "answer"

Step 3: Move to the first-party SDK when you want more. The env-var mode is the evaluation on-ramp. For a fully shaped trace tree (explicit agent spans, manual tool and retriever spans, session IDs), follow the tracing SDK quickstart and call setup() before your clients are constructed:

# Install: pip install 'inference-catalyst-tracing[openai]'
import os

from inference_catalyst_tracing import agent_span, setup
from openai import OpenAI

tracing = setup()
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

with agent_span(
    tracing.tracer,
    agent_id="hello-agent",
    agent_name="Hello Agent",
    session_id="session-001",
) as span:
    span.set_input("Reply with just the word hello.")
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": "Reply with just the word hello."}],
        max_tokens=16,
    )
    span.set_output(response.choices[0].message.content or "")

tracing.shutdown()

Python installs take per-integration extras, like pip install 'inference-catalyst-tracing[openai,langchain]'.

Step 4 (optional): Add the Gateway. Pointing your SDK at the Catalyst Gateway quickstart base URL captures every LLM request with cost and latency at roughly 10ms of overhead, and captured traffic can be filtered and saved when you build datasets from captured traffic for evals or fine-tuning:

import os
from openai import OpenAI

client = OpenAI(
    base_url="https://api.inference.net/v1",
    api_key=os.environ["INFERENCE_API_KEY"],
    default_headers={
        "x-inference-provider-api-key": os.environ["OPENAI_API_KEY"],
        "x-inference-provider": "openai",
    },
)

response = client.chat.completions.create(
    model="gpt-4.1",
    messages=[{"role": "user", "content": "Hello, world!"}],
)

print(response.choices[0].message.content)

Total migration cost for the evaluation: three environment variables and one attribute on your top-level span.

Beyond Inspecting Traces: Cross-Run Analysis with Halo

Every tool in this article will store your traces and render a tree view. That's the category's ceiling, and it leaves the hardest question manual: what keeps going wrong across all of these runs?

Halo (Hierarchical Agent Loop Optimization) is Catalyst's answer, and it's open source under MIT. It reads OpenTelemetry-compatible spans, decomposes them across many runs to find systemic failure modes, and writes up ranked findings. Each finding cites the specific trace IDs it came from, so you can click straight into the evidence. You can self-host it (pip install halo-engine) or run Halo on your traces hosted inside the Catalyst dashboard, on demand or on a schedule from hourly to monthly with windows up to 30 days.

That scheduling detail is the point. A weekly report that says "your refund agent's top failure mode is the lookup tool returning empty results, here are 14 example traces" turns observability from a debugging archive into an improvement loop: trace, analyze, fix, re-run, confirm.

Which Alternative Fits Your Stack

A quick decision framework by framework:

LangChain / LangGraph. LangSmith is the native default and a fine one. Catalyst instruments both frameworks too, worth a look if you want agent grouping and cross-run analysis without per-seat pricing.
OpenAI Agents SDK. Catalyst has a dedicated integration; Phoenix also supports it via OpenInference.
Claude Agent SDK / Claude Code. Catalyst ships integrations for both, including tracing Claude Code subprocess calls.
Vercel AI SDK. Catalyst traces generateText, streamText, and tool loops; Phoenix covers it as well.
Pydantic AI. Catalyst's Python SDK has a dedicated integration.
Already on Datadog for everything else. Datadog LLM Observability, with eyes open about span billing.
Hard open-source requirement. Phoenix under ELv2, or staying on MIT-licensed Langfuse, are the defensible picks.

FAQ

Is Langfuse still open source after the ClickHouse acquisition? Yes. ClickHouse's January 2026 announcement states Langfuse remains 100% open source under its existing MIT license, and that nothing changes for users.

How much does Langfuse cost? Langfuse Cloud runs from a free Hobby tier (50k units/month) through Core at $29/month, Pro at $199/month, and Enterprise at $2,499/month, with overage from $8 per 100k units. Self-hosting the MIT open-source version is free apart from infrastructure.

What counts as a billable unit in Langfuse? Any tracing data point sent to the platform: traces, observations (spans, events, and generations), and scores all count. This is why agent workloads, which emit many observations per run, consume units quickly.

What's the best open-source Langfuse alternative? Arize Phoenix is the strongest open-source alternative: self-hosted tracing, evals, and datasets under the Elastic License 2.0. Note the license difference: Langfuse itself is MIT.

Can I try an alternative without rewriting my tracing code? With Catalyst, yes: point your existing Langfuse SDK at Catalyst by changing three environment variables, and your spans stream into its dashboard with no code changes.

The Bottom Line

Don't pick a brand; pick the category that matches your bottleneck. Agent-native tracing when production agents are failing and you can't see why. Eval-first when quality gates are the problem. APM when consolidation rules, DIY when you have the platform team for it. And Langfuse itself when your app is LLM-call-shaped and MIT self-hosting matters.

Whatever you choose, the cheapest evaluation is the one that doesn't require a rewrite. Since Catalyst ingests your existing Langfuse traces through an env-var change, you can see your own production agents in it this afternoon and judge with your own data.

Trace your first agent run in minutes

Install the Catalyst tracing SDK and call setup() before your clients. You get the full trace tree: agent, LLM, and tool spans with cost, latency, and token usage.

Start tracing

Agent Observability Guide — trace trees, OpenTelemetry and OpenInference conventions, and what to capture
Braintrust Alternatives — the same criteria-first treatment for readers also evaluating Braintrust
AI Regression Testing for LLM Apps — building eval datasets from production traces and gating changes on them