News

    Introducing Catalyst: Train self-improving AI models

    Learn more

    Jun 16, 2026

    Langfuse vs LangSmith: Which LLM Observability Tool Fits Your Team?

    Inference Research

    TL;DR: The Verdict

    If you're weighing Langfuse vs LangSmith, the two LLM observability tools most teams shortlist, here's the short answer: choose Langfuse if you need open source, self-hosting, or framework-agnostic OpenTelemetry instrumentation. Choose LangSmith if your stack is built on LangChain or LangGraph and you want zero-setup tracing, native alerting, and a managed agent runtime. Both store and inspect traces well. The deciding factors are who runs the infrastructure, what framework you use, and how the billing math lands at your volume.

    A bit more precisely:

    • Choose Langfuse if self-hosting is a requirement, your stack spans multiple frameworks, or you want MIT-licensed code you can read and run. Self-hosting is free with near-full feature parity.
    • Choose LangSmith if you're committed to LangChain/LangGraph and value the zero-setup tracing, the alerting (webhooks plus PagerDuty), and the option to deploy agents on the same platform.
    • Choose neither alone if the reason you're shopping is that your agent keeps failing in production. Both tools show you individual traces; neither tells you what keeps going wrong across thousands of runs. That's a different category of tool, and we'll get to it.

    Every price in this article was verified against the live pricing pages on June 11, 2026. That matters more than usual here: several comparisons still ranking for this query cite LangSmith's pre-2026 trace prices, which are now off by 5x.

    Read time: 14 minutes


    Langfuse vs LangSmith at a Glance

    Before the details, the whole comparison in one table:

    DimensionLangfuseLangSmith
    Open sourceMIT coreNo
    Self-hostingFree, full parityEnterprise only
    Pricing modelPer unitPer trace + seat
    Free tier50k units/mo5k traces/mo
    Framework fitFramework-agnosticLangChain-native
    OpenTelemetryNative SDKsIngest + export
    EvalsJudge, code, online30+ templates*
    AlertingWebhooks onlyNative + PagerDuty
    Agent deploymentNoneManaged runtime

    Template counts are vendor-reported (LangChain claims 30+ vs 8 in Langfuse). Prices and limits verified June 11, 2026.

    Most LangSmith vs Langfuse pages on the web are written by one of the two vendors, so here's a useful credibility check: each vendor's own comparison page concedes the other's strengths. Langfuse's page admits LangSmith wins on LangGraph deployment, zero-setup tracing for LangChain apps, and the community Prompt Hub. LangChain's page leans hard on alerting and eval automation, the places Langfuse is noticeably thinner. When the vendors agree on where the lines are, you can mostly trust those lines. Where they spin is in what those differences are worth. That's what the rest of this article works out.

    Core Capabilities Head-to-Head

    Tracing Depth

    Both platforms capture the things that matter: nested traces of LLM calls, tool invocations, retrieval steps, token usage, latency, and cost. The difference is the instrumentation model, not what you can see afterward.

    Langfuse structures everything as hierarchical observations (spans, generations, and events) emitted by OpenTelemetry-native SDKs, and claims integrations with more than 80 frameworks. If your stack mixes a LangGraph agent with a raw OpenAI client and a Vercel AI SDK frontend, Langfuse treats them all the same way.

    LangSmith's tracing is built around run trees that mirror LangChain's execution model. Inside LangChain or LangGraph, tracing is genuinely zero-setup: set environment variables and every chain, tool call, and agent step appears with no code changes, a strength even Langfuse's comparison page concedes. Outside the framework, LangSmith now ingests standard OpenTelemetry traces at its OTel endpoint and maps GenAI semantic conventions, OpenInference, and TraceLoop attributes.

    Verdict: parity for most applications. Your instrumentation philosophy (framework-native run trees versus framework-neutral OTel spans) decides which feels better, not trace quality. For a deeper look at what belongs in a production trace in the first place, see our guide to what a production trace should capture.

    Evals

    This is where vendor pages diverge most from current reality, in both directions.

    Langfuse ships LLM-as-a-judge evaluators, deterministic code evaluators, human annotation queues, datasets, and experiments, and supports both online scoring of live production traces and offline evaluation before deploys. Two of those pieces are recent: code evaluators written as Python or TypeScript functions directly in the UI shipped in May 2026, and LLM-as-a-judge evaluators became manageable through the public API in April 2026.

    LangSmith counters with a larger evaluator library, which LangChain pegs at more than 30 templates (trajectory evaluation and prompt-injection detection included) against 8 built-in templates in Langfuse, plus automation rules that route low-quality runs to human review and promote high-confidence runs into datasets. Treat the template counts as vendor numbers, but the automation-rules gap is real.

    Worth flagging: LangChain's comparison page claims online deterministic evals are still "on the roadmap" for Langfuse. Langfuse's own documentation describes online scoring of production traces as shipped. Vendor comparison pages age badly; this one already has.

    Prompt Management, Datasets, and Dashboards

    Both platforms version prompts and link them to traces. LangSmith adds the Prompt Hub, a community prompt repository. Langfuse promoted Experiments to a top-level feature alongside Datasets in April 2026, and its v4 architecture, in preview on Langfuse Cloud since March 10, 2026, substantially speeds up chart loading and trace browsing on large projects. If you evaluated Langfuse before v4 and found the dashboards sluggish at scale, that critique is dated.

    Two asymmetries to know about. Alerting: LangSmith offers configurable alerts on run count, latency, and feedback scores, with webhook delivery and a PagerDuty integration; Langfuse has no native alerting and relies on webhook APIs and third-party wiring. Raw data access: Langfuse exposes its ClickHouse store for direct SQL over your traces, which LangSmith's managed SaaS doesn't match. Pick your poison: pager integration or SQL access.

    Open Source and Self-Hosting: Where Langfuse Wins

    Langfuse is open source with an MIT-licensed core, and you can self-host it for free. The repo sits at roughly 28,900 GitHub stars with commits landing as of June 2026. Langfuse states self-hosting has full feature parity, with exactly three features gated behind the paid Enterprise Edition license: Organization Creators, the Instance Management API, and UI Customization.

    LangSmith is proprietary. There is no open-source LangSmith, and self-hosted or hybrid deployment is available only on the Enterprise plan, behind a sales conversation. If your compliance posture requires traces to stay in your VPC and you don't want an enterprise contract, the decision is already made.

    One development worth addressing directly: ClickHouse acquired Langfuse, announced January 16, 2026, and the open-source project remains actively maintained. For most teams this is reassuring rather than alarming: Langfuse already ran on ClickHouse, and the acquirer's business depends on the database, not on squeezing observability licenses. But if your governance review asks "who owns this project," the answer changed this year, and you should know that going in.

    What Self-Hosting Actually Costs

    "Self-host for free" describes the license, not the platform. A production Langfuse deployment runs six moving parts: the Langfuse web container, an async worker, PostgreSQL for transactional data, ClickHouse for the trace store, Redis or Valkey for queues and caching, and S3-compatible blob storage. Docker Compose covers testing and low-scale use; production means Kubernetes with Helm or the Terraform modules for AWS, Azure, or GCP.

    At meaningful trace volume, expect a few hundred dollars a month in cloud infrastructure, plus the part that actually hurts: a fraction of an engineer who owns upgrades, scaling, and backups. That cost is invisible on the pricing page and very visible in your on-call rotation. Self-hosting is the right call when data residency demands it or when your volume makes cloud pricing painful. It is rarely the right call just to save the $29/month Core subscription.

    Framework Gravity: LangSmith Inside LangChain and LangGraph

    If your team is all-in on LangChain or LangGraph, LangSmith's pull is real. Tracing turns on with environment variables, every framework abstraction maps cleanly onto the run tree, and the prompt playground and eval tooling were designed around the same primitives your code uses. Third-party survey data suggests roughly 84% of LangSmith users work with LangChain. Directionally, LangSmith is the LangChain observability layer.

    LangSmith also does something Langfuse doesn't attempt: deployment. The platform includes a managed runtime for stateful agents with task queues, persistence, and one-click GitHub deploys. If you want your observability vendor to also run your agents, only one of these tools offers that.

    Now the lock-in question, which most comparisons get wrong. The lazy version says closed source means your data is captive. It isn't: LangSmith ingests OpenTelemetry traces, and its SDKs can export OTel spans to other backends via LANGSMITH_OTEL_ENABLED, or bypass LangSmith entirely with LANGSMITH_OTEL_ONLY=true. Your traces can leave whenever you want.

    The real gravity is different: it's the accumulation of convenience. Zero-setup tracing only works inside LangChain. Your dashboards, automation rules, and eval datasets accrete around run-tree semantics. And the per-trace, per-seat billing model follows you as your team and volume grow. If you later migrate frameworks (say, to a custom agent loop or the OpenAI Agents SDK), the zero-setup advantage evaporates and you're instrumenting manually against a proprietary backend anyway. Teams confident in LangChain long-term can discount this heavily. Teams that hedge frameworks should weight it.

    Langfuse Pricing vs LangSmith Pricing: The Math Both Pages Obscure

    The pricing pages are accurate and individually clear. What they obscure is comparability, because the two products bill different units.

    LangSmith bills per trace, plus per seat. A trace is one execution of your application (an agent run, an evaluator, or a playground session) no matter how many LLM calls happen inside it. Your trace count therefore depends on where your instrumentation starts new top-level runs. Evaluation runs generate billable traces too; the trace definition explicitly covers evaluator executions. And free Developer accounts are capped at 5,000 traces a month until a credit card is on file.

    PlanPriceIncluded tracesSeats
    Developer$0/seat5k base/mo1 max
    Plus$39/seat10k base/moUnlimited
    EnterpriseCustomCustomCustom

    Trace overage: base $2.50/1k (14-day retention), extended $5.00/1k (400-day), base-to-extended upgrade +$2.50/1k. Evaluation runs bill as traces; Developer accounts are capped at 5k traces/month until a card is added. Verified June 11, 2026.

    If you've read a "LangSmith pricing 2025" guide citing $0.50 per 1k base traces, throw it out: the live page lists base traces (14-day retention) at $2.50 per 1k and extended traces (400-day retention) at $5.00 per 1k, with base-to-extended upgrades costing another $2.50 per 1k. LangSmith's invoice can also carry non-observability meters (deployment runs, deployment uptime, Fleet runs, and Engine compute) that are out of scope here but worth knowing exist. For the full mechanics, including the verbosity traps, see our full LangSmith pricing breakdown.

    Langfuse bills per unit, with unlimited users on paid tiers. A unit is any tracing data point: a trace, an observation (a span, event, or generation), or a score. That definition is easy to skim past and it dominates the bill: one agent execution that produces a trace, 14 observations, and 2 evaluation scores consumes 17 units, not 1.

    PlanPriceIncluded unitsData accessUsers
    Hobby$050k/mo30 days2
    Core$29/mo100k/mo90 daysUnlimited
    Pro$199/mo100k/mo3 yearsUnlimited
    Enterprise$2,499/mo100k/mo3 yearsUnlimited

    A unit is any tracing data point: a trace, an observation (span, event, generation), or a score. Overage: $8 per 100k units, lower with volume. Pro offers a $300/mo Teams add-on. Self-hosting is free (open source). Verified June 11, 2026.

    So LangSmith's cost scales with how many runs you start; Langfuse's cost scales with how deep each run is. Neither maps cleanly onto business traffic, which is why you should run your own shape through the math. Here's ours.

    What 100K Traces a Month Costs

    Assumptions: a 5-engineer team; a production agent whose average trace contains 14 observations and 2 scores (about 17 Langfuse units, or roughly 15 spans); default retention on each plan.

    • LangSmith Plus, base retention: 5 seats × $39 = $195; traces: (100,000 − 10,000 included) × $2.50/1k = $225. Total ≈ $420/month at 14-day retention.
    • LangSmith Plus, extended retention: $195 + 90,000 × $5.00/1k = $450. Total ≈ $645/month at 400-day retention.
    • Langfuse Core: 100K traces × 17 = 1.7M units; overage (1.7M − 100k included) = 1.6M × $8/100k = $128; plus $29 base. Total ≈ $157/month at 90-day data access, or about $327 on Pro if you need 3-year access.
    • Same volume, shallow traces (~5 units each, e.g. simple RAG calls): 500k units → $32 overage + $29 = ≈ $61/month on Core.

    That last pair is the insight the "Langfuse is cheaper" takes skip: Langfuse's advantage at 100K traces ranges from 2.7x to 7x depending entirely on how chatty your traces are.

    What 1M Traces a Month Costs

    Same assumptions, ten times the volume:

    • LangSmith Plus, base: $195 + 990,000 × $2.50/1k = $2,475. ≈ $2,670/month, 14-day retention.
    • LangSmith Plus, extended: $195 + 990,000 × $5.00/1k = $4,950. ≈ $5,145/month, 400-day retention.
    • Langfuse Core, deep traces: 17M units → 16.9M overage × $8/100k = $1,352 + $29 = ≈ $1,381/month before the volume discounts Langfuse advertises kick in.
    • Langfuse self-hosted: license $0; realistically a few hundred dollars of infrastructure plus the ops time discussed above.
    • Reference point — Catalyst: the same workload is about 15M spans, which lands in Catalyst's $250/month Growth tier (50M spans included); 100K traces/month (~1.5M spans) fits the $25 Starter tier, with no per-seat charges.
    ScenarioLangSmith baseLangSmith ext.Langfuse CoreCatalyst
    100K traces/mo~$420~$645~$157$25
    1M traces/mo~$2,670~$5,145~$1,381$250

    Assumptions: 5 engineers (LangSmith seats 5 × $39; Langfuse and Catalyst have no per-seat charge); ~17 Langfuse units and ~15 spans per trace; Langfuse volume discounts not applied. Retention differs: LangSmith base 14 days, extended 400 days, Langfuse Core 90 days. All list prices verified June 11, 2026; totals derived arithmetically in the surrounding prose.

    Monthly Cost at Volume
    Figure 1: Monthly Cost at Volume — 5-seat team, ~17 units / ~15 spans per trace, list prices verified June 11, 2026

    Caveat on fairness: these totals buy different retention windows: 14 days on LangSmith base versus 90 days on Langfuse Core versus 400 days on LangSmith extended. If you're building eval datasets from production traces, short retention is a real cost, not a footnote.

    Interop and Migration Paths: OpenTelemetry Is Your Exit

    The healthiest way to make this decision is to make it reversible. Instrument once on open standards (OpenTelemetry spans with GenAI or OpenInference semantic conventions) and treat the backend as swappable. Both vendors now support this posture: LangSmith ingests OTel and exports it, and Langfuse's SDKs are OTel-native.

    That also means you can trial an alternative without re-instrumenting. Two concrete paths:

    Already on LangSmith? Catalyst bridges LangSmith OpenTelemetry spans into Catalyst: set LANGSMITH_TRACING=true and LANGSMITH_TRACING_MODE=otel, and your existing traceable functions flow through the Catalyst tracer provider. If you leave the mode unset, Catalyst defaults LangSmith to hybrid mode: both backends receive spans, so you compare side by side without giving anything up.

    import os
    
    from inference_catalyst_tracing import setup
    from langsmith import Client, traceable
    
    os.environ["LANGSMITH_TRACING"] = "true"
    os.environ["LANGSMITH_TRACING_MODE"] = "otel"
    
    tracing = setup(service_name="langsmith-worker")
    client = Client()
    
    @traceable(name="answer_question", run_type="tool", client=client, enabled=True)
    def answer_question(question: str) -> str:
        return question.upper()
    
    print(answer_question("hi"))
    client.flush()
    tracing.shutdown()

    Already on Langfuse? Keep your tracing code and change three environment variables: pointing the Langfuse SDK at Catalyst ingests your traces, generations, spans, usage, costs, users, sessions, and tags with no SDK swap. One honest caveat: this integration is v1, and Langfuse scores and evaluations are not ingested yet.

    export LANGFUSE_HOST="https://telemetry.inference.net"
    export LANGFUSE_PUBLIC_KEY="pk-catalyst"
    export LANGFUSE_SECRET_KEY="<your-catalyst-api-key>"

    Either way, the strategic point stands: with OTel as the contract, "Langfuse vs LangSmith" stops being a marriage and becomes a configuration choice you can revisit with an env var.

    See how Catalyst compares to Langfuse

    Side-by-side on tracing, evals, and pricing. Both speak OTLP, so you can bring your existing Langfuse traces over with an env-var change.

    The Third Option: When Your Problem Is Fixing Failures, Not Storing Traces

    Most lists of LangSmith alternatives or Langfuse alternatives just swap one trace store for another. Step back from the feature tables and ask why you're shopping for observability at all. If the answer is "our agent keeps failing in production and we don't know why," there's an uncomfortable truth: a trace viewer doesn't solve that. Langfuse and LangSmith are both excellent at capturing runs and letting you inspect one trace at a time. Neither will read ten thousand traces and tell you the three things that keep breaking.

    That's the gap Halo targets. Halo is an open-source, RLM-based analysis engine that reads OpenTelemetry-compatible spans, decomposes them across many runs to find systemic failure modes, and writes up ranked findings with concrete fixes, each citing the specific trace IDs that prove it. You can run it on demand or on a schedule, hosted inside Catalyst's Agents dashboard, or self-host it with pip install halo-engine against a JSONL trace file.

    The substrate is Catalyst Tracing, which captures OpenInference-shaped spans on OpenTelemetry: full message history, tool calls with their JSON arguments and results, token usage, and errors. The free tier includes 1M spans a month with no per-seat charges, and because of the bridges in the previous section, it runs alongside whichever incumbent you're evaluating. You don't have to pick the third option to trial it; you just have to flip env vars on traffic you're already tracing.

    To be clear about scope: if what you need is a mature trace store with evals and prompt management, Langfuse and LangSmith are both strong answers and you should pick between them on the criteria above. The third option matters when the comparison you should be making isn't tool vs tool, but trace storage vs an improvement loop.

    Decision Framework: Match the Tool to Your Team

    Work down this list and stop at the first line that describes you:

    1. Compliance requires traces to stay in your infrastructure, and you won't sign an enterprise contract → Langfuse, self-hosted. It's the only free, full-parity self-host option here.
    2. You're all-in on LangChain/LangGraph and plan to stay → LangSmith. Zero-setup tracing plus native alerting and the deployment runtime compound inside the ecosystem.
    3. Your stack spans multiple frameworks, or you hedge framework bets → Langfuse or Catalyst; both are OTel-native and framework-neutral.
    4. Your traces are deep (agents with many steps) and volume is high → run the unit math first; Langfuse's per-unit billing multiplies with trace depth, and per-span pricing may beat it.
    5. You need 400-day retention for eval datasets → price LangSmith extended traces against Langfuse Pro's 3-year access at your real volume; the answer flips with scale.
    6. Your on-call rotation needs paging when feedback scores drop → LangSmith is the only one with native alerting and PagerDuty.
    7. Your actual problem is recurring production failures you can't diagnose → add Catalyst + Halo alongside whichever you pick, and let cross-run analysis find the patterns.

    Figure 2: Which LLM Observability Tool Fits Your Team?

    FAQ

    Is Langfuse really open source? Yes. The core is MIT-licensed with roughly 28,900 GitHub stars and active maintenance, and self-hosting is free. Three features (Organization Creators, the Instance Management API, and UI Customization) require the paid Enterprise Edition license.

    Can you self-host LangSmith? Only on the Enterprise plan, which offers hybrid and self-hosted deployment under custom pricing. There is no open-source or free self-hosted LangSmith.

    Does LangSmith work without LangChain? Yes. LangSmith ingests OpenTelemetry traces and maps GenAI, OpenInference, and TraceLoop semantic conventions, so non-LangChain apps can send standard OTel spans. You lose the zero-setup experience, which is most of the reason LangChain teams choose it.

    Which is cheaper, Langfuse or LangSmith? At our modeled volumes, Langfuse Cloud comes out cheaper: roughly $157 vs $420/month at 100K deep traces, and roughly $1,381 vs $2,670 at 1M. But the gap depends on trace depth (Langfuse bills every observation as a unit) and on retention needs (LangSmith base is 14 days vs Langfuse Core's 90). Shallow-trace, low-seat teams see a much smaller difference.

    Did the ClickHouse acquisition change Langfuse? Ownership changed (ClickHouse announced the acquisition January 16, 2026), but the open-source project remains actively maintained, and the months since brought v4's performance rework and the May 2026 evaluator launches.

    Conclusion

    Langfuse vs LangSmith isn't a question with one winner. It's a sorting function: self-hosting and framework neutrality sort you toward Langfuse; LangChain depth, alerting, and managed agent deployment sort you toward LangSmith; and the pricing math at your trace shape settles ties. The one mistake to avoid is treating the choice as permanent. Instrument on OpenTelemetry and every backend, including these two, stays an env var away.

    And if you're shopping because production keeps breaking, remember that storing traces is the start, not the fix. The Catalyst tracing SDK runs alongside LangSmith or Langfuse, so you can compare all three on your own traffic this week, and let Halo tell you what your traces have been trying to say.

    Trace your first agent run in minutes

    Install the Catalyst tracing SDK and call setup() before your clients. You get the full trace tree: agent, LLM, and tool spans with cost, latency, and token usage.

    CONTACT

    Meet with our research team

    Schedule a call with our research team to learn more about how Specialized Language Models can cut costs and improve performance.