Jun 14, 2026
OpenLLMetry Alternatives: From LLM Spans to Agent Analysis
Inference Research
The Question OpenLLMetry Cannot Answer
You added Traceloop.init() to your LLM app, and spans started flowing. Token counts, model names, latency — all of it lands somewhere. Now the harder question: what do you actually do with those spans?
Search for OpenLLMetry and almost everything you'll find answers the same question: what is it? The GitHub README, Traceloop's docs, and the knowledge-base pages from Dynatrace, New Relic, and Splunk all explain that it instruments your LLM calls and exports OpenTelemetry spans. None of them help you decide whether that's enough.
Here's the framing that actually matters. LLM observability has three layers:
- Instrumentation: code that emits spans from your app
- Backend: somewhere to store, view, and query those spans
- Analysis: something that tells you what's failing and why
OpenLLMetry covers exactly one of these. It's a good instrumentation library, but instrumentation was never the hard part. The hard part is what happens after: whether your backend understands agent behavior, and whether anything in your stack can look across a thousand runs and tell you the three things that keep going wrong.
This article covers what OpenLLMetry actually is (including the OpenLLMetry-vs-Traceloop confusion), the gap instrumentation cannot fill, four ways to architect LLM tracing, and an honest comparison of the alternatives — including when sticking with OpenLLMetry is the right call.
Read time: 14 minutes
What Is OpenLLMetry, Exactly?
OpenLLMetry is an open-source instrumentation library built on top of OpenTelemetry that captures telemetry from LLM applications. It's built and maintained by Traceloop, licensed Apache 2.0, and has around 7.2k stars on GitHub. It emits LLM-specific spans (prompts, completions, token usage, model parameters) that any OpenTelemetry-compatible backend can ingest.
The project is actively maintained: the latest release is 0.61.0, shipped May 31, 2026. Coverage is broad. On the provider side it instruments OpenAI and Azure OpenAI, Anthropic, Google Gemini, Cohere, Bedrock, Mistral, Groq, HuggingFace, Ollama, and Replicate. It also covers vector databases (Chroma, Pinecone, Qdrant, Weaviate, LanceDB, Milvus, Marqo) and frameworks (LangChain, LlamaIndex, LangGraph, CrewAI, Haystack, OpenAI Agents). SDKs exist for Python, TypeScript/JavaScript, Go, and Ruby.
Setup is genuinely simple, which is a big part of the appeal. You initialize the SDK once, and the instrumentations patch your installed clients. For multi-step apps, decorators mark the structure: @workflow, @task, @agent, and @tool.
import os
from openai import OpenAI
from traceloop.sdk import Traceloop
from traceloop.sdk.decorators import task, workflow
# One init call patches the installed clients.
# Use disable_batch=True in local dev so spans flush immediately.
Traceloop.init()
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
@task(name="summarize_ticket")
def summarize_ticket(ticket_text: str) -> str:
completion = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": f"Summarize: {ticket_text}"}],
)
return completion.choices[0].message.content
@workflow(name="support_triage")
def triage(ticket_text: str) -> str:
# Spans: a "workflow" span wrapping a "task" span wrapping the LLM call.
return summarize_ticket(ticket_text)So far, so good. The confusion starts with the name on the package.
OpenLLMetry vs. Traceloop: Library or Platform?
This is the question searchers keep asking, so here is the direct answer. OpenLLMetry is the open-source SDK: free, Apache 2.0, runs in your process. Traceloop is the company that maintains it, and also a commercial observability platform that happens to be one of the places OpenLLMetry can send spans. The docs list 30+ export destinations, including Datadog, Dynatrace, New Relic, Grafana, Honeycomb, Splunk, Azure Application Insights, and Google Cloud, plus Traceloop's own platform.
The business model makes the distinction concrete. Traceloop's platform has a Free Forever tier with up to 50,000 spans per month, up to 5 seats, and 24-hour data retention; beyond that, it's a custom-priced Enterprise plan with SOC 2 and an on-prem option. Read that retention number again: one day. The library is free forever. Keeping your spans — and doing anything with them — is the product.
The Semantic Conventions Problem
There's a subtler issue worth understanding before you build on OpenLLMetry's span format: the "standard" it implements is still moving.
OpenLLMetry emits attributes like gen_ai.system, gen_ai.request.model, gen_ai.prompt and gen_ai.completion content arrays, and token counts under gen_ai.usage.*, plus its own extensions: traceloop.span.kind (workflow, task, agent, or tool), traceloop.workflow.name, and traceloop.entity.name.
Meanwhile, the upstream OpenTelemetry GenAI semantic conventions, the official standard OpenLLMetry aims to track, are still marked Development status as of June 2026. The required attributes are just gen_ai.operation.name and gen_ai.provider.name, and message content (gen_ai.input.messages, gen_ai.output.messages) is opt-in: the spec says instrumentations SHOULD NOT capture it by default. The agent and framework span conventions are in the same Development state. The OpenTelemetry GenAI working group has been iterating on these since April 2024, and most of them remain experimental.
The churn is visible inside OpenLLMetry itself. The gen_ai.prompt and gen_ai.completion attributes it has emitted for years are now deprecated upstream in favor of gen_ai.input.messages and gen_ai.output.messages. There's an open issue (#3515) tracking exactly this. OpenLLMetry's 0.60.0 release started aligning its instrumentations with semconv v1.40.0. If you build dashboards or queries against today's attribute names, plan to migrate them.
There is a third convention in this picture: OpenInference, maintained by Arize. It defines ten span kinds (LLM, EMBEDDING, RETRIEVER, RERANKER, TOOL, CHAIN, AGENT, GUARDRAIL, EVALUATOR, PROMPT), requires openinference.span.kind on every span, and captures full message content by default using indexed attributes like llm.input_messages.0.message.role. It was designed for AI workloads from the start, which is why agent-native backends tend to prefer it. Catalyst Tracing, for example, emits OpenInference wire keys byte-identical to the upstream spec; the full list is in the span attributes Catalyst emits.
| Convention | Status | Message content | Structure |
|---|---|---|---|
OpenLLMetry (gen_ai.* + traceloop.*) | Migrating | Captured (legacy keys) | 4 entity kinds |
| OTel GenAI semconv | Development | Opt-in only | 2 required attrs |
| OpenInference | Shipping spec | Default | 10 span kinds |
Notes: OpenLLMetry's gen_ai.prompt/gen_ai.completion keys are deprecated upstream in favor of gen_ai.input.messages/gen_ai.output.messages (issue #3515); release 0.60.0 began aligning with semconv v1.40.0. OTel GenAI requires only gen_ai.operation.name and gen_ai.provider.name. OpenInference requires openinference.span.kind on every span and captures full message history via indexed attributes.
The takeaway: span shape is not a detail. It determines what your backend can show you — which brings us to the actual gap.
The Gap Instrumentation Cannot Fill
Send OpenLLMetry spans to a generic OpenTelemetry backend and you'll get exactly what that backend was built for: a flame graph. Spans with durations, nested by parent ID, colored by service. For a payment API, that's observability. For an agent, it's a list of anonymous boxes.
Debugging an agent requires the backend to understand what the spans are. Concretely, you need:
- Span-kind-aware trace trees. An LLM call, a tool execution, and a retrieval step are different things and need to render differently: the model's messages for one, the tool's JSON arguments for another.
- Full message history. When an agent goes off the rails on turn 7, you need every system, user, assistant, and tool message that led there. This is where the upstream conventions bite: content capture is opt-in, so a default-configured OTel pipeline may not have recorded the prompt at all.
- Tool-call fidelity. Tool names, call IDs, full JSON arguments, and the results that came back. As a contrast: Catalyst's auto-instrumented spans capture input and output messages, model name, invocation parameters, finish reason, token counts including prompt-cache reads and writes, and tool calls with names, IDs, and JSON arguments.
- Session grouping. A conversation is many traces. Without a session ID convention your backend treats each turn as an unrelated event.
- Stable agent identity. You need every run of the checkout agent grouped together across deploys and versions. Catalyst's dashboard, for instance, groups executions by
agent.idfirst and falls back toagent.name, which is why the docs insist on a stable agent identity rather than per-process UUIDs. - Cost attribution. Token counts per span, rolled up per agent, per session, per user.
Figure 1: The Three Layers of LLM Observability
The APM vendors know this, which is why their OpenLLMetry content exists at all. Dynatrace publishes a "What is OpenLLMetry?" knowledge-base page plus ingestion docs, New Relic has an integration guide, and Splunk has announced OpenLLMetry tie-ins. Each positions its platform as the destination for your spans. Pricing reflects the bolt-on nature: Datadog's LLM Observability bills on LLM spans specifically, with a free allotment of 40,000 LLM spans per month and on-demand charges beyond it. Storage is solved. Understanding is the upsell, and analysis mostly doesn't exist.
Four Ways to Architect LLM Tracing
If you're evaluating OpenLLMetry alternatives, you're really choosing one of four architectures:
- OpenLLMetry + a generic APM backend (Datadog, Dynatrace, New Relic, Grafana). Right when your organization already lives in one of these tools and LLM traces must sit next to existing dashboards and alerting. You accept the weakest agent-level reading of your data and add-on pricing.
- OpenLLMetry + Traceloop's platform. The native pairing. You get an LLM-aware monitoring and evaluation dashboard from the people who wrote the instrumentation. The free tier's 50,000 spans and 24-hour retention are for kicking tires; production means a custom Enterprise contract.
- OpenInference-shaped instrumentation + an agent-native backend (Catalyst Tracing; Arize Phoenix sits in this family too). You swap the instrumentation layer for one whose span shape was designed for agents (full message content by default, explicit agent and session semantics), paired with a backend that reads it natively, with evals and cross-run analysis layered on top.
- Raw OTel DIY (vanilla OpenTelemetry SDK + collector + ClickHouse/Grafana). Maximum control and data ownership, no vendor. You also inherit everything: span-kind-aware views, session grouping, content capture (off by default in the upstream conventions), evals, and analysis are all yours to build.
| Path | Strongest when | What you give up |
|---|---|---|
| OpenLLMetry + generic APM | Org lives in the APM | Agent-aware views, analysis |
| OpenLLMetry + Traceloop | Native pairing, fast start | Retention on free tier |
| OpenInference + agent-native backend | Multi-step agents in production | Existing APM co-location |
| Raw OTel DIY | Total data ownership | Your build time, every layer |
Notes: APM path adds separate LLM-span billing (e.g., Datadog). Traceloop free tier is 50,000 spans/month with 24-hour retention. DIY inherits the opt-in content-capture default of the upstream conventions.
Paths 1 and 2 keep OpenLLMetry and choose a backend. Path 3 replaces the pair. Path 4 replaces the vendor with your weekends. The rest of this article compares the concrete options, then walks through what path 3 looks like in practice.
OpenLLMetry Alternatives Compared
"Alternatives" hides two different questions: what backend should receive my OpenLLMetry spans, and should I be using different instrumentation entirely? The table covers both kinds of candidates.
Langfuse is an open-source backend that can sit on either side of the swap. Its SDKs are OpenTelemetry-native (Python v3+, JS v4 are built on the OTel client), and it accepts OTLP traces directly on its /api/public/otel endpoint, so it can receive OpenLLMetry spans or run its own instrumentation. The core is MIT-licensed and self-hostable; Langfuse Cloud runs from a free Hobby tier (50k units/month) through Core at $29/month and Pro at $199/month. Worth knowing: ClickHouse acquired Langfuse in January 2026, with the open-source project continuing under its existing license.
LangSmith is LangChain's proprietary platform, and the deepest integration if you're on LangChain or LangGraph, with tracing, offline and online LLM-as-judge evaluation, dashboards, and alerting. It added OpenTelemetry export in March 2026. Pricing combines seats and volume: a free Developer tier with 5,000 base traces a month, then Plus at $39 per seat per month with 10,000 base traces included, then $2.50 per 1,000 base traces (14-day retention) or $5.00 per 1,000 extended traces (400-day retention).
Arize Phoenix is Arize's open-source observability platform — free to self-host, built on OpenTelemetry and the OpenInference conventions Arize maintains, with tracing, evaluation, and experimentation built in. The managed offering (Arize AX) starts at a free 25,000-span tier and a $50/month Pro plan.
Catalyst Tracing (inference.net) takes the OpenInference path end to end: drop-in setup() instrumentation that emits OpenInference-shaped OTel spans, full message and tool-call capture by default, agent and session identity built into the dashboard, rubric-based evals, and Halo for cross-run analysis (more on that below). The free tier includes 1M OTEL spans per month with 50 GB of span data; paid plans start at $25/month for 10M spans.
Generic APM (Datadog as the exemplar) makes sense when LLM traces are a minority concern inside a larger estate — see path 1 above.
| Tool | Span shape | Agent & sessions | Evals & analysis | Free tier |
|---|---|---|---|---|
| Catalyst Tracing | OpenInference on OTel | Identity + sessions | Rubric evals + Halo | 1M spans/mo |
| Traceloop | OpenLLMetry native | Workflow entities | Eval dashboard | 50k spans, 24h |
| Langfuse | OTel-native + OTLP | Sessions | Evals | 50k units/mo |
| LangSmith | Proprietary + OTel | Sessions | Online/offline evals | 5k traces/mo |
| Arize Phoenix | OpenInference on OTel | Sessions | Evals | Self-host free |
| Datadog LLM Obs. | OTel ingest | APM-level | Dashboards only | 40k LLM spans/mo |
Notes: each vendor counts a different free-tier unit (spans, units, traces). Phoenix's managed offering (Arize AX) has a 25,000-span free tier. LangSmith is proprietary and added OTel export in March 2026; cross-run automated failure analysis (ranked findings with trace-ID citations) is a Halo capability the other rows do not list.
Free tiers vary wildly. Note that each vendor counts a different unit (spans, units, traces):

Entry paid pricing is closer, with one structural difference: LangSmith's $39 is per seat, while the others are flat.

Migrating From OpenLLMetry to Catalyst Tracing
If you decide the OpenInference path fits, the migration is smaller than it sounds, because the integration pattern is the same one you already use: initialize once, before your clients are constructed.
The Python SDK is inference-catalyst-tracing (TypeScript: @inference/tracing), installed with per-integration extras: pip install 'inference-catalyst-tracing[openai]' for an OpenAI app, or stack extras like [openai,anthropic,langchain]. Calling setup() patches the installed SDKs it detects. Export is three environment variables: CATALYST_OTLP_ENDPOINT=https://telemetry.inference.net, CATALYST_OTLP_TOKEN with your API key, and a stable CATALYST_SERVICE_NAME. The tracing quickstart covers both languages.
Here's the before and after for an OpenAI app:
import os
# Before (OpenLLMetry):
# from traceloop.sdk import Traceloop
# Traceloop.init()
# After (Catalyst). Install: pip install 'inference-catalyst-tracing[openai]'
# Env: CATALYST_OTLP_ENDPOINT=https://telemetry.inference.net
# CATALYST_OTLP_TOKEN=<your API key>
from inference_catalyst_tracing import setup
from openai import OpenAI
tracing = setup(service_name="checkout-agent")
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
# Your application code does not change. The span now carries the full
# message history, tool calls with JSON arguments, finish reason, and
# token counts (including cache details).
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "You answer in one short sentence."},
{"role": "user", "content": "Summarize order ABC-123."},
],
max_tokens=80,
)
print(response.choices[0].message.content)
tracing.shutdown()From that one swap, every instrumented call captures input and output messages, model and invocation parameters, finish reason, token counts including cache details, and tool calls with their JSON arguments.
Framework coverage runs wider than the OpenAI example: documented integrations include Anthropic, Vercel AI SDK, LangChain, LangGraph, Pydantic AI, OpenAI Agents, LiveKit Agents, ElevenLabs, Claude Agent SDK, and Claude Code SDK, alongside manual spans and agent identity. Two bridges matter for teams with existing instrumentation: LangSmith traceable functions keep working with LANGSMITH_TRACING_MODE=otel routing their spans into the Catalyst provider, and existing Langfuse SDK code can be pointed at Catalyst by changing three environment variables via the Langfuse bridge.
For everything auto-instrumentation can't see, there are two escape hatches. agent_span() wraps an agent run as the parent span, carrying agentId, agentName, and sessionId so runs group correctly, with child SDK spans nesting automatically. manual_span() authors TOOL, CHAIN, RETRIEVER, and EMBEDDING spans for custom steps, parenting under the active agent span.
Two operational notes. First, flushing: short-lived scripts should call shutdown() before exit, and serverless functions should call force_flush() per invocation so frozen processes don't drop batched spans. Second, if you'd rather not do any of this by hand, inf instrument --mode tracing launches a coding agent that scans your codebase, installs the right packages, and wires setup() into your entrypoint.
From Spans to Fixes: Cross-Run Analysis with Halo
Everything to this point, OpenLLMetry included, answers "what happened in this trace?" The production question is different: "what keeps happening across all of them?" No generic OTel backend answers it, and this is the layer where backend choice stops being about storage entirely.
Halo (Hierarchical Agent Loop Optimization) is an open-source, RLM-based analysis engine: pip install halo-engine, MIT-licensed, at github.com/context-labs/halo. It reads OpenTelemetry-compatible spans, decomposes them to find systemic failure modes across many runs, and writes up concrete fixes with citations back to specific trace IDs. You can run it self-hosted against a JSONL trace file, or hosted inside the Catalyst dashboard.
The hosted version lives in the Agents tab's Analysis view. Point it at an agent and a time window and it returns a ranked list of findings (recurring failures, suspicious tool-call patterns, wasted or redundant work, slow or high-cost paths), each citing the trace IDs it came from, so you can click from a finding straight into the trace tree that produced it. You can also ask it pointed questions ("which tool calls return empty results most often?"). For production agents, scheduled runs review recent traffic automatically on an hourly, daily, weekly, or monthly cadence, with lookback windows capped at 30 days. The walkthrough is in running Halo on your traces.
The workflow this enables is a loop, not a dashboard: trace, analyze, fix, re-run, confirm the finding disappeared. That's the difference between observability as a viewer and observability as an improvement system.
If your agents are already in production, this is the layer worth evaluating backends on: every vendor can store spans, few can tell you what's wrong with them.
Find out why your agents fail
Halo reads your production traces and returns a ranked list of failure modes, with the trace IDs to prove it. It works across all your traffic instead of one trace at a time.
Adding Evals on Top of Traces
The second post-storage layer is evaluation. Metric dashboards chart latency and token counts, but they cannot tell you whether the agent's answers were any good. That requires scoring outputs.
The trace-native approach: rubric-based LLM-as-a-judge scoring. In Catalyst, a rubric is a set of plain-English quality dimensions scored numerically by a judge model. Every rubric includes an {{ eval_model_response }} placeholder, and the default scale is 1 to 10. Offline evaluation, available today, runs a dataset (with inputs typically built from captured production traffic) through the models you select and scores the outputs; online evaluation, which will judge live production traffic as it flows through, is the direction the platform is heading. The distinction is covered in offline and online evals.
Mapped back to the alternatives: OpenLLMetry itself has no eval story; it's instrumentation. Traceloop's platform includes an evaluation dashboard, LangSmith treats offline and online LLM-as-judge evaluation as a core feature, and Langfuse offers LLM-as-a-judge evaluators, annotation queues, and custom scores. Generic APM backends are the odd ones out: spans arrive, get stored, and nothing scores them.
When OpenLLMetry Is Still the Right Choice
An honest taxonomy includes the cases where you should keep what you have.
Keep OpenLLMetry when:
- Your organization already lives in an APM. If Datadog or Dynatrace dashboards and alerting are how your company operates, LLM spans flowing into the same place has real value, and the vendors explicitly support the path.
- Your workload is polyglot or not agentic. OpenLLMetry ships Go and Ruby SDKs; if your LLM calls happen in a Go service, it may be the only practical option. And for single-shot completion calls (classify, extract, summarize), token and latency metrics cover most of what goes wrong.
- Compliance pins your telemetry. If traces must land in an approved, audited backend, instrumentation that exports there wins by default.
Reconsider when:
- Your application is a multi-step agent with tool calls, where session grouping and stable agent identity determine whether you can debug at all.
- You need full message content by default. Remember, the upstream conventions leave it opt-in.
- You need evals on production traffic or cross-run failure analysis, neither of which a span store provides.
- Retention limits bite: a 24-hour free window is not enough history to see a pattern.
The library is good engineering. The question was never the library. It's whether the place you send its spans understands agents, and whether anything analyzes what arrives.
FAQ
What is the difference between OpenLLMetry and Traceloop?
OpenLLMetry is the open-source instrumentation SDK (Apache 2.0, maintained by Traceloop). Traceloop is the company behind it and a commercial observability platform that is one of 30+ destinations OpenLLMetry can export spans to. Using the library does not require the platform.
Is OpenLLMetry free?
The library is. It's Apache 2.0, with no usage limits. The backends you send spans to are not: Traceloop's platform is free up to 50,000 spans per month with 24-hour retention, then custom Enterprise pricing, and every other backend has its own pricing model.
Do OpenTelemetry GenAI spans capture prompts and completions?
Not by default. The OTel GenAI semantic conventions make message content opt-in — instrumentations SHOULD NOT capture it unless you enable it. OpenInference-shaped instrumentation captures full message content by default.
What backends does OpenLLMetry support?
30+ destinations, including Traceloop, Datadog, Dynatrace, New Relic, Grafana, Honeycomb, Splunk, Azure Application Insights, and Google Cloud. Any OTLP-compatible endpoint works.
Can I switch from OpenLLMetry without re-instrumenting everything?
Mostly, yes. Catalyst's setup() follows the same init-once pattern as Traceloop.init(), with auto-instrumentation for the major SDKs and frameworks. If parts of your stack use LangSmith or Langfuse instrumentation, bridges route those spans in without code changes.
Conclusion
OpenLLMetry solved instrumentation: spans come out of your app with one init call, in a format dozens of backends accept. But instrumentation is the first third of the problem. Choose your architecture by the other two: a backend that renders agents as agents (message history, tool calls, sessions, identity), and an analysis layer that reads across runs instead of one trace at a time. That last layer (evals on production traffic, ranked failure modes with trace IDs attached) is where span storage turns into a system that actually improves your agent.
If you want to see what OpenInference-shaped tracing looks like on your own stack, the setup is one package and one function call.
Trace your first agent run in minutes
Install the Catalyst tracing SDK and call setup() before your clients. You get the full trace tree: agent, LLM, and tool spans with cost, latency, and token usage.
Related Reading
- LLM Tracing: a field-by-field guide — a deep dive on the span fields and conventions this article summarizes
- Agent Observability Guide — the conceptual companion on trace trees, OTel, and OpenInference
- AI Agent Observability Tools — a criteria-first comparison across the wider tool landscape
Meet with our research team
Schedule a call with our research team to learn more about how Specialized Language Models can cut costs and improve performance.