News

    Introducing Catalyst: Train self-improving AI models

    Learn more

    Jun 18, 2026

    OpenInference and OpenTelemetry for LLM Tracing: A Practical Guide

    Inference Research

    Why Raw OpenTelemetry Spans Aren't Enough for LLM Apps

    Wire up standard OpenTelemetry instrumentation around an LLM-powered app and you'll get back something like POST api.openai.com 200 1.4s. That is technically a trace. It tells you a network request happened, how long it took, and that it succeeded. It tells you nothing about what was asked, what the model answered, which model ran, how many tokens it burned, or which tool the model decided to call.

    This is the gap OpenInference exists to fill. LLM tracing on open standards has two layers: OpenTelemetry supplies the transport and plumbing, and a set of semantic conventions supplies the meaning. OpenInference is the convention set with the longest track record for LLM and agent spans: an Apache 2.0 project of conventions and instrumentation plugins, complementary to OpenTelemetry, that works with any OTel-compatible backend. A second effort, the OpenTelemetry GenAI semantic conventions, is developing inside the OTel project itself.

    This guide explains how the two layers fit together in plain English: what OpenInference spans actually look like attribute by attribute, how the OTel GenAI conventions compare and how to choose, what a fully traced agent run looks like as a trace tree, the practical paths to instrumenting your stack, and what standards-shaped traces unlock once you have them. At the end there's a checklist to verify your tracing is actually compliant.

    Read time: 12 minutes


    The Two-Layer Model: OpenTelemetry Is Transport, Semantic Conventions Are Meaning

    OpenTelemetry gives you the machinery of distributed tracing. A span records one timed operation: a trace ID, a span ID, a parent span ID, start and end timestamps, and a status. Spans nest by parent ID to form a trace, the full tree of work behind one request. Context propagation carries the active span across function and process boundaries so children attach to the right parent, even through async code. A tracer provider batches finished spans and exports them over OTLP, the vendor-neutral wire protocol, optionally through a collector that routes and fans out telemetry to one or more backends.

    None of that machinery knows what an LLM is. Wrap an OpenAI call in a generic HTTP span and OTel will faithfully record host, method, status code, and duration. The prompt, the completion, the model name, the token usage, the tool calls the model requested: all of it is invisible, because nothing in the span data model has a slot for it. The span structure is there; the vocabulary is missing.

    Semantic conventions are that vocabulary. They're nothing more exotic than agreed-upon attribute names attached to ordinary spans. An OpenInference span is an OpenTelemetry span (same data model, same export path, same collectors) with attributes like llm.model_name and llm.token_count.prompt that any OpenInference-aware backend knows how to read. That shared vocabulary is what lets a backend render a conversation view, compute per-call cost, and group agent runs without custom parsing per app.

    Figure 1: The Two-Layer Model of LLM Tracing — OpenTelemetry handles span creation, batching, OTLP export, and fan-out; semantic conventions are the attribute vocabulary that makes the spans meaningful.

    Keep this split in mind, because it dissolves most of the confusion in this space. Questions about exporters, collectors, sampling, and propagation are OpenTelemetry questions, answered the same way for LLM apps as for any service. Questions about what an LLM span should say are semantic-convention questions, and there are currently two serious answers.

    OpenInference Explained: Span Kinds and Attributes That Make LLM Spans Readable

    OpenInference is a set of OpenTelemetry-compatible semantic conventions plus instrumentation libraries for tracing AI applications. It's maintained by Arize under an Apache 2.0 license, natively supported by Arize Phoenix, and consumable by any OpenTelemetry-compatible backend. It has been shipping since 2023, before the OTel project began its own GenAI conventions.

    The spec rests on two ideas: every span declares a kind, and each kind carries a known set of attributes.

    Span kinds

    The attribute openinference.span.kind is required on every OpenInference span and identifies what kind of work the span wraps. The spec defines ten values: LLM, EMBEDDING, CHAIN, RETRIEVER, RERANKER, TOOL, AGENT, GUARDRAIL, EVALUATOR, and PROMPT. Six of them do almost all the work in a typical app:

    Span kindWrapsKey attributes
    LLMOne model callllm.model_name, llm.input_messages, llm.token_count.*
    TOOLA tool executiontool.name, input.value, output.value
    AGENTA whole agent runagent.id, session.id, input.value
    CHAINA non-LLM pipeline stepinput.value, output.value
    RETRIEVERA document/vector lookupinput.value (query), retrieval.documents
    EMBEDDINGAn embedding callembedding.model_name, embedding.text

    The remaining four — RERANKER, GUARDRAIL, EVALUATOR, and PROMPT — cover reranking retrieved documents, safety checks, scoring passes, and prompt-template resolution.

    An annotated LLM span

    Attributes are where the actual payload lives. Messages are structured lists: llm.input_messages and llm.output_messages, each entry carrying message.role and message.content, plus message.tool_calls when the model requests a tool, with tool_call.function.name and tool_call.function.arguments as a JSON string. Model metadata rides in llm.model_name and llm.invocation_parameters. Token usage lands in llm.token_count.prompt, llm.token_count.completion, and llm.token_count.total. Generic operations use input.value and output.value, and session.id plus user.id group spans into conversations and users.

    Here's what that looks like on a real chat call that triggered a tool:

    {
      "name": "ChatCompletion",
      "trace_id": "af1d6701f1b4e8e0b1d2a7c93f3e9a44",
      "span_id": "9c2b1de40a6f8c31",
      "parent_span_id": "4b7a90cd2e8f1a02",          // nests under the AGENT span
      "attributes": {
        "openinference.span.kind": "LLM",             // required on every span
    
        // -- the conversation going in --
        "llm.input_messages.0.message.role": "system",
        "llm.input_messages.0.message.content": "You are a support agent. Use tools to look up orders.",
        "llm.input_messages.1.message.role": "user",
        "llm.input_messages.1.message.content": "Where's my order #1842?",
    
        // -- the model's reply: a tool call, not text --
        "llm.output_messages.0.message.role": "assistant",
        "llm.output_messages.0.message.tool_calls.0.tool_call.function.name": "lookup_order",
        "llm.output_messages.0.message.tool_calls.0.tool_call.function.arguments": "{\"order_id\": \"1842\"}",
    
        // -- model metadata --
        "llm.model_name": "gpt-4o-mini",
        "llm.invocation_parameters": "{\"temperature\": 0.2, \"max_tokens\": 512}",
    
        // -- the token bill --
        "llm.token_count.prompt": 212,
        "llm.token_count.completion": 18,
        "llm.token_count.total": 230,
    
        // -- grouping --
        "session.id": "session-7f3a"
      }
    }

    Notice that everything a debugging session needs is on the span itself: the exact conversation state going in, the tool call coming out, the model, the parameters, and the token bill. That's the difference between a trace you can act on and POST api.openai.com 200.

    OpenInference is not a paper spec, either. Instrumentation libraries exist for Python, TypeScript/JavaScript, and Java, with Python coverage the broadest: OpenAI, Anthropic, Bedrock, LangChain, LlamaIndex, DSPy, CrewAI, OpenAI Agents, PydanticAI, Smolagents, and more. Releases are active: the core instrumentation package and the OpenAI instrumentor both shipped new versions in the first week of June 2026.

    OpenInference vs. the OpenTelemetry GenAI Semantic Conventions

    The OTel project has its own answer to the vocabulary problem: the GenAI semantic conventions, living under the gen_ai.* namespace. The namespace started in 2024 and shipped its first versions in 2025, a year-plus behind OpenInference's 2023 start.

    The gen_ai conventions are broader in signal types and narrower in span semantics. They define spans for model calls and agent/framework operations, plus metrics, events, and conventions for MCP and specific providers like Anthropic, Azure AI Inference, AWS Bedrock, and OpenAI. A gen_ai inference span is named {gen_ai.operation.name} {gen_ai.request.model} (chat gpt-4o-mini, say) and must carry gen_ai.operation.name and gen_ai.provider.name, with token usage in gen_ai.usage.input_tokens and gen_ai.usage.output_tokens and response details in gen_ai.response.*.

    Two differences actually drive decisions, and most comparisons skim past both.

    Stability. The gen_ai conventions are marked Development status, meaning experimental, as of June 2026. Attribute names can still change between versions. The spec manages this with an opt-in: instrumentations on v1.36.0 or earlier keep emitting their current format by default, and you set OTEL_SEMCONV_STABILITY_OPT_IN=gen_ai_latest_experimental to get the newest attribute versions. There's no committed timeline for a stable release. OpenInference doesn't carry an experimental label; its conventions are versioned by Arize and have been stable in practice across years of instrumentor releases.

    Message capture. By default, gen_ai instrumentations record no prompt or completion content. Capturing messages on-span (gen_ai.input.messages, gen_ai.output.messages, gen_ai.system_instructions) is opt-in, with formal JSON schemas, or you can reference externally stored content. OpenInference records structured messages on the span by default. Teams adopting gen_ai are routinely surprised to find traces with no prompts in them; teams with strict data-residency rules sometimes want exactly that.

    DimensionOpenInferenceOTel GenAI
    Namespacellm.*, openinference.span.kindgen_ai.*
    Shipping since20232024–2025
    StatusVersioned, stable in practiceDevelopment (experimental)
    Messages on spansDefaultOpt-in
    Span typing10 span kindsPer-operation names
    Signal typesTracesTraces, metrics, events
    MCP conventionsNoYes
    GovernanceArize-led OSSOTel SIG

    Both are attribute vocabularies on ordinary OpenTelemetry spans: same OTLP export, same collectors, same backends. "Default"/"Opt-in" refers to whether prompt and completion content is recorded on spans without extra configuration.

    Convention Coverage Today
    Figure 2: Convention Coverage Today — 0 = not covered, 1 = opt-in, 2 = covered by default.

    How to choose

    The honest answer is that this is a low-stakes choice dressed up as a high-stakes one. Both convention sets are attributes on ordinary OTel spans, exported over the same OTLP, through the same collectors. The namespaces are converging, and some OpenInference instrumentations already emit both attribute sets for compatibility. Translating one vocabulary to the other is mechanical.

    A pragmatic split:

    • You want full payloads visible today (prompts, tool arguments, retrieved documents) for debugging and analysis: use OpenInference. Its instrumentors capture rich content by default, and its span kinds cover reranking, guardrails, and evaluators, which gen_ai doesn't type.
    • You're a platform team standardizing all telemetry, including metrics and events, under OTel governance, and you can tolerate attribute churn: track gen_ai, pin versions deliberately, and use the stability opt-in.
    • Either way, you're on OpenTelemetry. The cost of picking "wrong" is an attribute mapping, not a re-instrumentation.

    Anatomy of a Fully Traced Agent Run

    Single LLM calls are the easy case. The value of LLM tracing shows up when one user request fans out into a loop of model calls and tool executions, and you need to see the whole thing as one tree.

    Take a support agent handling "Where's my order #1842?". A fully traced run looks like this:

    Figure 3: Anatomy of a Fully Traced Agent Run — solid lines are parent-child span relationships; dotted lines show the logical data flow between the LLM's tool request, the tool execution, and the final answer.

    Reading from the root:

    1. The AGENT span wraps the whole run. It carries the agent's identity (agent.id, agent.name) plus session.id for the conversation and the run's top-level input.value and output.value.
    2. The first LLM span holds the full message array in, and an output message whose message.tool_calls shows the model requesting lookup_order with arguments {"order_id": "1842"}.
    3. The TOOL span records the actual execution: tool name, the arguments as input, the result as output, and its own duration and status, so you can tell a slow tool from a slow model.
    4. The second LLM span takes the tool result back in and produces the final answer, with its own token counts.

    The structural glue is OpenTelemetry context propagation: spans created while the AGENT span is active automatically become its children. The semantic glue is OpenInference attributes at every level. For a deeper treatment of what's worth recording at each level, see our guide to what to capture in production agents.

    Session grouping and agent identity

    Two attributes make agent traces queryable across runs rather than just viewable one at a time.

    session.id groups traces into conversations. Set it once on the outer agent span and every span underneath inherits it, so a five-turn conversation reads as one session rather than five disconnected trees.

    agent.id gives the agent itself a stable identity. This is the one teams get wrong: the ID should survive deploys, environment changes, and display-name changes. Use something like support-triage-agent, not a random UUID per process and not a pretty name like SupportAgent v2. Identity attributes set on the AGENT span are copied onto child LLM and TOOL spans, so every span in the system is attributable to the agent that produced it. In Catalyst, the Agents dashboard groups runs by exactly these attributes, and the same pattern holds for any backend that understands the conventions. The agent identity guide covers choosing good IDs and wiring them in.

    Practical Instrumentation Paths

    Knowing the conventions is half the job. Getting compliant spans out of your actual stack is the other half, and there are three paths, in order of preference.

    Auto-instrumentation

    For supported SDKs and frameworks, you don't hand-author spans. An instrumentor patches the client and emits convention-shaped spans for every call. The OpenInference ecosystem ships instrumentors across providers and frameworks, and OpenInference-native platforms package the same idea as a single setup call.

    Here's the end-to-end flow using the Catalyst tracing SDKs (@inference/tracing for TypeScript, inference-catalyst-tracing for Python), which emit OpenInference-shaped OpenTelemetry over OTLP/HTTP, with wire keys byte-identical to the upstream conventions:

    1. Install the tracing package plus the integrations you use: pip install 'inference-catalyst-tracing[openai]', with extras for anthropic, langchain, langgraph, openai-agents, pydantic-ai, and more.
    2. Configure export with three env vars: CATALYST_OTLP_ENDPOINT, CATALYST_OTLP_TOKEN, and a stable CATALYST_SERVICE_NAME.
    3. Call setup() before constructing clients. In Python it auto-detects installed packages; in TypeScript you pass the SDK modules to patch. Initialization order matters: clients built before setup() runs go untraced.
    4. Wrap the run in an agent span so calls group under a stable identity (previous section).
    5. Run your app and look at the trace. Each model call appears as an LLM span with messages, model, parameters, finish reason, and token counts filled in.
    import os
    
    from inference_catalyst_tracing import setup
    from openai import OpenAI
    
    tracing = setup()
    client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
    
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": "Reply with just the word hello."}],
        max_tokens=16,
    )
    
    print(response.choices[0].message.content)
    tracing.shutdown()

    The TypeScript shape is the same, with the SDK modules passed explicitly:

    import { setup } from "@inference/tracing";
    import OpenAI from "openai";
    
    const tracing = await setup({
      modules: { openai: OpenAI },
    });
    
    const client = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
    
    const response = await client.chat.completions.create({
      model: "gpt-4o-mini",
      messages: [{ role: "user", content: "Reply with just the word hello." }],
      max_tokens: 16,
    });
    
    console.log(response.choices[0]?.message.content);
    await tracing.shutdown();

    One framework-specific wrinkle: the Vercel AI SDK exposes its own experimental_telemetry option, so there's nothing to patch. You generate telemetry settings and pass them into each generateText or streamText call instead.

    Manual spans

    Auto-instrumentation covers model calls, but agent runs contain steps that aren't SDK calls: a tool you execute yourself, a retrieval against your own index, a router, a validation pass. For those you author spans directly, choosing the right span kind. The SDKs expose this as a manual span helper that takes a TOOL, CHAIN, RETRIEVER, or EMBEDDING kind plus input and output.

    from inference_catalyst_tracing import (
        SpanKindValues,
        agent_span,
        manual_span,
        setup,
    )
    
    tracing = setup()
    
    # --- your own (non-SDK) application code ---
    query = "open refund tickets for order #1842"
    
    def run_refund_review():
        ...  # your agent loop: LLM calls made here are auto-captured
    
    def retrieve(q):
        ...  # your vector search
    
    # The AGENT span: the parent row the whole run nests under.
    with agent_span(
        tracing.tracer,
        agent_id="refund-review-agent",
        span_name="refund-review.run",
    ) as span:
        span.set_input("Review refund request #1842")
        decision = run_refund_review()
        span.set_output(decision.summary)
    
    # manual_span authors TOOL / CHAIN / RETRIEVER / EMBEDDING spans.
    with manual_span(
        tracing.tracer,
        name="rag.retrieve",
        span_kind=SpanKindValues.RETRIEVER,
        input={"query": query, "k": 8},
    ) as span:
        docs = retrieve(query)
        span.set_output(docs)
    
    tracing.shutdown()

    One operational gotcha worth its own sentence: spans are batched and exported in the background, so short-lived processes must call shutdown() before exit, and serverless functions should flush per invocation with forceFlush() instead. Otherwise spans silently vanish when the runtime freezes.

    Bridging from LangSmith or Langfuse

    Already instrumented with a vendor SDK? Standards make bridges cheap, because everyone ultimately speaks OTel. LangSmith can emit its spans as OpenTelemetry: set LANGSMITH_TRACING_MODE=otel and your traceable functions flow into an OTel tracer provider alongside everything else. Langfuse instrumentation can be redirected wholesale with three environment variables, no code changes:

    # --- Langfuse -> Catalyst: redirect existing Langfuse instrumentation, no code changes ---
    export LANGFUSE_HOST="https://telemetry.inference.net"
    export LANGFUSE_PUBLIC_KEY="pk-catalyst"          # compatibility only; value never used
    export LANGFUSE_SECRET_KEY="<your-catalyst-api-key>"
    
    # --- LangSmith -> OTel: emit LangSmith spans as OpenTelemetry ---
    export LANGSMITH_TRACING="true"
    export LANGSMITH_TRACING_MODE="otel"

    The LangSmith and Langfuse bridge guides cover both flows end to end. The deeper point: bridges like these only exist because the span format underneath is standardized.

    Avoiding Lock-In: What Standards-Shaped Traces Buy You

    The strategic payoff of doing all this on open conventions instead of a proprietary SDK comes down to three capabilities.

    You can switch backends by changing an endpoint. OTLP export means the destination is configuration, not code. Re-pointing CATALYST_OTLP_ENDPOINT, or any OTLP endpoint variable, moves your traces; your instrumentation doesn't know or care what's listening.

    You can forward the same traces to multiple places. An OpenTelemetry collector can fan out one span stream to several backends at once (a migration trial, a data warehouse, a compliance archive) without touching application code.

    Your instrumentation outlives your tooling. Because OpenInference wire keys are a published convention, any OpenInference-aware viewer renders the same spans without configuration. The work you put into agent spans, tool spans, and session IDs is an asset that transfers, not a vendor relationship to unwind. The bridges in the previous section are this property in action. The same logic protects you in the gen_ai direction too, since both vocabularies ride identical OTel plumbing.

    Beyond Viewing Traces: Automated Cross-Run Analysis

    There's a second payoff to standardized span semantics that gets much less airtime: traces stop being something you only look at and become something software can reason over.

    A human debugging one bad run reads one trace tree. Finding patterns (the same tool failing the same way across hundreds of runs, a planning loop that wastes tokens on every third request) requires analyzing spans in bulk, and that's only tractable when every span labels itself the same way. An analysis engine can find "recurring tool-call failures" precisely because every tool execution is marked openinference.span.kind = TOOL and every requested call carries tool_call.function.name, regardless of which framework produced it.

    This is exactly how Halo works. Halo (Hierarchical Agent Loop Optimization) is an open-source, RLM-based engine that reads OpenTelemetry-compatible spans, decomposes them to find systemic failure modes across many runs, and writes up concrete fixes with citations back to the specific trace IDs that show each problem. You can run it self-hosted with pip install halo-engine, or hosted inside Catalyst, where the Agents dashboard runs it on your collected traces on demand or on a schedule. That workflow, trace, analyze, fix, re-run, turns observability from a debugging aid into an improvement loop.

    None of that works on POST api.openai.com 200. Standards-shaped spans are what make traces analyzable, and analyzable traces are what make agents improvable.

    Checklist: Is Your Tracing Actually OpenInference/OTel Compliant?

    Pull up a recent trace, sample a few spans, and verify:

    • Every span declares openinference.span.kind: no untyped spans in the tree.
    • LLM spans carry llm.model_name and token counts in llm.token_count.prompt / .completion / .total. Without these, no cost accounting.
    • Messages are structured, not blobs: llm.input_messages.{i}.message.role and .content, not one giant string.
    • Tool calls are explicit: tool_call.function.name and tool_call.function.arguments on the requesting LLM span, and a TOOL span for the execution itself.
    • AGENT spans carry a stable agent.id and a session.id: IDs that survive deploys, not per-process UUIDs.
    • Export is OTLP with endpoint and auth in configuration, not hardcoded into the app.
    • Spans actually flush: shutdown() on exit for scripts, forceFlush() per invocation in serverless.

    And the four mistakes that cause most non-compliance: constructing SDK clients before instrumentation initializes, minting a new agent ID per process, losing spans to un-flushed serverless invocations, and assuming gen_ai instrumentation captures message content by default when it's opt-in. The full attribute reference lists every wire key worth checking against.

    Get the two layers right, with OpenTelemetry moving the spans and OpenInference making them mean something, and everything downstream gets easier: debugging, cost tracking, backend portability, and automated analysis all read from the same standardized record. If you want that without hand-authoring spans, the fastest route is an SDK that emits the conventions for you.

    Trace your first agent run in minutes

    Install the Catalyst tracing SDK and call setup() before your clients. You get the full trace tree: agent, LLM, and tool spans with cost, latency, and token usage.

    CONTACT

    Meet with our research team

    Schedule a call with our research team to learn more about how Specialized Language Models can cut costs and improve performance.