News

    Introducing Catalyst: Train self-improving AI models

    Learn more

    Jun 10, 2026

    LLM Gateway: What It Is and How to Build One (2026)

    Inference Research

    What Is an LLM Gateway?

    Your team started with one OpenAI integration. Then someone added Anthropic for the long-context work, Gemini for the cheap classification calls, and a Groq endpoint because latency mattered for one feature. Now you have four SDKs, four retry implementations, four billing dashboards, and no good answer when finance asks which feature blew up the bill last Tuesday.

    An LLM gateway is a proxy layer that sits between your application and your model providers. It centralizes routing, authentication, failover, cost tracking, and observability behind a single endpoint, so every LLM request in your stack flows through one place you control instead of N places you don't.

    In 2026, the gateway has stopped being an optional add-on. It's the single point in your architecture where cost, reliability, and governance can actually be enforced. Provider APIs won't do it for you, and application code is the wrong place to try.

    This guide covers what a gateway actually does, how LLM routing and fallbacks work under the hood, the cost and observability layers that justify the whole exercise, and an honest build-vs-buy framework based on what's actually available in 2026. It's written for engineers who already run multiple providers in production, or who are about to.

    Read time: 17 minutes


    What an LLM Gateway Does

    Strip away vendor branding and every serious gateway has the same four planes:

    1. Data plane. The proxy itself: terminate the request, authenticate the caller, attach the right provider credentials, dispatch to the upstream provider, and stream the response back without buffering.
    2. Routing and policy plane. The decision layer: which provider and model serves this request, what happens when it fails, which budgets and rate limits apply.
    3. Capture plane. Asynchronous recording of everything the request reveals: payloads, token counts, cost, latency. This must never block the hot path.
    4. Control plane. Where humans interact with the system: dashboards, API key management, and downstream workflows like dataset building and evals.

    Most "what is an LLM gateway" content stops at "it routes requests to multiple providers." That undersells the capture and control planes, which in practice are where most of the long-term value accumulates. Routing saves you an outage; capture gives you the data to improve your product every week.

    The policy plane is also where teams hang everything else that needs a chokepoint: request guardrails, PII redaction, allowed-model lists per team, and environment separation. None of those are reliable as application-level conventions. They become reliable when the only path to a provider runs through a component that enforces them.

    LLM Gateway vs API Gateway

    If you already run Kong or an ALB, it's fair to ask why you need another proxy. The answer is that LLM traffic breaks the assumptions classic API gateways are built on.

    ConcernAPI gatewayLLM gateway
    Unit of costPer requestPer token (input, output, cached)
    Latency metricResponse timeTime to first token, tokens per second
    Routing inputPath, host, headersModel, cost, deployment health, task
    Payload handlingOpaque pass-throughPrompts and completions captured for analysis
    Failure handlingRetry the same backendRetry, then fall back across models and providers
    Rate limitingRequests per secondTokens and spend per key, team, or task

    The deepest difference is measurement. An API gateway sees a request and a response; an LLM gateway has to understand tokens and streams. Time to first token and tokens per second are invisible to application code; only a proxy sitting in the request path can measure them. A standard API gateway sits in the path too, but it doesn't parse SSE streams or token usage, so the numbers that determine your user experience and your bill never get recorded.

    Transparent Proxies vs Translating Gateways

    Gateways split into two architectural families, and the choice shapes everything downstream.

    A transparent proxy forwards your requests as-is. You keep your existing provider API keys and your existing request format; the gateway captures and forwards. Catalyst Gateway works this way: you point your SDK at https://api.inference.net/v1, add a couple of headers, and every request is captured with cost, latency, and full request/response payloads, with roughly 10ms of added overhead.

    A translating gateway exposes one API shape, usually OpenAI-compatible, and converts each request to the upstream provider's native format. LiteLLM is the reference open-source implementation, exposing more than 100 providers through the OpenAI format as both a Python SDK and a deployable proxy server. OpenRouter is the managed version of the same idea, aggregating 400+ models behind a single API.

    Translation buys you provider portability at the cost of a thicker abstraction: the gateway must keep pace with every provider API change, and provider-specific features can lag or leak. Transparency buys you fidelity (the provider sees exactly what you sent) at the cost of your application still speaking each provider's dialect.


    Gateway vs Direct Provider Integrations

    Direct integration is the default, and for a while it's fine. The cost shows up as provider count grows, because everything multiplies.

    With direct integrations, each provider means another SDK to version-bump, another authentication scheme to vault, another retry-and-timeout implementation that subtly disagrees with the others, and another export to reconcile at the end of the month. Cross-provider failover doesn't exist, because no single component sees more than one provider. Cost attribution requires every team to tag requests the same way in five different codebases, which is to say it doesn't happen.

    A gateway converts that linear integration tax into a constant. One endpoint, one auth handshake, one retry policy, one place where every request is recorded automatically in a single schema, without touching request logic. The capture point matters as much as the routing. More on that in the observability section.

    Honesty requires the other half: you might not need one. If you run a single provider at modest volume, with no compliance pressure and no on-call pain, a gateway is premature infrastructure. The tipping points that change the answer are concrete: the second provider, the first surprise bill, the first provider outage that takes your product down with it, and the first incident where nobody can say which prompt version caused the regression.

    If you've hit two of those four, you're past the point where direct integrations are cheaper.

    There's also a migration argument that's easy to miss. Once a gateway fronts your traffic, swapping providers stops being a code change and becomes a routing change. Teams that negotiated better pricing, or needed to move a workload off a deprecated model, do it at the gateway in an afternoon instead of coordinating releases across every service that calls an LLM.


    LLM Routing, Fallback, and Failover

    LLM routing is the part of the gateway that decides, per request, which provider, model, or deployment does the work. It's also the part most articles hand-wave. Let's be specific.

    Model Routing Strategies

    The cleanest public taxonomy of model routing strategies is LiteLLM's router, which ships six: simple-shuffle (the default weighted pick), latency-based-routing, usage-based-routing-v2 (TPM/RPM-aware, coordinated through Redis), least-busy, cost-based-routing, and fully custom strategies via CustomRoutingStrategyBase.

    StrategyWhat it optimizesWhen to use
    simple-shuffle (default)Weighted random pick across deploymentsDefault; stateless and predictable
    latency-based-routingRoutes to the lowest-latency deploymentLatency-sensitive, user-facing traffic
    usage-based-routing-v2Respects TPM/RPM limits via Redis-tracked usageMultiple rate-limited deployments near capacity
    least-busyFewest in-flight requestsSpiky concurrency
    cost-based-routingPicks the lowest-cost deploymentCost-first batch workloads
    Custom (CustomRoutingStrategyBase)Whatever you encodeTask-tier routing, compliance routing

    Whatever strategy you choose, it's only as good as the health data feeding it. Latency-based routing needs recent latency samples per deployment; usage-based routing needs accurate token counters; cost-based routing needs current pricing tables. Stale inputs turn a smart router into a random one, so treat the router's data sources as production dependencies with their own monitoring.

    Two practical notes from production use. First, weighted shuffle is the default for a reason: it's stateless, predictable, and doesn't require a Redis deployment to coordinate usage counters. Start there and graduate to latency- or usage-based routing when you have evidence you need it.

    Second, the highest-leverage form of model routing isn't load balancing at all. It's task-tier routing. Send the cheap, fast model everything by default, and escalate to the expensive model only when the task demands it. The gateway is the natural place to encode that policy, because it's the one component that sees every request and its cost.

    Retries, Cooldowns, and Fallback Chains

    Failure handling in a gateway is an escalation ladder, not a single retry loop:

    1. Retry the same deployment. Transient errors get retried in place, with exponential backoff for rate-limit errors.
    2. Re-pick within the model group. If retries fail, the router selects a different deployment of the same model using its existing weights.
    3. Cool down the failing deployment. A deployment that keeps failing gets suspended automatically so traffic stops hitting it.
    4. Fall back across model groups. When a whole model group is unhealthy, requests move to a designated fallback model.
    5. Degrade gracefully. When everything is down, return a cached or reduced response rather than an error, if the product allows it.

    LiteLLM implements steps one through four concretely: configurable num_retries with exponential backoff on rate limits, re-picking deployments within the model group before cross-group fallbacks, automatic cooldowns once a deployment exceeds allowed_fails (default: three failures per minute), and per-deployment timeouts with fallback escalation.

    Here's what that looks like as configuration:

    # LiteLLM proxy config: a primary model, a fallback chain, retries, and cooldowns.
    model_list:
     - model_name: zephyr-beta # primary: a self-hosted deployment
     litellm_params:
     model: huggingface/HuggingFaceH4/zephyr-7b-beta
     api_base: http://0.0.0.0:8001
     - model_name: gpt-3.5-turbo # fallback: a hosted provider
     litellm_params:
     model: gpt-3.5-turbo
     api_key: <my-openai-key>
    
    litellm_settings:
     num_retries: 3 # retry the call 3 times on each model_name
     request_timeout: 10 # seconds before a request counts as failed
     fallbacks: [{"zephyr-beta": ["gpt-3.5-turbo"]}] # tried in order, after retries are exhausted
     allowed_fails: 3 # cooldown a model if it fails more than this per minute
     cooldown_time: 30 # seconds the failing model stays out of rotation

    The structure to notice: reliability is configured declaratively at the router, not scattered through application code. When you change your fallback model, you change one config block, not eleven call sites.

    One subtlety that separates toy gateways from production ones: fallbacks interact with streaming. If the model fails after streaming half a response to the user, the gateway can't transparently re-run the request against a fallback without the application's cooperation. Decide upfront whether mid-stream failures retry silently (and eat the duplicate cost), surface as errors, or degrade. Then test that path before an outage forces the issue.

    The Gateway Is Your New Single Point of Failure

    The uncomfortable corollary of centralizing all LLM traffic: the gateway becomes the thing that takes you down.

    This isn't hypothetical. OpenRouter, one of the largest managed gateways, had two outages in February 2026 (38 minutes on February 17, 35 minutes on February 19) when a third-party caching dependency failed, dropped database connections, and cascaded into authentication failures across the platform. They've since added circuit breakers and fallback caching. The lesson isn't that OpenRouter is bad at operations; it's that even teams whose entire business is gateway reliability get bitten by their own dependencies.

    Design for it from day one:

    • Keep a bypass path. Your application should be able to talk to providers directly if the gateway is down. Transparent proxies make this trivial, because you still hold your own provider keys and your requests are already in the provider's native format.
    • Health-check the gateway itself, not just the providers behind it.
    • Treat gateway dependencies (cache, database, auth) as part of your availability math, because they are.

    Cost Controls That Actually Work

    Application-level cost tracking fails for predictable reasons. Client-side token estimates drift from what providers actually bill. Attribution requires every feature team to tag requests consistently, forever. And by the time the invoice arrives, the spend already happened.

    The gateway fixes all three, because it's a chokepoint. Every request passes through it, so every request can be measured, attributed, and capped in one place.

    The mechanisms that matter:

    • Per-key budgets and virtual keys. LiteLLM's proxy issues virtual keys with spend tracking per project and per user, plus rate limiting on top. A team that exhausts its budget gets stopped at the gateway, not discovered in the invoice.
    • True per-request cost capture. Catalyst records cost per call and aggregate spend from the actual request and response, with cost broken down by input, cached input, output, and reasoning tokens.
    • Attribution by tagging at the chokepoint. Catalyst's header contract makes this concrete: tag requests with an environment (x-inference-environment), group them under a logical task (x-inference-task-id), or attach arbitrary metadata via x-inference-metadata-* headers, then filter spend by those dimensions. Tag once at the gateway client, and every dashboard downstream inherits the dimension.

    A note on enforcement style: budgets come in soft and hard variants, and you want both. Soft limits alert a human when a key crosses a threshold, which catches the slow leak: the retry loop that quietly doubled traffic to a feature. Hard caps stop requests outright, which is the only thing that catches the fast leak: a deploy that accidentally points a batch job at your most expensive model on a Friday night. Teams that configure only alerts learn the difference from an invoice.

    One pricing-model gotcha worth knowing before you pick a gateway: aggregators and transparent proxies bill differently. OpenRouter passes through provider pricing with no markup on inference, but charges a fee on credit purchases and a separate BYOK fee when you bring your own provider keys. A transparent proxy like Catalyst leaves billing with your provider entirely, because your own API keys ride along on every request. Neither model is wrong; they're just different line items to budget for.


    The Observability Layer a Gateway Unlocks

    Observability is usually listed as a gateway "feature," one bullet among twelve. That undersells it. For most teams, the capture plane ends up being the reason the gateway pays rent.

    Start with what only the proxy can see. TTFT and tokens per second are exactly the metrics application code cannot measure. They're also the metrics that define perceived latency for streaming UIs. Beyond those, a gateway in the request path records the full request and response payloads, cost per call, input and output token counts, error rates and status codes, and the model and provider for every request, all with zero changes to your request logic.

    What should you expect from the dashboard side? As a benchmark, Catalyst's Metrics Explorer shows duration and time-to-first-token as percentile charts (p50, p75, p90, p99), alongside request volume, error rate, cost and token breakdowns, payload sizes, peak throughput, and model distribution, all filterable by task, model, provider, environment, and status. Percentiles are the detail to insist on. Averages hide the p99 stall that makes users close the tab, and a gateway that can't show you tail latency per model can't tell you which provider is actually slow.

    A gateway is not a tracing system, and it's worth keeping the two layers straight. The gateway captures one record per LLM request from the proxy's vantage point. Tracing instruments your code from the inside, capturing the full agent hierarchy (LLM calls, tool calls, framework steps) via SDKs that emit OpenInference-shaped OpenTelemetry over OTLP/HTTP. They're independent and complementary: the gateway tells you what every request cost and how it performed; traces tell you why the agent did what it did.

    From Captured Traffic to Datasets and Evals

    Here's the payoff almost no gateway discussion mentions: captured traffic is training data.

    Once requests flow through the gateway, you can filter the capture by model, task, provider, or status code, and save the result as a dataset for evals or fine-tuning. That turns the gateway from a cost center into a flywheel: production traffic becomes eval sets that catch regressions, and training sets that let you fine-tune a smaller task-specific model on exactly the distribution your app sees.

    The discipline that makes it work: keep eval sets small, stable, and hard; keep training sets large, diverse, and representative; never let the two overlap.


    Build vs Buy: A Decision Framework

    Every platform team eventually sketches their own gateway. Sometimes that's right. Here's how to decide, given what's actually on the market in 2026.

    What Building Actually Involves

    The naive gateway is a weekend project: an HTTP service, a provider map, a retry loop. The production gateway is a different animal:

    • Streaming pass-through without buffering. Buffer the SSE stream and you've destroyed TTFT for every user.
    • Routing state. Cooldown tracking, health scoring, and usage counters that survive restarts and coordinate across replicas.
    • Key vaulting. You're now the custodian of every provider credential in the company.
    • A capture pipeline that never blocks the hot path, plus storage and dashboards for what it captures.
    • An on-call rotation. Remember the February OpenRouter outages: a team whose entire product is gateway reliability still lost two production windows to a caching dependency. Your two-engineer platform team will not do better while also doing everything else.

    The 2026 Landscape

    The buy (or adopt) side has matured fast, and recent shifts changed the calculus:

    • Self-hosted open source. LiteLLM remains the default choice: 100+ providers behind the OpenAI format, virtual keys, spend tracking, and load balancing, with roughly 49,900 GitHub stars and a v1.88.1 release shipped June 9, 2026. The bigger 2026 shift: Portkey open-sourced its entire gateway in March, including the governance, observability, authentication, and cost-control layers that used to require its SaaS. That's software that was processing over a trillion tokens and 120M+ requests daily across 24,000+ organizations at announcement. The repo ships under the MIT license. Full-featured gateways are now genuinely free to run in your own VPC, if you staff the ops.
    • Managed aggregators. OpenRouter raised a $113M Series B in May 2026 led by CapitalG, with weekly volume growing from 5 trillion to 25 trillion tokens in six months across 8M+ developers. Maximum convenience, one bill, 400+ models, and the SPOF and governance trade-offs discussed above.
    • API-gateway vendors moving in. Kong's AI Gateway 3.14 (April 2026) added native agent-to-agent (A2A) traffic support alongside MCP token validation and multi-provider LLM routing. A reasonable path if you're already deep in Kong's ecosystem.
    • Managed transparent proxies. Catalyst Gateway takes the low-friction position: keep your provider keys and request shapes, change a base URL, and get capture, metrics, datasets, and evals on the same traffic. There are dedicated guides for OpenAI, Anthropic, Vertex AI, Gemini, OpenRouter, Cerebras, and Groq, plus an escape hatch to any OpenAI-compatible provider via the x-inference-provider-url header.
    PathRepresentative optionWhat you getWhat it costs you
    Build in-houseYour own proxy serviceExact fit, full controlStreaming, routing state, key vaulting, capture pipeline, and an on-call rotation — even gateway specialists have outages (OpenRouter: two in Feb 2026)
    Self-host OSSLiteLLM (100+ providers, v1.88.1 Jun 2026) or Portkey (fully open source since Mar 2026, MIT)In-VPC deployment, no license fees, full feature setRedis/Postgres dependencies, upgrades, and the ops burden land on your team
    Managed aggregatorOpenRouter (400+ models, 25T tokens/week)One API key, one bill, instant model breadthAggregator fees on credits/BYOK; their availability is your availability
    API-gateway vendorKong AI Gateway 3.14 (LLM + MCP + A2A)Rides existing Kong investment and governanceHeavy adoption cost if you are not already a Kong shop
    Managed transparent proxyCatalyst Gateway (~10ms overhead, BYO provider keys)Capture, percentile metrics, datasets/evals on the same traffic; native bypass pathAnother vendor in the request path (mitigated by keeping your own keys)

    Factor in migration cost when comparing paths, in both directions. Getting in is usually cheap, since most gateways adopt via a base-URL change. Getting out depends on the architecture you chose: leaving a transparent proxy means reverting a base URL, while leaving a translating gateway means re-integrating every provider's native API your code stopped speaking. Pick the exit you can live with before you need it.

    The heuristics compress to three questions:

    1. Is routing your product? If you're selling inference or model access, build; the gateway is your moat. Otherwise, don't.
    2. Does compliance require in-VPC? Self-host LiteLLM or Portkey and budget real ops time.
    3. Do you want the capability without the pager? Use a managed gateway, and prefer a transparent proxy if you want to keep direct provider relationships and a bypass path.

    Putting a Gateway in Front of Your Traffic

    For OpenAI-compatible SDKs, adopting a transparent proxy is a base-URL change plus headers. Here's the Catalyst version in TypeScript. The client points at the gateway, your project key authenticates to Catalyst, and your provider key rides along in a header:

    import OpenAI from "openai";
    
    const client = new OpenAI({
     baseURL: "https://api.inference.net/v1",
     apiKey: process.env.INFERENCE_API_KEY,
     defaultHeaders: {
     "x-inference-provider-api-key": process.env.OPENAI_API_KEY,
     "x-inference-provider": "openai",
     },
    });
    
    const response = await client.chat.completions.create({
     model: "gpt-4.1",
     messages: [{ role: "user", content: "Hello, world!" }],
    });
    
    console.log(response.choices[0].message.content);

    The same request as raw HTTP, which makes the header contract explicit:

    curl https://api.inference.net/v1/chat/completions \
     -H "Authorization: Bearer $INFERENCE_API_KEY" \
     -H "x-inference-provider-api-key: $OPENAI_API_KEY" \
     -H "Content-Type: application/json" \
     -H "x-inference-provider: openai" \
     -d '{
     "model": "gpt-4.1",
     "messages": [{"role": "user", "content": "Hello, world!"}]
     }'

    That's the whole integration. The headers are doing the routing work:

    HeaderRequiredWhat it does
    AuthorizationYesBearer <project-api-key> — authenticates to the gateway; for OpenAI-compatible SDKs, set it as the SDK's apiKey
    x-inference-provider-api-keyWhen proxying a providerYour provider's API key; forwarded downstream as bearer auth (Anthropic's native /v1/messages uses x-api-key instead)
    x-inference-providerOptionalForces a provider (openai, anthropic, groq, cerebras, vertex-ai, gemini); otherwise inferred from SDK path or base URL
    x-inference-provider-urlOptionalRoutes to any OpenAI-compatible provider by base URL
    x-inference-environmentOptionalTags the request production, staging, etc. for filtering
    x-inference-task-idOptionalGroups requests under a logical task for analytics and datasets
    x-inference-metadata-*OptionalAttaches arbitrary metadata; prefix is stripped to form the key

    Two habits worth adopting on day one. Tag x-inference-environment and x-inference-task-id from the start, because cost attribution and per-feature metrics only work if requests carry the dimensions. And wire up more than one provider early, even if the second one only takes 1% of traffic. Failover paths that have never carried production traffic don't work when you need them.

    If you'd rather not edit client code by hand, the Inference CLI automates it: inf instrument --mode gateway launches a coding agent that scans your codebase, redirects LLM client base URLs to the gateway, and adds the routing headers, with a --dry-run flag to preview changes.


    FAQ

    What is an LLM gateway? A proxy layer between your application and LLM providers that centralizes routing, authentication, failover, cost tracking, and observability behind a single endpoint. Instead of N provider integrations, you maintain one.

    How is an LLM gateway different from an API gateway? An API gateway routes and rate-limits opaque HTTP. An LLM gateway understands the traffic: tokens, streaming, model identity, and cost. The clearest example is measurement: time to first token and tokens per second can only be measured by a proxy in the request path, and only an LLM-aware one records them.

    Do I need an LLM gateway if I only use one provider? For routing, no. But the capture argument can still hold: a gateway records payloads, cost, latency, and token counts with no changes to request logic, which is the cheapest observability you'll ever install.

    How much latency does an LLM gateway add? A transparent proxy adds roughly 10ms, which is noise against multi-second model inference. Translation layers and self-hosted gateways vary with implementation and load; measure TTFT through the gateway before and after.

    Should I build or buy an LLM gateway? Build only if routing is your product. Self-host open source (LiteLLM, or Portkey's now fully open-source gateway) if compliance demands in-VPC. Otherwise use a managed gateway and spend your engineers on your actual product.


    Where to Go From Here

    The LLM gateway pattern is simple to state: put one chokepoint between your application and your model providers, and let it do the routing, failover, cost enforcement, and capture that application code does badly. The hard questions are which chokepoint, and who operates it.

    If you're running multiple providers without a gateway today, start with the failure math: what happens to your product during your primary provider's next outage? Then run the cost math: can you attribute last month's LLM spend by feature? If either answer is uncomfortable, you already know the next move.

    Need provider routing with built-in observability? Catalyst gives you the gateway layer without the DIY ops. Point your SDK at the gateway, keep your provider keys, and get routing headers, percentile metrics, and traffic-to-dataset workflows out of the box. The gateway quickstart takes one base-URL change.


    References

    1. Catalyst Gateway overview — https://docs.inference.net/integrations/gateway/overview
    2. Catalyst Gateway quickstart — https://docs.inference.net/integrations/gateway/quickstart
    3. Catalyst Metrics Explorer — https://docs.inference.net/platform/observe/metrics-explorer
    4. Build a Dataset from Traffic — https://docs.inference.net/platform/datasets/build-from-traffic
    5. Catalyst Tracing overview — https://docs.inference.net/integrations/traces/overview
    6. LiteLLM repository — https://github.com/BerriAI/litellm
    7. LiteLLM Router documentation — https://docs.litellm.ai/docs/routing
    8. OpenRouter Series B announcement — https://openrouter.ai/announcements/series-b
    9. OpenRouter outage postmortem (Feb 17 & 19, 2026) — https://openrouter.ai/announcements/openrouter-outages-on-february-17-and-19-2026
    10. OpenRouter FAQ (pricing model) — https://openrouter.ai/docs/faq
    11. Portkey gateway open-source announcement (Mar 24, 2026) — https://www.globenewswire.com/news-release/2026/03/24/3261574/0/en/portkey-s-gateway-is-now-fully-open-source-processing-over-1-trillion-tokens-every-day.html
    12. Kong AI Gateway 3.14 release — https://konghq.com/blog/product-releases/kong-ai-gateway-3-14
    CONTACT

    Meet with our research team

    Schedule a call with our research team to learn more about how Specialized Language Models can cut costs and improve performance.