Inference Economics: What are your options?

Introduction

Your product is working. Users are signing up. And somewhere in the background, an LLM is generating tokens, answering questions, summarizing documents, writing code. The more users you get, the more tokens you produce. The more tokens you produce, the bigger your inference bill.

At some point, every company building on LLMs asks the same question: how should I think about this cost?

This post will walk you through the two main pricing models: per-token and dedicated capacity. We'll cover how each works, when one makes more sense than the other, and how to calculate the crossover point for your own product. We'll also look at the tradeoffs around model quality, how fine-tuning has changed the game, and what to consider when operating dedicated infrastructure. By the end, you should have a clear framework for making this decision as your usage grows.

The two pricing models

Per-token pricing

There are two main ways to pay for LLM inference. The first is per-token pricing, where you pay for exactly what you use. Think of it like a taxi: you pay per ride, no commitment, simple. Most companies start here and never leave. It's simple, and simple is underrated.

Per-token pricing is what most developers start with. You call an API, you get a response, you get charged. Prices are listed per million tokens. Claude Sonnet 4.5 costs around $3 per million input tokens and $15 per million output tokens. GPT 5.2 and Gemini Pro 3.0 have similar pricing structures. These numbers change frequently, but the structure stays the same: you pay for consumption.

What exactly is a token? Roughly speaking, a token is about three-quarters of a word in English. The sentence "The quick brown fox jumps over the lazy dog" is about 10 tokens. A typical page of text is around 400-500 tokens. When you send a prompt to an LLM, both your input and the model's output are measured in tokens, and you pay for both.

Input vs output tokens

One detail that matters more than people realize: input and output tokens are priced differently. Output tokens typically cost 3-5x more than input tokens. Why? Generating new tokens requires more computation than processing existing ones. The model has to run its full prediction process for each output token, while input tokens can be processed in parallel.

This means your product's usage pattern directly affects your bill. A RAG application that stuffs 50,000 tokens of context into every request but only generates short answers is input-heavy. A creative writing tool that takes a brief prompt and produces 2,000 words is output-heavy. An AI agent that reasons through complex problems with long chains of thought? Heavy on both. A classification endpoint that takes a sentence and returns a label? Light on both. Know which one you're building.

A concrete example

Let's make this concrete. Say you're running a document Q&A product. Each query sends 10,000 input tokens (the document context plus the user's question) and generates 500 output tokens (the answer). At $3/$15 per million, that's $0.03 for input and $0.0075 for output, about 3.8 cents per query.

Here's how that scales:

The math is linear and predictable, as long as your usage is.

The beauty of per-token pricing is that your costs scale directly with value delivered. No users, no cost. Lots of users, higher cost but presumably higher revenue too. This alignment makes it easy to reason about unit economics. If each query generates more revenue than it costs in inference, you have a viable business.

Dedicated capacity

Dedicated capacity works differently. Instead of paying per token, you rent GPUs by the hour. Think of it like leasing a car: fixed monthly cost, but you need to drive enough to make it worthwhile. An H100, the workhorse of modern inference, runs about $2-4 per hour from cloud providers, depending on who you use and how you commit. You pay for the GPU whether you're using it or not. What you get in return is all the tokens that GPU can produce.

The economics here are fundamentally different. With per-token, your marginal cost is constant: the millionth token costs the same as the first. With dedicated capacity, your marginal cost approaches zero: once you've paid for the GPU, additional tokens are essentially free. The question becomes whether you can generate enough tokens to make the fixed cost worthwhile.

The model constraint

But here's where it gets interesting: if you want dedicated capacity, you're almost certainly running open-source models. Closed-source models like Claude, GPT, and Gemini don't let you run them on your own infrastructure. You're locked into their per-token pricing. Dedicated capacity means choosing models like Llama, Mistral, or DeepSeek.

This is a fundamental constraint that shapes the entire decision. It's not just about cost optimization. It's about which models you're willing and able to use.

Running the numbers

Model sizes and GPU requirements

Open-source models come in different sizes, and size determines cost. Llama 3.3 and Qwen 3 both come in multiple parameter versions ranging from small to very large. The numbers refer to how many parameters (weights) the model has. More parameters generally means better quality, but also more memory and compute required.

This creates a tradeoff. Smaller models are cheaper to run but may not match the quality you need. Larger models can rival closed-source quality but require serious hardware investment.

The cost comparison

Let's revisit our document Q&A example. Say you deploy Llama 3.3 70B on 4 H100s at $3/hour each, $12/hour total, or roughly $8,600/month for 24/7 operation. With our workload (10,000 input tokens, 500 output tokens per query), a well-optimized setup using modern inference engines like vLLM or TensorRT-LLM can serve roughly 1-2 requests per second.* At 1.5 req/s, that's about 3.9 million queries per month of capacity from a single 4xH100 deployment.

Here's how the costs compare:

Monthly queries	Per-token cost	Dedicated cost	Savings
100,000	$3,800	$8,600	-126% (per-token wins)
250,000	$9,500	$8,600	9%
500,000	$19,000	$8,600	55%
1,000,000	$38,000	$8,600	77%
5,000,000	$190,000	$17,200 (2x capacity)	91%

*Throughput varies significantly based on input/output lengths, batch size, and quantization. NVIDIA benchmarks show 8xH100 achieving ~5 req/s on Llama 70B with shorter sequences (2,048 input tokens). Our heavier workload (10,000 input tokens) reduces throughput proportionally.

The crossover point

The crossover point is somewhere around 200,000-250,000 queries per month. Below that, per-token is cheaper. Above that, dedicated capacity wins, and the savings grow quickly.

At a million queries per month, you're saving almost $30,000. At five million queries, you need to scale to 2x the GPU capacity ($17,200/month), but you're still saving over $170,000 per month compared to per-token pricing. That's the power of near-zero marginal costs when you have the volume to justify the fixed investment.

That sounds like a clear win for dedicated capacity at scale. But there are catches.

The quality tradeoff

Open-source vs closed-source

Closed-source models are still better for many tasks. The gap has narrowed significantly. Open-source models like Qwen and GLM now compete with frontier models on many benchmarks. But for complex reasoning, nuanced writing, or tasks requiring broad knowledge, models like GPT 5.2, Claude Sonnet 4.5, and Gemini Pro 3.0 often have an edge.

How much of an edge? It depends entirely on your use case. For some applications, open-source models work just as well. For others, the quality difference is immediately noticeable to users. The only way to know is to test with your actual workload.

Fine-tuning bridges the gap

Here's what has changed: fine-tuning actually works now.

A year or two ago, fine-tuning was expensive, finicky, and often disappointing. You'd spend weeks and thousands of dollars training a model, only to find it wasn't much better than the base model for your task. The tooling was immature. The techniques were still being figured out. Many companies tried fine-tuning, got mediocre results, and concluded it wasn't worth the effort.

That's no longer true. Two things have shifted.

Fine-tuning has matured

First, fine-tuning itself has matured. Modern post-training frameworks have dramatically reduced the cost and complexity of training. Whether you're doing parameter-efficient fine-tuning or full model training, the infrastructure and tooling have gotten much better. The cost has dropped by an order of magnitude. What used to require a dedicated ML team and serious infrastructure can now be done with a few hundred dollars and some Python scripts.

Better base models

Second, the base models got much better. The floor of open-source intelligence is dramatically higher than it was. When you fine-tune a model, you're not teaching it to be smart from scratch. You're teaching it your specific task, your tone, your domain. The base model provides the foundation. If that foundation is weak, no amount of fine-tuning will save you. But today's open-source models, GLM, Qwen, DeepSeek, are genuinely capable. They understand language, reason about problems, and follow instructions well out of the box. Fine-tuning takes them from generally capable to specifically excellent at your task.

This changes the calculus. A fine-tuned Llama 70B can match or exceed frontier model quality for your specific use case, while running on dedicated capacity at a fraction of the per-token cost. The combination of better base models and cheaper fine-tuning makes this path viable in a way it simply wasn't before.

Operating dedicated capacity

Provisioning for traffic

There's an operational reality with dedicated capacity: you need to provision for your traffic. If your product serves users in real-time, you need enough GPUs to handle peak load, not average load. A product with spiky traffic might need to over-provision by 2-3x to avoid slow responses during bursts.

Auto-scaling sounds like the answer, but GPU spin-up times are measured in minutes, not seconds. And on-demand GPU availability can be spotty, especially for high-end chips like H100s. For real-time products, over-provisioning is often the only reliable option. That means you're paying for capacity you're not always using, which changes the economics.

Hosted inference providers

The good news is that you don't need to manage GPUs yourself. Hosted inference providers like Together, Fireworks, Replicate, and others will run open-source models on dedicated capacity for you. They handle the infrastructure, the optimization, the scaling. You get an OpenAI-compatible API endpoint.

Switching from a closed-source model to dedicated capacity can be as simple as changing the base URL in your code. You get the economics of dedicated GPUs without hiring a team to operate them. This dramatically lowers the barrier to experimenting with dedicated capacity.

Async and batch processing

If your product doesn't need real-time responses, you have even more options. Background processing, generating reports, analyzing documents overnight, batch operations, can use async inference APIs that queue work and deliver results via webhooks.

This changes the economics significantly. With async processing, you don't need to provision for peak load. Work gets queued and processed as capacity becomes available. You can run at much higher utilization without worrying about response times. Some providers offer batch APIs with significant discounts. Anthropic offers 50% off for batch processing. If your workload tolerates latency, this is worth exploring.

The key question is whether your users can wait. A chatbot needs real-time responses. A nightly report generation job doesn't. Many products have a mix of both, and the smart approach is to use the right pricing model for each workload.

Making the decision

Quick decision guide

Your situation	Recommended approach	Why
Early stage, still finding product-market fit	Per-token	Your usage patterns will change as you iterate. Don't lock into infrastructure before you know what you're building.
Low volume (under 100K requests/month)	Per-token	The math doesn't work yet. You'd pay $8,600/month for capacity you're not using.
High volume, need frontier model quality	Per-token with closed-source	GPT 5.2, Claude Sonnet 4.5, and Gemini Pro 3.0 aren't available on dedicated infrastructure. Quality has a price.
High volume, quality requirements are flexible	Dedicated capacity via hosted provider	Open-source models can handle many tasks well. Test before committing, but the savings are substantial.
High volume, task-specific quality needs	Fine-tune an open-source model on dedicated capacity	A fine-tuned model can match frontier quality for your specific task at a fraction of the cost.
High volume, latency-tolerant workloads	Batch APIs or async inference	If users aren't waiting, why pay real-time prices? Batch discounts (up to 50%) add up fast.
Predictable budget is critical	Dedicated capacity	Your CFO will thank you. Fixed cost means no surprise bills when usage spikes.
Spiky, unpredictable traffic	Per-token	Dedicated capacity means paying for peak even during valleys. Per-token matches cost to demand.

Pros & Cons

Here are some of the pros and cons that you should be aware of as you think about which of the two pricing models makes sense for you.

Factor	Serverless (Per-token)	Dedicated Capacity
Cost at low volume	🟢 Only pay for what you use, no minimums	🔴 Paying for GPUs whether they're busy or idle
Cost at high volume	🔴 Linear scaling gets expensive fast	🟢 Near-zero marginal cost once provisioned
Budget predictability	🔴 Spend fluctuates with usage, hard to forecast	🟢 Fixed monthly cost, easy to budget
Infrastructure	🟢 Nothing to manage, just call the API	🔴 Must provision, monitor, and scale for peak load
Scaling	🟢 Provider handles traffic spikes automatically	🔴 GPU spin-up takes minutes, not seconds
Uptime	🔴 Shared infrastructure with all other customers, providers have had reliability issues	🟢 Dedicated resources isolated from noisy neighbors
Rate limits	🔴 Capped by default, need to request increases through support	🟢 No artificial limits, scale by adding capacity
Vendor lock-in	🔴 Tied to one provider's API and pricing changes	🟢 Portable across providers, can self-host if needed

How to think about this

Here's how I'd approach it: start with per-token. It's simple, there's no upfront commitment, and you'll learn your actual usage patterns. Instrument your application to track tokens consumed per request, requests per user, and total monthly spend. This data is essential for making good decisions later.

As you scale, watch two numbers: your monthly inference spend and your query volume. When you're spending enough that dedicated capacity would clearly be cheaper, start experimenting. Run some workloads on open-source models through a hosted provider. See if the quality holds for your use case.

If the base open-source model falls short, consider fine-tuning. Collect examples of good outputs from your current system. Use them to train an open-source model on your specific task. The investment is much smaller than it used to be, and the results are much better.

If it works, you have a path to much lower unit economics. If it doesn't, you've learned something valuable, and per-token pricing is always there.

The best infrastructure decision is the one that matches where you are today while leaving room for where you're going. Don't over-optimize for a future that may never arrive. But don't ignore the economics either. At scale, the difference between per-token and dedicated capacity can be the difference between a profitable product and one that bleeds money.

Schematron

ClipTagger

View All Models