What is Speculative Decoding & Does It Speed up Language Models?
Published on Aug 15, 2025
Get Started
Fast, scalable, pay-per-token APIs for the top frontier models like DeepSeek V3 and Llama 3.3. Fully OpenAI-compatible. Set up in minutes. Scale forever.
You know the frustration: a chat stalls while a model grinds through a long response, or your app slows when multiple users request text at once. Speculative decoding speeds things up by letting a fast proposal model draft multiple tokens while a larger verifier model checks them, cutting latency and boosting throughput. In this article, we’ll break down how it works, compare it to traditional decoding, and show practical ways to generate text faster without sacrificing quality. We will also touch upon LLM inference optimization and its benefits.
Inference's AI inference APIs make it simple to run speculative execution and managed speculative decoding pipelines so you can boost GPU utilization, lower latency, and deliver faster, more efficient text generation with minimal system overhead.
What is Speculative Decoding, and is it Faster?

Large language models produce excellent text but often at a cost: latency. Waiting many seconds for each response hurts user experience and drives up compute bills.
Speculative decoding was developed to cut that wait by shifting some of the work to a faster, smaller model without losing the quality that users expect.
The technique answers a simple question:
Can we guess likely following words quickly, then let the big model only check those guesses instead of generating every token itself?
How Speculative Decoding Differs From Standard Token-By-Token Decoding
Standard decoding generates one token at a time from the big model, sampling from its full probability distribution and then moving on. Speculative decoding changes that flow. A smaller draft model proposes a batch of candidate tokens ahead of time.
The large model then runs a verification step:
- It confirms which draft tokens match its conditional distribution and corrects any mismatches.
- The significant change is that the heavy model does verification work in bulk rather than for the whole generation for every token.
Core Idea: Draft Model And Verifier Working Together
Think of two agents. The draft model is fast and lightweight. It produces speculative generations or candidate tokens from a proposal distribution. The verifier model is larger and more accurate. It evaluates the draft tokens by comparing their logits or probabilities to its own.
If the big model accepts a proposed token under an acceptance test, that token becomes part of the final output. If not, the big model rejects and computes the correct token. This draft verifies that the loop lets the large model avoid computing every next token from scratch.
Step By Step: How Speculative Decoding Runs in Practice
- Draft generation: The small model samples a chunk of tokens conditioned on the current prefix. You can choose how many tokens the draft produces per round.
- Parallel verification: The large model scores those draft tokens in parallel. It applies an acceptance test that ensures the draft token is plausible under the large model probability. Accepted tokens are emitted without further heavy sampling. Rejected tokens force the large model to compute the correct token and possibly reattempt the draft for the remaining positions.
- Final output: Accepted draft tokens join the stream. The large model output replaces rejections, and the process repeats until the sequence ends.
When is Speculative Decoding Faster, and How are Speed Gains Achieved
When the draft model aligns well with the large model, many proposed tokens pass the acceptance test. That means the heavy model only needs to score batches of tokens instead of sampling a token at a time.
The parallel verification step exploits efficient matrix operations and throughput on accelerators, reducing wall clock latency. Scenarios with long outputs, stable token distributions, and good draft model calibration show the best gains.
Performance Gains and Resource Efficiency of Speculative Decoding
How much faster can it be? Benchmarks often report a 30 to 40 percent reduction in latency. For example, end-to-end generation that used to take 25 to 30 seconds can fall to about 15 to 18 seconds under well-tuned speculative decoding.
Memory and computing numbers shift too because most token generation moves to the smaller model. You can see memory use drop from roughly 26 GB to about 14 GB in some setups and a halving of heavy model compute when the acceptance rate stays high.
What Achieves Those Gains Technically
- Batch token proposals reduce the number of sequential generation steps.
- Verification leverages parallel scoring, so the big model performs fewer autoregressive samples.
- The small model runs cheaply on a CPU or a smaller GPU slice, lowering peak memory and power draw.
- Acceptance tests and probability rescaling limit quality loss while permitting more aggressive draft sampling.
Trade Offs and When Benefits Shrink
Speculative decoding is not magic. If the draft model has poor alignment with the verifier, rejection rates rise, and the verifier must repair many tokens, erasing gains. Overhead from a round-trip between models can negate benefits for very short outputs or setups with high IPC overhead.
Tight sampling regimes like deterministic beams or specialized constrained decoding can conflict with speculative proposals.
Extra engineering complexity appears:
You need to maintain two models,
- Tune acceptance thresholds
- Handle KL correction
- Logits rescaling
What Practical Failure Modes Look Like
- There is a high rejection rate when the draft diverges from the large model style or domain.
- Increased variance in latency when some rounds accept many tokens and others accept few.
- Compatibility issues with decoding features such as constrained tokens, alignment filters, or complex stopping conditions.
- There is extra memory usage if you keep large caches for both models concurrently on the same device.
Practical Tips and Knobs That Matter for Speed and Quality
- Which draft size to pick? Larger batches give more parallelism but can increase wasted work when rejection is common. Start small and raise the batch size while watching the acceptance ratio.
- How to set acceptance tests? Use per-token probability ratio tests or KL-aware thresholds. Probability rescaling and acceptance tests based on logit ratios let you control the trade-off between speed and fidelity.
- Which draft model? Pick a model that is fast but not too cheap. Models with similar token distribution reduce rejection. You can try intermediate sizes or domain-tuned drafts.
- Where to run which model? Run the draft on CPU or a small accelerator if latency and memory allow. Keep the verifier on the powerful accelerator to preserve quality.
- What about sampling modes? Speculative decoding works best with sampling style generation, such as top k or nucleus sampling. Deterministic beam search needs extra care.
A Quick Checklist to Test Speculative Decoding on Your Stack
- Measure acceptance rate and per-round overhead.
- Tune draft size and acceptance threshold to maximize accepted tokens per verifier call.
- Monitor latency variance and peak memory.
- Verify output quality with automatic metrics and human checks for subtle shifts.
- Add KL correction or simple reweighting if you see distribution drift
Common Technical Terms and Mechanisms Worth Knowing
- Proposal model
- Draft model
- Verifier model
- Speculative generation
- Speculative sampling
- Draft verify loop
- Parallel verification
- Candidate tokens
- Token batching
- Acceptance test
- Probability ratio test
- Logits rescaling
- KL correction
- Reranking
- Approximate decoding
- Two model decoding
- Ensembled decoding
- Early acceptance
- Rejection correction
- Proposal distribution
- Verification stage
- Token proposals
- Acceptance rate
Questions to Ask When Adopting Speculative Decoding
- Do you need lower tail latency or just lower average latency?
- Can you host two models cost-effectively?
- How similar is your draft model distribution to the verifier?
- Will your application tolerate occasional correction delays when rejections occur?
Related Reading
- Model Context Protocol
- Lora Fine Tuning
- Gradient Checkpointing
- Post Training Quantization
- LLM Use Cases
- vLLM Continuous Batching
- LLM Quantization
- LLM Inference Optimization
Step-By-Step Speculative Decoding

1. Ingest The User Request And Warm The Context
The server accepts the prompt and prepares the whole context window, including any system prompts and recent tokens. The target model must have that context ready before any speculative work begins.
2. The Target Model Produces The First Ground Truth Token
The target model runs one autoregressive step and emits the next token using its logits, sampling policy, or beam choice. That first token anchors the sequence and ensures the target model stays authoritative.
3. Send Context Plus Anchor To The Draft Model And Ask For A Chunk
The server copies the input plus the first output token to the draft model and requests N draft tokens. N is typically chosen to balance validation cost and throughput, for example, 8 to 64 tokens.
4. Draft Model Generates Draft Tokens Quickly
The draft model runs ordinary inference using the same tokenizer and produces a sequence of candidate tokens using greedy sampling or top-k sampling. Because it is smaller or optimized, it creates tokens at lower latency.
5. Target Model Validates Draft Tokens In A Single Verification Step
The target model scores the draft tokens by computing logits for the sequence positions covered by the draft. If the draft tokens match the tokens the target model would have produced under its sampling policy, those tokens are accepted. If any token deviates, the target model rejects the draft suffix from the first mismatch onward.
6. If Accepted, Adopt The Draft Tokens And Advance The Output Pointer
The server appends accepted tokens to the output and moves the context forward by the accepted length. This is where latency gains appear because the target model avoided extra autoregressive calls for each accepted token.
7. If Rejected, Discard The Mismatching Suffix And Fall Back To The Target Model
The server keeps the correct prefix returned during validation and asks the target model to generate the next single token by its normal autoregressive step. That keeps the system lossless because the target model remains the final authority.
8. Repeat: Feed New Context Into The Draft Model And Request Another Chunk
The loop continues until the sequence reaches the end of a text token or a stopping condition. At each iteration, the draft model contributes candidate tokens and the target model validates or corrects them. Short illustration:
Imagine a user asks for a short poem. The target model emits token A. The draft model returns tokens B, C, D, E, and F. The target model computes logits for B, C, and D and finds that B and C match, but D differs. The system accepts B C, discards D E F, then the target model emits the next valid token G, and the server asks the draft model for a new proposal starting after G.
Draft Model Selection And Compatibility
Choose a draft model in the same model family so it shares the tokenizer and similar tokenization behavior. Pick a size that runs substantially faster on your hardware while still reflecting the target model token distribution.
Too small, and the acceptance rate collapses. Too large and you lose the latency benefit. Test for token distribution alignment and measure per token cross-entropy between draft and target.
Fine-Tuning The Draft To Match The Target
Train the draft model on samples of inputs and target model responses so it learns the target model's conditional distribution. Supervised fine-tuning on a curated set of target model outputs often raises the token acceptance rate substantially. We measured about a 15 percent improvement in acceptance rate after modest fine-tuning on representative prompts.
Validation Policy And Sampling Strategy
Use deterministic greedy proposals for higher acceptance rates or controlled sampling such as top k with small k, when you need diversity.
Adjust N the draft length:
Longer drafts increase potential speedups but raise rejection probability and revalidation cost. The validation step should compute logits only over the draft span in a batch-friendly manner.
Orchestration and RPC Efficiency
Reduce latency from model to model by colocating services when possible, using fast RPC, and keeping models warm in memory. Avoid frequent context serialization and small RPC messages. Overlap network and compute by pipelining the draft generation while the target model prepares validation logits.
Resource Allocation And GPU Packing
Reserve a small GPU partition or use mixed precision for the draft model to enable instant startup. Use memory pooling and preloaded weights to avoid cold start stalls that kill latency. If you have spare compute on the same machine, run the draft model on that spare to minimize cross-host hops.
Instrumentation And Metrics To Watch
Track token acceptance rate, raw latency per token, end-to-end throughput, and the cost per generated token. Monitor the distribution of rejection points to decide if you should shorten or lengthen draft chunks. Log token-level disagreements to drive fine-tuning and model selection.
When to Use Speculative Decoding: Practical Triggers and Use Cases
If you run huge target models at low batch sizes and you need per-token latency guarantees, speculative decoding gives predictable speedups. Interactive code generation while a user types is a typical example where outputs are long and latency matters.
What Conditions Justify The Extra Complexity?
Use speculative decoding when:
- The target model latency per token is high
- Batch size cannot be increased due to context limits or memory
- You have spare compute capacity to host a draft model.
If your operation is purely throughput-oriented with large batches, other scaling techniques may win.
Concrete Decision Checklist
- Measure baseline per token latency and GPU utilization
- Test a candidate draft model for acceptance rate and latency
- Simulate overall savings for different draft chunk sizes
- Ensure orchestration can supply low-overhead RPC and warm starts
If the expected latency reduction exceeds the added orchestration and validation cost, proceed to production testing.
Quick Cost and Quality Trade Off Example
Suppose the target model emits tokens at 120 ms each, and a draft model can propose eight tokens in 80 ms with a 60 percent acceptance rate. The net expected per-token time drops because many tokens avoid the 120 ms step. If acceptance falls, the savings shrink, and you re-tune the draft model or chunk size.
Edge Cases and Failure Modes to Watch
Draft and target must share a tokenizer and similar vocab mapping, or you will see systematic mismatches. High rejection rates waste cycles and add jitter. Cold starts, network latency, and incorrect validation logic can introduce correctness bugs. Build retry and monitoring logic so you detect regressions fast.
Want to Run a Quick Experiment?
Try a small draft model with greedy sampling and a short draft length on a held-out set of prompts. Measure acceptance rate, per token latency, and end-user latency to see if the approach pays off for your workload.
Related Reading
- Serving Ml Models
- LLM Serving
- LLM Benchmark Comparison
- Inference Optimization
- Inference Latency
- LLM Performance Metrics
- Pytorch Inference
- KV Cache Explained
- LLM Performance Benchmarks
Challenges of Speculative Decoding

Speculative decoding keeps a small draft model and a larger verifier live at the same time. That doubles down on memory demands in two places:
- Model parameter
- The growing key value cache for attention.
The KV cache scales with:
- Layer count
- Head size
- Sequence length
So a long prompt or long generated stream can blow past GPU memory budgets. What does that mean in practice? You either must use more GPUs, accept slower CPU offload, or limit the draft length. Those choices raise cost, increase latency, or constrain the size of the verifier you can run.
Practical Mitigations
Include KV compression, quantization, storing only one complete set of parameters on the GPU, swapping small drafts in and out, and using memory-efficient attention. Each mitigation reduces the raw memory footprint but adds implementation complexity or runtime cost, and may lower throughput when the verifier must be reloaded or when offload trips to the CPU.
Choosing and Tuning the Draft Model: The Speed Versus Correctness Tradeoff
Which draft model you pick and how you set its sampling parameters determine the acceptance rate and the eventual speed gains. A tiny draft model with high temperature produces diverse tokens fast, but it also rejects more often under verification.
That causes rollbacks and recomputation by the verifier. A stronger draft reduces rollbacks but consumes more compute and memory, cutting the theoretical speedup.
How do you tune this in practice?
- Measure acceptance rate
- Latency per token
- Cost per verification
Try different draft lengths, temperatures, and top p or top k settings. Use validation prompts similar to production traffic to estimate the break-even point where extra verification work outweighs speculative savings.
You can also distill a draft from the verifier to improve acceptance while keeping size down. These steps trade engineering time and validation compute for more predictable latency and higher throughput.
Implementation Complexity: Syncing Streams and Recovering from Rollbacks
Speculative decoding changes a simple decode loop into a two-level pipeline. You must manage concurrency between the draft-producing tokens and the verifier checking and committing them. That requires careful state management for:
- KV caches
- Atomic commits of accepted tokens
- Precise token alignment
- Robust rollback procedures for rejected tokens.
Streaming outputs add pressure:
You must hold back or tentatively stream tokens while verification completes, which complicates user-facing latency and consistency guarantees.
Error Handling Needs to be Explicit
What happens if the verifier fails mid-block, or a hardware fault corrupts a KV cache? You need snapshot and restore paths or incremental checkpointing.
Tests must simulate partial commits and race conditions. These engineering tasks increase code complexity, add test coverage needs, and raise maintenance costs in production systems.
Decoding Strategy Limits: Why Beam Search and Diverse Decoding Don’t Fit Easily
Speculative decoding works naturally with greedy search and sampling because the draft can propose a contiguous token sequence that the verifier can accept in one pass. Beam search, by contrast, explores many branches in parallel.
You cannot easily propose a single draft sequence that faithfully represents multiple beams. The verifier would need to check combinations of partial beams and reconcile divergences, which multiplies compute and memory and destroys the simplicity that makes speculative decoding attractive. That incompatibility matters when your application relies on beam search for quality gains, like precise translation or constrained generation.
You can try approximations:
- Run speculative decoding on the top beam only, convert the beam to a constrained sampling strategy, or perform speculative steps at fixed points between beam expansions.
- Each workaround reduces potential speedups and may change quality characteristics in ways that require careful validation.
Verification Overhead: When Many Draft Tokens Get Thrown Away
If the draft model produces tokens that the verifier often rejects, you pay two costs.
- You waste cycles running the draft.
- You re-run the verifier from a rollback point to generate corrected tokens.
Rollback cost rises with the length of speculative blocks and with the extent of the rewind. You lose the entire benefit and add more latency than a single verifier run. Operators tune speculative block size and acceptance thresholds to reduce wasted work.
Another approach is layered verification:
- Run a cheap classifier or lightweight verifier on draft tokens before invoking the whole model. That filters obvious errors and reduces full verifier invocations.
- Instrumentation that tracks token rejection rates and per token recompute time gives you the visibility needed to decide when to back off from speculation.
Batching and Throughput: Why Speculation Often Fights Parallel Efficiency
High-throughput systems rely on batching to amortize GPU cost. Speculative decoding creates divergence between requests because drafts will produce different token streams at other times.
That makes it hard to form large uniform batches for the verifier and reduces GPU utilization. The practical impact is lower throughput per dollar compared to traditional batched inference.
Strategies for Improving Throughput and Performance
You can partially recover by grouping similar prompts or by running the draft model in larger batches while using smaller microbatches for verifier runs.
Another option is to limit speculative decoding to interactive low-latency paths where batching is less critical. These mitigations can restore some throughput but add scheduling complexity and may require online prompt clustering or custom batching logic.
Start Building with $10 in Free API Credits Today!
Inference offers OpenAI-compatible serverless inference APIs that run top open source language models with predictable latency and elastic scaling. You call an API and the system handles:
- Model loading
- KV cache reuse
- Mixed precision
- Autoscaling
So you pay for inference time, not idle GPU cycles. The API surface matches common OpenAI patterns so you can swap endpoints with minimal code change while gaining access to model variants and runtime knobs for throughput tuning.
Want to keep your existing code and squeeze more performance from the same models?
Scale Async Workloads With Batch Processing That Actually Scales
Specialized batch processing handles high-volume async jobs such as bulk generation, summarization, and long document parsing. The system uses dynamic batching, request coalescing, and batch decode strategies to raise throughput and lower per-token cost while controlling tail latency.
It supports prefix caching and shared KV cache reuse to avoid redundant recomputation across similar requests and improves GPU utilization through larger per-step workloads.
How would you use batch decode to process millions of pages per day?
Extract Documents for RAG With Pipelines Built For Retrieval And Accuracy
Document extraction focuses on reliable chunking, OCR integration, field extraction, and semantic segmentation tailored for retrieval augmented generation. The pipeline supports overlapping chunk windows, dense embedding exports to vector stores, and structured metadata capture so retrieval returns high-quality passages for the verifier model.
You can route extracted content to different stores and tune chunk size against token budget to optimize recall and latency.
Which retrieval strategy fits your index and query patterns?
Start Building With $10 In Free Api Credits And Clear Cost Signals
New accounts get $10 in free credits so you can test latency, throughput, and quality before committing budget. Pricing aligns with the serverless model, so you see costs per token and request, and you can profile how optimization techniques affect spend.
Use small experiments to measure speedup from tactics like:
- Quantization
- mixed precision
- speculative decoding
To find the best cost-quality point.
Ready to run a cost per completion test?
Speculative Decoding Is Explained, And How It Speeds Things Up
Speculative decoding uses a two-model pipeline where a small draft model proposes candidate tokens and a larger acceptor model verifies them. The draft model generates token proposals in parallel with the complete model running verification, which reduces the number of expensive full model forward passes and increases throughput.
Techniques include speculative sampling, greedy acceptor rules, token rollouts, and speculative beam search to balance acceptance rate against quality. Implementation details matter: you need logits transfer, accept probability calculation, and a verification step that can consume the draft proposals without redoing work.
Architectural Trade-Offs When You Add Speculative Decoding
Speculative strategies raise throughput and lower cost but introduce calibration and safety choices. A poorly tuned draft model can push low-quality token proposals that the verifier must reject, leading to extra work.
Acceptance thresholds and temperature affect acceptance probability and stability. Monitor perplexity shifts, acceptance rate, and rollback overhead to ensure net speedup.
How will you set acceptance thresholds for your use case?
Practical Tuning and Production Rules for Fast Sampling
Pick a draft model that is substantially smaller than the verifier and train or tune it to mimic the acceptor’s token distribution for common prompts. Use mixed precision and quantization on the draft to reduce memory usage and increase the sample rate.
Tune batch sizes, speculative window length, and the number of candidate tokens per step while measuring throughput, latency, and answer quality. Instrument acceptance metrics, token-level confidence, and end-to-end latency so you can A/B test different speculative sampling policies.
Operational Tips for Async Batch and Document RAG Pipelines
Combine batch processing with speculative decoding to exploit both GPU throughput and parallel token proposal. Use request coalescing to feed the draft model with many short sequences at once and keep the verifier busy on the accepted rollouts.
Cache embeddings and partial KV cache entries for repeated queries to reduce cold start overhead. Log mismatches between draft proposals and verifier decisions to improve the draft model iteratively.
Quality Control and Safety When You Push Speed
Add verification checkpoints, fallback decoding paths, and thresholded acceptance to prevent degradation. Use smaller verification beams on cheap rejects and a full decode only when acceptance fails.
Audit hallucination rates and maintain prompt level checks that trigger conservative decoding when risk is high.
What safety signals will you monitor in real time?
Related Reading
- Inference
- vLLM Multi-GPU
- Distributed Inference
- KV Caching