Step-By-Step Pytorch Inference Tutorial for Beginners
Published on Aug 30, 2025
Get Started
Fast, scalable, pay-per-token APIs for the top frontier models like DeepSeek V3 and Llama 3.3. Fully OpenAI-compatible. Set up in minutes. Scale forever.
Within deep learning workflows, PyTorch Inference is the bridge between training and real-world use: it determines how a model delivers predictions, how quickly it responds, and how efficiently it uses compute resources. Many beginners master training loops but stumble when moving models into inference, facing slow response times or unclear deployment steps. This guide walks you through the process step by step loading a trained PyTorch model, switching to evaluation mode, disabling gradients, LLM Inference Optimization, batching inputs, and measuring performance so you can go from a checkpoint to reliable predictions without wasted effort.
To reach that goal, Inference's AI inference APIs give you ready endpoints, managed scaling, batching, and runtime optimizations so you spend minutes on deployment and hours on improving results rather than fixing plumbing.
What is Inference in PyTorch?

Inference is the act of using a trained neural network to produce predictions from new input data. During inference, the model consumes input data and produces output results. It does not compute gradients or update parameters. In plain terms, training modifies the model, and inference queries it.
When is Inference Critical
In PyTorch, the distinction maps to specific behaviors. Training uses forward and backward passes, gradient buffers, optimizer steps, dropout layers that randomize activations, and batch normalization layers that update running statistics.
During inference, you disable gradient computation, freeze the model behavior for modules like dropout and batch normalization, and focus on predictable, efficient forward passes. Which PyTorch primitives control these behaviors?
Where Does Inference Performance Matter Most?
- Cloud deployment: Lower latency and higher throughput reduce cost per request.
- Edge device inference: Limited compute and memory require model compression and quantization.
- Real-time use cases: Fraud detection, recommendation, video analysis, and autonomous systems need predictable low latency.
- High-volume batch serving: Throughput and efficient batching determine capacity and cost.
Which Scenario Applies to Your Model Right Now
Training turns on gradient tracking and parameter updates. Inference turns those things off. If you forget to switch modes, you may receive incorrect outputs from batch normalization or incur extra memory usage from gradient buffers.import torchfrom torchvision.models import resnet50model = resnet50(pretrained=True)dummy_input = torch.randn(1, 3, 224, 224)# Training behaviormodel.train()output_train = model(dummy_input)# Inference behaviormodel.eval()with torch.no_grad(): output_infer = model(dummy_input)
What Do These Calls Do in Practice
model.eval() sets dropout to pass through and sets batch normalization to use running estimates instead of batch statistics. Torch.no_grad() disables gradient tracking so torch.autograd does not allocate gradient buffers.
Both reduce memory usage and improve speed during forward-only execution. What happens if you leave the model in train mode when serving requests
Practical Considerations for Inference: Speed, Memory, and Deployment Constraints
Latency and Throughput Trade Offs
Every millisecond matters in production. Reduce wall clock latency by lowering per-request compute and by increasing concurrency with batching. Measure both tail latency and median latency. Use a profiler to identify hotspots before making code changes.
Memory and Compute Limits
On edge devices, memory often limits batch size and precision. On servers, GPU memory and PCIe bandwidth can be the bottleneck when gradients are disabled, you free memory previously used for activations and gradient accumulators.
Device Placement and Data Pipeline
Move model and input tensors to the same device once. Use pin memory and non-blocking copies on the data loader when feeding a GPU. Profile the host-to-device transfer versus kernel execution with nvprof or NVIDIA Systems, so you do not mask CPU overhead with GPU speed.
Profiling and Benchmarking
Use torch.profiler or the older autograd profiler to capture FLOPs time, and memory. Benchmark with realistic batch sizes and realistic preprocessing. Are you measuring warm start and steady state? Keep the same seeds and disable deterministic flags when you are optimizing for throughput.
Pytorch Inference Wrappers: The Essentials
torch.no_gradWhat it does: disables autograd during the forward pass, so no gradient graphs are built. When to use it: every forward pass that will not call backward.with torch.no_grad(): output = model(input)torch.jit TorchScript trace and script
What TorchScript Gives You
A statically serialized model that removes Python overhead and can run in a C++ runtime on servers or devices.
- torch.jit.trace records tensor operations for fixed input shapes. It is fast to use but does not capture Python control flow.
- torch.jit.script compiles Python control flow into scripted code. It supports dynamic behavior at the cost of more rigorous type and shape checks.
Example tracing flow
model.eval()traced = torch.jit.trace(model, dummy_input)torch.jit.save(traced, "traced_model.pt")loaded = torch.jit.load("traced_model.pt")loaded.eval()with torch.no_grad(): out = loaded(dummy_input)
When to Prefer Trace or Script
Use trace when your model forward is static and input size is fixed. Use a script when loops or conditionals depend on tensor values. How does this affect deployment target choices?torch.compile in PyTorch 2.0 and laterWhat it does: compiles and optimizes the model under the hood to generate faster kernels and remove Python overhead while preserving dynamic behavior. How to use it is simple.compiled_model = torch.compile(model)compiled_model.eval()with torch.no_grad(): out = compiled_model(dummy_input)
When it Helps
It often improves speed for dynamic graphs and workloads that run many small Python-level operations. Test it against your baseline, as interactions with third-party operations or custom kernels can alter the results.
Quantization for Inference: Reduce Size and Speed Up CPU Execution
Why Quantize
Lower-precision computing reduces memory bandwidth and can speed up inference, especially on CPUs and mobile devices. Common choices include INT8 and mixed-precision FP16 on GPU.
Types of Quantization
- Dynamic quantization: Convert weights to low precision at runtime without calibration. Simple and effective for transformer and linear heavy models.
- Static quantization: Calibrate activations with representative data to get better accuracy.
- Quantization aware training: Simulate quantization during training to recover most of the original accuracy.
Dynamic Quantization Example
import torchimport torch.quantizationfrom torchvision.models import resnet50model = resnet50(pretrained=True)quant_model = torch.quantization.quantize_dynamic( model, {torch.nn.Linear}, dtype=torch.qint8)with torch.no_grad(): out = quant_model(dummy_input)
When to Use Each Method
Start with dynamic quantization for CPU inference. Move to static quantization or QAT when you need more aggressive compression with minimal accuracy loss
Utility Patterns and Small Helpers
A compact wrapper keeps code consistent and prevents mistakes across a code base.def run_inference(model, input_tensor): model.eval() with torch.no_grad(): return model(input_tensor)Use this pattern in tests and production code so the calls are less error-prone. Does your team have a single point of entry for inference?
Common Pitfalls and Mistakes During Deployment
Leaving The Model in Train Mode
Batch normalization and dropout behave differently and will produce inconsistent predictions when the mode is wrong.
Measuring Warm-Up and Caching Effects as if They Were Real Speed
GPU kernels and caches take time to warm up. Exclude warm up runs from your steady state numbers.
Device Transfers and Data Preprocessing Overhead
Moving tensors across devices or doing heavy Python side preprocessing per request hides actual model time. Profile the whole pipeline.
Using Cudnn.Benchmark Incorrectly
Turning torch.backends.cudnn.benchmark on can speed up training or inference for fixed-size inputs. It can cause non-determinism and extra overhead when input sizes vary.
Tools and Further Optimizations to Try
- Mixed precision and AMP for GPU: reduce compute and memory with FP16 for many CNN and transformer workloads.
- ONNX export and ONNX Runtime: target other runtimes and accelerators. Try ORT for CPU inference and GPU integrations.
- TensorRT or TVM for kernel fusion and platform-specific acceleration.
- Pruning and weight sharing when you need smaller models at inference time.
- FX graph mode transformations for custom quantization or operator fusion.
- Set environment and runtime knobs: torch.set_num_threads, OMP_NUM_THREADS, and use of FBGEMM or QNNPACK backends for quantized paths.
Key Takeaway Actionable Checklist
- Use model.eval and torch.no_grad for forward only execution.
- Profile the entire pipeline not just the model forward pass
- Try torch.compile for dynamic graphs and torch.jit for serialized deployments.
- Apply quantization for CPU and mixed precision for GPU when accuracy permits.
- Optimize data transfers, batching, and thread settings to raise throughput without harming latency.
Related Reading
- LLM Inference Optimization
- Model Context Protocol
- Speculative Decoding
- Lora Fine Tuning
- Gradient Checkpointing
- LLM Quantization
- LLM Use Cases
- Post Training Quantization
- vLLM Continuous Batching
Inference PyTorch Models

Place the model on the correct device, turn off gradients, set evaluation mode, and run batched inputs. Use torch.no_grad to avoid computing gradients. Use' model.eval () ' to disable dropout and enable deterministic paths for many layers. Send tensors to the device and control the dtype for speed and memory efficiency.
Example for simple batched CPU or GPU inference:# setupdevice = torch.device("cuda" if torch.cuda.is_available() else "cpu")model.to(device)model.eval()# batched inference loopFrom torch.utils.data import DataLoader, TensorDatasetinputs = torch.randn(1024, 3, 224, 224) # example inputsdataset = TensorDataset(inputs)loader = DataLoader(dataset, batch_size=32, shuffle=False, pin_memory=True, num_workers=4)with torch.no_grad(): for batch in loader: x = batch[0].to(device, non_blocking=True) out = model(x) # forward pass # process out or move to CPU as neededTry this loop with increasing batch sizes until GPU memory limits appear.
Save and Load Models: Complete Model vs State_dict
- Save the entire model when you want exact object graph restoration and you control both save and load environments.# save wholetorch.save(model, "model.pt")# load wholemodel = torch.load("model.pt")model.eval()
- Save state_dict when you want more minor, more portable artifacts and you have the model class available.# save state dicttorch.save(model.state_dict(), "weights.pt")# reload state dictmodel = TheModel(...)model.load_state_dict(torch.load("weights.pt"))model.eval()
Which one to pick depends on whether you can recreate the model class during load or you need a self-contained artifact; what do you have on the deployment side?
Efficient Batching and Throughput Best Practices
Batch inputs rather than single examples to improve utilization. Use a collate_fn that pads variable-length sequences and returns attention masks when working with text or token sequences.
Prefer larger batches up to the memory and latency tradeoff point. Use pinned memory and num_workers > 0 for CPU to GPU transfers.
Example collate and DataLoader for token sequences:from torch.nn.utils.rnn import pad_sequencedef collate_tokens(batch): input_ids = [torch.tensor(x["input_ids"]) for x in batch] input_ids = pad_sequence(input_ids, batch_first=True, padding_value=pad_token_id) attention_mask = (input_ids != pad_token_id).long() return {"input_ids": input_ids, "attention_mask": attention_mask}loader = DataLoader(dataset, batch_size=8, collate_fn=collate_tokens, pin_memory=True, num_workers=4)Use bucketing or sorting by length to reduce padding overhead in large text workloads. When serving low latency, prefer smaller batches or adaptive batching with a short timeout.
Torchscript for Portable Inference Without a Python Interpreter
Use torch.jit.script or torch.jit.trace to create a serialized model that runs without Python. Scripted modules can be loaded from C++ with libtorch.
# scriptexample = torch.randn(1, 3, 224, 224).to(device)script = torch.jit.script(model, example)script.save("model_script.pt")# load in Pythonmodel = torch.jit.load("model_script.pt")model.eval()C++ loader example (skeleton):#include <torch/script.h>torch::jit::script::Module module;try { module = torch::jit::load("model_script.pt");} catch (const c10::Error& e) { // handle error}Test behavior differences between scripting and tracing when your model uses Python control flow. Script when logic depends on data-dependent branches.
Onnxruntime For Cross-Platform Speed and Deployment
Export to ONNX when you need to run on a different runtime, leverage hardware execution providers, or reduce latency. Include input and output names, opset version, and dynamic axes if sequence lengths vary.
# export with dynamic axes and namesdummy = torch.randn(1, seq_len).to(device)torch.onnx.export(model, dummy, "model.onnx", input_names=["input_ids"], output_names=["logits"], dynamic_axes={"input_ids": {0: "batch", 1: "seq"}, "logits": {0: "batch", 1: "seq"}}, opset_version=17)
Run in Python with ONNXRuntime:import onnxruntime as ortsess = ort.InferenceSession("model.onnx", providers=["CUDAExecutionProvider","CPUExecutionProvider"])inputs = {sess.get_inputs()[0].name: input_array}outs = sess.run(None, inputs)Use ONNX Runtime execution providers, such as:
- CUDA
- TensorRT
- OpenVINO
- CoreML
- NNAPI
On the target platform. ONNXRuntime applies graph optimizations and supports quantized models to reduce memory usage and accelerate execution.
Quantization, Mixed Precision, and Other Inference Accelerations
For CPU and edge, try dynamic or static quantization. For transformer-style models, post-training dynamic quantization often gives a reasonable tradeoff.
# dynamic quantization on linear layersImport torch.quantizationqmodel = torch.quantization.quantize_dynamic(model, {torch.nn.Linear}, dtype=torch.qint8)For GPU, use AMP and half precision to cut memory and increase throughput:From torch. cuda.amp import autocastwith torch.no_grad(): with autocast(): out = model(x)Consider torch.compile to optimize kernels for inference in PyTorch 2.x, but test stability and performance for your model. Enable cudnn benchmark for varying input sizes:torch.backends.cudnn.benchmark = True
Advanced LLM Inference Patterns and Generation Speedups
For autoregressive generation, cache key value states to avoid recomputing past attention (use_cache or past_key_values). Batch prompts by similar length. Use attention mask and pad correctly. For Hugging Face models, generate in batched mode with return_dict_in_generate and use_cache True.
Example with transformers:from transformers import AutoTokenizer, AutoModelForCausalLMtokenizer = AutoTokenizer.from_pretrained("gpt2")model = AutoModelForCausalLM.from_pretrained("gpt2").to(device).eval()prompts = ["Hello world", "Short prompt"]tokens = tokenizer(prompts, return_tensors="pt", padding=True).to(device)with torch.no_grad(): outputs = model.generate(**tokens, max_new_tokens=50, do_sample=False, use_cache=True)Use beam search only when necessary. For large-scale async workloads, run specialized batch processing pipelines that assemble many requests into microbatches and apply generation in fewer passes. How will you structure request aggregation for your service?
Putting it into Production: Profiling and Observability
Profile end-to-end: decode CPU time, GPU kernels, transfers, memory peaks, and GC. Use torch.Profiler, NVIDIA Nsight Systems, and ONNX Runtime profiling. Track tail latency as well as average throughput. Automate load tests that match real request patterns so you measure realistic batch shapes and sequence lengths.
Two sentence CTA about Inference
Inference delivers OpenAI-compatible serverless AI inference APIs for top open source LLM models, offering developers the highest performance at the lowest cost in the market. Start building with $10 in free API credits and experience state of the art language models that balance cost efficiency with high performance.
Related Reading
- KV Cache Explained
- LLM Performance Metrics
- LLM Serving
- Serving ML Models
- LLM Performance Benchmarks
- LLM Benchmark Comparison
- Inference Latency
- Inference Optimization
Start Building with $10 in Free API Credits Today!
Inference offers OpenAI-compatible serverless inference APIs for top open-source LLM models. You call a familiar endpoint, and the service handles model hosting, autoscaling, and secure access control so you can move from prototype to production without rewriting your API layer. Endpoints accept the same request shape that many engineers already use, while the backend applies GPU scheduling, model packing, and runtime optimizations that reduce latency and GPU cost.
Batch Processing for Large-Scale Async AI Jobs
When throughput matters, utilize specialized batch processing for large-scale asynchronous AI workloads. The system batches many requests into fewer GPU runs, supports micro-batching for variable-length token streams, and offers job queueing with retry and priority controls.
You can push thousands of documents or millions of tokens and let the service optimize I/O, padding, and tokenized batch layout to maximize tensor core utilization. How would you size batches to accommodate your typical prompt length and target tail latency?
Document Extraction Tuned for Retrieval Augmented Generation
Document extraction features are designed for RAG use cases that require clean passages, entity extraction, and chunking strategies that preserve context. The pipeline supports text chunking, passage embedding, semantic search, and extractive QA models.
It integrates with vector stores and can produce embeddings using CPU inference or GPU-accelerated models, which can be exported to ONNX Runtime or TorchScript for a reduced memory footprint and faster embedding throughput.
Start Building With Ten Dollars Free and Fast Onramps
You get ten dollars in free API credits to test endpoints, run batch jobs, and evaluate model trade-offs. SDKs and example notebooks demonstrate how to replace an OpenAI-compatible call with the Inference endpoint, export Hugging Face transformers to TorchScript or ONNX, and run profiling on a representative workload. What small experiment would prove cost per token and p95 latency for your app?
PyTorch Inference Playbook To Cut Cost and Latency
Use TorchScript and FX graph tracing to freeze the model graph, enabling operator fusion and kernel fusion. Export critical components to the ONNX runtime or run via TensorRT for int8 and fp16 kernels that use tensor cores.
Apply dynamic quantization for CPU-friendly transformers, or use static quantization or quantization-aware training for aggressive int8 models on GPU inference paths. Try mixed precision using AMP or fp16 model weights to halve memory and boost throughput where numeric stability allows.
Strategies for Optimizing Model Inference
Optimize memory and scheduling by pinning input tensors, using asynchronous CUDA streams, and reusing preallocated buffers for tokenizer outputs. For multi-GPU serving, apply model sharding or tensor parallelism while maintaining efficient batch packing. Use TorchServe or Triton as an orchestration layer when you need custom scheduling, or run compiled models via torch compile or TorchInductor to reduce Python overhead and improve operator fusion.
Operational Tricks For Production Throughput
Warm models with synthetic traffic to avoid cold start penalties and keep a small pool of hot replicas for tail latency-sensitive requests. Cache embeddings and prompt templates to avoid repeated computation.
Monitor p50, p95, p99 latencies, GPU utilization, memory footprint, and throughput per second so autoscaling rules reflect real costs. Profile with Torch Profiler or Nsight to identify operator hotspots and verify that your quantized model utilizes the intended kernels rather than falling back to slow reference operations.
Integration Patterns That Respect Cost And Accuracy
Choose the right model size for the task and split responsibilities across models when it saves cost. Use smaller encoder-only models for embeddings and rerankers, and reserve larger decoder models for final generation
Consider two-stage pipelines: a low-cost, fast model for candidate generation followed by a higher-quality model for the final output. Use int8 quantized checkpoints where loss in quality is acceptable and keep fp16 or bf16 models for core generation when you need higher fidelity.
Questions To Tune Your Inference Strategy
- What is your target p95 latency and cost per 1k tokens?
- How many concurrent users create peak load and how variable is prompt length?
Answering those helps choose batch sizes, autoscaling thresholds, and whether to invest in quantization-aware training or model compilation.
Related Reading
- Continuous Batching LLM
- Inference Solutions
- vLLM Multi-GPU
- Distributed Inference
- KV Caching
- Inference Acceleration
- Memory-Efficient Attention