Introducing ClipTagger-12b.Learn More

    A Practical Guide to Post Training Quantization for Edge AI

    Published on Aug 20, 2025

    Get Started

    Fast, scalable, pay-per-token APIs for the top frontier models like DeepSeek V3 and Llama 3.3. Fully OpenAI-compatible. Set up in minutes. Scale forever.

    When you try to run a large language model on a phone or a tiny server, speed and memory become the limits, not ideas. Within LLM Inference Optimization Techniques, Post Training Quantization offers a practical path to cut model size and boost throughput by converting floating point weights and activations into lower precision formats like INT8 or FP16 while using calibration data and per channel scaling to preserve accuracy. Want to keep quality while lowering latency and memory footprint? This article will help you to easily shrink and speed up your AI models so they run efficiently on edge devices without losing accuracy.

    To make that easier, Inference's AI inference APIs provide ready-made tools to apply quantization, handle calibration, and deploy optimized models so you get lower latency, smaller memory use, and simpler edge deployment without heavy changes to your workflow.

    What Is Post-Training Quantization (PTQ) and Why Is It Important?

    What Is Post-Training Quantization (PTQ) and Why Is It Important

    Model quantization reduces the number of bits used to represent each numerical value inside a model. In practice, that means converting weights, biases, and activations that live as 32-bit floats into lower precision formats such as 16-bit floats or 8-bit integers. The result:

    This matters when you deploy models to phones, embedded controllers, drones, or any system where storage, latency, and power matter. Quantized operators typically support only the forward pass, so quantized inference uses integer or reduced precision arithmetic, while training stays in complete precision.

    What Quantization Does: Scales, Zero Points, and Formats

    Quantization maps real numbers x to integer values q using a scale and an optional zero point. A simple affine mapping looks like q = round(x / scale) + zero_point. When you dequantize, you recover a float approximation via x̂ = scale * (q - zero_point). That scale and zero point encode dynamic range and allow asymmetric quantization when values are not centered at zero.

    Choices you make here matter:

    • Per tensor versus per channel
    • Symmetric versus asymmetric
    • Integer versus float scale

    Per-channel weight quantization usually preserves accuracy better for convolution and linear layers. Activation quantization often needs per-tensor ranges and careful calibration to handle outliers.

    Post-Training Quantization Explained: How It Works and When to Use It

    Post training quantization applies reduced precision after you finish training. You take a pretrained float32 model and compress its tensors into int8 or float16 without retraining. PTQ can be static or dynamic. Static PTQ uses a small calibration dataset to collect activation ranges and fix quantization parameters before deployment.

    Dynamic PTQ computes ranges on the fly during inference. Calibration methods include simple min-max, percentile clipping, histogram-based methods, and KL divergence for distribution matching. Calibration data should resemble real inputs and can be a few hundred to a few thousand samples for many models. PTQ reduces model size, lowers memory traffic, and accelerates inference on hardware that supports int8 math.

    Quantization Aware Training: Why It Differs and When to Choose It

    Quantization aware training simulates reduced precision during training by inserting fake quantization operators in forward and backward passes. The model learns weights that are robust to quantization error. QAT tends to achieve higher final accuracy than straight PTQ on complex networks, but it raises training cost.

    You must modify the training graph, maintain fake quant modules, and tune hyperparameters. QAT is the right choice when accuracy budgets are tight and you can afford extra training time and engineering work.

    PTQ Techniques and Calibration Strategies You Can Use Today

    • Static versus dynamic quantization: Use static with a representative calibration set when you want predictable performance and lower variance. Use dynamic when calibration data is limited or activations vary a lot.
    • Calibration methods: Min max is fast but sensitive to outliers; percentile clipping trims extremes; histogram and KL divergence-based methods match distributions for better scale selection.
    • Granularity: Per channel weights, per tensor activations. Per channel often recovers more accuracy for weights; per tensor is cheaper and sometimes necessary for hardware.
    • Symmetric versus asymmetric: Symmetric quantization uses zero point zero and works well when values center on zero; asymmetric handles non-centered activations and dynamic ranges.
    • Bias correction and outlier handling: Correct bias introduced by quantizing weights, or clip extreme activation values before scale computation. These reduce quantization noise.
    • Mixed precision and integer only: Keep sensitive layers in float16 or float32 and quantize the rest to int8. For best throughput, target integer-only inference to avoid repeated dequantize and requantize overhead.
    • Small fine-tuning after PTQ: A few epochs of quantization-aware fine-tuning on a small dataset can recover much of the accuracy lost by naive PTQ.
    • Tooling: Use TFLite, ONNX Runtime, TensorRT, PyTorch quantization toolkit, or OpenVINO to experiment with static calibration, dynamic quant, and integer-only kernels.

    Operator Support and Practical Constraints

    Not every operator or custom kernel has a quantized implementation. Check your inference engine. Fused ops like conv plus bias plus activation usually perform better when fused and quantized together. Remember that quantized operators generally support only the forward pass, which means you cannot backprop through quantized kernels during training.

    Some accelerators only support per-tensor quantization or require symmetric scales, so you must map your quantization scheme to the hardware capabilities. Also, watch memory layout and alignment; quantized kernels can be bandwidth-bound as well as compute-bound.

    Accuracy Tradeoffs and How to Measure Them

    Quantization introduces quantization noise and distribution shifts that can reduce accuracy. Measure accuracy on a realistic validation set and evaluate latency, memory, and energy separately.

    If accuracy loss exceeds your threshold, try per-channel weight quantization, better calibration strategies, bias correction, or targeted mixed precision. Use end-to-end benchmarks with representative input pipelines to capture absolute latency and power numbers.

    Which Quantization Strategy Fits Your Use Case?

    Ask these questions:

    • Do you have a representative calibration dataset?
    • Is model accuracy critical?
    • Does target hardware support int8 or float16 well?

    If you lack calibration data and the accuracy tolerance is moderate, start with dynamic PTQ. If you can collect calibration samples and need a smaller model size and better throughput, use static PTQ with histogram or KL-based calibration plus per-channel weights. If the accuracy margin is tight and you can retrain, invest in QAT or PTQ plus short fine-tuning.

    Deployment Tips and Hardware Considerations

    Select quantization parameters that match the device: many mobile NPUs and GPUs expose optimized int8 paths, and some tensor cores prefer float16. Use integer-only inference kernels when possible to avoid dequantize overhead. Validate operator coverage in your inference engine and convert the model with tools that preserve quantization parameters, such as scale and zero point.

    Monitor memory bandwidth and cache behavior. Quantization reduces model size but may shift bottlenecks to data movement. Measure power and thermal behavior on target hardware to confirm energy savings for battery-powered devices.

    How Can Post-Training Quantization Be Achieved?

    How Can Post-Training Quantization Be Achieved

    Post-training quantization converts a pretrained floating-point model to lower precision to shrink model size and speed up inference without retraining. It usually reduces weights and activations to 8-bit integers, which, when starting from FP16, cuts the parameter footprint in half and often speeds up LLM inference.

    The technique uses a small calibration set to estimate ranges and find quantization scaling factors with a search over a compact parameter space. Metrics for that search include mean squared error and cosine distance.

    Calibration: Collecting Representative Activation Statistics

    PTQ begins with a representative dataset, sometimes called a calibration dataset. You run forward passes through the unquantized model while inserting observer modules at layer outputs and key tensors.

    Observers record min and max values and can also capture histograms, percentiles, and other distribution details for activations. No training or weight updates happen here. The goal is to capture how the model uses dynamic range on realistic inputs so you can estimate quantization ranges later.

    Quantization Parameter Calculation: How Scales and Zero Points Are Found

    After calibration, you analyze observer statistics to compute scale and zero point for each tensor or channel. Standard range estimation methods include simple min-max, entropy-based methods, and percentile clipping, such as using the 99th or 99.9th percentile to ignore extreme outliers. You can choose symmetric or asymmetric mapping, per-tensor or per-channel linear quantization, and whether to apply integer-only inference.

    PTQ often uses a small parameter grid search over clip thresholds and per-layer scales, optimizing a metric like MSE or cosine distance between original and quantized activations. That search stays cheap because it uses collected statistics rather than gradient updates.

    Model Conversion: Quantizing Weights and Activations in Practice

    With scales and zero points in hand, you quantize weights by rounding to integer levels and store the quantized tensors. For activations, you either insert quantize and dequantize operations around operators or rely on runtime support that performs quantization on the fly.

    Many deployments prefer weight-only quantization because weights dominate memory for large language models, but weight-only quantization forces dequantization at runtime, which can add compute overhead if your bottleneck is arithmetic rather than memory. You can choose to quantize per tensor for simplicity or to quantize per channel for better accuracy on convolutions and linear layers.

    Activation Aware Weight Selection

    Using activation signals to keep important weights precise. Not all weights are equally important. Activation-aware approaches observe activation magnitudes and identify the small fraction of weights that carry outsized influence. Techniques of this type keep a tiny subset of weights in higher precision while quantizing the bulk to integers.

    One recent method applies this idea to LLMs and finds that about 0.1 to 1 percent of weights are salient and worth preserving at higher precision. Using activation-driven selection reduces accuracy loss while keeping most memory and compute savings.

    Parameter Search and Metrics: How PTQ Tunes Scales Without Retraining

    PTQ commonly chooses quantization parameters by searching in a small space driven by calibration statistics. Search objectives include mean squared error between tensors, cosine distance of embedding or activation vectors, and end-to-end task proxies when feasible.

    Search variables include clipping percentiles, symmetric versus asymmetric range, and per-channel versus per-tensor scaling. Because you do not update weights, this search is fast and reproducible even with limited calibration examples.

    Handling Outliers and Clipping: Percentile and Entropy-Based Methods

    Outliers can wreck integer ranges. Percentile clipping sets range endpoints at a high percentile value, such as 99 or 99.9, to ignore rare spikes. Entropy-based methods try to choose ranges that minimize information loss measured by quantization-induced distribution changes.

    Both approaches trade off tail fidelity for better average performance and minor quantization distortion. Your choice often depends on the activation distribution shape and downstream sensitivity.

    Insertion of Quantize and Dequantize Operations and Runtime Models

    You can insert explicit quantize and dequantize operations into the model graph to produce int outputs and then dequantize when needed. Alternatively, some runtimes perform mixed precision or dynamic quantization without modifying the graph.

    Runtimes can also provide integer-only kernels that avoid dequantization entirely. The insertion strategy affects runtime memory traffic and operator fusion.

    When to Choose Quantization Aware Training Instead of PTQ

    Choose quantization-aware training when your architecture or task is susceptible to activation quantization, when you need high accuracy, or when you target very low bits, such as 4 or 2 bits.

    QAT simulates quantized arithmetic during training using fake quantization and updates weights with backpropagation so the model can learn to compensate for discretization errors. QAT adds training cost but recovers accuracy that PTQ alone cannot in many complex cases.

    When to Prefer PTQ Instead of Quantization Aware Training

    Choose PTQ when retraining is impractical due to model size, data privacy, or limited compute. PTQ works well when a slight drop in accuracy is acceptable and you need rapid deployment. It scales to huge models because calibration runs are cheap and do not require gradient memory. A well-tuned PTQ model also makes a good initialization point if you later decide to apply QAT in a targeted way.

    Combining PTQ With QAT Fine Tuning: A Best of Both Worlds Workflow

    Start by applying PTQ to quantize weights or both weights and activations to the target precision and use that quantized model as initialization. Then fine-tune under quantization-aware training, where fake quantization simulates integer behavior during forward and backward passes.

    Fine-tuning from a PTQ state usually converges faster than training QAT from scratch, often needing only a few epochs to recover lost accuracy because the model already adapts to quantization noise. You update weights with regular optimization while keeping the quantization simulation active.

    Practical Trade-Offs for LLM Inference: Memory, Compute, and Throughput

    Large language model inference often hits memory limits first, which makes weight-only quantization attractive because it reduces the model footprint. If your hardware is compute-bound, weight-only quantization can add dequantization overhead that reduces throughput.

    Mixed precision and selective higher precision for salient weights balance accuracy and efficiency. Also consider operator fusion, integer kernel availability, and memory bandwidth when choosing per-tensor or per-channel strategies.

    Tooling Note: Framework Support for PTQ

    PyTorch, TensorFlow, and ONNX Runtime all include built-in PTQ tooling such as observers, calibration pipelines, quantize and dequantize transforms, and interfaces for per-tensor and per-channel quantization. These frameworks also offer integration with integer-only runtimes and support for calibration metrics and simple parameter searches for scale selection.

    How to Create Your PTQ Application in C++ and Python

    How to Create Your PTQ Application in C++ and Python

    PTQ Playbook: Four clear steps to shrink FP32 to INT8:

    • Load a pre-trained model you intend to optimize. Use a PyTorch nn. Module for quick experiments or a scripted LibTorch module for C++ deployment.
    • Choose a quantization method: Static post-training quantization with calibration, entropy-based or min-max calibrators, per-channel or per-tensor weight quantization.
    • Calibrate with a representative sample dataset. Run the model on that dataset and record FP32 activations to compute scales and zero points for INT8. Save a calibration cache for reuse.
    • Export the quantized artifact: A serialized TensorRT engine, a Torch‑TensorRT module, or a quantized ONNX graph. Then run inference and validate accuracy and throughput.

    Below you will find parallel, compact examples in Python and C++ that show loading, quantizing, calibrating, and running inference.

    Python lab: Rapid PTQ with PyTorch and Torch‑TensorRT

    • Goal: Prototype fast, iterate on calibrator choice, and dataset size.
    • Key libs: Torch, torchvision, torch_tensorrt.

    Python example (minimal, end-to-end):

    import torch
    import torchvision
    import torchvision.transforms as transforms
    import torch_tensorrt
    
    # ------------------- Load pretrained model -------------------
    model = torchvision.models.resnet18(pretrained=True).eval().cuda()
    
    # ------------------- Representative dataset + DataLoader -------------------
    transform = transforms.Compose([
        transforms.Resize(224),
        transforms.ToTensor(),
        transforms.Normalize(
            (0.485, 0.456, 0.406),   # mean (ImageNet)
            (0.229, 0.224, 0.225)    # std
        ),
    ])
    
    testset = torchvision.datasets.CIFAR10(
        root="./data",
        train=False,
        download=True,
        transform=transform,
    )
    
    loader = torch.utils.data.DataLoader(
        testset,
        batch_size=8,
        shuffle=False,
        num_workers=2,
    )
    
    # ------------------- Create INT8 calibrator -------------------
    calibrator = torch_tensorrt.ptq.DataLoaderCalibrator(
        loader,
        cache_file="./calib.cache",
        use_cache=False,  # set True to reuse cache
        algo_type=torch_tensorrt.ptq.CalibrationAlgo.ENTROPY_CALIBRATION_2,
        device=torch.device("cuda:0"),
    )
    
    # ------------------- Compile Torch-TensorRT module -------------------
    trt_mod = torch_tensorrt.compile(
        model,
        inputs=[torch_tensorrt.Input((8, 3, 224, 224))],  # input spec
        enabled_precisions={torch.float, torch.half, torch.int8},  # allow FP32, FP16, INT8
        calibrator=calibrator,
        device={"device_type": torch_tensorrt.DeviceType.GPU, "gpu_id": 0},
    )
    
    # ------------------- Run inference (input is FP32) -------------------
    sample = torch.randn(8, 3, 224, 224).cuda()
    out = trt_mod(sample)
    
    print("Output shape:", out.shape)
    



    Python notes: Switch algo_type to CalibrationAlgo.MIN_MAX for certain NLP or symmetric range tasks. Use cache_file when you want to avoid recalibrating every time.

    C++ shop: Production PTQ with LibTorch and Torch‑TensorRT

    • Goal: Production-grade, typed APIs, control over data loading and preprocessing.
    • Key libs: LibTorch, Torch‑TensorRT, TensorRT runtime for engine hardening.

    C++ example (sketch, core parts only):

    #include <torch/script.h>
    #include <torch/torch.h>
    #include <torch_tensorrt/torch_tensorrt.h>
    
    // ------------------- Load scripted model -------------------
    auto module = torch::jit::load("model_scripted.pt");
    module.to(torch::kCUDA).eval();
    
    // ------------------- Build Dataset + DataLoader -------------------
    // Example: use CIFAR10 test set with normalization and stacking
    auto calibration_dataset = datasets::CIFAR10(
            data_dir, datasets::CIFAR10::Mode::kTest)
        .use_subset(320)  // use subset for calibration
        .map(torch::data::transforms::Normalize<>(
            {0.4914, 0.4822, 0.4465},   // mean
            {0.2023, 0.1994, 0.2010}))  // std
        .map(torch::data::transforms::Stack<>());
    
    auto calibration_dataloader = torch::data::make_data_loader(
        std::move(calibration_dataset),
        torch::data::DataLoaderOptions()
            .batch_size(32)
            .workers(2)
    );
    
    // ------------------- Make Torch-TensorRT calibrator -------------------
    auto calibrator = torch_tensorrt::ptq::make_int8_calibrator(
        std::move(calibration_dataloader),
        "./calib.cache",
        /*use_cache=*/true
    );
    
    // ------------------- Compile spec with INT8 enabled -------------------
    std::vector<std::vector<int64_t>> input_shape = {{32, 3, 32, 32}};
    auto compile_spec = torch_tensorrt::CompileSpec({input_shape});
    
    // Allow FP16 + INT8
    compile_spec.enabled_precisions.insert(torch::kF16);
    compile_spec.enabled_precisions.insert(torch::kI8);
    
    // Attach calibrator for INT8 PTQ
    compile_spec.ptq_calibrator = calibrator;
    
    // ------------------- Compile Torch-TensorRT module -------------------
    auto trt_mod = torch_tensorrt::CompileGraph(module, compile_spec);
    
    // ------------------- Run inference -------------------
    torch::Tensor input = torch::rand({32, 3, 32, 32}).to(torch::kCUDA);
    auto out = trt_mod.forward({input}).toTensor();
    



    C++ notes: You can also set compile_spec to accept an external nvinfer1::IInt8Calibrator pointer if you already wrote a native TensorRT calibrator. Serialize trt_mod or the TensorRT engine to disk for fast startup.

    TensorRT native: Custom INT8 calibrator in C++

    Why write one? You may need custom preprocessing, streaming datasets, or constrained storage. The calibrator must implement TensorRT’s IInt8Calibrator interface. Use nvinfer1::IInt8EntropyCalibrator2 for entropy-based calibration or IInt8MinMaxCalibrator for min-max style.

    Minimal calibrator skeleton (concept):

    #include <NvInfer.h>
    #include <cuda_runtime_api.h>
    #include <fstream>
    #include <vector>
    #include <string>
    
    class MyCalibrator : public nvinfer1::IInt8EntropyCalibrator2
    {
    public:
        MyCalibrator(DataLoader& dl, int batch_size, const std::string& cacheFile)
            : dataloader_(dl),
              batch_size_(batch_size),
              cacheFile_(cacheFile)
        {
            // Allocate device buffer(s) for input
            int inputSize = batch_size_ * inputChannels * inputHeight * inputWidth; 
            cudaMalloc(&deviceInput_, inputSize * sizeof(float));
        }
    
        ~MyCalibrator() override
        {
            cudaFree(deviceInput_);
        }
    
        int getBatchSize() const noexcept override
        {
            return batch_size_;
        }
    
        bool getBatch(void* bindings[], const char* names[], int nbBindings) noexcept override
        {
            // Fetch next batch from DataLoader
            if (!dataloader_.nextBatch(hostData_))  // Fill hostData_ with batch
            {
                return false;  // No more data
            }
    
            // Copy batch to device
            size_t inputSize = hostData_.size() * sizeof(float);
            cudaMemcpy(deviceInput_, hostData_.data(), inputSize, cudaMemcpyHostToDevice);
    
            // Bind the device pointer to TensorRT
            bindings[0] = deviceInput_;
            return true;
        }
    
        const void* readCalibrationCache(size_t& length) noexcept override
        {
            calibrationCache_.clear();
            std::ifstream input(cacheFile_, std::ios::binary);
            if (input.good())
            {
                input.seekg(0, std::ios::end);
                size_t size = input.tellg();
                input.seekg(0, std::ios::beg);
                calibrationCache_.resize(size);
                input.read(reinterpret_cast<char*>(calibrationCache_.data()), size);
                length = size;
                return calibrationCache_.data();
            }
            length = 0;
            return nullptr;
        }
    
        void writeCalibrationCache(const void* cache, size_t length) noexcept override
        {
            std::ofstream output(cacheFile_, std::ios::binary);
            output.write(reinterpret_cast<const char*>(cache), length);
        }
    
    private:
        DataLoader& dataloader_;
        int batch_size_;
        void* deviceInput_{nullptr};
        std::vector<float> hostData_;
        std::vector<char> calibrationCache_;
        std::string cacheFile_;
    
        // Example shape params (should come from your DataLoader or network input)
        int inputChannels = 3;
        int inputHeight = 224;
        int inputWidth = 224;
    };
    


    Then:

    // Create builder and config
    nvinfer1::IBuilder* builder = nvinfer1::createInferBuilder(logger);
    nvinfer1::IBuilderConfig* config = builder->createBuilderConfig();
    
    // Enable INT8 mode
    config->setFlag(nvinfer1::BuilderFlag::kINT8);
    
    // Attach calibrator (implements nvinfer1::IInt8Calibrator)
    config->setInt8Calibrator(myCalibratorPointer);
    
    // Build engine
    nvinfer1::ICudaEngine* engine = builder->buildEngineWithConfig(*network, *config);
    

    ONNX route: Export from PyTorch, quantize, deploy with ONNX Runtime C++

    When to use this? You want portable quantized models supported across runtimes and hardware. The usual flow exports an ONNX graph, uses ONNX quantization tools to produce a static INT8 graph, then loads that graph in C++ using ONNX Runtime.

    Export and quantize (Python):

    import torch
    from onnxruntime.quantization import (
        quantize_static,
        CalibrationDataReader,
        QuantFormat,
        QuantType,
    )
    
    # Export model to ONNX
    dummy = torch.randn(1, 3, 224, 224).cuda()
    torch.onnx.export(
        model.cpu(),
        dummy.cpu(),
        "model.onnx",
        opset_version=13
    )
    
    # Implement a CalibrationDataReader
    class MyDataReader(CalibrationDataReader):
        def __init__(self, dataloader):
            self.dataloader = dataloader
            self.iterator = iter(dataloader)
    
        def get_next(self):
            try:
                x, _ = next(self.iterator)
                return {"input": x.numpy()}
            except StopIteration:
                return None
    
    # Perform static quantization with representative data
    quantize_static(
        "model.onnx",
        "model_quant.onnx",
        MyDataReader(loader),
        quant_format=QuantFormat.QOperator,
        per_channel=True,
        activation_type=QuantType.QUInt8
    )
    



    Load and run quantized ONNX in C++ with ONNX Runtime (sketch):

    Ort::Env env(ORT_LOGGING_LEVEL_WARNING, "ptq");
    
    Ort::SessionOptions opts;
    opts.SetGraphOptimizationLevel(GraphOptimizationLevel::ORT_ENABLE_ALL);
    
    Ort::Session session(env, "model_quant.onnx", opts);
    
    // Prepare input Ort::Value from raw float data
    std::vector<const char*> input_names = {"input"};
    std::vector<Ort::Value> inputs = {
        /* create input tensor */
    };
    
    auto output_tensors = session.Run(
        Ort::RunOptions{nullptr},
        input_names.data(),
        inputs.data(),
        1,
        /* output_names */,
        1
    );
    



    ONNX notes: Onnxruntime.quantization supports dynamic quantization too. Static quantization with calibration can yield higher accuracy for activations that need per-tensor scales.

    Which calibrator and algorithm should you pick:

    • Entropy calibration (EntropyCalibrator2) reduces the KL divergence between the FP32 activation distribution and the quantized distribution. Suitable for many vision tasks.
    • MinMax style (IInt8MinMaxCalibrator) captures full activation range and can work well for NLP when tails matter.
    • Per-channel weight quantization often preserves accuracy on conv layers. Per tensor is simpler and faster.
    • Use representative calibration data that matches the expected runtime input distribution. A few hundred to a few thousand samples are commonly sufficient.
    • Choose cache reuse when calibration is expensive or when the build machine lacks the dataset. Use a cache file to skip repeating calibration.

    Quantization design choices that affect accuracy and performance:

    • Scale and zero point: Symmetric vs asymmetric affects dynamic range and integer math requirements.
    • Granularity: Per-channel weight scales usually yield better accuracy for conv layers.
    • Granularity for activations: per tensor is the norm for speed.
    • Mixed precision: Enable INT8 plus FP16 or FP32 fallback for operations that do not quantize well. Torch‑TensorRT and TensorRT support mixed precision via enabled_precisions.
    • Batch size and calibration batch size change observed activation ranges and therefore scale estimates.

    Deploying the quantized model and running inference:

    • TensorRT: Serialize the engine after build, then deserialize on the target host. The engine contains the calibrated scales when built with an INT8 calibrator.
    • Torch‑TensorRT: trt_mod can be saved and loaded using its serialization APIs or by serializing the underlying TensorRT engine for fast startup.
    • ONNX Runtime: Load model_quant.onnx in C++ and run using the Ort C++ API. The quantized graph includes int8 ops where applicable.

    Testing accuracy and performance before you ship:

    • Accuracy checks: Run the quantized model on a held-out validation set and compare top1/top5 to the FP32 baseline. Report degradation per class and layer if possible.
    • Latency and throughput: measure tail latency and sustained throughput under expected batch sizes, warm up the GPU, and measure with realistic input pipelines.
    • Memory and power: check GPU memory footprint and power draw for the quantized model. INT8 often reduces memory and increases throughput.
    • A/B testing: Run the quantized model alongside the FP32 baseline in a canary deployment and compare user-facing metrics.
    • If accuracy drops too much, try per-channel quantization, different calibrator algorithms, larger or more representative calibration datasets, or fall back to mixed precision for problematic subgraphs.

    Quick checklist for a safe PTQ rollout:

    • Export reproducible calibration cache and store it with your build artifacts.
    • Validate a range of inputs, including edge cases.
    • Benchmark on the actual target hardware under real load.
    • Automate calibration, build, and validation as part of your CI so you can track quality regressions.

    References and small pointers:

    • Use LibTorch DataLoader and Dataset for C++ preprocessing and batching. See PyTorch C++ frontend docs for details on Dataset and DataLoader APIs.
    • Torch‑TensorRT offers DataLoaderCalibrator and CacheCalibrator factories to simplify calibration integration.
    • If you already have a native TensorRT calibrator, set it directly on the compile spec or builder config.
    • Save calibration caches to avoid repeated long calibration builds when iterating.

    Start Building with $10 in Free API Credits Today!

    Inference delivers OpenAI-compatible serverless inference APIs for top open source LLM models, offering developers the highest performance at the lowest cost in the market. It supports standard request-response patterns and exposes the same predictable endpoints developers already know.

    Start building with ten dollars in free API credits and test state-of-the-art language models that balance cost efficiency with high performance.

    Batch Processing for Large-Scale Async AI Workloads: Fast and Efficient Throughput

    Inference provides specialized batch processing for large-scale async AI workloads. You can group many short requests into efficient batches to increase throughput per GPU and cut cost per token.

    The system handles queueing, dynamic batching, and retry logic so you do not waste cycles on tiny requests. For heavy batch jobs, tune batch size, max latency, and worker concurrency so that GPU utilization climbs while tail latency stays acceptable.

    Document Extraction for RAG Applications: Accurate Chunking and Embeddings

    Inference includes document extraction features tailored to retrieval augmented generation workflows. It splits documents into semantically coherent chunks, extracts metadata, runs OCR when needed, and generates embeddings for each chunk. The pipeline supports text normalization, sentence boundary aware chunking, and overlap control so context windows align with model capacities while preserving key facts for retrieval.

    Why Post Training Quantization Matters for LLM Inference: Cut Cost, Keep Accuracy

    Post-Training Quantization reduces model size and runtime compute by converting weights and activations from floating point to low-bit integer formats after training. PTQ, INT8, and INT4 quantization shrink memory bandwidth and speed up matrix multiply operations. You keep most of the model behavior while lowering inference cost and raising throughput.

    PTQ Methods Explained: Static, Dynamic, and Hybrid Approaches

    Static post-training quantization calibrates scales and zero points with a representative dataset and quantizes both weights and activations. Dynamic quantization converts weights to low-bit formats and quantizes activations at runtime based on observed ranges, which simplifies calibration but can add runtime overhead. Hybrid or mixed precision keeps sensitive layers in higher precision and quantizes the rest to INT8 or INT4 to balance perf with quality.

    Quant Formats and Granularity: Per Channel, Per Tensor, Symmetric, Asymmetric

    Choose quantization granularity carefully. Per-channel quantization assigns independent scale factors per weight channel and often preserves accuracy for convolution or transformer linear layers. Per-tensor quantization uses one scale for a whole tensor and runs faster on some kernels.

    Symmetric quantization centers range around zero and simplify arithmetic, while asymmetric uses a zero point to represent an offset. Scale factors, zero points, and rounding strategy define quantization error.

    Calibration and Accuracy Recovery: Histograms, KL, and Advanced Rounding

    Calibration uses a small representative dataset to estimate activation ranges. Min max, percentile clipping, and KL divergence-based histogram methods all aim to choose scales that reduce quantization noise.

    Advanced techniques like AdaRound adjust the rounding of weights to minimize loss, and SmoothQuant redistributes scales between weights and activations to improve INT8 results. Use bias correction and per-channel rescales when simple calibration harms accuracy.

    Model Level Tricks: Group Wise, Block Wise, and Non-Uniform Quant

    Group-wise or block-wise quantization divides large matrices into smaller blocks and quantizes each block independently to reduce dynamic range issues. Non-uniform schemes like k-means quantization and lookup table quantization compress weights with clusters instead of linear scales. These approaches can mitigate quantization error at the cost of more complex dequantization logic and memory layout changes.

    Runtime and Hardware Considerations: Kernels, Libraries, and Acceleration

    Performance depends on kernel support. INT8 GEMM on modern GPUs, optimized CPU paths in oneDNN, and specialized NPUs deliver significant gains when kernels support per-channel scales and fast dequantize paths.

    Use cuBLASLt, Triton custom kernels, or vendor runtimes for best throughput. Keep an eye on memory bandwidth and cache friendliness because quantized models trade math precision for smaller data movement.

    Integrating PTQ into Serverless Inference Pipelines: Convert, Store, Serve

    Convert models once during the deployment pipeline using quantization tools and store the quantized artifacts alongside float checkpoints. Design the serverless runtime to select the quantized model by request profile and to fall back to float32 when high fidelity is needed. Expose tracing information so clients can pick the level of quantization by use case, for instance, using INT8 for chat and float32 for critical scoring tasks.

    Validation, Monitoring, and Rollback: Safety Nets for Production

    Measure both offline metrics, like perplexity, and online metrics such as user quality signals and error rates. Run A/B tests and shadow traffic to detect any degradation introduced by quantization.

    Track latency, throughput, and tail percentiles to ensure expected cost gains. Keep rollback paths that swap in float models without service disruption if accuracy drops or rare failure modes appear.

    Practical Steps and Quick Wins: Where to Start with Limited Credits

    Start small by quantizing only linear layer weights to INT8 and leave layer norm and embedding layers in float. Calibrate with a few hundred to a few thousand representative examples.

    Compare inference quality with a set of downstream tasks and measure latency improvements. If you need more aggressive compression, test per-channel INT4 or group-wise quantization and evaluate advanced methods like AdaRound and SmoothQuant on a holdout.

    • Continuous Batching LLM
    • Inference Solutions
    • vLLM Multi-GPU
    • Distributed Inference
    • KV Caching
    • Inference Acceleration
    • Memory-Efficient Attention


    START BUILDING TODAY

    15 minutes could save you 50% or more on compute.