DeepSeek-V3-0324 is now live.Try it

    Is TensorRT the Best LLM Inference Engine? A Head-To-Head Comparison

    Published on Mar 15, 2025

    As deep learning models grow more complex, getting the best performance possible during inference is crucial. Slowdowns during inference can waste time and resources, often negating the benefits of using more advanced models in the first place. This article will explain how TensorRT can help you achieve your goals, like finding the fastest, most efficient LLM inference engine that minimizes cost, maximizes throughput, and seamlessly scales for real-world AI applications. Understanding AI Inference vs Training is essential to optimizing performance and making informed decisions when deploying deep learning models.

    One way to think about inference is to think of it as running a race. You train for weeks or months to improve your performance. But when it comes time to run the race, you want to be able to complete it as quickly as possible. In an AI inference APIs, this “race” is the execution of a trained model, which occurs after a user prompts the system for information. TensorRT is designed to make this process as fast as possible.

    What is NVIDIA TensorRT, and How Does It Perform LLM Optimization?

    Person Using Computer - TensorRT

    NVIDIA TensorRT is a high-performance deep learning inference library optimized for NVIDIA GPUs. It enhances LLM (Large Language Model) performance by leveraging techniques like quantization, kernel fusion, and efficient memory management. Specifically, it optimizes transformer-based architectures, improving speed, efficiency, and scalability.

    Layer & Tensor Fusion - TensorRT

    Weight and Activation Precision Calibration

    During the training process, parameters and activations are in FP32 (Floating Point 32) precision. To convert them to FP16 or INT8 precision, this optimization reduces latency and model size because FP32 precision is converted into FP16 or INT8.

    Weight and Activation Precision Calibration - TensorRT

    While converting to FP16 (lower precision), some of our weights will be shrunk due to overflow, as the dynamic range of FP16 is lower than that of FP32. But experimentally, this doesn’t affect the accuracy significantly. But how do we justify that? We know FP32 is high precision. Weights and activation values, in general, are resilient to noise. While training, the model tries to preserve the features necessary for inference.

    Managing Precision Loss in INT8 Model Conversion

    Throwing out unnecessary stuff is a built-in process. While we convert the model to lower precision, we assume that the model throws out noise. This similar technique of clipping overflowing weights won’t work while converting to INT8 precision.

    Because INT8 values are minimal, ranging from [-127 to +127], most of our weights will get modified and overflow in lower precision, resulting in a significant drop in our model's accuracy. We map those weights in INT8 precision using scaling and bias terms.

    Layers and Tensor Fusion

    Any deep learning framework must perform a similar computation routinely while executing a graph. To overcome this, TensorRT uses layer and tensor fusion to optimize the GPU memory and bandwidth by fusing nodes in a kernel vertically or horizontally (or both), which reduces the overhead and cost of reading and writing the tensor data for each layer.

    This is a simple analogy: Instead of buying three items from the market in three trips, we do a single trip and buy all three items. As shown above, TensorRT recognizes all layers with similar input and filter sizes but different weights and combines them to form a 1x1 CBR layer.

    Unoptimized vs TensorRT Optimized Network - TensorRT

    Kernel Auto Tuning

    While optimizing models, some kernel-specific optimizations can be performed. This selects the best layers, algorithms, and optimal batch size based on the target GPU platform. For example, there are multiple ways of performing convolution operations, but which one is the most optimal on this selected platform? TRT opts for that automatically.

    Dynamic Tensor Memory

    TensorRT improves memory reuse by allocating memory to a tensor only for the duration of its usage. It helps reduce memory footprints and avoid allocation overhead for fast and efficient execution.

    Multiple Stream Execution

    TensorRT is designed to process multiple input streams in parallel. This is Nvidia’s CUDA stream.

    Related Reading

    Is TensorRT the Best LLM Inference Engine? TensorRT vs vLLM vs LMDeploy vs MLC-LLM

    Person Coding - TensorRT

    TensorRT-LLM is another inference engine that accelerates and optimizes inference performance for the latest LLMs on NVIDIA GPUs. LLMs are compiled into TensorRT Engine and then deployed with a Triton server to leverage inference optimizations such as In-Flight Batching (which reduces wait time and allows higher GPU utilization), paged KV caching, multiGPU-multiNode inference, and FP8 Support.

    Usage

    We will compare the execution time, ROUGE scores, latency, and throughput across the HF, TensorRT, and TensorRT-INT8 models (quantized).

    You need to install Nvidia-container-toolkit for your Linux system, initialize Git LFS (to download HF Models), and download the necessary packages as follows:

    ```bash
    !curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
      && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
        sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
        sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
    !apt-get update
    !git clone https://github.com/NVIDIA/TensorRT-LLM/
    !apt-get update && apt-get -y install python3.10 python3-pip openmpi-bin libopenmpi-dev
    !pip3 install tensorrt_llm -U --pre --extra-index-url https://pypi.nvidia.com
    !pip install -r TensorRT-LLM/examples/phi/requirements.txt
    !pip install flash_attn pytest
    !curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | bash
    !apt-get install git-lfs
    ```

    Now Retrieve The Model Weights

    ```bash
    PHI_PATH="TensorRT-LLM/examples/phi"
    !rm -rf $PHI_PATH/7B
    !mkdir -p $PHI_PATH/7B && git clone https://huggingface.co/microsoft/Phi-3-small-128k-instruct $PHI_PATH/7B
    ```

    Convert the model into TensorRT-LLM checkpoint format and and build the TensorRT-LLM from the checkpoint.

    ```bash
    !python3 $PHI_PATH/convert_checkpoint.py --model_dir $PHI_PATH/7B/ \
                    --dtype bfloat16 \
                    --output_dir $PHI_PATH/7B/trt_ckpt/bf16/1-gpu/
    Build TensorRT-LLM Model From Checkpoint
    !trtllm-build --checkpoint_dir $PHI_PATH/7B/trt_ckpt/bf16/1-gpu/ \
                    --gemm_plugin bfloat16 \
                    --output_dir $PHI_PATH/7B/trt_engines/bf16/1-gpu/
    ```

    INT8 weight-only quantization is now applied to the HF model, and the checkpoint is converted into TensorRT-LLM.

    ```bash
    !python3 $PHI_PATH/convert_checkpoint.py --model_dir $PHI_PATH/7B \
                    --dtype bfloat16 \
                    --use_weight_only \
                    --output_dir $PHI_PATH/7B/trt_ckpt/int8_weight_only/1-gpu/
    !trtllm-build --checkpoint_dir $PHI_PATH/7B/trt_ckpt/int8_weight_only/1-gpu/ \
                    --gemm_plugin bfloat16 \
                    --output_dir $PHI_PATH/7B/trt_engines/int8_weight_only/1-gpu/
    ```

    Test The Base Phi3 And Two Tensorrt Models on the Summarization Task

    ```bash
    %%capture phi_hf_results
    Huggingface
    !time python3 $PHI_PATH/../summarize.py --test_hf \
                           --hf_model_dir $PHI_PATH/7B/ \
                           --data_type bf16 \
                           --engine_dir $PHI_PATH/7B/trt_engines/bf16/1-gpu/
    %%capture phi_trt_results
    TensorRT-LLM
    !time python3 $PHI_PATH/../summarize.py --test_trt_llm \
                           --hf_model_dir $PHI_PATH/7B/ \
                           --data_type bf16 \
                           --engine_dir $PHI_PATH/7B/trt_engines/bf16/1-gpu/
    %%capture phi_int8_results
    TensorRT-LLM (INT8)
    !time python3 $PHI_PATH/../summarize.py --test_trt_llm \
                           --hf_model_dir $PHI_PATH/7B/ \
                           --data_type bf16 \
                           --engine_dir $PHI_PATH/7B/trt_engines/int8_weight_only/1-gpu/
    ```

    After capturing the results, you can parse the output and plot it to compare execution time, ROUGE scores, latency, and throughput across all models.

    vLLM: Fast Inference and Serving for LLMs

    vLLM offers LLM inferencing and serving with SOTA throughput, Paged Attention, Continuous batching, Quantization (GPTQ, AWQ, FP8), and optimized CUDA kernels.

    Usage

    Let’s evaluate the throughput and latency of microsoft/Phi3-mini-4k-instruct. Start by setting up dependencies and importing libraries.

    ```bash
    !pip install -q vllm
    !git clone https://github.com/vllm-project/vllm.git
    !pip install -q datasets
    !pip install transformers scipy
    from vllm import LLM, SamplingParams
    from datasets import load_dataset
    import time
    from tqdm import tqdm
    from transformers import AutoTokenizer
    ```

    Let’s Load The Model and Generate its Outputs on a Small Slice of The Dataset.

    ```python
    dataset = load_dataset("akemiH/MedQA-Reason", split="train").select(range(10))
    prompts = []
    for sample in dataset:
        prompts.append(sample)
    sampling_params = SamplingParams(max_tokens=524)
    llm = LLM(model="microsoft/Phi-3-mini-4k-instruct", trust_remote_code=True)
    def generate_with_time(prompt):
        start = time.time()
        outputs = llm.generate(prompt, sampling_params)
        taken = time.time() - start
        generated_text = outputs[0].outputs[0].text
        return generated_text, taken
    generated_text = []
    time_taken = 0
    for sample in tqdm(prompts):
        text, taken = generate_with_time(sample)
        time_taken += taken
        generated_text.append(text)
    Tokenize The Outputs and Calculate The Throughput
    tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")
    token = 1
    For sample in generated_text:
        tokens = tokenizer(sample)
        tok = len(tokens.input_ids)
        token += tok
    print(token)
    print("tok/s", token // time_taken)
    ```

    Let’s benchmark the model’s performance through vLLM on the ShareGPT dataset.

    ```bash
    !wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
    %cd vllm
    !python benchmarks/benchmark_throughput.py --backend vllm --dataset ../ShareGPT_V3_unfiltered_cleaned_split.json --model microsoft/Phi-3-mini-4k-instruct --tokenizer microsoft/Phi-3-mini-4k-instruct --num-prompts=1000
    ```

    LMDeploy: Efficient Inference and Deployment for LLMs

    This package also allows compressing, deploying, and serving LLMs while offering:

    • Efficient inference (persistent batching
    • Blocked KV cache
    • Dynamic split&fuse
    • Tensor parallelism
    • High-performance CUDA kernels)
    • Effective quantization (4-bit inference performance is 2.4x higher than FP16)
    • Effortless distribution server (deployment of multi-model services across multiple machines and cards)
    • Interactive inference mode (remembers dialogue history and avoids repetitive processing of historical sessions)

    It also allows for profiling token latency and throughput, request throughput, API server, and triton inference server performance.

    Usage

    Install dependencies and import packages.

    ```bash
    !pip install -q lmdeploy
    !pip install nest_asyncio
    import nest_asyncio
    nest_asyncio.apply()
    !git clone --depth=1 https://github.com/InternLM/lmdeploy
    %cd lmdeploy/benchmark
    ```

    LMdeploy has developed two inference engines, TurboMind and PyTorch.

    Let’s profile the PyTorch engine on microsoft/Phi3-mini-128k-instruct.

    ```bash
    !python3 profile_generation.py microsoft/Phi-3-mini-128k-instruct --backend pytorch
    ```

    It profiles the engine over multiple rounds and reports the token latency & throughput for each round.

    MLC-LLM: A High Performance Deployment and Inference Engine

    MLC-LLM offers a high performance deployment and inference engine, called MLCEngine.

    Usage

    Let’s install dependencies which includes setting up dependencies with conda and creating a conda environment. Then clone the git repository and configure.

    ```bash
    conda activate your-environment
    python -m pip install --pre -U -f https://mlc.ai/wheels mlc-llm-nightly-cu121 mlc-ai-nightly-cu121
    conda env remove -n mlc-chat-venv
    conda create -n mlc-chat-venv -c conda-forge \
        "cmake>=3.24" \
        rust \
        git \
        python=3.11
    conda activate mlc-chat-venv
    git clone --recursive https://github.com/mlc-ai/mlc-llm.git && cd mlc-llm/
    mkdir -p build && cd build
    python ../cmake/gen_cmake_config.py
    cmake .. && cmake --build . --parallel $(nproc) && cd ..
    set(USE_FLASHINFER ON)
    conda activate your-own-env
    cd mlc-llm/python
    pip install -e .
    ```

    We need to convert model weights into MLC format to run a model with MLC LLM. Download the HF model to by Git LFS then convert weights.

    ```bash
    mlc_llm convert_weight ./dist/models/Phi-3-small-128k-instruct/ \
        --quantization q0f16 \
        --model-type "phi3" \
        -o ./dist/Phi-3-small-128k-instruct-q0f16-MLC
    ```

    Now load your MLC format model into the MLC engine

    ```python
    from mlc_llm import MLCEngine
    Create engine
    model = "HF://mlc-ai/Phi-3-mini-128k-instruct-q0f16-MLC"
    engine = MLCEngine(model)
    Let’s Calculate Throughput
    Let’s Calculate Throughput
    import time
    from transformers import AutoTokenizer
    start = time.time()
    response = engine.chat.completions.create(
        messages=[{"role": "user", "content": "What is the Machine Learning?"}],
        model=model,
        stream=False,
    )
    taken = time.time() - start
    tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-128k-instruct")
    print("tok/s", 82 // taken)
    ```

    Summary

    TensorRT INT8 models outperform HF models and regular TensorRT with respect to inference speed while the regular TensorRT model performed better on the summarization task with the highest ROUGE score among the three models. LMDeploy delivers up to 1.8x higher request throughput than vLLM on an A100.

    Related Reading

    Start Building with $10 in Free API Credits Today!

    Inference delivers OpenAI-compatible serverless inference APIs for top open-source LLM models, offering developers the highest performance at the lowest cost in the market. Beyond standard inference, Inference provides specialized batch processing for large-scale async AI workloads and document extraction capabilities designed explicitly for RAG applications.

    Start building with $10 in free API credits and experience state-of-the-art language models that balance cost-efficiency with high performance.

    Related Reading

    START BUILDING TODAY

    15 minutes could save you 50% or more on compute.