Inference.net

In the world of artificial intelligence, open-source large language models are becoming increasingly vital. Developers can easily download these models, customize them, and deploy them locally or on the cloud. Yet, depending on the model's size and the specific task at hand, inference can be painfully slow. For instance, if you want to generate a short blog article to help with your writing, waiting for several minutes to get results isn't going to cut it. vLLM is one of several open-source inference frameworks designed to accelerate LLM inference so that you can get back to your real work faster. In this article, we’ll demystify vLLM and inference engine, and explore how it can help you achieve faster, more efficient, and scalable LLM inference with minimal engineering overhead.

A valuable tool like vLLM can help you achieve your goals in no time. vLLM is packed with features to help you speed up LLM inference and optimize it for your particular use case. vLLM also supports a wide array of large language models and is highly adaptable, allowing you to customize it to fit your unique needs.

What is vLLM and Why You Should Care

Have you ever wondered how AI-powered applications like chatbots, code assistants, and more respond so quickly? Or perhaps you’ve experienced the frustration of waiting for a large language model (LLM) to generate a response, wondering what’s taking so long. Well, behind the scenes, there’s an open source project aimed at making inference, or reactions from models, more efficient.

Accelerating Large Language Models

Originally developed at UC Berkeley, vLLM specifically targets the speed and memory challenges that come with running large AI models, gaining popularity for its open-source project that supports quantization, tool calling, and a wide variety of popular LLM architectures, including:

Llama
Mistral
DeepSeek

vLLM architectureThe innovations behind vLLM make it an ideal choice for organizations that want to run LLMs in production. Serving an LLM requires an incredible amount of calculations performed to generate each word of their response. This is unlike traditional workloads, and can often be:

Expensive
Slow
Memory-intensive

LLM Production Challenges

Memory hoarding, latency, and scaling are just some of the challenges LLMs face in production. Traditional LLM frameworks allocate GPU memory inefficiently, wasting expensive resources and forcing organizations to purchase more hardware than needed. Unfortunately, these systems often pre-allocate large memory chunks regardless of actual usage, resulting in poor utilization rates.

Memory usage and waste in different LLM serving systems.

More users interacting with an LLM results in slower response times due to batch processing bottlenecks. Conventional systems create queues that grow longer as traffic increases, leading to frustrating wait times and degraded user experiences.

Scaling Challenges and Costs

Expanding LLM deployments requires near-linear increases in costly GPU resources, making economic growth challenging for most organizations. Larger models often exceed the memory and floating-point capacity of a single GPU, requiring complex distributed setups that introduce additional overhead and technical complexity.

vLLM's Origins and Solutions

With the need for LLM serving to be affordable and efficient, vLLM arose from a research paper titled Efficient Memory Management for Large Language Model Serving with Paged Attention, from September of 2023. The project aimed to solve some of these issues by:

Eliminating memory fragmentation
Optimizing batch execution
Distributing inference

The results? Up to 24 times throughput improvements compared to similar systems such as Hugging Face Transformers and Text Generation Inference, with much less KV cache waste.

Key Features of vLLM

Let’s see what makes vLLM stand out from the competition and why it is one of the most essential tools in the broad realm of modern machine learning.

Multi-GPU Support

Large language models, such as ChatGPT, which require high performance, work best on systems with multi-GPU setups. Thanks to the vLLM multi-GPU architecture, workloads are distributed effectively, which reduces bottlenecks and increases the system’s overall throughput. vLLM has native multi-GPU support, which helps with optimization.

Continuous Batching

Most LLM systems require fixed batch sizes. vLLM eliminates this archaic requirement and instead enables continuous batching. This enables dynamic task distribution, allowing for better resource management and efficiency. Continuous batching is particularly useful in environments where workloads fluctuate frequently.

As it continuously manages input streams, vLLM minimizes idle time. This makes it an almost necessary component for real-time AI applications.

Speculative Decoding for Speed

Chatbots and real-time text generators are incredibly latency-sensitive, and vLLM optimizes this by generating potential future tokens preemptively as well as validating them at the same time. This reduces and optimizes the inference time of LLMs through its speculative decoding approach. It’s one of the many standout innovations of vLLM.

Optimized Memory Usage with Embedding Layers

Scaling LLMs requires excellent efficiency in terms of GPU memory. vLLM embedding layers are highly optimized for memory efficiency, ensuring that the LLM utilizes GPU memory in a balanced and effective manner, thereby resolving the memory resource issue that most LLMs, such as ChatGPT, face without sacrificing performance.

LLM Adapters

And lastly, we have flexibility. Namely, vLLM supports the integration of LLM adapters, which allows developers to fine-tune and customize LLMs without actually retraining entire systems. Modular approaches are gaining more and more popularity in the field of AI and ML, and that’s mainly because they not only save time and resources but also offer an easier path to integrating LLMs into specialized applications.

How Does vLLM Work?

To understand the value of vLLM, it’s essential to understand what an inference server does and the baseline mechanics of how an LLM operates. From there, we can gain a better understanding of how vLLM contributes to the improvement of existing language models.

What is an Inference Server, and Why Does It Matter for LLMs?

An inference server is software that enables an AI model to draw new conclusions based on its prior training. Inference servers feed input requests through a machine learning model and return an output. To infer is to conclude based on evidence. You may see your friend’s living room light on, but you don’t see them.

Inferring from Imperfect Information

You may assume that they’re home, but you don’t have absolute evidence to prove it. A language model also doesn’t have absolute proof about the meaning of a word or phrase (it’s a piece of software), so it uses its training as evidence. In a series of calculations based on data, it generates a conclusion. Just like when you calculate that if the light is off, it means your friend is out.

How Do LLMs Generate Responses?

LLMs use math to make conclusions. When an LLM is trained, it learns via mathematical calculations. When an LLM generates a response (inference), it does so by performing a series of probability calculations (more math). For an LLM to understand what you’re requesting from it, the model needs to know how words relate to each other and how to make associations between words.

Why Are Inference Servers Useful for LLMs?

When an LLM responds to millions of users per day, it’s doing a lot of calculations. Processing all these calculations at once while an application is live can be challenging. This is because (traditionally) the processing power involved in running an LLM can quickly consume significant memory.

vLLM architecture upgrades continue to enhance resource efficiency for features such as memory and speed.

How Does vLLM Improve Inference for LLMs?

vLLM is designed to improve the efficiency and speed of LLM workloads. It achieves faster inference with several techniques, including:

Paged Attention
Continuous Batching
Optimised CUDA Kernels

Paged Attention Helps vLLM Process Calculations More Efficiently

KV cache memory management in existing systems.

Addressing Memory Challenges in LLM Inference

One of the biggest challenges in LLM inference is the high memory consumption required to process large sequences of tokens. Traditional inference mechanisms struggle to scale efficiently, often resulting in excessive vLLM memory usage and reduced throughput. vLLM addresses this issue with Paged Attention, an attention algorithm inspired by the classic concept of virtual memory and paging in operating systems.

Optimizing GPU Memory and Scalability

Instead of allocating all memory upfront, Paged Attention allocates memory in smaller chunks. This approach:

Reduces GPU Memory Overhead: By using only the memory needed at any given moment, Paged Attention avoids unnecessary allocations.
Enables Larger Context Windows: Developers can work with more extended sequences without worrying about memory constraints.
Improves Scalability: Multiple models or larger batch sizes can run simultaneously on the same hardware, enhancing efficiency and reducing costs.

Continuous Batching Maximizes Throughput

Traditional batching methods in LLM inference often fail to utilize GPU resources fully. Static batching requires waiting until a batch is filled before processing, leading to underutilization during periods of low activity. vLLM introduces Continuous Batching, an innovative approach that dynamically merges incoming requests into ongoing batches. This system offers:

Higher Throughput: By continuously feeding the GPU with data, vLLM minimizes idle time and maximizes utilization.
Reduced Latency: Real-time applications benefit from faster response times, as requests no longer have to wait for a full batch.
Support for Diverse Workloads: Continuous Batching adapts seamlessly to varying request sizes and arrival patterns, making it ideal for multi-tenant environments.

Optimized CUDA Kernels Drive Fast Inference for vLLM

Optimized CUDA kernels are tailored to perform low-level GPU operations with maximum efficiency. CUDA is a parallel computing platform by NVIDIA for accelerating AI workloads.

Kernel Optimization for Peak Performance

vLLM takes this a step further by fine-tuning its kernels specifically for vLLM inference. These optimizations include integration with FlashAttention and FlashInfer, resulting in faster end-to-end inference. The optimized kernels harness the full potential of NVIDIA GPUs, including the NVIDIA A100 and the NVIDIA H100, to deliver top-tier performance across various hardware generations.

How Can vLLM Benefit Your Organization?

vLLM allows organizations to “do more with less” in a market where the hardware needed for LLM-based applications comes with a hefty price tag. Building cost-efficient and reliable LLM services requires significant computing power, energy resources, and specialized operational skills.

Democratizing Secure & Efficient AI

These challenges effectively put the benefits of customized, deployment-ready, and more security-conscious AI out of reach for many organizations. vLLM and Paged Attention, the algorithms they are built on, aim to address these challenges by making more efficient use of the hardware needed to support AI workloads.

What Are the Benefits of Using vLLM?

Using vLLM as an inference server for LLMs has benefits such as:

Faster response time: By some calculations, vLLM achieves up to 24x higher throughput (how much data an LLM can process) compared to Hugging Face Transformers, a popular open-source library for working with LLMs.
Reduced hardware costs: More efficient use of resources means fewer GPUs are needed to handle the processing of LLMs.
Scalability: vLLMs organize virtual memory to enable the GPU to handle more simultaneous requests from users.
Data privacy: Self-hosting an LLM with vLLM provides you with more control over data privacy and usage compared to using a third-party LLM service or application like ChatGPT.
Open-source innovation: Community involvement in maintaining and sustaining vLLM enables consistent improvements to the code. Transparency in how users can access and modify code provides developers with the freedom to utilize vLLM in a way that best meets their needs.

How to Use vLLM?

vLLM is easy to use. Here is a glimpse into how it can be used in Python:

One can install vLLM via pip:

Create a new conda environment. conda create -n myenv python=3.9 -y conda activate myenv Install vLLM with CUDA 12.1.

pip install vllm

Offline Inference

Then import the vLLM module into your code and do an offline inference with vLLM’s engine. The LLM class is to initialize the vLLM engine with a specific built-in LLM model. The LLM models are by default downloaded from HuggingFace. The SamplingParams class is to set the parameters for inferencing. from vllm import LLM, SamplingParams Then we define an input sequence and set the sampling parameters. Initialize vLLM’s engine for offline inference with the LLM class and an LLM model:

prompts = ["The future of humanity is"]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(model="meta-llama/Meta-Llama-3-8B-Instruct")

The output/response can be generated by: Responses = llm.generate(prompts, sampling_params) print(f"Prompt: { Responses[0].prompt!r}, Generated text: { Res

Online Serving

To use vLLM for online serving, OpenAI’s completions and APIs can be used in vLLM. The server can be started with Python: python -m vllm.entrypoints.openai.api_server --model NousResearch/Meta-Llama-3-8B-Instruct --dtype auto --api-key token-abc123 To call the server, the official OpenAI Python client library can be used. Alternatively, any other HTTP client works as well. from openai import OpenAI client = OpenAI( base_url="http://localhost:8000/v1", api_key="token-abc123", ) completion = client.chat.completions.create( model="NousResearch/Meta-Llama-3-8B-Instruct", messages=[ {"role": "user", "content": "Hello!"} ] ) print(completion.choices[0].message)

What Makes vLLM a Game Changer for LLM Performance?

Paged Attention is the primary algorithm that came out of vLLM. However, Paged Attention is not the only capability that vLLM provides. Additional performance optimizations that vLLM can offer include:

PyTorch Compile/CUDA Graph: for optimizing GPU memory.
Quantization: for reducing memory space required to run models.
Tensor parallelism: for breaking up the work of processing among multiple GPUs.
Speculative decoding: for speeding up text generation by using a smaller model to predict tokens and a larger model to validate that prediction.
Flash Attention: for improving the efficiency of transformer models.

Flexibility and Open-Source Advantage

Aside from the optimization abilities that vLLM offers, its flexible nature has also contributed to its growing popularity. vLLM works with both small and large language models, integrating with popular models and frameworks. Finally, the open-source nature of vLLM allows for code transparency, customization, and faster bug fixes.

LoRA and QLoRA are both resource-efficient fine-tuning techniques that can help users optimize costs and utilize compute resources more effectively.

Performance Comparison of vLLM with Other Alternatives

vLLM benchmarks its performance against alternative inference engines using various models, datasets, and metrics. These elements are essential to note because they impact overall performance.

1. Models

vLLM was tested on performance on two popular models, Llama 3 8B and Llama 3 70B. Both of these models are widely used in the industry, and their performance serves as a reliable indicator of how well vLLM can perform under real-world conditions.

2. Hardware

The official vLLM benchmarks were carried out using high-end GPUs like the NVIDIA A100 and NVIDIA H100. Leveraging top-tier hardware allows vLLM to operate at peak performance, maximizing throughput and efficiency.

3. Datasets

The tests were performed using three datasets:

ShareGPT: Comprising 500 prompts randomly sampled from the ShareGPT dataset.
Prefill-heavy Dataset: A synthetic dataset using the sonnet dataset with an average of 462 input tokens and 16 output tokens.
Decode-heavy Dataset: Another synthetic dataset generated from the sonnet dataset, containing similar input tokens but with a significantly higher average of 256 output tokens.

4. Metrics

The benchmarks evaluated the engines based on:

Throughput: The number of requests served per second (requests per second, QPS) under simultaneous request loads.

vLLM Performance Results

In performance tests, vLLM showed the highest throughput across the board on the NVIDIA H100 GPUs for both Llama 3 8B and Llama 3 70B models compared to the other serving engines. vLLM outperformed all alternatives in the ShareGPT and Decode-heavy datasets.

vLLM Outperforms Competitors

While TensorRT-LLM is a strong player in this space, especially with its hardware-optimized vLLM inference pipeline, vLLM showed superior throughput and reduced token-generation times. The SGLang and lmdeploy engines also performed decently in some instances, but vLLM surpassed them in both throughput and processing efficiency for more complex queries.

Start Building with $10 in Free API Credits Today!

Inference delivers open-source serverless inference APIs for large language models (LLMs). Their APIs are compatible with the most popular models in the OpenAI family, including GPT-3.5 and GPT-4. The company’s APIs provide developers with the highest performance for LLM inference at the lowest cost on the market.

Beyond standard inference, Inference provides specialized batch processing for large-scale async AI workloads and document extraction capabilities designed explicitly for RAG applications.

What is vLLM? Key Features and How It Supercharges LLM Inference

Get Started