Inference.net | Sglang

Running large language models to get predictions or embeddings for downstream tasks is like driving a car fast. You don’t want to get a speeding ticket, but a smooth ride is as important as hitting high speeds. Inference engines are the smooth asphalt you want to drive your LLM. SGlang, short for Speedy Generalized Language, is a programming language designed to help LLMs run faster and with less resource consumption. This article unpacks SGlang, how it works, and how you can use it to achieve the fastest and most efficient LLM inference engine for your AI applications. Inference’s AI inference APIs is a valuable tool for achieving objectives such as the fastest, most efficient, and cost-effective LLM inference while seamlessly integrating with your existing AI stack.

What is SGLang and Its Key Features?

SGLang is an open-source inference engine designed to optimize large language model deployments. It improves performance during the inference phase of LLM applications to reduce costs and enhance reliability.

SGLang accomplishes this by streamlining resource management for processing requests to achieve substantially higher throughput than many competitive solutions.

Why are Inference Engines Like SGLang Needed?

Organizations face significant challenges when deploying LLMs in today’s technology landscape. The primary issues include:

Managing the enormous computational demands required to process high volumes of data
Achieving low latency
Ensuring an optimal balance between CPU-intensive tasks, such as:
- Scheduling and memory allocation
- GPU-intensive computations

Repeatedly processing similar inputs further compounds the inefficiencies in many systems, leading to redundant computations that slow down overall performance. Also, generating structured outputs like JSON or XML in real-time introduces further delays, making it difficult for applications to deliver fast, reliable, cost-effective performance at scale.

How Does SGLang Optimize Resource Usage?

SGLang optimizes CPU and GPU resources during inference, achieving significantly higher throughput than many competitive solutions. Its design utilizes an innovative approach that reduces redundant computations and enhances overall efficiency, thereby enabling organizations to manage better the complexities associated with LLM deployment.

How Does RadixAttention Work?

RadixAttention is central to SGLang, which reuses shared prompt prefixes across multiple requests. This approach effectively minimizes the repeated processing of similar input sequences, improving throughput.

The technique is advantageous in conversational interfaces or retrieval-augmented generation applications, where similar prompts are frequently processed. By eliminating redundant computations, the system ensures that resources are used more efficiently, contributing to faster processing times and more responsive applications.

What About SGLang's Batch Scheduler?

SGLang also features a zero-overhead batch scheduler. Earlier inference systems often suffer from significant CPU overhead due to tasks like:

Batch scheduling
Memory allocation
Prompt preprocessing

In many cases, these operations result in idle periods for the GPU, which in turn hampers overall performance.

SGLang Optimizes GPU Utilization for Faster Inference

SGLang addresses this bottleneck by overlapping CPU scheduling with ongoing GPU computations. The scheduler keeps the GPUs continuously engaged by running one batch ahead and preparing all necessary metadata for the next batch.

Profiling has shown that this design reduces idle time and achieves measurable speed improvements, especially in configurations that involve smaller models and extensive tensor parallelism.

How Does SGLang Balance Loads Efficiently?

SGLang incorporates a cache-aware load balancer that departs from conventional load-balancing methods such as round-robin scheduling. Traditional techniques often ignore the state of the key-value (KV) cache, leading to inefficient resource use.

SGLang’s Load Balancer Boosts Efficiency

In contrast, SGLang’s load balancer predicts the cache hit rates of different workers and directs incoming requests to those with the highest likelihood of a cache hit. This targeted routing increases throughput and enhances cache utilization.

The mechanism relies on an approximate radix tree that reflects each worker's current cache state, and it lazily updates this tree to impose minimal overhead. The load balancer, implemented in Rust for high concurrency, is incredibly well-suited for distributed, multi-node environments.

What About SGLang's Data Parallelism Attention?

SGLang supports data parallelism attention, a strategy particularly tailored for DeepSeek models. While many modern models use tensor parallelism, which can lead to duplicated KV cache storage when scaling across multiple GPUs, SGLang employs a different method for models utilizing multi-head latent attention.

In this approach, individual data parallel workers independently handle various batches, such as:

Prefill
Decode
Idle

The attention-processed data is then aggregated across workers before passing through subsequent layers, such as a mixture-of-experts layer, and later redistributed.

How Efficient Is SGLang at Generating Structured Outputs?

SGLang excels in the efficient generation of structured outputs. Many inference systems struggle with the real-time decoding of formats like JSON, which can be a critical requirement in many applications. SGLang addresses this by integrating a specialized grammar backend known as xgrammar.

This integration streamlines the decoding process, allowing the system to generate structured outputs up to ten times faster than other open-source alternatives. This capability is especially valuable when rapidly producing machine-readable data, which is essential for downstream processing or interactive applications.

How Are Companies Using SGLang?

Several high-profile companies have recognized SGLang’s practical benefits. For example, ByteDance channels a significant portion of its internal NLP pipelines through this engine, processing petabytes of data daily.

Similarly, xai has reported substantial cost savings by leveraging optimized scheduling and effective cache management, resulting in a notable reduction in serving expenses. These real-world applications highlight SGLang’s ability to operate efficiently at scale, delivering performance improvements and cost benefits.

What Else Should I Know About SGLang?

SGLang is released under the Apache 2.0 open-source license and is accessible for academic research and commercial applications. Its compatibility with OpenAI standards and the provision of a Python API allows developers to integrate it seamlessly into existing workflows.

SGLang Supports Top Models & Advanced Quantization

The engine supports many models, including popular ones such as:

Llama
Mistral
Gemma
Qwen
DeepSeek
Phi
Granite

It is designed to work across various hardware platforms, including NVIDIA and AMD GPUs, and integrates advanced quantization techniques like FP8 and INT4.

Future enhancements will include:

FP6 weight and FP8 activation quantization
Faster startup times
Cross-cloud load balancing

vLLM

Comparing vLLM, LMDeploy, and SGLang

vLLM: A Memory Efficiency Powerhouse

vLLM tackles the memory challenges of large language models head-on. When running large models, GPUs can run out of memory, leading to errors and crashes. vLLM reduces memory consumption by optimizing how LLMs use memory during inference.

vLLM Reduces Memory & Speeds Inference

As a result, it can run LLMs with much lower memory requirements regardless of the underlying hardware. VLLM can reduce memory usage by 50% or more, and it achieves this via memory management techniques borrowed from computer science, including:

Memory mapping
Efficient data structure organization

In addition, vLLM enables parallel computation to accelerate inference and resource utilization further. vLLM speeds up inference without compromising accuracy.

LMDeploy: Simplifying Large Model Deployment

Deploying large language models can be daunting. LMDeploy simplifies the deployment of LLMs at scale to make them more accessible for real-world applications. The framework integrates model parallelism and fine-tuning techniques to optimize large model deployment and improve inference speed and scalability.

LMDeploy excels in distributed settings, allowing users to deploy and run LLMs seamlessly across multiple GPUs or nodes. With LMDeploy, organizations can more easily harness the power of LLMs and put them to practical use.

SGLang: Using Structured Programming to Optimize LLMs

SGLang approaches LLM optimization from a structured programming perspective. The framework introduces specialized programming abstractions and tools to give users fine-grained control over model execution. This enables efficient resource management.

LMDeploy delivers up to 1.8x higher request throughput than vLLM, by introducing key features like persistent batch(a.k.a. continuous batching), blocked KV cache, dynamic split&fuse, tensor parallelism, high-performance CUDA kernels and so on.

LMDeploy has 2 inference engines: pytorch and turbomind

Core features:

Inference: persistent batch(a.k.a. continuous batching), blocked KV cache, dynamic split&fuse, tensor parallelism, high-performance CUDA kernels and so on.
Quantizations: LMDeploy supports weight-only and k/v quantization, and the 4-bit inference performance is 2.4x higher than FP16.
Distributed inference
Transformers
Multimodal LLMs
Mixture-of-Expert LLMs

LMDeploy, vLLM, and SGLang are three notable frameworks designed to optimize large language model (LLM) inference, each with its own unique features and capabilities.

LMDeploy delivers significant performance improvements with up to 1.8x higher throughput than vLLM. It supports key features like persistent batch (also known as continuous batching), blocked KV cache, dynamic split & fuse, tensor parallelism, and high-performance CUDA kernels. LMDeploy has two main inference engines: PyTorch and TurboMind, and supports weight-only and k/v quantization. The 4-bit inference performance is 2.4x better than FP16, making it a highly efficient option. LMDeploy also supports distributed inference and can handle transformers, multimodal LLMs, and mixture-of-expert LLMs.

vLLM is a fast and user-friendly library that focuses on LLM inference and serving, providing features such as cached PagedAttention, continuous batching, and distributed inference. vLLM benefits from fast model execution via CUDA/HIP graph and supports various quantizations, including GPTQ, AWQ, INT4, INT8, and FP8. It also utilizes optimized CUDA kernels and integrates with FlashAttention and FlashInfer. vLLM supports a wide range of models, including transformers, multimodal LLMs, mixture-of-expert LLMs, and embedding models.

SGLang builds upon open-source LLM engines like LightLLM, vLLM, and Guidance, incorporating advanced features such as RadixAttention for KV cache reuse and a compressed state machine for fast constrained decoding. It also leverages high-performance CUDA kernels from FlashInfer and the torch.compile from gpt-fast. SGLang uses a highly efficient Python-based batch scheduler, often outperforming C++-based systems, and is compatible with almost all transformer-based models.

Supported Architectures and GPUs:

LMDeploy: Primarily supports Nvidia GPUs.
vLLM: Supports Nvidia GPUs, AMD CPUs and GPUs, Intel CPUs and GPUs, PowerPC CPUs, TPUs, and AWS Neuron.
SGLang: Primarily supports Nvidia GPUs and has recently added support for AMD.

Each framework has a distinct focus on optimizing model performance and supporting various hardware and software configurations, making them suitable for different use cases in LLM deployment and inference.

Ollama

SGLang Programming Model and Methodology

The SGLang programming model is introduced through a running example, which describes its language primitives and execution modes, and outlines opportunities for runtime optimization.

This model simplifies tedious operations in multi-call workflows by providing flexible and composable primitives, like:

String manipulation
API calling
Constraint specification
Parallelism

SGLang is a domain-specific language embedded in Python. The following figure shows a program that evaluates an essay about an image using the branch-solve-merge prompting method.

The multi-dimensional_judge function takes three arguments: s (the prompt state), path (the image file path), and essay (the essay text). The states can be extended with new strings and SGLang primitives using the += operator. The function begins by adding the image and essay to the prompt. It then checks whether the essay is related to the picture using select, storing the result in s["related"].

Multi-Dimensional Judgement and Grading with Forked Prompts

If they are related, the prompt is forked into three parallel branches to evaluate the essay from different dimensions, storing results in f["judgment"] using gen. These judgments are then merged, summarised, and used to assign a letter grade. The function returns the final output in JSON format, constrained by a regular expression schema (regex).

SGLang significantly streamlines this workflow. Compared to an equivalent implementation using OpenAI-style APIs, this version is 2.1 times shorter, eliminating the need for manual string construction and parallelization logic.

SGLang's Primitives: The Building Blocks of SGLang Programs

SGLang provides primitives for controlling prompt state, generation, and parallelism, which can be used with Python syntax and libraries.

Here are the primitives:

gen: Calls a model to generate and stores the results in a variable with the name specified in its first argument. It supports a `regex` argument to constrain the output to follow a grammar defined by a regular expression (e.g., a JSON schema).
select: Calls a model to choose the highest probability option from a list.
+= or extend: Appends a string to the prompt.
[variable_name]: Fetches the results of a generation.
fork: Creates parallel forks of the prompt state.
join: Rejoins the prompt state.
image and video: Take in image and video inputs.

Execution Modes: How SGLang Programs Run

The simplest way to execute an SGLang program is through an interpreter, where a prompt is treated as an asynchronous stream. Primitives like extend, gen, and select are submitted to the stream for asynchronous execution. These non-blocking calls allow Python code to continue running without waiting for the generation to finish, similar to launching CUDA kernels asynchronously.

Execution Modes and Model Compatibility in SGLang

Each prompt is managed by a stream executor in a background thread, enabling intra-program parallelism. Fetching generation results will block until they are ready, ensuring correct synchronization. SGLang programs can be compiled as computational graphs and executed with a graph executor, allowing for more optimizations.

This paper uses interpreter mode by default and discusses compiler mode results in Appendix D. SGLang supports open-weight models with its own SGLang Runtime (SRT), as well as API models such as OpenAI and Anthropic models.

SGLang’s Place Among Other Programming Systems for LLMs

Programming systems for LLMs can be classified into two categories:

High-level (e.g., LangChain, DSPy)
Low-level (e.g., LMQL, Guidance, SGLang)

High-level systems provide predefined or auto-generated prompts, such as DSPy’s prompt optimizer. Low-level systems typically do not alter prompts but allow direct manipulation of prompts and primitives. SGLang is a low-level system similar to LMQL and Guidance. The following table compares their features.

SGLang focuses more on runtime efficiency and comes with its own co-designed runtime, allowing for novel optimizations. High-level languages (e.g., DSPy) can be compiled to low-level languages (e.g., SGLang). The integration of SGLang as a backend in DSPy for better runtime efficiency is demonstrated later.

The above example illustrates RadixAttention operations with an LRU eviction policy across nine time points, showcasing the dynamic evolution of the radix tree in response to various requests. These requests include two chat sessions, a batch of few-shot learning inquiries, and self-consistency sampling.

Each tree edge carries a label denoting a substring or a sequence of tokens. The nodes are color-coded to reflect different states: green for newly added nodes, blue for cached nodes accessed during the time point, and red for nodes that have been evicted.

Step 1: The radix tree is initially empty.

Step 2: The server processes an incoming user message “Hello” and responds with the LLM output “Hi”. The system prompt “You are a helpful assistant”, the user message “Hello!”, and the LLM reply “Hi!” are consolidated into the tree as a single edge linked to a new node.

Step 3: A new prompt arrives, and the server finds the prefix of the prompt (i.e., the first turn of the conversation) in the radix tree and reuses its KV cache. The new turn is appended to the tree as a new node.

Step 4: A new chat session begins. The node from Step 3 is split into two nodes to allow the two chat sessions to share the system prompt.

Step 5: The second chat session continues. However, due to memory limits, a node from Step 4 must be evicted. The new turn is appended after the remaining node from Step 4.

Step 6: The server receives a few-shot learning query, processes it, and inserts it into the tree. The root node is split because the new query does not share any prefix with existing nodes.

Step 7: The server receives a batch of additional few-shot learning queries. These queries share the same set of few-shot examples, so a node from Step 6 is split to enable sharing.

Step 8: The server receives a new message from the first chat session. It evicts all nodes from the second chat session as they are the least recently used.

Step 9: The server receives a request to sample more answers for the questions in a node from Step 8, likely for self-consistency prompting. To make space for these requests, multiple nodes are evicted.

Adaptive KV Cache Management with RadixAttention in Real-Time Inference

This example demonstrates how RadixAttention handles the dynamic allocation and eviction of nodes in response to various types of requests, ensuring efficient reuse of the KV cache and effective memory management.

SGLang Evaluation and Results

The latency and throughput results demonstrate the performance and scalability of SGLang when working with open-weight models. The figures indicate that SGLang can improve throughput by up to 6.4 times while also reducing latency by up to 3.7 times.

These improvements result from KV cache reuse, the exploitation of parallelism within a single program, and faster constrained decoding.

On these benchmarks, the cache hit rate ranges from 50% to 99%. Figure 13 (Appendix) lists the achieved and optimal cache hit rates for all of them, showing that SGLang’s cache-aware scheduling approaches 96% of the optimal hit rate on average.

Results on Larger Models with Tensor Parallelism

Larger models, Mixtral-8x7 B and Llama-70B, were tested with tensor parallelism on the same set of benchmarks, and the results are reported in the following figure. The speedup on larger models shows a trend similar to that observed on smaller models, indicating that SGLang's optimization generalizes well to larger models.

Guidance and LMQL were omitted due to the lack of efficient implementations of tensor parallelism.

SGLang has native support for multi-modal models with image and video primitives. The optimizations in this paper are compatible with multi-modal models. For RadixAttention, the hash of the input images is computed and used as the key in the radix tree, allowing the reuse of the KV cache of the image tokens from the same image.

LLaVA-v1.5-7B (image) was run on llava-bench-in-the-wild, and LLaVA-NeXT-34B (video) on ActivityNet. Since these models are not well-supported by other baseline systems, the model authors' original implementation in Hugging Face Transformers was used as the baseline. As shown in the following table, SGLang achieves throughput up to 6 times higher on these benchmarks. In llava-bench-in-the-wild, multiple questions about the same image were handled, and SGLang runtime reused the KV cache in this case.

Production Deployment

SGLang has been deployed in Chatbot Arena to serve open-weight models. Due to low traffic for some models, only one SGLang worker serves each. After one month, a 52.4%

RadixAttention cache hit rate for LLaVA-Next-34B and 74.1% for Vicuna-33B was observed. Cache hits came from common system messages, frequently reused example images, and multi-turn chat histories. This reduced first-token latency by an average of 1.7× for Vicuna-33 B.

Start Building with $10 in Free API Credits Today!

Inference delivers OpenAI-compatible serverless inference APIs for top open-source LLM models, offering developers the highest performance at the lowest cost in the market. Beyond standard inference, Inference provides specialized batch processing for large-scale async AI workloads and document extraction capabilities designed explicitly for RAG applications.

Start building with $10 in free API credits and experience state-of-the-art language models that balance cost-efficiency with high performance.

What is the SGlang Inference Engine, and How Does it Stack Up?

Get Started

What is SGLang and Its Key Features?

Why are Inference Engines Like SGLang Needed?

How Does SGLang Optimize Resource Usage?

How Does RadixAttention Work?

What About SGLang's Batch Scheduler?

SGLang Optimizes GPU Utilization for Faster Inference

How Does SGLang Balance Loads Efficiently?

SGLang’s Load Balancer Boosts Efficiency

What About SGLang's Data Parallelism Attention?

How Efficient Is SGLang at Generating Structured Outputs?

How Are Companies Using SGLang?

What Else Should I Know About SGLang?

SGLang Supports Top Models & Advanced Quantization

Comparing vLLM, LMDeploy, and SGLang

vLLM: A Memory Efficiency Powerhouse

vLLM Reduces Memory & Speeds Inference

LMDeploy: Simplifying Large Model Deployment

SGLang: Using Structured Programming to Optimize LLMs

Supported Architectures and GPUs:

SGLang Programming Model and Methodology

Multi-Dimensional Judgement and Grading with Forked Prompts

SGLang's Primitives: The Building Blocks of SGLang Programs

Execution Modes: How SGLang Programs Run

Execution Modes and Model Compatibility in SGLang

SGLang’s Place Among Other Programming Systems for LLMs

Adaptive KV Cache Management with RadixAttention in Real-Time Inference

SGLang Evaluation and Results

Results on Larger Models with Tensor Parallelism

Production Deployment

Start Building with $10 in Free API Credits Today!

START BUILDING TODAY

What is the SGlang Inference Engine, and How Does it Stack Up?

Get Started

What is SGLang and Its Key Features?

Why are Inference Engines Like SGLang Needed?

How Does SGLang Optimize Resource Usage?

How Does RadixAttention Work?

What About SGLang's Batch Scheduler?

SGLang Optimizes GPU Utilization for Faster Inference

How Does SGLang Balance Loads Efficiently?

SGLang’s Load Balancer Boosts Efficiency

What About SGLang's Data Parallelism Attention?

How Efficient Is SGLang at Generating Structured Outputs?

How Are Companies Using SGLang?

What Else Should I Know About SGLang?

SGLang Supports Top Models & Advanced Quantization

Related Reading

Comparing vLLM, LMDeploy, and SGLang

vLLM: A Memory Efficiency Powerhouse

vLLM Reduces Memory & Speeds Inference

LMDeploy: Simplifying Large Model Deployment

SGLang: Using Structured Programming to Optimize LLMs

Supported Architectures and GPUs:

Related Reading

SGLang Programming Model and Methodology

Multi-Dimensional Judgement and Grading with Forked Prompts

SGLang's Primitives: The Building Blocks of SGLang Programs

Execution Modes: How SGLang Programs Run

Execution Modes and Model Compatibility in SGLang

SGLang’s Place Among Other Programming Systems for LLMs

Radix Tree Dynamics: Node Reuse, Sharing, and Eviction in Prompt Caching

Adaptive KV Cache Management with RadixAttention in Real-Time Inference

SGLang Evaluation and Results

Results on Larger Models with Tensor Parallelism

Results on Multi-Modal Models

Production Deployment

Start Building with $10 in Free API Credits Today!

Related Reading

START BUILDING TODAY