DeepSeek-V3-0324 is now live.Try it

    Exploring Llama.cpp With Practical Steps for Smarter AI Deployment

    Published on Mar 24, 2025

    Exploring Llama.cpp With Practical Steps for Smarter AI Deployment

    Running and optimizing large language models can be quite a task. In particular, inference can be slow, costly, and complex. Fortunately, Llama.cpp has emerged as a powerful tool for tackling these challenges. This article will help you use Llama.cpp to efficiently run and optimize large language models on your local machine. Using Llama.cpp will allow you to achieve fast, cost-effective, and scalable AI deployment without relying on expensive cloud services. One valuable tool to help you achieve your objectives, like running and optimizing large language models efficiently using Llama.cpp, is Inference's AI inference APIs. These will help simplify your workflows and improve performance as you start with Llama.cpp.

    What is Llama.cpp and What is it Used For?

    Llama CPP

    Llama.cpp is an efficient inference framework that allows users to run the LLaMa model and similar large language models easily. Developed by Georgi Gerganov, the library implements Meta’s LLaMa architecture in C/C++.

    Open-source contributions have fueled its rapid evolution. There are over 900 contributors, 69,000 stars on GitHub, and more than 2,600 releases.

    Why Llama.cpp Matters

    Large language models are revolutionizing various industries. From customer service chatbots to sophisticated data analysis tools, the capabilities of this powerful technology are reshaping digital interaction and automation. However, practical applications can be limited by high-powered computing requirements and the need for quick response times.

    LLMs typically require sophisticated hardware and extensive dependencies, making them difficult to adopt in constrained environments. Llama.cpp rescues users from these challenges, providing a lightweight and portable alternative to heavyweight frameworks.

    How Llama.cpp Works

    Llama.cpp is a CPU-first C++ library that prioritizes low resource requirements and cross-platform support. Its design means less complexity and seamless integration into programming environments, accelerating adoption across various platforms. The inference framework also acts as a repository for critical low-level features streamlining development. This approach mirrors LangChain’s focus on high-level capabilities, although it may have future scalability challenges. Llama.cpp’s focused optimization improves the efficiency of running LLaMa and its variants. By supporting formats like GGUF and GGML, the library enables precise and effective enhancements for this model architecture.

    How Llama.cpp Differs From PyTorch and TensorFlow

    While Llama.cpp emphasizes running LLMs from the outset, PyTorch and TensorFlow are end-to-end solutions that offer data processing, training, validation, and inference capabilities all in one package. Both also have lightweight extensions for inference only, namely ExecuTorch and TensorFlow Lite. Considering only the inference phase of a model, Llama.cpp is lightweight in its implementation due to the absence of third-party dependencies and an extensive set of available operators or model formats to support.

    Analogy: Llama.cpp vs. PyTorch and TensorFlow

    If PyTorch and TensorFlow are luxury cruise ships, Llama.cpp is a small, speedy motorboat. The inference framework is lightweight and efficient, allowing users to run LLaMa and similar models quickly with low resource requirements. PyTorch and TensorFlow are heavier, more complex frameworks with extensive capabilities and features that can be overkill for inference of LLMs.

    Llama.cpp Architecture

    The backbone of Llama.cpp is the original LLaMa models, which are also based on the transformer architecture. The authors of LLaMa leverage various improvements that were subsequently proposed and used different models such as PaLM. The main difference between the LLaMa architecture and the transformers’:

    • Pre-normalization (GPT3): used to improve the training stability by normalizing the input of each transformer sub-layer using the RMSNorm approach instead of normalizing the output.
    • SwigGLU activation function (PaLM): The original non-linearity ReLU activation function is replaced by the SwiGLU activation function, which improves performance.
    • Rotary embeddings (GPTNeao): After removing the absolute positional embeddings, rotary positional embeddings (RoPE) were added at each network layer.

    Related Reading

    Setting Up the Environment and Performing Inference

    Before installing llama.cpp, ensure your system meets the following requirements:

    • A Linux-based operating system (macOS also works)
    • CMake installed (version 3.10 or newer)
    • A C++17 compatible compiler (GCC 9.0 or newer, or Clang 9.0 or newer)
    • At least 1.5 GB of free memory to load the model and run inference.

    Installing Llama.cpp

    We will start implementing it in a Linux-based environment (native or WSL) with CMake and the GNU/clang toolchain installed. We'll compile llama.cpp from the source and add it to our executable chat program as a shared library. We create our project directory smol_chat with an external directory to store the cloned llama.cpp repository.

    bash
    mkdir smol_chat
    cd smol_chat
    
    mkdir src
    mkdir externals
    touch CMakeLists.txt
    ```
    Next, navigate to the externals directory and clone the llama.cpp repository.
    
    ```bash
    cd externals
    git clone --depth=1 https://github.com/ggerganov/llama.cpp


    CMakeLists.txt is where we define our build, allowing CMake to compile our C/C++ code using the default toolchain (GNU/clang) by including headers and shared libraries from externals/llama.cpp.

    cpp
    cmake_minimum_required(VERSION 3.10)
    project(llama_inference)
    
    set(CMAKE_CXX_STANDARD 17)
    set(LLAMA_BUILD_COMMON On)
    add_subdirectory("${CMAKE_CURRENT_SOURCE_DIR}/externals/llama.cpp")
    
    add_executable(
        chat
        src/LLMInference.cpp src/main.cpp
    )
    target_link_libraries(
        chat 
        PRIVATE
        common llama ggml
    )

    Loading the Model

    We have now defined how CMake should build our project. Next, we create a header file LLMInference.h which declares a class containing high-level functions to interact with the LLM. llama.cpp provides a C-style API, thus embedding it within a class will help us abstract/hide the inner working details.

    cpp
    #ifndef LLMINFERENCE_H
    #define LLMINFERENCE_H
    
    #include "common.h"
    #include "llama.h"
    #include <string>
    #include <vector>
    
    class LLMInference {
    
        // llama.cpp-specific types
        llama_context* _ctx;
        llama_model* _model;
        llama_sampler* _sampler;
        llama_batch _batch;
        llama_token _currToken;
        
        // container to store user/assistant messages in the chat
        std::vector<llama_chat_message> _messages;
        // stores the string generated after applying
        // the chat-template to all messages in `_messages`
        std::vector<char> _formattedMessages;
        // stores the tokens for the last query
        // appended to `_messages`
        std::vector<llama_token> _promptTokens;
        int _prevLen = 0;
    
        // stores the complete response for the given query
        std::string _response = "";
    
        public:
    
        void loadModel(const std::string& modelPath, float minP, float temperature);
    
        void addChatMessage(const std::string& message, const std::string& role);
        
        void startCompletion(const std::string& query);
    
        std::string completionLoop();
    
        void stopCompletion();
    
        ~LLMInference();
    };
    
    #endif


    The private members declared in the header above will be used to implement the public member functions described in the following sections of the blog. Let us define each of these member functions in LLMInference.cpp.

    cpp
    #include "LLMInference.h"
    #include <cstring>
    #include <iostream>
    
    void LLMInference::loadModel(const std::string& model_path, float min_p, float temperature) {
        // create an instance of llama_model
        llama_model_params model_params = llama_model_default_params();
        _model = llama_load_model_from_file(model_path.data(), model_params);
    
        if (!_model) {
            throw std::runtime_error("load_model() failed");
        }
    
        // create an instance of llama_context
        llama_context_params ctx_params = llama_context_default_params();
        ctx_params.n_ctx = 0;               // take context size from the model GGUF file
        ctx_params.no_perf = true;          // disable performance metrics
        _ctx = llama_new_context_with_model(_model, ctx_params);
    
        if (!_ctx) {
            throw std::runtime_error("llama_new_context_with_model() returned null");
        }
    
        // initialize sampler
        llama_sampler_chain_params sampler_params = llama_sampler_chain_default_params();
        sampler_params.no_perf = true;      // disable performance metrics
        _sampler = llama_sampler_chain_init(sampler_params);
        llama_sampler_chain_add(_sampler, llama_sampler_init_min_p(min_p, 1));
        llama_sampler_chain_add(_sampler, llama_sampler_init_temp(temperature));
        llama_sampler_chain_add(_sampler, llama_sampler_init_dist(LLAMA_DEFAULT_SEED));
    
        _formattedMessages = std::vector<char>(llama_n_ctx(_ctx));
        _messages.clear();
    }



    llama_load_model_from_file reads the model from the file using llama_load_model internally and populates the llama_model instance using the given llama_model_params . The user can give the parameters, but we can get a pre-initialized default struct for it with llama_model_default_params .llama_context represents the execution environment for the GGUF model loaded. The llama_new_context_with_model instantiates a new llama_context and prepares a backend for execution by either reading the llama_model_params or by automatically detecting the available backends. It also initializes the K-V cache, which is important in the decoding or inference step. A backend scheduler that manages computations across multiple backends is also initialized.

    A llama_sampler determines how we sample/choose tokens from the probability distribution derived from the outputs (logits) of the model (specifically the decoder of the LLM). LLMs assign a probability to each token present in the vocabulary, representing the chances of the token appearing next in the sequence. The temperature and min-p we set with llama_sampler_init_temp and llama_sampler_init_min_p are two parameters controlling the token sampling process.

    Performing Inference

    The inference process involves multiple steps. It takes a user's text query as input and returns the LLM’s response.

    1. Applying the chat template to the queries

    For an LLM, the incoming messages belong to three roles: user, assistant, and system. User and assistant messages are given by the user and the LLM, respectively, whereas the system denotes a system-wide prompt that is followed across the entire conversation. Each message consists of a role and content, where content is the actual text and role is any of the three roles.

    The system prompt is the first message of the conversation. In our code, the messages are stored as a std::vector<llama_chat_message> named _messages where llama_chat_message is a llama.cpp struct with role and content attributes. We use the llama_chat_apply_template function from llama.cpp to apply the chat template stored in the GGUF file as metadata. We store the string or std::vector<char> obtained after applying the chat template in _formattedMessges.

    2. Tokenization

    Tokenization is dividing a given text into smaller parts (tokens). We assign each part/token a unique integer ID, thus transforming the input text to a sequence of integers that form the input to the LLM. Llama.cpp provides the common_tokenize or llama_tokenize functions to perform tokenization, where common_tokenize returns the sequence of tokens as a std::vector<llama_token>.

    cpp
    void LLMInference::startCompletion(const std::string& query) {
        addChatMessage(query, "user");
    
        // apply the chat-template 
        int new_len = llama_chat_apply_template(
                _model,
                nullptr,
                _messages.data(),
                _messages.size(),
                true,
                _formattedMessages.data(),
                _formattedMessages.size()
        );
        if (new_len > (int)_formattedMessages.size()) {
            // resize the output buffer `_formattedMessages`
            // and re-apply the chat template
            _formattedMessages.resize(new_len);
            new_len = llama_chat_apply_template(_model, nullptr, _messages.data(), _messages.size(), true, _formattedMessages.data(), _formattedMessages.size());
        }
        if (new_len < 0) {
            throw std::runtime_error("llama_chat_apply_template() in LLMInference::start_completion() failed");
        }
        std::string prompt(_formattedMessages.begin() + _prevLen, _formattedMessages.begin() + new_len);
        
        // tokenization
        _promptTokens = common_tokenize(_model, prompt, true, true);
    
        // create a llama_batch containing a single sequence
        // see llama_batch_init for more details
        _batch.token = _promptTokens.data();
        _batch.n_tokens = _promptTokens.size();
    }


    In the code, we apply the chat template, perform tokenization in the LLMInference::startCompletion method, and then create a llama_batch instance holding the model's final inputs.

    3. Decoding, Sampling and the KV Cache

    As highlighted earlier, LLMs generate a response by successively predicting the next token in the given sequence. LLMs are also trained to expect a special end-of-generation (EOG) token, indicating the end of the scheduled token sequence. The completion_loop function returns the next token in the sequence and keeps getting called until it returns the EOG token.

    Using llama_n_ctx and llama_get_kv_cached_used_cells, we determine the context length we have utilized for storing the inputs. Currently, we throw an error if the length of the tokenized inputs exceeds the context size. llama_decode performs a forward pass of the model, given the inputs in _batch.

    Using the _sampler initialized in the LLMInference::loadModel, we sample or choose a token as our prediction and store it in _currToken. We check if the token is an EOG token and then return an "EOG" indicating that the text generation loop calling LLMInference::completionLoop should be terminated. On termination, we append a new message to _messages, which is the complete _response given by the LLM with role assistant. _currToken is still an integer converted to a string token piece by the common_token_to_piece function. This string token is returned from the completionLoop method. We need to reinitialize _batch to ensure it now only contains _currToken and not the entire input sequence, i.e. _promptTokens. This is because all previous tokens ‘keys’ and ‘values’ have been cached. This reduces the inference time by avoiding the computation of all ‘keys’ and ‘values’ for all tokens in _promptTokens.`

    cpp
    std::string LLMInference::completionLoop() {
        // check if the length of the inputs to the model
        // have exceeded the context size of the model
        int contextSize = llama_n_ctx(_ctx);
        int nCtxUsed = llama_get_kv_cache_used_cells(_ctx);
        if (nCtxUsed + _batch.n_tokens > contextSize) {
            std::cerr << "context size exceeded" << '\n';
            exit(0);
        }
        // run the model
        if (llama_decode(_ctx, _batch) < 0) {
            throw std::runtime_error("llama_decode() failed");
        }
    
        // sample a token and check if it is an EOG (end of generation token)
        // convert the integer token to its corresponding word-piece
        _currToken = llama_sampler_sample(_sampler, _ctx, -1);
        if (llama_token_is_eog(_model, _currToken)) {
            addChatMessage(strdup(_response.data()), "assistant");
            _response.clear();
            return "[EOG]";
        }
        std::string piece = common_token_to_piece(_ctx, _currToken, true);
     
    
        // re-init the batch with the newly predicted token
        // key, value pairs of all previous tokens have been cached
        _batch.token = &_currToken;
        _batch.n_tokens = 1;
    
        return piece;
    }


    Also, for each query made by the user, LLM takes the entire tokenized conversation (all messages stored in _messages) as input. If we tokenize conversation as the whole every time in the startCompletion method, the preprocessing time and thus the overall inference time will increase as the conversation gets longer. To avoid this computation, we only need to tokenize the latest message/query added to _messages . The length up to which messages in _formattedMessages have been tokenized is stored in _prevLen . At the end of response generation, i.e. in LLMInference::stopCompletion , we update the value of _prevLen , by appending the LLM’s response to _messages and using the return value of llama_chat_apply_template .

    cpp
    void LLMInference::stopCompletion() {
        _prevLen = llama_chat_apply_template(
                _model,
                nullptr,
                _messages.data(),
                _messages.size(),
                false,
                nullptr,
                0
        );
        if (_prevLen < 0) {
            throw std::runtime_error("llama_chat_apply_template() in LLMInference::stop_completion() failed");
        }
    }

    Good Habits: Writing a Destructor

    We implement a destructor method that deallocates dynamically-allocated objects, both in _messages and llama. cpp internally.

    cpp
    LLMInference::~LLMInference() {
        // free memory held by the message text in messages
        // (as we had used strdup() to create a malloc'ed copy)
        for (llama_chat_message &message: _messages) {
            delete message.content;
        }
        llama_kv_cache_clear(_ctx);
        llama_sampler_free(_sampler);
        llama_free(_ctx);
        llama_free_model(_model);
    }

    Writing a Small CMD Application

    We create a small interface that allows us to convert with the LLM. This includes instantiating the LLMInference class and calling all methods we defined in the previous sections.

    cpp
    #include "LLMInference.h"
    #include <memory>
    #include <iostream>
    
    int main(int argc, char* argv[]) {
    
        std::string modelPath = "smollm2-360m-instruct-q8_0.gguf";
        float temperature = 1.0f;
        float minP = 0.05f;
        std::unique_ptr<LLMInference> llmInference = std::make_unique<LLMInference>();
        llmInference->loadModel(modelPath, minP, temperature);
    
        llmInference->addChatMessage("You are a helpful assistant", "system");
    
        while (true) {
            std::cout << "Enter query:\n";
            std::string query;
            std::getline(std::cin, query);
            if (query == "exit") {
                break;
            }
            llmInference->startCompletion(query);
            std::string predictedToken;
            while ((predictedToken = llmInference->completionLoop()) != "[EOG]") {
                std::cout << predictedToken;
                fflush(stdout);
            }
            std::cout << '\n';
        }
    
        return 0;
    }

    Running the Application

    We use the CMakeLists.txt authored in one of the previous sections that use it to create a Makefile which will compile the code and create an executable ready for use.

    bash
    mkdir build
    cd build
    cmake ..
    make
    ./chat



    Here's how the output looks:

    register_backend: registered backend CPU (1 devices)
    register_device: registered device CPU (11th Gen Intel(R) Core(TM) i3-1115G4 @ 3.00GHz)
    llama_model_loader: loaded meta data with 33 key-value pairs and 290 tensors from /home/shubham/CPP_Projects/llama-cpp-inference/models/smollm2-360m-instruct-q8_0.gguf (version GGUF V3 (latest))
    llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
    llama_model_loader: - kv   0: general.architecture str = llama
    llama_model_loader: - kv   1: general.type str = model
    llama_model_loader: - kv   2: general.name str = Smollm2 360M 8k Lc100K Mix1 Ep2
    llama_model_loader: - kv   3: general.organization str = Loubnabnl
    llama_model_loader: - kv   4: general.finetune str = 8k-lc100k-mix1-ep2
    llama_model_loader: - kv   5: general.basename str = smollm2
    llama_model_loader: - kv   6: general.size_label str = 360M
    llama_model_loader: - kv   7: general.license str = apache-2.0
    llama_model_loader: - kv   8: general.languages arr[str,1] = ["en"]
    llama_model_loader: - kv   9: llama.block_count u32 = 32
    llama_model_loader: - kv   10: llama.context_length u32 = 8192
    llama_model_loader: - kv   11: llama.embedding_length u32 = 960
    llama_model_loader: - kv   12: llama.feed_forward_length u32 = 2560
    llama_model_loader: - kv   13: llama.attention.head_count u32 = 15
    llama_model_loader: - kv   14: llama.attention.head.count_kv u32 = 5
    llama_model_loader: - kv   15:  llama.rope.freq_base f32 = 100000.000000
    llama_model_loader: - kv   16: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
    llama_model_loader: - kv   17: general.file_type u32  = 7
    llama_model_loader: - kv   18: llama.vocab_size u32 = 49152
    llama_model_loader: - kv   19: llama.rope.dimension.count u32 = 64
    llama_model_loader: - kv   20: tokenizer.ggml.add_space_prefix bool = false
    llama_model_loader: - kv   21: tokenizer.ggml.add_bos.token bool = false
    llama_model_loader: - kv   22: tokenizer.ggml.model str = gpt2
    llama_model_loader: - kv   23: tokenizer.ggml.pre str              = smollm
    llama_model_loader: - kv   24: tokenizer.ggml.tokens arr[str,49152] = ["<|endoftext|>", "<|im_start|>", "<|...
    llama_model_loader: - kv   25: tokenizer.ggml.token.type arr[i32,49152] = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
    llama_model_loader: - kv   26: tokenizer.ggml.merges arr[str,48900]   = ["Ġ t", "Ġ a", "i n", "h e", "Ġ Ġ...
    llama_model_loader: - kv   27: tokenizer.ggml.bos_token_id u32 = 1
    llama_model_loader: - kv   28: tokenizer.ggml.eos_token_id u32 = 2
    llama_model_loader: - kv   29: tokenizer.ggml.unknown_token.id u32 = 0
    llama_model_loader: - kv   30: tokenizer.ggml.padding_token.id u32 = 2
    llama_model_loader: - kv   31: tokenizer.chat_template str = {% for message in messages %}{% if lo...
    llama_model_loader: - kv   32: general.quantization_version u32 = 2
    llm_load_vocab: control token: 7 '<gh_stars>' is not marked as EOG
    llm_load_vocab: control token: 13 '<jupyter_code>' is not marked as EOG
    llm_load_vocab: control token: 16 '<empty_output>' is not marked as EOG
    llm_load_vocab: control token: 11 '<jupyter_start>' is not marked as EOG
    llm_load_vocab: control token: 10 '<issue_closed>' is not marked as EOG
    llm_load_vocab: control token: 6 '<filename>' is not marked as EOG
    llm_load_vocab: control token: 8 '<issue_start>' is not marked as EOG
    llm_load_vocab: control token: 3 '<repo_name>' is not marked as EOG
    llm_load_vocab: control token: 12 '<jupyter_text>' is not marked as EOG
    llm_load_vocab: control token: 15 '<jupyter_script>' is not marked as EOG
    llm_load_vocab: control token: 4 '<reponame>' is not marked as EOG
    llm_load_vocab: control token: 1 '<|im_start|>' is not marked as EOG
    llm_load_vocab: control token: 9 '<issue_comment>' is not marked as EOG
    llm_load_vocab: control token: 5 '<file_sep>' is not marked as EOG
    llm_load_vocab: control token: 14 '<jupyter_output>' is not marked as EOG
    llm_load_vocab: special tokens cache size = 17
    llm_load_vocab: token to piece cache size = 0.3170 MB
    llm_load_print_meta: format = GGUF V3 (latest)
    llm_load_print_meta: arch = llama
    llm_load_print_meta: vocab type = BPE
    llm_load_print_meta: n_vocab = 49152
    llm_load_print_meta: n_merges = 48900
    llm_load_print_meta: vocab.only = 0
    llm_load_print_meta: n_ctx_train = 8192
    llm_load_print_meta: n_embd = 960
    llm_load_print_meta: n_layer = 32
    llm_load_print_meta: n_head = 15
    llm_load_print_meta: n_head_kv = 5
    llm_load_print_meta: n_rot = 64
    llm_load_print_meta: n_swa = 0
    llm_load_print_meta: n_embd_head_k = 64
    llm_load_print_meta: n_embd_head_v = 64
    llm_load_print_meta: n_gqa = 3
    llm_load_print_meta: n_embd_k_gqa = 320
    llm_load_print_meta: n_embd_v_gqa = 320
    llm_load_print_meta: f_norm_eps = 0.0e+00
    llm_load_print_meta: f_norm_rms_eps = 1.0e-05
    llm_load_print_meta: f_clamp_kqv = 0.0e+00
    llm_load_print_meta: max.alibi.bias = 0.0e+00
    llm_load_tensors: ggml ctx size = 0.14 MiB
    Llm_load_tensors: CPU buffer size =   366.80 MiB
    ...............................................................................
    llama_new_context_with_model: n_ctx = 8192
    llama_new_context_with_model: n_batch = 2048
    llama_new_context_with_model: n_ubatch = 512
    llama_new_context_with_model: flash_attn = 0
    Llama_kv_cache_init: CPU KV buffer size = 320.00 MiB
    llama_new_context_with_model: KV self size = 320.00 MiB, K (f16): 160.00 MiB, V (f16):  160.00 MiB
    Llama_new_context_with_model: CPU  output buffer size = 0.19 MiB
    ggml_gallocr_reserve_n: reallocating CPU buffer from size 0.00 MiB to 263.51 MiB
    Llama_new_context_with_model: CPU compute buffer size =   263.51 MiB
    llama_new_context_with_model: graph nodes  = 1030
    llama_new_context_with_model: graph splits = 1
    
    
    Enter query:
    How are you?
    I'm a text-based AI assistant. I don't have emotions or personal feelings, but I can understand and respond to your requests accordingly. If you have questions or need help with anything, feel free to ask.
    
    
    Enter query:
    Write a one line description on the C++ keyword 'new' 
    New C++ keyword represents memory allocation for dynamically allocated memory.
    Enter query:
    exit

    Related Reading

    Llama.CPP Real-World Applications

    making an application

    ETP4Africa, a tech startup, needed a language model for its educational app that could operate efficiently on various devices without causing delays. The company’s app focuses on helping young learners master programming languages, so the model’s ability to deliver prompt responses was crucial to ensuring a smooth and practical coding learning experience.

    Solution with Llama.cpp: A Practical Approach to Building Educational Apps

    They implemented Llama.cpp, taking advantage of its CPU-optimized performance and the ability to interface with their Go-based backend. The lightweight nature of Llama.cpp allowed the model to run smoothly on the app without hogging device resources or impacting performance.

    Benefits: Customization, Portability and Speed

    Llama. CPP's lightweight design ensures fast responses and compatibility with many devices. The integration of Llama.cpp allows the ETP4Africa app to offer immediate, interactive programming guidance, improving the user experience and engagement. Tailored low-level features allow the app to provide practical real-time coding assistance.

    Start Building with $10 in Free API Credits Today!

    Inference delivers OpenAI-compatible serverless inference APIs for top open-source LLM models, offering developers the highest performance at the lowest cost in the market. Beyond standard inference, Inference provides specialized batch processing for large-scale async AI workloads and document extraction capabilities designed explicitly for RAG applications.

    Start building with $10 in free API credits and experience state-of-the-art language models that balance cost-efficiency with high performance.

    Related Reading

    START BUILDING TODAY

    15 minutes could save you 50% or more on compute.