Most teams treat LLM calls as either synchronous (show me the answer now) or batch (run the whole dataset tonight). Still, batch inference doesn't work for some workloads, which is why we are introducing a new type of LLM request: an asynchronous request.
Asynchronous requests – fire‑and‑forget calls that finish whenever idle GPUs are free.
Below is the 5‑minute explainer on why this matters and how to use it today with Inference.net.
Why “synchronous” is expensive
Serverless inference providers from the OpenAI/Anthropics to OS LLM providers like Inference.net all have the same problem--sometimes you see large spikes in usage, and it's difficult to scale up GPUs to handle all requests without failing some. Individuals buy credits, and expect to be able to able to take this LLM inference however and whenever they want. They might fire off requests as fast as rate limits allow or just wait a while before firing off any requests at all and expect capacity to be up when they decide to do inference.
The only way to minimize error rates in interactive endpoints is by keeping GPUs warm and waiting. The problem is that renting and underutilizing GPUs is expensive, and the costs are passed down to the user.
Industry surveys put average utilization in the 15‑40 % range, meaning you’re paying for a lot of dark time (DevZero, LinkedIn). That wasted capacity is ultimately baked into the pricing of all offered LLM models.
This is also why most LLM providers have extremely restrictive rate limits--they simply aren't able to handle sudden jumps in inference while maintaining reliability, and their margin for error is usually surprisingly low. So what can you do if you need hundreds of thousands or tens of millions of requests to enrich your db?
Batch jobs help—but may be the wrong choice
Batch APIs let you upload a giant file and get a single discounted bill. From a pricing perspective, batching is actually really great for both sides. The user gets a discount, and the LLM inference provider gets to use underutilized capacity throughout the day.
Batching is excellent for processing large datasets for tasks like classification or synthetic dataset generation, but often you just want to process a request and update your database. You don't need an instant response for this (in fact waiting for an instant response might actually be using valuable server resources) but you also don't want to worry about accumulating LLM requests into one big batch!
For many products that don't require an instant response, developers either:
- wait artificially until midnight to run the batch (requires managing requests and building a whole workflow, high risk if batch fails for some reason), or
- keep burning money with synchronous calls.
Meet the asynchronous request
An asynchronous request is just a normal API call that says:
“Here’s my payload. I don’t care if the answer comes back in two minutes or 10 hours—just make it cheap.”
Because the provider can slide these jobs into idle GPU gaps (same as in batch), the price drops and rate‑limit's are pretty much uneccessary. 10M asynchronous requests in a day are substantially easier to process than 50k requests in 10 minutes, despite the number of requests/s being much higher in the latter.
These 'asynchronous' requests end up much more convenient for developers, who can set a webhook endpoint to receive responses in the inference.net dashboard, and wait for requests as they come in.
Using /v1/slow
on Inference.net
from openai import OpenAI # OpenAI SDK works out of the box
client = OpenAI(
base_url="https://api.inference.net/v1/slow", # only change
api_key=os.getenv("INFERENCE_API_KEY"),
)
job = client.chat.completions.create(
model="meta-llama/llama-3.2-1b-instruct/fp-8",
messages=[{"role": "user", "content": "Summarize War and Peace in 10 bullets"}],
metadata= {
"webhook_id": "AhALzdz8S"
}
)
print(job.id) # store this for later
The call returns instantly with a request ID. When the GPUs are at low capacity (which can be instantly or up to 24 hours later), grab the result:
curl https://api.inference.net/v1/generation/$ID \
-H "Authorization: Bearer $INFERENCE_API_KEY"
Or, more likely, recieve the response at your webhook endpoint!
When asynchronous makes sense
- Large‑scale content generation (translations, synthetic data, SEO blobs, enrichment)
- Async enrichment jobs (company‐data lookups, document embeddings)
Some example applications that could benefit from asynchronous requests
Synchronous (sub‑second or streamed)
- Live chat completions for support widgets
- Code autocomplete in an IDE
- Real‑time policy filtering before a user message is posted
Batch (big payloads on a schedule)
- Nightly bulk translation of that day’s articles into 20 languages
- Weekly model‑retraining pass scoring 10 M reviews to refresh embeddings
- End‑of‑month contract audit that runs an LLM over every new PDF for compliance flags
Asynchronous (fire‑and‑forget single calls)
- User‑profile enrichment: when someone signs up, kick off a cheap async request to summarize their LinkedIn/GitHub and store it for later personalization
- Background tagging: each newly uploaded PDF is sent to
/v1/slow
for keyword extraction; the tags land in your DB whenever ready - Async sentiment / tone scoring: every inbound customer email gets an asynchronous request to label sentiment, feeding analytics dashboards without blocking support agents
Why asynchronous might be better than a synchronous request
For synchronous requests, we need to wait for a response. Often, you end up paying for this waiting time (I personally have racked up over 10k of Vercel function duration fees because I have awaited slow async requests). Asynchronous requests do not require you wait for a response, so handling large numbers of them is an order of magnitude simpler. On top of this you get price savings and enhanced reliability. It's really a no-brainer if you don't need instant results.
How much cheaper?
Inference.net currently prices asynchronous inference at 10% below synchronous rates, but if you expect large workloads, reach out and we may be able to increase this number.
TL;DR
- Synchronous for UX, batch for big offline jobs.
- Everything else? Asynchronous requests.
- They’re drop‑in (
/v1/slow
), webhook‑friendly, and cut your LLM bill.