The serverless LLM inference market has exploded with dozens of providers serving both proprietary and open-source models at varying prices, latencies, reliability levels, and floating-point precisions. While choosing the perfect provider and LLM isn't straightforward, most providers offer OpenAI-compatible endpoints that make integration simple. This standardization has made cost the primary differentiator for serverless open-source model inference. With new open-source models achieving state-of-the-art performance weekly, routing services like OpenRouter and Portkey have become essential infrastructure for seamlessly switching between providers.
The Race to the Bottom
Serverless inference is undeniably a race to the bottom—look at the provider prices on OpenRouter. Most serverless providers use OpenRouter as a top-of-funnel strategy for their more profitable dedicated endpoints and operate on either negative or razor-thin margins. Unless you're running a proprietary model, margins above 20% are virtually nonexistent.
Providers push the limits of optimization, implementing model-specific tweaks that allow them to serve inference faster and cheaper than competitors using default vLLM configurations. These optimizations require expensive LLMOps talent, a cost ultimately reflected in inference pricing.
Still, serverless inference prices have the potential to be even lower, as right now consumers are limited to only a few inference providers with either their own GPU clusters or deals with datacenters. Tens of millions of consumer GPUs are unable to provide inference, even if they can do so even more cheaply than datacenters!
The Trust Bottleneck
Currently, serverless inference remains a competition among a few dozen players, with most being visible on OpenRouter. Individual operators and datacenters with idle capacity can't participate because platforms like OpenRouter can't verify their reliability or validate their outputs. How do you confirm an LLM output is legitimate—not gibberish from a smaller model or the same model running at lower precision? Trust and reliability are non-negotiable requirements that simply cannot be sacrificed for cheaper inference.
Breaking the Bottleneck
But what if we allow every GPU to run serverless inference, and can verify that their LLM output is correct?
Suddenly, the market opens to anyone worldwide—from college students leveraging dorm room electricity to datacenters strategically located near hydroelectric dams for rock-bottom power costs. The margins for LLM inference with existing GPUs far exceed those in cryptocurrency mining, which means a secondary income stream is unlocked for anybody with a consumer GPU.
The Inference.net Solution
This is precisely what we're building at Inference.net: a decentralized, verified network of LLM inference providers. We've achieved 99.99%+ accuracy in verifying LLM outputs, and through token staking and penalties for bad actors, we're driving this error rate toward zero. Any GPU owner can run our Docker container and become an Inference provider, earning rewards for their compute.
In the future, we see model routers like OpenRouter becoming Inference.net wrappers as we dominate open-source model inference through our decentralized network.
The Migration Has Begun
Thousands of GPUs all over the world, from massive cryptocurrency operations to gaming rigs to idle workstations, are already switching to us. Someone with cheap electricity can serve inference profitably at prices that would bankrupt the current small set of centralized providers.
Crypto mining already proved all computation eventually trades at electricity cost. Through our network of verifiable serverless inference, the same will be true of next token prediction.
We'll see you at the bottom.