Inference.net Blog | Infinite Context

When an LLM project gets traction, the focus shifts from growth to scrambling to save on tokens. This is especially common with multi-turn chat applications, where users often have non-cost efficient habits like pasting massive texts or PDFs, or forgetting to start new chats.

Several founders have scrambled as their tools went viral and their Anthropic/OpenAI costs ballooned to hundreds of thousands in a week. One infamous example is Wordware, whose AI roast tool went megaviral and the entire team didn't sleep for days. A similar thing happened at Exa when their Twitter Wrapped tool went viral - but they thought of a clever solution that saved them 90% on tokens: route people with the most followers to Claude, and everyone else to dirt cheap open-source models.

This is a problem most developers working with LLMs face at some point. Typically less than 5% of your users account for most of your costs. The trick is not to offer fewer features, but rather to make these users less expensive.

But how do you actually do this?

Using a Cheaper LLM/Provider

OpenAI is great for prototyping, but its pricing lags behind other providers. While OpenAI works to catch up and you may need to offer proprietary OpenAI models to discerning users (if your app allows model selection), free users typically shouldn't have access to premium OpenAI models as that would destroy margins.

Price Comparison: GPT-4.1 Mini vs DeepSeek R1 Distill Llama 70B

GPT-4.1 Mini (OpenAI):

Input: $0.40 / 1M tokens
Output: $1.60 / 1M tokens

DeepSeek R1 Distill Llama 70B (Inference.net):

Input: $0.10 / 1M tokens
Output: $0.40 / 1M tokens

Cost Savings

DeepSeek is 4x cheaper than GPT-4.1 Mini, offering significant cost savings for production applications. Inference.net specializes in offering these open-source models at competitive rates with up to 90% cost savings compared to legacy providers, making them particularly attractive for free tier users where margins matter most.

Context Manipulation

Here's the thing with context manipulation: we have some users that massively abuse the context window. They paste entire PDFs, dump massive text files, or just keep adding to conversations without ever starting fresh. The problem is, you don't want to pass all of this irrelevant information to the LLM because it's going to hurt both performance and your wallet.

The trick is creating a good chat experience where users don't have to worry about how you're managing the context behind the scenes. But in practice, you actually don't want to feed everything to the LLM - and this approach wins on multiple fronts.

First, it can actually improve performance because you're removing irrelevant information from the context. Second, it's going to be a massive improvement in cost. You're basically winning in both worlds.

How to reduce your context

You can chunk your inputs, filter chat history for relevance, use re-rankers to surface the most important information, and implement all these optimizations yourself. Or you can use a service that handles this for you.

Supermemory is one service that tackles this exact problem. Under the hood, it chunks your inputs and uses a cross-encoder reranker to filter out irrelevant data that might confuse your model. This allows for much longer context sizes - they call it "unlimited context" since conversations can extend indefinitely without hitting traditional token limits. When conversations exceed 20k tokens, Supermemory intelligently retrieves only the most relevant context from previous messages.

What's particularly impressive is how performance actually improves with Supermemory's enhancement. As shown in their benchmarks, while standard models see dramatic performance degradation as context length increases (dropping from ~100 to ~30 performance score at 32K tokens), Supermemory-enhanced models maintain much more stable performance, typically staying above 60-70 performance score even at longer contexts.

Beyond performance improvements, Supermemory also delivers significant latency benefits. By filtering out irrelevant context, they claim up to 80% latency reduction on large texts - your users get faster responses while you save money.

Combining Both Strategies for Maximum Savings

Here's where it gets interesting: you can use both approaches together for maximum cost optimization. Supermemory works with any OpenAI-compatible API, which means you can point it at Inference.net's endpoints instead of OpenAI's.

Setting this up is straightforward. First, get your API keys from both services:

Get a Supermemory API key from their Developer Platform
Get your Inference.net API key from their dashboard
Choose your endpoint - instead of using https://api.supermemory.ai/v3/https://api.openai.com/v1/chat/completions, you'd use the Inference endpoint: https://api.supermemory.ai/v3/https://api.inference.net/v1/chat/completions

Your request would look something like this:

curl https://api.supermemory.ai/v3/https://api.inference.net/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $INFERENCE_API_KEY" \
  -H "x-api-key: $SUPERMEMORY_API_KEY" \
  -d '{
    "model": "deepseek/r1-distill-llama-70b/fp-8",
    "messages": [
      {"role": "user", "content": "Your long conversation here..."}
    ]
  }'

The Math on Combined Savings

Let's calculate the real savings with a practical example. Assume you're processing 1B input tokens and 300M output tokens per month:

GPT-4.1 Mini (OpenAI) Baseline:

Input: 1B tokens × $0.40/M = $400
Output: 300M tokens × $1.60/M = $480
Total: $880/month

DeepSeek + Supermemory Combined:

Base model cost (4x cheaper): $880 ÷ 4 = $220
Supermemory token reduction (50%+ savings): $220 × 0.5 = $110
Total: $110/month

Total savings: 87.5% - you're paying $110 instead of $880.

The math is even more compelling for power users with longer conversations:

Inference.net saves you 4x on model costs
Supermemory saves you 50% on token usage through intelligent context management
Combined with 80% latency reduction on large texts
Better performance maintenance as context grows

Since Supermemory acts as an intelligent proxy, you get the benefits of smart context management while routing your optimized requests to cheaper models. Your users get unlimited context, better performance (thanks to filtered irrelevant information), faster responses, and you get sustainable economics.

All of this is assuming you are using 4.1 mini, and not even 4.1. If you are using larger models, your potential savings are 5x this.

Final Thoughts

These two strategies are essential for anyone running LLM applications at scale. The combination of switching to cheaper providers for appropriate use cases and implementing smart context management can literally save you 87%+ of your token costs while often improving the user experience.

You can implement smart context management yourself without external tools, but studying existing solutions like Supermemory provides a useful blueprint for context expansion and management strategies.

The key is being strategic about where you apply premium models versus cheaper alternatives, and not letting power users destroy your margins through inefficient context usage. Your business model should work for both casual users and power users - these techniques make that possible.

Remember: the goal isn't to restrict features, but to deliver them more efficiently. When done right, your users get better performance, faster responses, and you get sustainable economics. The 5% of users who drive most of your costs become manageable, and you can focus back on growth instead of scrambling to optimize every token.

How Smart Routing Saved Exa 90% on LLM Costs During Their Viral Moment