Case Studies / Olive
How Olive Delivers Real-Time Food Verdicts on a Model It Owns
Olive is a consumer food transparency app that helps families scan groceries, understand ingredient risks, and make better food choices in real time.
Outcomes
Company Overview
Olive is a consumer food transparency app. A user scans a product by its barcode or label and the app returns a score from 0 to 100 along with a short verdict on whether there are any potential issues to be careful of when consuming that product.
The score itself comes from Olive's knowledge system, which weighs factors such as seed oils, additives, processing level, organic sourcing, and lab-tested toxins. The model handles the part the user actually reads: turning that score and the ingredient breakdown into a short message that informs the user if there are any ingredients to be wary of.
A typical verdict reads like this:
The main issue here is sodium benzoate, a synthetic preservative that has no real reason to be in a ferment or brine. Look for a version where the ingredient list stops at salt, vinegar, and spices.
Because the verdict streams onto the screen word by word while the user waits, response speed is imperative to a seamless user experience.
Results at a Glance
Using Inference.net, Olive replaced the third-party frontier models it had been running, first GPT-4.1 and later Claude Sonnet 4.6, with a custom model specifically tuned for its product. The new custom model is a fine-tuned Qwen 3.5 9B model served on NVIDIA GPUs deployed for production-scale on the Inference.net platform.
The custom model uses the same brand-voice prompt Olive had already written, so the comparison was direct and the evals confirmed that quality was maintained. The new model cut median end-to-end latency to under 600ms, about 4.6x faster, while p99 response time fell below a second and inference cost was lowered by approximately 70%. Olive now serves over 70,000 product scans a day on a much smaller, faster, and cheaper model capable of delivering the exceptional responses that its users have grown to love.
Challenges
Olive ran into several problems that are common when using frontier models at scale for task specific inference.
Latency wasn't optimal
Olive's verdict appears the instant a user finishes a scan. When the model takes two to three seconds to respond, the user is directly experiencing that latency making the product seem less snappy and less of a pleasure to use. On Claude Sonnet 4.6, median end-to-end responses sat near 2.7 seconds, and the first visible word could feel slow to appear. For a feature that people use repetitively while purchasing a multitude of different items in the grocery aisle, that delay stacked and took away from the whole experience.
Costs climbed with usage
Olive's traffic grows with its user base, and frontier model pricing grows right along with it. Running every scan through a general-purpose model meant paying general-purpose prices for a narrow, repetitive task, with the bill rising every month. Huge models capable of writing code, generating complex outlines, and other general things are just not the right tool when it comes to use cases such as this.
Switching frontier models did not fix it
Olive had already tried to improve things by moving from GPT-4.1 to Claude Sonnet 4.6, rewriting their prompt in the process. Quality held up, but latency got worse rather than better. It became clear that picking a different frontier model was not going to solve a problem that came from running a large general model for a small, specialized job.
Solution
Olive partnered with Inference.net to train a model built for this one task.
A specialized model on the existing prompt
The Inference.net team trained a 9B model on Olive's production data and kept Olive's existing brand-voice prompt unchanged. The system prompt was kept the same across the model switch which made the before-and-after comparison with the model evaluations clean rather than approximate.
They did not ask us to rewrite anything or change how we work. We pointed our traffic at the new model and the verdicts read the same as before, just faster. — Gino Rey, Head of Engineering, Olive
The model is private to Olive and runs on NVIDIA GPUs deployed on the Inference.net platform and configured to handle their production load.
Built for streaming
All of Olive's requests stream their response, so the user sees the verdict build character by character. The custom model was tuned for exactly this pattern, which is where a lot of the perceived-speed improvement comes from. The first words show up almost immediately, and the rest follows fast enough to read in one motion.

Results
After the switch to the new model, every latency measure improved, and the gains were largest where they matter most for a live user experience.
End-to-end latency

-
p50: 2,721ms → 591ms (4.6x faster)
-
p75: 3,089ms → 671ms (4.6x faster)
-
p90: 3,508ms → 755ms (4.6x faster)
-
p95: 3,967ms → 816ms (4.9x faster)
-
p99: 6,414ms → 998ms (6.4x faster)
Measured against the GPT-4.1 setup, median latency dropped 65% and p99 dropped 79%. The slowest responses now land under a second compared to before where they could take up to 6 whole seconds. Sonnet was used as an experiment after GPT-4.1 but the latency with that model was actually slower and not the improvement they were hoping for.
Time to first token
The first visible word is what tells the user the app is working. On the custom model, it arrives in about a quarter of a second.
-
TTFT p50: 1,130ms → 259ms
-
TTFT p95: 2,002ms → 413ms
Streaming speed
Once the response starts, it streams roughly 4.8x faster than Sonnet and about 2.8x faster than GPT-4.1, so the full two-sentence verdict finishes in around 0.6 seconds instead of 2.7. That is the difference between a noticeable wait and an answer that simply appears.

Cost
Moving off third-party frontier pricing onto a model Olive owns reduced inference cost by roughly 70%, while serving the same volume at better performance.
Reliability & Consistency
Across the full history of the project there are no refusal or malformed-output errors among the top failure modes, which is a meaningful reassurance for a feature that runs millions of times.
Once we were past the first week of tuning, it just ran. We stopped thinking about the model as something that might go down, and started thinking about it as part of our own stack. — Gino Rey, Head of Engineering, Olive
According to recent metrics, p50 has also held near 580ms and p95 near 800ms with almost no drift, even during the busiest hours. Predictable latency matters as much as low latency for a feature users reach for dozens of times a week.
What's Next
Olive serves its current volume with significant headroom on its existing deployment, leaving availability to grow well beyond today's traffic before adding capacity. With the training pipeline already in place, Olive can retrain and ship improved versions of the model as its rubric evolves and the product expands, all without changing the integration.
For teams running a frontier model on a single, well-defined task in front of real users, Olive's path is a useful reference point. A smaller model trained for the job, served on inference.net infrastructure, can hold quality steady while making the experience faster and cheaper at the same time.
Train your own specialized models
Fine-tune and deploy AI on your production data. Lower cost, lower latency, specialized for your workload.
Meet with our research team
Schedule a call with our research team to learn more about how Specialized Language Models can cut costs and improve performance.
Other Customer Stories
We're creating a platform for progressive AI companies to build their products in the fastest, most performant infrastructure available.

How Gravity Ads Trains Specialized LLMs to Power Their AI-Native Ad Network
Gravity Ads serves native, intent-aware sponsored suggestions inside AI chatbots and assistants, turning conversational traffic into a new advertising surface.
How Cal AI reduced latency by 66% while improving reliability
Cal AI is the leading consumer nutrition app, letting millions of users log meals by snapping a photo, scanning a barcode, or describing what they ate.
How Wynd Labs Processes Videos at 95% Lower Cost
Wynd Labs operates Grass, a decentralized data network with millions of nodes that aggregates public web content into structured datasets for AI training and search.
