News

    Introducing Catalyst: Train self-improving AI models

    Learn more

    Case Studies / Olive

    How Olive Delivers Real-Time Food Verdicts on a Model It Owns

    Olive-Logo

    Olive is a consumer food transparency app that helps families scan groceries, understand ingredient risks, and make better food choices in real time.

    Outcomes

    78%
    Lower p50 latency
    6.4 s → under 1 s
    p99 response time
    591 ms
    median end-to-end latency
    ~0.25 s
    to the first streamed word
    70%
    Cheaper inference
    Talk to an Engineer

    Company Overview

    Olive is a consumer food transparency app. A user scans a product by its barcode or label and the app returns a score from 0 to 100 along with a short verdict on whether there are any potential issues to be careful of when consuming that product.

    The score itself comes from Olive's knowledge system, which weighs factors such as seed oils, additives, processing level, organic sourcing, and lab-tested toxins. The model handles the part the user actually reads: turning that score and the ingredient breakdown into a short message that informs the user if there are any ingredients to be wary of.

    A typical verdict reads like this:

    The main issue here is sodium benzoate, a synthetic preservative that has no real reason to be in a ferment or brine. Look for a version where the ingredient list stops at salt, vinegar, and spices.

    Because the verdict streams onto the screen word by word while the user waits, response speed is imperative to a seamless user experience.

    Results at a Glance

    Using Inference.net, Olive replaced the third-party frontier models it had been running, first GPT-4.1 and later Claude Sonnet 4.6, with a custom model specifically tuned for its product. The new custom model is a fine-tuned Qwen 3.5 9B model served on NVIDIA GPUs deployed for production-scale on the Inference.net platform.

    The custom model uses the same brand-voice prompt Olive had already written, so the comparison was direct and the evals confirmed that quality was maintained. The new model cut median end-to-end latency to under 600ms, about 4.6x faster, while p99 response time fell below a second and inference cost was lowered by approximately 70%. Olive now serves over 70,000 product scans a day on a much smaller, faster, and cheaper model capable of delivering the exceptional responses that its users have grown to love.

    Challenges

    Olive ran into several problems that are common when using frontier models at scale for task specific inference.

    Latency wasn't optimal

    Olive's verdict appears the instant a user finishes a scan. When the model takes two to three seconds to respond, the user is directly experiencing that latency making the product seem less snappy and less of a pleasure to use. On Claude Sonnet 4.6, median end-to-end responses sat near 2.7 seconds, and the first visible word could feel slow to appear. For a feature that people use repetitively while purchasing a multitude of different items in the grocery aisle, that delay stacked and took away from the whole experience.

    Costs climbed with usage

    Olive's traffic grows with its user base, and frontier model pricing grows right along with it. Running every scan through a general-purpose model meant paying general-purpose prices for a narrow, repetitive task, with the bill rising every month. Huge models capable of writing code, generating complex outlines, and other general things are just not the right tool when it comes to use cases such as this.

    Switching frontier models did not fix it

    Olive had already tried to improve things by moving from GPT-4.1 to Claude Sonnet 4.6, rewriting their prompt in the process. Quality held up, but latency got worse rather than better. It became clear that picking a different frontier model was not going to solve a problem that came from running a large general model for a small, specialized job.

    Solution

    Olive partnered with Inference.net to train a model built for this one task.

    A specialized model on the existing prompt

    The Inference.net team trained a 9B model on Olive's production data and kept Olive's existing brand-voice prompt unchanged. The system prompt was kept the same across the model switch which made the before-and-after comparison with the model evaluations clean rather than approximate.

    They did not ask us to rewrite anything or change how we work. We pointed our traffic at the new model and the verdicts read the same as before, just faster. — Gino Rey, Head of Engineering, Olive

    The model is private to Olive and runs on NVIDIA GPUs deployed on the Inference.net platform and configured to handle their production load.

    Built for streaming

    All of Olive's requests stream their response, so the user sees the verdict build character by character. The custom model was tuned for exactly this pattern, which is where a lot of the perceived-speed improvement comes from. The first words show up almost immediately, and the rest follows fast enough to read in one motion.

    Olive Request Flow Diagram

    Results

    After the switch to the new model, every latency measure improved, and the gains were largest where they matter most for a live user experience.

    End-to-end latency

    Latency Graph Olive
    • p50: 2,721ms → 591ms (4.6x faster)

    • p75: 3,089ms → 671ms (4.6x faster)

    • p90: 3,508ms → 755ms (4.6x faster)

    • p95: 3,967ms → 816ms (4.9x faster)

    • p99: 6,414ms → 998ms (6.4x faster)

    Measured against the GPT-4.1 setup, median latency dropped 65% and p99 dropped 79%. The slowest responses now land under a second compared to before where they could take up to 6 whole seconds. Sonnet was used as an experiment after GPT-4.1 but the latency with that model was actually slower and not the improvement they were hoping for.

    Time to first token

    The first visible word is what tells the user the app is working. On the custom model, it arrives in about a quarter of a second.

    • TTFT p50: 1,130ms → 259ms

    • TTFT p95: 2,002ms → 413ms

    Streaming speed

    Once the response starts, it streams roughly 4.8x faster than Sonnet and about 2.8x faster than GPT-4.1, so the full two-sentence verdict finishes in around 0.6 seconds instead of 2.7. That is the difference between a noticeable wait and an answer that simply appears.

    TTFT and Streaming Graph Olive

    Cost

    Moving off third-party frontier pricing onto a model Olive owns reduced inference cost by roughly 70%, while serving the same volume at better performance.

    Reliability & Consistency

    Across the full history of the project there are no refusal or malformed-output errors among the top failure modes, which is a meaningful reassurance for a feature that runs millions of times.

    Once we were past the first week of tuning, it just ran. We stopped thinking about the model as something that might go down, and started thinking about it as part of our own stack. — Gino Rey, Head of Engineering, Olive

    According to recent metrics, p50 has also held near 580ms and p95 near 800ms with almost no drift, even during the busiest hours. Predictable latency matters as much as low latency for a feature users reach for dozens of times a week.

    What's Next

    Olive serves its current volume with significant headroom on its existing deployment, leaving availability to grow well beyond today's traffic before adding capacity. With the training pipeline already in place, Olive can retrain and ship improved versions of the model as its rubric evolves and the product expands, all without changing the integration.

    For teams running a frontier model on a single, well-defined task in front of real users, Olive's path is a useful reference point. A smaller model trained for the job, served on inference.net infrastructure, can hold quality steady while making the experience faster and cheaper at the same time.

    Train your own specialized models

    Fine-tune and deploy AI on your production data. Lower cost, lower latency, specialized for your workload.

    CONTACT

    Meet with our research team

    Schedule a call with our research team to learn more about how Specialized Language Models can cut costs and improve performance.