News

    Introducing Catalyst: Train self-improving AI models

    Learn more

    Case Studies / Olive

    How Olive Delivers Real-Time Food Verdicts on a Model It Owns

    Olive-Logo

    Olive is a consumer food transparency app that helps families scan groceries, understand ingredient risks, and make better food choices in real time.

    Outcomes

    78%
    Lower p50 latency
    Talk to an Engineer

    Company Overview

    Olive is a consumer food transparency app. A user scans a product by its barcode, label, or menu, and the app returns a score from 0 to 100 along with a short, plain-language verdict on whether the product is worth eating.

    The score itself comes from Olive's own rubric, which weighs factors like seed oils, additives, processing level, organic sourcing, and lab-tested toxins. The model handles the part the user actually reads: turning that score and the ingredient breakdown into a two-sentence verdict written in the voice of a knowledgeable friend, delivered live as the result loads.

    A typical verdict reads like this:

    The main issue is sodium benzoate, a synthetic preservative that has no real reason to be in a ferment or brine. Look for a version where the ingredient list stops at salt, vinegar, and spices.

    Because the verdict streams onto the screen word by word while the user waits, response speed is not a backend detail for Olive. It is the experience.

    Results at a Glance

    Using Inference.net, Olive replaced the third-party frontier models it had been running, first GPT-4.1 and later Claude Sonnet 4.6, with a custom model trained specifically for its product. The new model, olive/analyze-product-v1, is a 9B fine-tune served on NVIDIA GPUs deployed on the Inference.net platform.

    The custom model uses the same brand-voice prompt Olive had already written, so the comparison is direct. It cut median end-to-end latency to under 600ms, brought p99 response time below one second, lowered inference cost by roughly 70%, and held a trailing-7-day error rate of 0.06%. Olive now serves around 70,000 product scans a day on infrastructure it controls.

    [GRAPH: Daily request volume from December 2025 to present, showing the handoff from GPT-4.1 to Claude Sonnet to olive/analyze-product-v1, with the latency drop marked at cutover.]

    Challenges

    Olive ran into three problems that are familiar to any team putting a frontier model directly in front of users.

    Latency was the product, and it was too slow

    Olive's verdict appears the instant a user finishes a scan. When the model takes two to three seconds to respond, the user is left watching a spinner during the exact moment they are deciding whether to trust the app. On Claude Sonnet 4.6, median end-to-end responses sat near 2.7 seconds, and the first visible word took over a second to arrive. For a feature meant to feel like a quick gut check in the grocery aisle, that delay worked against the whole experience.

    Our entire pitch is that you scan something and you just know. Every second the user spends watching a loading state is a second they start to wonder why this is taking so long. We needed the answer to feel immediate. — [Name, Title], Olive

    Costs climbed with usage

    Olive's traffic grows with its user base, and frontier model pricing grows right along with it. Running every scan through a general-purpose model meant paying general-purpose prices for a narrow, repetitive task, with the bill rising every month.

    Switching frontier models did not fix it

    Olive had already tried to improve things by moving from GPT-4.1 to Claude Sonnet 4.6, rewriting their prompt in the process. Quality held up, but latency got worse rather than better. It became clear that picking a different frontier model was not going to solve a problem that came from running a large general model for a small, specialized job.

    Solution

    Olive partnered with Inference.net to train a model built for this one task.

    A specialized model on the existing prompt

    The Inference.net team trained a 9B model on Olive's production data and kept Olive's existing brand-voice prompt unchanged. Prompt hashes confirm that the large majority of traffic runs on exactly the same system prompt Sonnet used, which made the before-and-after comparison clean rather than approximate.

    They did not ask us to rewrite anything or change how we work. We pointed our traffic at the new model and the verdicts read the same as before, just faster. — [Name, Title], Olive

    The model takes text and image input with a 32K context window, and it was trained on a Qwen base. It is private to Olive and runs on NVIDIA GPUs deployed on the Inference.net platform, with automatic failover configured so capacity is not a single point of failure.

    Built for streaming

    More than 99% of Olive's requests stream their response, so the user sees the verdict build character by character. The custom model was tuned for exactly this pattern, which is where most of the perceived-speed improvement comes from. The first words show up almost immediately, and the rest follows fast enough to read in one motion.

    [GRAPH: Request flow. Olive app, to Olive backend, to Inference.net API, to olive/analyze-product-v1, to streamed verdict. Inputs labeled: product data, Olive score, sub-scores, detected issues. Output labeled: short streamed verdict.]

    Results

    After the cutover, every latency measure improved, and the gains were largest where they matter most for a live experience.

    End-to-end latency

    [GRAPH: End-to-end latency by percentile, previous frontier models vs. olive/analyze-product-v1. Highlight p99 under one second.]

    • p50: 2,721ms → 591ms (4.6x faster)
    • p75: 3,089ms → 671ms (4.6x faster)
    • p90: 3,508ms → 755ms (4.6x faster)
    • p95: 3,967ms → 816ms (4.9x faster)
    • p99: 6,414ms → 998ms (6.4x faster)

    Measured against the original GPT-4.1 setup, median latency dropped 65% and p99 dropped 79%. Either way, the slowest responses now land under a second.

    Time to first token

    The first visible word is what tells the user the app is working. On the custom model, it arrives in about a quarter of a second.

    • TTFT p50: 1,130ms → 259ms
    • TTFT p95: 2,002ms → 413ms

    Streaming speed

    Once the response starts, it streams roughly 4.8x faster than Sonnet and about 2.8x faster than GPT-4.1, so the full two-sentence verdict finishes in around 0.6 seconds instead of 2.7. That is the difference between a noticeable wait and an answer that simply appears.

    [GRAPH: Time to first token and decode throughput, previous models vs. custom model. Annotate a typical Olive verdict at roughly 55 output tokens.]

    Cost

    Moving off third-party frontier pricing onto a model Olive owns reduced inference cost by roughly 70%, while serving the same volume at better performance.

    Reliability

    The custom model's error rate over the last 7 days is 0.06%, and nearly all of the rare errors are operational rather than model quality issues. Across the full history of the project there are no refusal or malformed-output errors among the top failure modes, which is a meaningful reassurance for a feature that runs millions of times.

    Once we were past the first week of tuning, it just ran. We stopped thinking about the model as something that might go down, and started thinking about it as part of our own stack. — [Name, Title], Olive

    Latency that stays put

    [GRAPH: Hourly p50 and p95 over the last 18 days, showing a flat line through traffic peaks.]

    Across 18 days of production traffic, p50 has held near 580ms and p95 near 800ms with almost no drift, even during the busiest hours. Predictable latency matters as much as low latency for a feature users reach for dozens of times a week.

    What's Next

    Olive serves its current volume with significant headroom on its existing deployment, leaving room to grow well beyond today's traffic before adding capacity. With the training pipeline already in place, Olive can retrain and ship improved versions of the model as its rubric evolves and the product expands, all without changing the integration.

    For teams running a frontier model on a single, well-defined task in front of real users, Olive's path is a useful reference point. A smaller model trained for the job, served on infrastructure you control, can hold quality steady while making the experience faster and cheaper at the same time.

    Train your own specialized models

    Fine-tune and deploy AI on your production data. Lower cost, lower latency, specialized for your workload.

    CONTACT

    Meet with our research team

    Schedule a call with our research team to learn more about how Specialized Language Models can cut costs and improve performance.