News

    Introducing Catalyst: Train self-improving AI models

    Learn more

    Case Studies / Gravity

    How Gravity Ads Swapped Cerebras for Catalyst and Cut Latency 4x

    Gravity-Logo

    Native, high-intent sponsored suggestions inside AI conversations.

    Outcomes

    4x
    Lower p90 latency (959ms to 240ms)
    10x
    Lower inference cost
    70x
    Smaller model (70B to 1B)
    3M
    Requests per day
    Talk to an Engineer

    Overview

    Gravity is an AI-native advertising platform that places high-intent sponsored suggestions inside AI conversations. The platform sits in the request path between users and LLM-powered chat surfaces, extracting purchase intent from each query and matching it against advertiser inventory in real time.

    They replaced a 70B model with a 1B model that matched or beat quality while cutting latency 4x and cost significantly.

    Swapping from the 70B to Inference.net's 1B was a single config change on our side. The hard part was getting a model that small to actually hold quality at our throughput. They handled it, and it just worked.

    — Zach Oldham, CEO, Gravity

    Key Outcomes

    • 4x lower p90 latency. p90 went from 959ms on the prior 70B setup to 240ms on the custom 1B model. Tail latency improved more than the median: p99 dropped from 1992ms to 352ms.
    • Roughly 10x lower inference cost at production volume.
    • 70x smaller model (70B to 1B), with better task performance on Gravity's eval set.
    • 3 million requests per day served on commodity Nvidia hardware.

    How Gravity Works

    Gravity inserts sponsored product suggestions directly into AI assistant responses when users show purchase intent.

    Because the call sits inline with a live conversation, every millisecond of inference latency is a millisecond the conversation feels slower than it should.

    The Challenge

    Before working with Inference.net, Gravity ran a 70B open-source model on Cerebras. Two problems compounded.

    • Latency was too high for an inline conversational surface. p90 was 959ms, meaning ten percent of conversations waited nearly a second on the inference call alone. p99 was nearly two seconds. The slowest queries shaped the user-perceived feel of the entire surface.
    • Costs scaled linearly with traffic. A general-purpose 70B was far more capacity than the task required, but it was the only open-source model that met Gravity's quality bar at the time.

    The team evaluated other fine-tuning providers and self-hosted approaches before partnering with Inference.net.

    Why a 1B Could Beat a 70B

    Gravity's task is narrow and well-defined. Most of the capacity in a 70B general-purpose model is spent on capabilities the task does not need. A smaller model trained on production traffic, with the output schema locked in during training, can match or beat the 70B for this specific shape of work.

    The win is two-sided. The smaller model uses tens of times less compute per request, which is where the latency and cost gains come from. And because it is trained directly against the schema, it does not drift outside the format the way a generalist model can on edge cases.

    The Solution: Catalyst Train and Deploy

    Gravity used Catalyst Train and Deploy, Inference.net's platform for distilling and serving custom models on Nvidia GPUs. Inference.net trained a 1B parameter custom model on Gravity's production traffic, with the existing 70B as a teacher. The training ran on Nvidia GPUs using Megatron Bridge, Inference.net's internal fork of Nvidia's Megatron training library.

    • Production traffic as training data. Real Gravity queries with teacher model outputs as the supervision signal, capturing the actual distribution Gravity sees in production rather than a synthetic dataset.
    • Schema-locked outputs. The model was trained against Gravity's exact pipe-delimited output contract, including the fallback strings for edge cases. Schema drift common with generalist models was eliminated.
    • Megatron Bridge on Nvidia. High-throughput distillation runs on commodity Nvidia hardware, replacing the specialty inference accelerator that powered the 70B.
    • Pre-deployment evaluation. The 1B was scored against the 70B teacher on Gravity's own eval set before the switch. The smaller model matched or beat the teacher on every metric Gravity tracks.

    Validation

    The team benchmarked the new deployment against the prior 70B-on-Cerebras setup across four dimensions:

    • Latency at p50, p95, and p99, both at peak and steady-state
    • Task accuracy and schema conformance against Gravity's eval set
    • Sustained throughput at peak load
    • Cost per million tokens at production volume

    On Nvidia hardware, the 1B model improved latency at every percentile against the prior 70B-on-Cerebras setup. The tail of the distribution improved more than the median: p99 dropped 5.7x, while p50 dropped 2.4x.

    • p50: 388ms to 161ms (2.4x faster)
    • p75: 631ms to 192ms (3.3x faster)
    • p90: 959ms to 240ms (4.0x faster)
    • p95: 1327ms to 275ms (4.8x faster)
    • p99: 1992ms to 352ms (5.7x faster)

    Inference cost dropped roughly 10x at Gravity's production volume against the prior 70B-on-Cerebras setup, and the 1B matched or beat the 70B teacher on Gravity's eval set.

    Business Impact

    Our p90 latency dropped from 959ms to 240ms, and our cost per request dropped at the same time. That changed what we could ship as a business. We turned on features we had been holding back, and we stopped having to throttle traffic during spikes. I would recommend Inference.net to anyone running LLMs at this kind of scale.

    — Leo Martinez, CTO, Gravity

    Lower inline latency directly improves the conversational experience for end users on Gravity's partner platforms. Lower per-request cost lets Gravity support more advertiser surfaces and more LLM platforms without unit economics breaking. The 70x parameter reduction also moves Gravity off specialty inference hardware and onto commodity Nvidia, which the broader market supplies at scale.

    What's Next

    Gravity continues to scale on the custom model and is working with Inference.net on the next generation of task-specific distillations as the platform expands. The Megatron Bridge training pipeline is in place to retrain the model as Gravity's traffic patterns evolve and as new advertiser categories come online.

    CONTACT

    Meet with our research team

    Schedule a call with our research team to learn more about how Specialized Language Models can cut costs and improve performance.