Banner background

    Introducing Schematron-8b: Frontier HTML to JSON conversion.

    Read Blog
    Case Study Hero

    Case Studies / Wynd Labs

    How Wynd Labs Processed 1 Billion Videos at 17x Lower Cost with ClipTagger-12b

    Wynd Labs Logo

    Wynd Labs builds Grass.io, a decentralized data network (3M+ nodes) aggregating public-web content into usable AI training and search datasets.

    Outcomes

    17x
    Cheaper Inference
    Billions
    of Frames Annotated
    3x
    Lower Latency
    Talk to an Engineer

    Background

    Wynd Labs operates Grass.io, a decentralized network with 3M+ nodes collecting public web video data. Their goal was to build a video clip search engine.


    To power this, Wynd needed to transform their billion-video corpus into structured, queryable data. They partnered with Inference.net to build ClipTagger, a custom vision-language model that could process frames at massive scale while maintaining strict schema compliance and economic viability.

    Challenges

    Wynd Labs had a massive dataset of over a billion videos from its public web collection, and wanted to make it searchable. Their use case required strict, fixed JSON per frame (objects, actions, scene, production attributes), which generic captioning models could not follow, so they turned to LLMs.

    Frontier models were accurate and effectively followed their schema, but were prohibitively expensive (~$5,850 per 1M frames for Claude 4 Sonnet). After experimenting with multiple open-source models, they all displayed issues with hallucination and generally poor captioning quality.

    Model sizing was a trade-off: a 7B model couldn't meet their quality standard, but a 27B model would be prohibitively expensive at their scale (and still didn't hit their quality requirements). Most serverless LLM APIs also weren't designed to support large, asynchronous workloads, and Wynd initially approached Inference.net to scale SLM (Small Language Model) inference through our batch API.

    However, it quickly became clear that simply choosing the best small open-source model was not a good solution, and Wynd needed a new unlock.

    Solutions

    To reach Wynd's cost targets, we proposed knowledge distillation—a technique that transfers the capabilities of a large, intelligent teacher model into a smaller, more efficient student model. Wynd curated 2M diverse keyframes from their corpus and finalized both the prompt structure and JSON schema before training began.

    After settling on the Gemma family, we experimented with several different model sizes, including 7B, 12B, and 27B parameters. The 12B model proved to be the best middle-ground in terms of quality and cost. To fit in a single 80GB GPU, we utilized FP8 quantization, which delivered significant latency and cost benefits without measurable quality loss.

    We conducted rigorous evaluations across schema compliance and caption quality. After we exposed and shared a dashboard with LLM-as-a-Judge scores, teacher–student diffs, schema error rates, and p50/p95/p99 latencies, the model was ready for joint sign-off.

    To scale to billions of video frames, our batch API proved to be the perfect solution. Wynd could process their entire dataset at their own pace and slowly scaled up their batches.

    What's next

    Wynd is currently rolling out search.video, the video clip search engine they built with ClipTagger, to select partners. They continue to use Inference.net to process additional billions of frames across their expanding corpus.

    Own your model. Scale with confidence.

    From custom training to seamless deployment, we help you launch specialist LLMs that outperform generic models, adapt autonomously, and run reliably at scale—so you can move faster, deliver smarter, and stay ahead.

    Talk to an Engineer