News

    Introducing Catalyst: Train self-improving AI models

    Learn more

    Jun 20, 2026

    Best Web Scraping API for E-Commerce Product Data

    Inference Research

    Why "Best Web Scraping API" Is the Wrong Question for E-Commerce

    You picked a web scraping API, pointed it at a product page, and it worked. The request came back fast, past the anti-bot wall, with the full page in hand. Then you tried to load it into your catalog and hit the real problem: the price is a string wedged inside markup, the sale price and list price are tangled together, availability is nowhere obvious, and the size and color options live in a script tag. You have the page. You do not have a record.

    That gap is why the search for the best web scraping API for e-commerce sites usually starts in the wrong place. "Scraping" bundles two different jobs. The first is retrieval: getting the rendered page past proxies and rate limits. The second is extraction: turning that page into typed fields a downstream system can trust. Most APIs marketed as scraping APIs are graded on the first job, where they are genuinely good. The job that decides whether your pipeline produces anything useful is the second one.

    So the better question is not which API fetches pages best. It is which approach produces a clean product record at a unit cost you can scale to your whole catalog. This guide treats those two jobs separately and stays on the second. It covers the difference between a crawl/fetch API and an extraction API, the fields e-commerce extraction actually needs, a four-axis rubric for grading any candidate, an example product schema, a runnable extraction call with validation, a cost framework with the actual math, recommended stack patterns, and the cases where an extraction layer is the wrong tool.

    Read time: 9 minutes


    Crawl/Fetch API vs Extraction API

    A crawl/fetch API discovers and downloads pages: it handles proxy rotation, JavaScript rendering, and rate limits, and it returns the page as HTML, markdown, or a screenshot. An extraction API does something different. It takes page content plus a schema you define and returns typed JSON that conforms to that schema. One succeeds when you get the bytes. The other succeeds when the price is a number and the required fields are present.

    Some fetch APIs bolt a schema-to-JSON option onto their scrape endpoint, which blurs the line. The distinction that matters is whether extraction is a trained, first-class capability or a feature layered on top of a retrieval product. Schematron is a family of long-context models trained specifically to extract clean, typed JSON from messy HTML, with the schema as the only input that steers it. If your bottleneck is JSON quality rather than page retrieval, that is a different tool than your crawler, and the two belong in the same pipeline rather than in competition. Our guide on when you need JSON, not markdown walks through that split in more detail.

    It helps to picture a web data pipeline as a stack of distinct layers, not one tool doing everything.

    Figure 1: The Web Data Stack, Layer by Layer

    Each layer fails in its own way and improves on its own schedule. Fetching is the rendering, proxy, and rate-limit problem. Cleaning strips scripts, styles, and boilerplate so the model sees signal instead of noise; lxml is the recommended cleaner for Schematron because it matches how the training data was preprocessed. Extraction maps page content to your fields. Validation enforces types and required fields before anything is stored. When you keep these separate, you can swap crawlers without touching schemas, and re-run extraction on cached HTML without re-fetching. The table below splits the responsibilities, and schema-guided extraction is the job in the third column.

    ResponsibilityCrawl/Fetch APIExtraction API
    Discover pagesYesNo
    Render JavaScriptYesNo
    Rotate proxies, solve CAPTCHAsYesNo
    Return page bytes (HTML/markdown)YesNo
    Return typed fieldsNoYes
    Enforce your schemaNoYes
    Validate types on outputNoYes

    What E-Commerce Product Extraction Actually Needs

    A product record is not a single price string. Before you can compare any product scraping api, you need to be honest about the fields a useful e-commerce record carries. At minimum:

    • Product identity: name, brand, SKU or MPN or GTIN, canonical URL
    • Price: list price, current or sale price, currency, unit price
    • Availability and inventory: in-stock flag, stock count, ship or lead time
    • Variants: size, color, and configuration options with per-variant price, stock, and SKU
    • Seller and marketplace: seller name, fulfillment type, buy-box winner, rating
    • Reviews and ratings: average rating, review count, rating distribution
    • Media and specs: image URLs, a spec key-value map, category breadcrumbs

    Rule-based selectors struggle here because e-commerce templates vary by site, by locale, and even by A/B test bucket. A CSS selector that nails the price on one retailer returns null on the next, and a selector that worked last quarter breaks the morning a redesign ships. Schema-guided extraction sidesteps that: you describe the field you want, and the model maps whatever the page happens to call it to your canonical key. The output is strict JSON that matches the schema you defined, not the page's accidental structure.

    Each field group maps to a downstream job. Price and currency feed pricing intelligence. Availability and variants feed catalog sync and stock monitoring. Seller and buy-box fields feed marketplace competitiveness analysis. Decide which records you actually need before you grade any API, because the right answer depends on what you are trying to load, not on which vendor has the biggest proxy pool.

    A Rubric for Evaluating a Web Scraping API

    Once retrieval is handled, grade any candidate for the best web scraping api on four axes that decide whether you get usable records. Most roundups of the top web scraping apis for data extraction stop at retrieval features; these four axes are what actually separate a usable record from a wasted call. The first three are quality; the fourth is economics.

    JSON validity

    Does it return parseable JSON on every call, or do you write defensive code for the times it returns prose, a half-finished object, or markdown? An extraction layer that occasionally breaks the contract forces a retry-and-repair loop around every request. Schematron runs in strict JSON mode, so output is always valid JSON.

    Schema adherence

    Valid JSON is not enough. The output has to match your schema: your keys, your types, your required fields. A response that is valid JSON but renames price to cost or returns it as a string is still a bug you have to catch. Schematron is built for 100% schema adherence, so the shape you define is the shape you get back.

    Long-page support

    Real product pages are long. A page with dozens of variants, a spec table, and a wall of reviews can blow past smaller context limits and get silently truncated, dropping exactly the fields you came for. Schematron is a long-context model trained for long, noisy HTML, with context up to 128K tokens; for very large pages you truncate or chunk and trim to the relevant region.

    Unit cost

    Quality you cannot afford at scale is not quality. The axis that decides whether you can run extraction across a whole catalog is cost per extracted page, which comes down to your token profile and the model's per-token price. Schematron is priced per input and output token, with two model tiers covered in the cost section below.

    On the quality axes, the proof is measurable. Schematron V2 was evaluated with an LLM-as-a-judge methodology, with the grader scoring extractions on a one-to-five scale.

    Schematron V2 Extraction Quality
    Schematron V2 Extraction Quality - LLM-as-a-Judge average score, 1-5 scale (higher is better)

    V2 Small scores 4.060 and V2 Turbo scores 4.039, putting both within striking distance of the original 8B model at 4.070 and comfortably ahead of the first-generation 3B at 3.909; both V2 models also outscore DeepSeek V3.2 and GPT-5.4 Nano, which are substantially larger. Throughput backs the economics: on a single H100, V2 Turbo handles 4.14 requests per second and V2 Small 2.47. And when Schematron is used as the extraction layer in a web-search pipeline, SimpleQA factuality climbs to 83.10 for Small and 79.42 for Turbo, versus 8.54 for the base model with no extraction step. For a deeper benchmark walkthrough, see our notes on the published quality metrics.

    If the rubric points you toward a high-throughput model, V2 Turbo is the one built for it.

    Try Schematron V2 Turbo

    Use Schematron V2 Turbo for high-throughput HTML-to-JSON extraction when cost and latency matter.

    An Example Product Schema

    The rubric is abstract until you write a schema. Here is a Pydantic product model with the kind of field descriptions that make extraction reliable.

    from pydantic import BaseModel, Field
    
    
    class Variant(BaseModel):
        sku: str = Field(..., description="Variant SKU or option identifier.")
        option: str = Field(
            ...,
            description="Variant label, e.g. size or color, exactly as shown.",
        )
        price: float = Field(
            ...,
            description="Price for this variant, numeric only, no currency symbol.",
        )
        in_stock: bool = Field(..., description="Whether this variant is purchasable.")
    
    
    class Product(BaseModel):
        name: str = Field(
            ...,
            description="Exact product name as shown in the title or primary heading.",
        )
        price: float = Field(
            ...,
            description=(
                "Current price the customer pays, after discounts. "
                "Numeric only, no currency symbol."
            ),
        )
        currency: str = Field(
            ...,
            description="ISO 4217 currency code for the price, e.g. USD.",
        )
        availability: str = Field(
            ...,
            description="Stock status: one of in_stock, out_of_stock, or preorder.",
        )
        specs: dict[str, str] = Field(
            default_factory=dict,
            description="Key attributes describing the product.",
        )
        variants: list[Variant] = Field(
            default_factory=list,
            description="Purchasable variants such as size or color options.",
        )
        tags: list[str] = Field(
            default_factory=list,
            description="Tags assigned to the product.",
        )

    The descriptions are not decoration. Schematron does not take a prompt or system message; the schema is the only channel you have to steer it, so any direction the model needs lives in the field descriptions. When a field requires a judgment call, like which of several numbers on the page is the current price, spell that out in the description rather than hoping the model guesses. Keep the structure typed: numbers for prices, enums for availability, lists for tags and variants, so validation has something to enforce.

    Extracting a Product Page with Schematron

    With a schema in hand, the flow is clean the HTML, send it with the schema, then validate the result.

    Clean the HTML

    Strip scripts, styles, and inline JavaScript before extraction. This cuts tokens and matches the preprocessing the model was trained on, which improves consistency.

    import os
    
    import lxml.html as LH
    from lxml.html.clean import Cleaner
    from openai import OpenAI
    
    from product_schema import Product
    
    HTML_CLEANER = Cleaner(
        scripts=True,
        javascript=True,
        style=True,
        inline_style=True,
        safe_attrs_only=False,
    )
    
    
    def strip_noise(html: str) -> str:
        """Remove scripts, styles, and JavaScript from HTML using lxml."""
        if not html or not html.strip():
            return ""
        try:
            doc = LH.fromstring(html)
            cleaned = HTML_CLEANER.clean_html(doc)
            return LH.tostring(cleaned, encoding="unicode")
        except Exception:
            return ""
    
    
    # Messy HTML (could be the full page; trim to the relevant region when possible)
    raw_html = """
    <div id="item">
      <h2 class="title">MacBook Pro M3</h2>
      <p>Price: <b>$2,499.99</b> USD</p>
      <ul info>
        <li>RAM: 16GB</li>
        <li>Storage: 512GB SSD</li>
      </ul>
      <span class="tag">laptop</span>
      <span class="tag">professional</span>
      <span class="tag">macbook</span>
      <span class="tag">apple</span>
    </div>
    """
    
    html = strip_noise(raw_html)
    
    client = OpenAI(
        base_url="https://api.inference.net/v1",
        api_key=os.environ.get("INFERENCE_API_KEY"),
    )
    
    resp = client.beta.chat.completions.parse(
        model="inference-net/schematron-v2-small",
        messages=[
            {"role": "user", "content": html},
        ],
        response_format=Product,
        temperature=0,
    )
    
    # Validate on ingest, even though output is schema-conforming by design.
    product = Product.model_validate(resp.choices[0].message.parsed.model_dump())
    print(product.model_dump_json(indent=2))

    Call the API and validate

    The call uses the OpenAI-compatible API: set the base URL to the Inference endpoint, pick a Schematron model, pass the HTML in a single user message, and hand the schema to response_format so the SDK parses and types the result. Keep temperature at zero, since the model is trained to perform best there. Setup details for the OpenAI-compatible API cover the base URL and key. For the example HTML, the model returns strict JSON that conforms to the schema:

    {
      "name": "MacBook Pro M3",
      "price": 2499.99,
      "specs": {
        "RAM": "16GB",
        "Storage": "512GB SSD"
      },
      "tags": [
        "laptop",
        "professional",
        "macbook",
        "apple"
      ]
    }

    Even though the output is schema-conforming by design, validate on ingest and handle errors explicitly. That last step is what turns a model response into a record you can load without a second thought.

    A Cost-Comparison Framework

    Cost per extracted page is simple arithmetic once you know your token profile:

    cost_per_page = (input_tokens / 1e6 * input_price) + (output_tokens / 1e6 * output_price)

    Plug in a representative product page at roughly 10,000 input tokens after cleaning and 2,000 output tokens, and the two Schematron tiers compare like this.

    ModelInput $/MOutput $/MCost / 1k pages
    Schematron V2 Small$0.05$0.25$1.00
    Schematron V2 Turbo$0.03$0.15$0.60

    Assumptions: 10,000 input + 2,000 output tokens per page after cleaning. Cost / 1k pages = 1,000 × ((10,000/1e6 × input $/M) + (2,000/1e6 × output $/M)). Your real token counts depend on page size and schema complexity.

    Those numbers are a framework, not a quote: your real token counts depend on how large your pages are after cleaning and how rich your schema is. The point is the shape of the decision. V2 Turbo is the pick for high-volume work where cost and latency dominate, like ecommerce price scraping and inventory monitoring across a large catalog. V2 Small is the pick when the schema is unusually complex or the pages are unusually long and you want the highest quality. When you are extracting at scale, the Batch API submits up to 50,000 jobs at once with async status and optional webhooks, and very high-volume pipelines can move to dedicated capacity. If you are weighing this against writing and maintaining your own parsers, our build versus buy breakdown puts numbers on both paths.

    The patterns that work all keep fetch and extract as separate steps.

    Pattern A, one-off or on-demand extraction. Your existing crawler fetches the page, lxml cleans it, Schematron extracts against your schema, Pydantic validates, and you store the record. Because the crawler and the extractor are decoupled, you can change retrieval tools later without rewriting a single schema.

    Pattern B, recurring monitoring. For price and inventory tracking, schedule the fetch, send the batch of pages through the Batch API, receive a webhook when the run completes, then diff against the last snapshot and store the changes. Re-running extraction on cached HTML costs nothing in fetch budget.

    Pattern C, high-volume catalog ingestion. When you are processing pages continuously at scale, dedicated capacity gives you predictable throughput for the extraction layer. For a fuller picture of how crawlers and extractors fit together, we compared them layer by layer.

    When Schematron Is Not the Right Layer

    An extraction model is not always the answer, and being clear about that is part of choosing well.

    It is not a fetch layer. Schematron does not discover pages, rotate proxies, render JavaScript, or solve CAPTCHAs; if retrieval is your actual bottleneck, you need a crawler and Schematron is the step after it. If a site already exposes a clean structured feed, like an official product API, trustworthy JSON-LD, or a stable export, parse that directly rather than running a model over rendered HTML. And if you are extracting from a single page on a fixed template that never changes, a hand-written selector may be cheaper than any API call. The extraction layer earns its place when pages are messy, templates vary, and schema drift is the thing breaking your pipeline.

    Conclusion

    The best web scraping API for e-commerce is not the one with the largest proxy network. It is the one that turns a messy product page into a typed, validated record at a unit cost you can run across your entire catalog. Grade candidates on the rubric, JSON validity, schema adherence, long-page support, and unit cost, and keep retrieval and extraction as separate jobs so you can improve each on its own.

    The fastest way to know whether this fits your data is to try it on a page you already struggle with. Take one of your hardest product pages, write the schema for the record you wish you had, and run it through Schematron.

    Extract typed JSON from messy HTML

    Send HTML and a Pydantic, Zod, or JSON Schema definition to Schematron and get structured JSON back without writing selectors or prompts.

    CONTACT

    Meet with our research team

    Schedule a call with our research team to learn more about how Specialized Language Models can cut costs and improve performance.