News

    Introducing Catalyst: Train self-improving AI models

    Learn more

    Jun 19, 2026

    Price Scraping with AI: Extract Product Prices into JSON

    Inference Research

    What a Price Scraping Pipeline Actually Needs

    A price scraper that returns the string $1,299.99 has not finished the job. Price scraping is the practice of turning e-commerce HTML into typed price records: not a single number, but a structured offer with an amount, a currency, a list price, a sale price, an availability state, the variant it applies to, and the seller behind it. The number is the easy part. The record is what your pipeline can validate, deduplicate, and store.

    That distinction is where most price scraping projects quietly fall apart. The same product page can show three different prices at once: a crossed-out list price, a sale price, and a "from $X" range for variants. Prices arrive as locale-formatted strings (1.299,00 in one market, 1,299.00 in another) wrapped in currency symbols, and the markup that holds them changes whenever the site ships a redesign. A CSS selector that works on Monday returns None on Friday.

    This article is for engineers building price-monitoring, catalog, and marketplace pipelines who already have a way to fetch pages and need reliable typed JSON out of the HTML those crawlers return. We will design a real price schema, run schema-guided extraction on raw HTML, validate the price and currency fields properly, and choose between Schematron V2 Turbo and Small with side-by-side outputs.

    Before any of that, it helps to be precise about what a usable price record contains:

    FieldTypeWhy it matters
    sale_pricenumberThe current price the buyer pays
    list_pricenumberOriginal price; needed to compute discount
    currencyISO 4217Disambiguates amounts across markets
    availabilityenumOut-of-stock pages must not yield a number
    variantslistPer-SKU price and stock on multi-variant pages
    sellerstringSame product, different sellers and prices
    couponslistConditional discounts, not part of base price
    evidencestringSource text the price was read from

    Fields map onto the schema.org Offer vocabulary (price, priceCurrency, availability).

    Most of these fields map almost one-to-one onto the schema.org Offer vocabulary that many e-commerce sites already embed as JSON-LD: price, priceCurrency, availability, priceValidUntil. When that JSON-LD is present and stable, it is the cleanest signal on the page. When it is missing, partial, or wrong, you are back to reading the rendered HTML, and the stability hierarchy of on-page signals runs from JSON-LD down through meta tags and data-* attributes to brittle CSS selectors over visible text. A schema-guided model reads all of those at once, which is the whole reason it holds up where selectors break.

    Read time: 10 minutes


    Fetch and Extract Are Two Different Jobs

    Price scraping bundles two jobs that are easy to conflate. The first is fetch: getting the HTML at all, which means rendering JavaScript, rotating proxies, handling rate limits, and getting past anti-bot defenses. The second is extract: turning that HTML into a typed record. Most price scraping tools are built around the fetch problem and treat extraction as an afterthought, which is exactly why their extraction breaks first.

    Keeping the two separate is the single most useful architectural decision in a price pipeline. The fetch layer is your existing crawler or proxy stack. The extraction layer is where schema-guided extraction with Schematron lives: a family of long-context models trained to take messy HTML plus a schema and return conforming JSON. Schematron does not fetch pages, render JavaScript, rotate proxies, or solve CAPTCHAs. It runs after retrieval, on the HTML you already have.

    If you do not yet have the fetch side solved, that is its own topic; fetching and cleaning page HTML in Python covers it end to end, and choosing a crawler is a separate decision from choosing an extractor. For the rest of this guide, assume a product page's raw HTML has already landed in a string variable, and focus on everything downstream of that.

    The full pipeline, with the seam between fetch and extract drawn explicitly:

    Figure 1: Fetch is your existing stack; Schematron is the extraction layer that runs on the HTML you already have. Records that fail validation route to a review queue instead of the store.


    A Schema for Product Price Extraction

    The schema is the contract. With Schematron there is no system prompt and no user instruction; the model uses only the schema you pass and the descriptions on its fields to decide what to extract. That constraint works in your favor. There is no prompt to drift and no instruction that quietly means one thing on a clean page and something else on a messy one. Everything the model needs lives in one typed object, so the whole behavior of the extractor sits in code you can review and version.

    Here is a ProductOffer model in Pydantic. Notice that every ambiguous field carries a description, because the description is the only place to disambiguate which of several displayed prices is the sale price.

    from enum import Enum
    from typing import Optional
    
    from pydantic import BaseModel, Field
    
    
    class Availability(str, Enum):
        in_stock = "in_stock"
        out_of_stock = "out_of_stock"
        preorder = "preorder"
        backorder = "backorder"
        discontinued = "discontinued"
        unknown = "unknown"
    
    
    class Coupon(BaseModel):
        code: Optional[str] = Field(
            default=None, description="Coupon or promo code, e.g. 'SPRING15'."
        )
        description: str = Field(
            ..., description="What the coupon does, e.g. 'Save 15% at checkout'."
        )
        value: Optional[str] = Field(
            default=None,
            description="The discount the coupon applies, as shown, e.g. '15%' or '$10'.",
        )
    
    
    class Bundle(BaseModel):
        is_bundle: bool = Field(
            ..., description="True if this listing sells multiple products together."
        )
        components: list[str] = Field(
            default_factory=list,
            description="Names of the products included in the bundle.",
        )
    
    
    class Variant(BaseModel):
        sku: Optional[str] = Field(default=None, description="Variant SKU or id if shown.")
        name: str = Field(..., description="Variant name, e.g. 'Space Gray, 512GB'.")
        sale_price: Optional[float] = Field(
            default=None, description="Current price for this variant if shown."
        )
        availability: Availability = Field(
            default=Availability.unknown,
            description="Stock state for this specific variant.",
        )
    
    
    class ProductOffer(BaseModel):
        name: str = Field(
            ..., description="Exact product name as shown in the title or primary heading."
        )
        sku: Optional[str] = Field(
            default=None, description="Product SKU, MPN, or id if present on the page."
        )
        currency: Optional[str] = Field(
            default=None,
            description="ISO 4217 currency code inferred from the price, e.g. 'USD', 'EUR'.",
        )
        list_price: Optional[float] = Field(
            default=None,
            description=(
                "Original or struck-through price before any discount. "
                "Null if the page shows only one price."
            ),
        )
        sale_price: Optional[float] = Field(
            default=None,
            description=(
                "Current price the buyer pays. If only one price is shown, put it here. "
                "Null if the product has no purchasable price."
            ),
        )
        availability: Availability = Field(
            default=Availability.unknown,
            description="Whether the product can be purchased right now.",
        )
        seller: Optional[str] = Field(
            default=None, description="Seller or merchant name if shown (e.g. marketplace)."
        )
        coupons: list[Coupon] = Field(
            default_factory=list, description="Any coupons or promo codes offered."
        )
        bundle: Optional[Bundle] = Field(
            default=None, description="Bundle details if this listing is a bundle."
        )
        variants: list[Variant] = Field(
            default_factory=list,
            description="Per-variant prices and stock if the page lists variants.",
        )
        evidence: Optional[str] = Field(
            default=None,
            description="Short snippet of source text the price was read from.",
        )

    The same schema expressed with Zod, for TypeScript pipelines:

    import { z } from "zod";
    
    const Availability = z.enum([
      "in_stock",
      "out_of_stock",
      "preorder",
      "backorder",
      "discontinued",
      "unknown",
    ]);
    
    const Coupon = z.object({
      code: z.string().nullable().describe("Coupon or promo code, e.g. 'SPRING15'."),
      description: z
        .string()
        .describe("What the coupon does, e.g. 'Save 15% at checkout'."),
      value: z
        .string()
        .nullable()
        .describe("The discount the coupon applies, as shown, e.g. '15%' or '$10'."),
    });
    
    const Variant = z.object({
      sku: z.string().nullable().describe("Variant SKU or id if shown."),
      name: z.string().describe("Variant name, e.g. 'Space Gray, 512GB'."),
      salePrice: z
        .number()
        .nullable()
        .describe("Current price for this variant if shown."),
      availability: Availability.describe("Stock state for this specific variant."),
    });
    
    export const ProductOffer = z.object({
      name: z
        .string()
        .describe("Exact product name as shown in the title or primary heading."),
      sku: z.string().nullable().describe("Product SKU, MPN, or id if present."),
      currency: z
        .string()
        .nullable()
        .describe("ISO 4217 currency code inferred from the price, e.g. 'USD', 'EUR'."),
      listPrice: z
        .number()
        .nullable()
        .describe(
          "Original or struck-through price before any discount. Null if one price is shown.",
        ),
      salePrice: z
        .number()
        .nullable()
        .describe(
          "Current price the buyer pays. If only one price is shown, put it here. Null if no purchasable price.",
        ),
      availability: Availability.describe(
        "Whether the product can be purchased right now.",
      ),
      seller: z.string().nullable().describe("Seller or merchant name if shown."),
      coupons: z.array(Coupon).describe("Any coupons or promo codes offered."),
      bundle: z
        .object({
          isBundle: z
            .boolean()
            .describe("True if this listing sells multiple products together."),
          components: z
            .array(z.string())
            .describe("Names of the products included in the bundle."),
        })
        .nullable()
        .describe("Bundle details if this listing is a bundle."),
      variants: z
        .array(Variant)
        .describe("Per-variant prices and stock if the page lists variants."),
      evidence: z
        .string()
        .nullable()
        .describe("Short snippet of source text the price was read from."),
    });

    Sale Price, List Price, Coupons, and Bundles

    The hard cases all come down to telling the model which number is which. Model list_price and sale_price as separate fields rather than a single price, and let the discount be derived rather than scraped. Describe list_price as the original or struck-through price and sale_price as the current price the buyer pays; that one sentence is what stops the model from swapping them on a page where the sale price appears first visually.

    Coupons and bundles need their own structure. A coupon is rarely a price, it is a conditional modifier ("save 15% with code SPRING"), so model it as a list of {code, description, value} rather than folding it into the price. A bundle ("buy the camera and get the bag") changes what the offer even refers to, so a bundle sub-object with is_bundle and a list of components keeps a bundled listing from silently overwriting the standalone product's price. Because the model only has the schema to work from, each of these fields earns its keep through its description.

    Unavailable Products and Variants

    An out-of-stock page is where naive extractors hallucinate. If price is required and the page shows no purchasable price, a model under pressure to fill the field may invent one. Make price and sale_price optional, and add an availability enum (in_stock, out_of_stock, preorder, backorder, discontinued, unknown) so the correct output for a sold-out product is a record with a null price and availability="out_of_stock", not a fabricated number. Modeling absence explicitly is how you keep junk out of the database.

    Variants are the other multiplier. A single page often lists several SKUs (sizes, colors, configurations), each with its own price and stock state. A variants list, where each entry carries its own sku, price, and availability, captures that without collapsing everything into one ambiguous "from" price.


    Running Extraction on Raw HTML

    With the schema defined, extraction is three steps: clean the HTML, call the model, and read the typed result.

    Schematron was trained on HTML that had been pre-cleaned with lxml's Cleaner to strip scripts, styles, and inline JavaScript, so aligning your preprocessing with that improves both quality and cost. Cleaning removes the markup the model does not need, which cuts input tokens, and input tokens are what you pay for most. Other tools (Readability, Trafilatura, BeautifulSoup, or targeted regex) are acceptable; lxml simply matches the training distribution. When in doubt, remove less rather than more.

    Call Schematron with Your Schema

    The call uses the OpenAI-compatible Chat Completions API, so if you have an OpenAI client you have most of the integration already. Point the base_url at https://api.inference.net/v1, supply your Inference API key, put the cleaned HTML in the user message, and pass the schema as the response format. Use inference-net/schematron-v2-turbo as the default for price work and keep temperature at 0, which is what the model is tuned for.

    The input is the page's raw HTML. A representative product page — a struck-through list price, a sale price, a coupon, a stock badge, and two variants — lands in the user message looking like this:

    <div class="pdp" data-product-id="SKU-44821">
      <h1 class="product-title">Aurora 27" 4K Monitor</h1>
      <div class="seller">Sold by <span>NorthLight Electronics</span></div>
      <div class="price-box">
        <span class="was">List: <s>$499.99</s></span>
        <span class="now">$399.99</span>
        <span class="ccy">USD</span>
      </div>
      <div class="promo">Use code <b>SAVE10</b> for an extra 10% off at checkout</div>
      <div class="stock in">In stock - ships in 24h</div>
      <ul class="variants">
        <li data-sku="SKU-44821-SLV">Silver - $399.99 - in stock</li>
        <li data-sku="SKU-44821-BLK">Black - $409.99 - backordered</li>
      </ul>
    </div>
    import os
    
    import lxml.html as LH
    from lxml.html.clean import Cleaner
    from openai import OpenAI
    
    from schema import ProductOffer
    
    HTML_CLEANER = Cleaner(
        scripts=True,
        javascript=True,
        style=True,
        inline_style=True,
        safe_attrs_only=False,
    )
    
    
    def strip_noise(html: str) -> str:
        """Remove scripts, styles, and JavaScript from HTML using lxml."""
        if not html or not html.strip():
            return ""
        try:
            doc = LH.fromstring(html)
            cleaned = HTML_CLEANER.clean_html(doc)
            return LH.tostring(cleaned, encoding="unicode")
        except Exception:
            return ""
    
    
    client = OpenAI(
        base_url="https://api.inference.net/v1",
        api_key=os.environ.get("INFERENCE_API_KEY"),
    )
    
    
    def extract_offer(raw_html: str) -> ProductOffer:
        cleaned = strip_noise(raw_html)
        resp = client.beta.chat.completions.parse(
            model="inference-net/schematron-v2-turbo",
            messages=[{"role": "user", "content": cleaned}],
            response_format=ProductOffer,
            temperature=0,
        )
        return resp.choices[0].message.parsed
    
    
    if __name__ == "__main__":
        with open("messy-product.html") as f:
            offer = extract_offer(f.read())
        print(offer.model_dump_json(indent=2))

    Two things are doing the work here. The HTML goes in as the raw user message because the model expects the page, not a prompt about the page. And strict JSON Schema outputs mean the response conforms to ProductOffer rather than free text you would have to parse and repair.

    Representative JSON Output

    For a product page showing a sale price, a struck-through list price, a coupon, and an in-stock badge, the extractor returns a record like this:

    {
      "name": "Aurora 27\" 4K Monitor",
      "sku": "SKU-44821",
      "currency": "USD",
      "list_price": 499.99,
      "sale_price": 399.99,
      "availability": "in_stock",
      "seller": "NorthLight Electronics",
      "coupons": [
        {
          "code": "SAVE10",
          "description": "Extra 10% off at checkout",
          "value": "10%"
        }
      ],
      "bundle": null,
      "variants": [
        {
          "sku": "SKU-44821-SLV",
          "name": "Silver",
          "sale_price": 399.99,
          "availability": "in_stock"
        },
        {
          "sku": "SKU-44821-BLK",
          "name": "Black",
          "sale_price": 409.99,
          "availability": "backorder"
        }
      ],
      "evidence": "List: $499.99 $399.99 USD"
    }

    Every field conforms to the schema, the optional fields that did not apply are absent or null, and there is nothing to clean up before it enters your store.

    If high-throughput price extraction is where you are headed, Schematron V2 Turbo is built for exactly this shape of work.

    Try Schematron V2 Turbo

    Use Schematron V2 Turbo for high-throughput HTML-to-JSON extraction when cost and latency matter.


    Validating Price and Currency Fields

    Schema-valid is not the same as semantically valid. Schematron guarantees the JSON matches the structure of ProductOffer, but it cannot know that your business rules require a non-negative price or that a sale price above the list price is a sign something went wrong. Those invariants are the pipeline's job, and price data has a specific set worth enforcing on every record.

    Check that the amount is non-negative, that the currency is in an allowlist of ISO 4217 codes you actually support, that sale_price does not exceed list_price, and that availability is one of your enum values. Records that fail go to a review queue rather than into the catalog; the validation and review-queue patterns for automated extraction generalize directly to prices.

    Two normalization steps matter enough to call out. First, never store money as a float; binary floating point cannot represent most decimal prices exactly, so convert to integer minor units or Decimal before any arithmetic, and remember that most currencies have two minor-unit digits while a few like JPY have zero. Second, normalize the currency itself: map the symbol the model saw ($, £, , ¥) to an ISO 4217 code and parse the numeric string with the page's locale in mind.

    from decimal import Decimal, ROUND_HALF_UP
    
    from schema import ProductOffer
    
    # Currencies you actually support, with their minor-unit digits (ISO 4217).
    SUPPORTED_CURRENCIES = {"USD": 2, "EUR": 2, "GBP": 2, "JPY": 0}
    
    
    class PriceValidationError(ValueError):
        pass
    
    
    def to_minor_units(amount: float, currency: str) -> int:
        """Convert a price to integer minor units (e.g. cents) using Decimal."""
        digits = SUPPORTED_CURRENCIES[currency]
        quant = Decimal(1).scaleb(-digits)  # 0.01 for 2-digit, 1 for JPY
        value = Decimal(str(amount)).quantize(quant, rounding=ROUND_HALF_UP)
        return int(value.scaleb(digits))
    
    
    def validate_offer(offer: ProductOffer) -> dict:
        """Check semantic invariants and return a money-safe record, or raise."""
        if offer.currency is None or offer.currency not in SUPPORTED_CURRENCIES:
            raise PriceValidationError(f"Unsupported currency: {offer.currency!r}")
    
        if offer.sale_price is None and offer.availability.value == "in_stock":
            raise PriceValidationError("In-stock product is missing a price")
    
        for label, price in (("list", offer.list_price), ("sale", offer.sale_price)):
            if price is not None and price < 0:
                raise PriceValidationError(f"Negative {label} price: {price}")
    
        if (
            offer.list_price is not None
            and offer.sale_price is not None
            and offer.sale_price > offer.list_price
        ):
            raise PriceValidationError("sale_price exceeds list_price")
    
        return {
            "name": offer.name,
            "currency": offer.currency,
            "list_minor": (
                to_minor_units(offer.list_price, offer.currency)
                if offer.list_price is not None
                else None
            ),
            "sale_minor": (
                to_minor_units(offer.sale_price, offer.currency)
                if offer.sale_price is not None
                else None
            ),
            "availability": offer.availability.value,
        }
    
    
    def ingest(offer: ProductOffer, review_queue: list) -> dict | None:
        try:
            return validate_offer(offer)
        except PriceValidationError as exc:
            review_queue.append({"offer": offer.model_dump(), "reason": str(exc)})
            return None

    Schematron V2 Turbo vs Small for Price Extraction

    The brief most price pipelines face is a cost-quality tradeoff across millions of pages, which is precisely the choice between the two Schematron V2 models.

    Quality and Throughput (Published Metrics)

    The Schematron V2 launch post graded extractions with an LLM-as-judge (GPT-5.4, on a 1-to-5 scale): V2 Small scored 4.060 and V2 Turbo 4.039, close enough that both clear the bar for structured price extraction, and both outperformed DeepSeek V3.2 and GPT-5.4 Nano on the same benchmark. On the SimpleQA benchmark the two land at 83.10 (Small) and 79.42 (Turbo). Where they separate is throughput: on a single H100, Turbo runs 4.14 requests per second against Small's 2.47.

    MetricV2 TurboV2 Small
    LLM-as-judge (1-5)4.0394.060
    SimpleQA79.4283.10
    Throughput (req/s, 1x H100)4.142.47
    Input ($/M tokens)$0.03$0.05
    Output ($/M tokens)$0.15$0.25

    Quality and throughput from the Schematron V2 launch post (LLM-as-judge graded by GPT-5.4); pricing from the current model pages. Throughput measured at 10k input / 500 output tokens.

    Schematron V2 throughput in requests per second on a single H100, Turbo versus Small
    Schematron V2 Throughput - Requests per second on a single H100 (10k in / 500 out).

    You can see the published extraction quality metrics in more detail, but the short version is that the quality gap is small and the throughput and cost gap is not.

    Side-by-Side Output and a Failure Case

    Run the same cleaned HTML through both models and the difference usually disappears. For a standard offer (one price, one currency, in stock), Turbo and Small return effectively identical records.

    import os
    
    from openai import OpenAI
    
    from schema import ProductOffer
    
    client = OpenAI(
        base_url="https://api.inference.net/v1",
        api_key=os.environ.get("INFERENCE_API_KEY"),
    )
    
    MODELS = [
        "inference-net/schematron-v2-turbo",
        "inference-net/schematron-v2-small",
    ]
    
    
    def extract_with(model: str, cleaned_html: str) -> ProductOffer:
        resp = client.beta.chat.completions.parse(
            model=model,
            messages=[{"role": "user", "content": cleaned_html}],
            response_format=ProductOffer,
            temperature=0,
        )
        return resp.choices[0].message.parsed
    
    
    if __name__ == "__main__":
        with open("messy-product.html") as f:
            cleaned_html = f.read()  # already cleaned upstream in your pipeline
        for model in MODELS:
            offer = extract_with(model, cleaned_html)
            print(f"\n=== {model} ===")
            print(offer.model_dump_json(indent=2))

    The gap shows up on complexity. On a page with a deep variant matrix or an ambiguous bundle, Turbo occasionally mis-assigns a per-variant price or flattens a bundle, while Small more reliably keeps the structure intact, which tracks with Small being tuned for the highest quality on complex schemas and very long pages. The practical rule: make Turbo the default for high-volume, simple-to-moderate offer schemas, and route the page classes that are genuinely hard (dense variants, ambiguous coupon math, very long pages) to Small. Sample both on a representative set before committing a page class to one model.

    Cost per Page

    Cost follows directly from per-token pricing. V2 Turbo is $0.03 per million input tokens and $0.15 per million output tokens; V2 Small is $0.05 per million input and $0.25 per million output. Take an illustrative cleaned product page of roughly 3,000 input tokens producing a 300-token record. That is about $0.00009 input plus $0.000045 output on Turbo, so a few hundredths of a cent per page, and proportionally more on Small.

    Illustrative extraction cost per one million product pages, Turbo versus Small
    Illustrative Cost per 1M Product Pages - Assuming ~3,000 input + 300 output tokens per cleaned page.

    Input tokens dominate the bill, which is the second reason to clean HTML with lxml: fewer input tokens is both a quality and a cost lever. Comparing Schematron V2 Turbo and Small on your own pages, with your own token profile, is the only way to size this precisely.


    When Schematron Is Not the Right Layer

    Schema-guided extraction is the right tool for messy, varied, changing HTML. It is the wrong tool in four situations, and being honest about them keeps your pipeline cheap and your expectations realistic.

    If your real problem is fetching (JavaScript rendering, residential proxies, CAPTCHAs, anti-bot), an extraction model does nothing for you; pair it with a crawler or proxy provider that solves that. If the page already exposes clean, stable JSON-LD Offer data or a public or backend JSON/GraphQL API with the exact fields you need, parse that directly and skip the model call entirely. If you are scraping a handful of fixed layouts that never change, a maintained set of CSS or XPath selectors is cheaper and fully deterministic. And if you have a sub-millisecond synchronous latency budget, any model call is too slow.

    The build-versus-buy tradeoff for extraction is worth working through deliberately rather than reaching for a model by default.


    Putting the Pipeline Together

    A reliable price pipeline is the same six steps every time: fetch the HTML with your existing stack, clean it with lxml, define the offer schema once, extract with Schematron, validate the price and currency invariants, and store or deduplicate the result. The seam between fetch and extract is what keeps each layer replaceable and each failure diagnosable.

    For recurring monitoring at scale, move off synchronous calls. The Batch API accepts up to 50,000 extraction jobs at once with optional webhooks and a 24-hour window, and the Group API handles smaller groups of 50 or fewer requests as a single job with one callback when it finishes. Both let a monitoring run process a day's worth of pages without holding open connections.

    The fastest way to see whether this fits your pages is to run one through it. Send a sample product page's HTML and your schema and look at the JSON that comes back.

    Extract typed JSON from messy HTML

    Send HTML and a Pydantic, Zod, or JSON Schema definition to Schematron and get structured JSON back without writing selectors or prompts.

    CONTACT

    Meet with our research team

    Schedule a call with our research team to learn more about how Specialized Language Models can cut costs and improve performance.