Web Scraping with ChatGPT vs Schematron

Web Scraping with ChatGPT Starts Great and Then Drifts

You paste a product page's HTML into ChatGPT, ask for the name, price, and specs, and get clean JSON back in seconds. By the afternoon you have a working prototype. Then you point it at 50,000 pages on a nightly schedule, and the output starts to wobble: a sale price shows up where the list price should be, a missing field becomes an empty string on some pages and null on others, and a model update you didn't ask for quietly changes how a whole class of pages gets parsed.

Can ChatGPT scrape websites? It can extract data from HTML you give it, and with Structured Outputs it returns valid JSON every time. What it can't do is fetch live sites for you, or keep a prompt meaning exactly the same thing across model upgrades and edge cases. The difference between a demo and a pipeline is the interface: a prompt is a moving target, and a JSON Schema is a contract.

This article is a fair comparison. We will cover where web scraping with ChatGPT genuinely helps, why prompt-only extraction drifts once it hits production, what JSON mode actually fixes, how a schema-driven model like Schematron holds its shape, what each costs per page, how to migrate from one to the other, and when a general LLM is still the right tool.

Read time: 11 minutes

Where ChatGPT Helps: Prototyping Extraction Fast

ChatGPT is a great place to start. When people talk about ChatGPT scraping, they usually mean one of three patterns that work well, and all three are legitimately useful early in a project.

The first is pasting HTML directly into the chat window and asking for fields. This is unbeatable for a one-off: you want the pricing table off a single page, you have it in two minutes, and you never write a line of code. The second is asking ChatGPT to generate a scraper for you, complete with CSS selectors and a requests or BeautifulSoup skeleton. The third is calling the OpenAI API with a prompt and Structured Outputs, which is the path most people take when a prototype graduates into something that runs more than once.

That last pattern matters because it removes the easiest objection to LLM extraction. With Structured Outputs and strict: true, the model's response is constrained to match a JSON Schema you supply, and OpenAI reports 100% reliability matching output schemas on recent GPT-4o snapshots. So the old complaint that "the LLM returns broken JSON" is largely solved.

The strengths here are exploratory strengths. No schema design, instant iteration, graceful handling of one strange page, and a model that can explain its own output. When you are still figuring out which fields even exist on a page, that flexibility is exactly what you want, and it is why so many extraction projects begin with a chat window rather than a codebase. The friction of defining types and writing validation early would only slow down the part of the work that benefits most from speed.

None of that is the same thing as a reliable nightly job, though, and the qualities that make ChatGPT a great sketchpad are not the qualities that make a pipeline trustworthy. A sketchpad rewards looseness; a pipeline rewards a fixed contract. That is where the trouble starts.

Crawl and Fetch Are Not Extraction

Before comparing tools, it helps to split a scraping system into two layers that get blurred constantly.

The first layer is fetch and crawl: getting the HTML in the first place. That is the job of requests, Playwright, a crawler, or a proxy network, and it is where anti-bot defenses, pagination, rate limits, and JavaScript rendering live. The second layer is extraction: turning that HTML into typed records you can store and query. Neither ChatGPT nor Schematron is a crawler or a proxy. They both operate on HTML you already have.

Most "web scraping with ChatGPT" tutorials mix these together and end up recommending a proxy product, because the author sells proxies. We are going to keep them separate, because the hard, durable problem for AI web scraping is the extraction layer, not the fetch. If you want to keep your existing crawler and only change how you turn pages into JSON, that is exactly the boundary this article works at. The same logic shows up when teams keep their crawler and swap the extraction layer, and it is why it pays to think of HTML-to-JSON as typed extraction rather than a generic page dump.

The shape of the pipeline stays the same no matter which model does the extracting: fetch the page, optionally clean the HTML with lxml to strip scripts and boilerplate, define the output you want, run the extraction, then validate and store the result.

Figure 1: Fetch vs Extract Pipeline. The fetch layer gets the HTML; the extraction layer turns it into typed JSON.

Why Prompt-Only Extraction Drifts in Production

A prompt is asked to do two jobs at once. It defines the shape of the output, and it carries the instructions about what to extract and how. Structured Outputs lets you lock the first job down. The second job is the one that drifts.

There are four common sources of drift. The first is prompt tuning: someone tweaks the wording to fix one stubborn page, and the change quietly alters behavior on a thousand pages nobody re-checked. The second is silent model upgrades: the provider ships a new snapshot, your prompt now lands slightly differently, and an extraction that was correct last week is subtly wrong this week. The third is edge cases the prompt never anticipated, like sale-versus-list pricing, multi-currency, or fields that simply aren't on some pages. The fourth is orchestration: as the task grows, the prompt accumulates clauses and exceptions until no one can reason about what it will do.

A concrete version makes this easier to see. Suppose your prompt says "extract the product price." On most pages that is unambiguous. Then you hit a page with a list price struck through next to a sale price, a page that bundles two items under one price, and a page where the price only appears inside a currency-formatted data attribute. Your prompt never said which of those counts as "the price," so the model guesses, and it may guess differently on similar pages or after the next model update. You can patch the prompt to handle sale prices, but now the clause that fixed sales subtly changes how bundles are read. Each fix ripples.

Be precise about what this is and isn't. This is not "ChatGPT can't produce valid JSON." With Structured Outputs the JSON is valid and schema-shaped. The problem is that valid JSON can still hold the wrong value, and a prompt-shaped interface gives you no stable contract for which values are right across pages and across time.

Figure 2: Prompt-as-Interface vs Schema-as-Interface. A prompt makes one channel carry both shape and instructions, so the instruction half drifts. A schema fixes the shape as a contract.

JSON Mode and Structured Outputs: What They Fix and What They Don't

JSON mode, and its stricter successor Structured Outputs, guarantee that the response parses and conforms to your schema. That removes a real class of bugs and is worth using whenever you call a general LLM.

What they do not guarantee is field semantics, consistency across model versions, or that your prompt still means what you intended after a snapshot change. The shape is enforced; the meaning is not. That gap is the entire argument for moving the interface from a prompt to a schema.

The Schema as the Interface: How Schematron Works

Schematron is a family of long-context models trained specifically to read messy HTML and emit JSON that conforms to a schema you provide. The thing that makes it different from web scraping with ChatGPT is what it does not have: there is no system or user prompt. The model uses only the schema to decide what to extract. The schema is the interface, which means there is no prompt to drift.

Because the output is driven entirely by your JSON Schema or typed model, Schematron runs in strict JSON mode and is documented as producing 100% schema-conforming output, while matching frontier-model quality at a lower cost. You drive it through schema-guided extraction from messy HTML, and the workflow is the same fetch, clean, define, extract, validate loop from earlier, with a schema-only call in the middle.

This also reframes the price-ambiguity problem from earlier. You don't argue with a prompt about what counts as the price; you encode it once in the field, with a description like "primary list price in the page's displayed currency," and that decision becomes a stable part of the contract rather than a clause you keep re-tuning. When the rule needs to change, you change the field description, not a paragraph of prose whose side effects you can't predict.

A Small Example: Messy HTML to Typed JSON

Start with a small, realistic chunk of product HTML: a heading, an inline price with currency, a couple of spec list items, and some tag spans.

<div id="item">
  <h2 class="title">MacBook Pro M3</h2>
  <p>Price: <b>$2,499.99</b> USD</p>
  <ul info>
    <li>RAM: 16GB</li>
    <li>Storage: 512GB SSD</li>
  </ul>
  <span class="tag">laptop</span>
  <span class="tag">professional</span>
  <span class="tag">macbook</span>
  <span class="tag">apple</span>
</div>

Defining the Schema (Pydantic and Zod)

Next, define the record you actually want. Because Schematron has no prompt, the field descriptions are where guidance lives: a description like "primary price of the product" is how you tell the model what to do. Here is the same product schema in Pydantic and in Zod.

from pydantic import BaseModel, Field

# Define your schema (nested data and lists are supported)
class Product(BaseModel):
    name: str
    price: float = Field(
        ...,
        description="Primary price of the product.",
    )
    specs: dict[str, str] = Field(
        default_factory=dict,
        description="Specs of the product.",
    )
    tags: list[str] = Field(
        default_factory=list,
        description="Tags assigned of the product.",
    )

import { z } from "zod";

// Define your schema (nested data and lists are supported)
const Product = z.object({
  name: z.string(),
  price: z
    .number()
    .describe(
      "Primary price of the product."
    ),
  specs: z
    .record(z.string())
    .describe(
      "Specs of the product."
    ),
  tags: z
    .array(z.string())
    .describe("Tags assigned of the product."),
});

Calling Schematron (No Prompt)

The call uses the OpenAI SDK with no prompt at all. You point the client at the Inference base URL, pass the HTML as the message content, hand the schema to response_format, and keep the temperature at zero, which is what the model is trained for. If you have not set up the client yet, point the OpenAI SDK at the Inference base URL and add your API key first.

import os

from openai import OpenAI

from product_schema import Product

# The HTML you fetched and cleaned (see messy-html.html)
with open("messy-html.html") as f:
    html = f.read()

# Client setup: OpenAI-compatible API pointed at Inference
client = OpenAI(
    base_url="https://api.inference.net/v1",
    api_key=os.environ.get("INFERENCE_API_KEY"),
)

# No system or user prompt: the schema is the interface.
resp = client.beta.chat.completions.parse(
    model="inference-net/schematron-v2-small",
    messages=[
        {"role": "user", "content": html},
    ],
    response_format=Product,
    temperature=0,
)

print(resp.choices[0].message.parsed.model_dump_json(indent=2))

Representative Output and Validation

For the HTML above, the model returns JSON that conforms to the schema.

{
  "name": "MacBook Pro M3",
  "price": 2499.99,
  "specs": {
    "RAM": "16GB",
    "Storage": "512GB SSD"
  },
  "tags": [
    "laptop",
    "professional",
    "macbook",
    "apple"
  ]
}

Even though the output is schema-conforming by design, validate it on ingest with the same model and handle errors explicitly. Validation is cheap insurance, and it is where you catch the upstream surprises, like a page whose structure quietly changed overnight.

from pydantic import ValidationError

from product_schema import Product

# `raw` is the JSON string returned by Schematron.
raw = resp.choices[0].message.content  # noqa: F821 (resp from schematron_call.py)

try:
    # Validate on ingest even though strict JSON mode returns schema-conforming output.
    product = Product.model_validate_json(raw)
    store(product)  # noqa: F821 (your persistence function)
except ValidationError as err:
    # Catch upstream surprises (e.g. a page whose structure changed) explicitly.
    log_extraction_error(raw, err)  # noqa: F821 (your error sink)

Once a single page works, the next decision is which Schematron model to run at volume and what it will cost.

Schematron V2 Turbo is the natural pick when throughput and cost matter, and it is the model most extraction pipelines settle on after the prototype proves out.

Try Schematron V2 Turbo

Use Schematron V2 Turbo for high-throughput HTML-to-JSON extraction when cost and latency matter.

View the model

Cost Comparison for Page Extraction

Extraction cost is driven by tokens, so the model's per-token price is most of the story. Take a representative page of 10,000 input tokens after cleaning and 500 output tokens, which is the same shape used in Schematron's own throughput tests.

At that mix, Schematron V2 Small costs about $0.000625 per page, or roughly $0.63 per 1,000 pages. Schematron V2 Turbo costs about $0.000375 per page, or roughly $0.38 per 1,000 pages. Turbo comes out around 40% cheaper at this token mix, which is why it is the default for high-volume jobs, while Small is worth the premium when the schema is complex or the pages are unusually long.

Model	Input ($/M)	Output ($/M)	Best for
Schematron V2 Small	$0.05	$0.25	Complex schemas, long pages
Schematron V2 Turbo	$0.03	$0.15	High throughput at scale

Prices per million tokens, from the current model pages. Turbo trades a little quality for throughput and cost.

Figure 3: Extraction Cost per 1,000 Pages - 10,000 input + 500 output tokens per page.

The throughput difference is just as real. On a single H100 with the same 10,000-in, 500-out workload, Turbo runs at 4.14 requests per second against Small's 2.47. A general-purpose GPT model can do this job too, but you are paying frontier-model rates for a task a purpose-built model handles at a fraction of the cost, which is the kind of tradeoff worth measuring on how task-specific extraction models compare on quality.

Quality does not get sacrificed to hit those prices. On an LLM-as-judge evaluation where GPT-5.4 grades extractions on a one-to-five scale, V2 Small scores 4.060 and V2 Turbo 4.039, close to the previous-generation 8B model and ahead of the smaller prior model. On a factuality benchmark, both V2 models score above 79 and beat DeepSeek V3.2 and GPT-5.4 Nano. The point of the cost section is not that cheaper is automatically better; it is that for the specific job of turning HTML into typed JSON, a model trained for it gives you frontier-class adherence and accuracy without frontier-class pricing.

Figure 4: Schematron V2 Throughput - Single H100, 10k input + 500 output tokens.

Migrating from a Prompt to a Schema

If you already have a working extraction prompt, moving to a schema-driven call is mostly mechanical. Because both speak the OpenAI-compatible API, the change is the base URL, the model id, and dropping the prompt.

List the fields your prompt asks for and turn each one into a schema field with a type.
Move every instruction in your prompt into the description of the field it governs, since descriptions are how Schematron receives guidance.
Leave your fetch and HTML-cleaning steps exactly as they are.
Point the OpenAI client at https://api.inference.net/v1 and set the model to inference-net/schematron-v2-turbo.
Remove the prompt: pass the HTML as the message content and the schema as response_format.
Validate the response with the same typed model you defined in step one.
For many pages, move them to the Batch API, which accepts up to 50,000 jobs at once with optional webhook updates.

If you want the full version of this with fetching and cleaning included, the companion walkthrough builds a full Python pipeline that fetches, cleans, and validates end to end.

When Schematron Is Not the Right Layer

A schema-driven extractor is the right tool for a narrow, valuable job, and the wrong tool for several others. The boundaries are worth being honest about.

Reasoning and synthesis tasks are where a general LLM wins. If you need to summarize the sentiment of reviews, judge whether a listing looks fraudulent, rank pages by quality, or make any call that requires open-ended judgment, that is reasoning a schema cannot express, and ChatGPT is the better choice. Schematron is an extraction layer, not a reasoning engine.

The same goes for quick one-offs, where the setup cost of defining a schema isn't worth it and the chat window is simply faster. It goes for fetching, crawling, and anti-bot work, which belong to the proxy and browser layer. And it goes for any task that genuinely needs free-text instructions beyond what a field description can carry, because there is no prompt channel to put them in. When the work is structure, reach for a schema; when the work is judgment, reach for the general model.

ChatGPT vs Schematron: Which to Use When

Here is the comparison in one place. Web scraping with ChatGPT and schema-driven extraction are not really competitors so much as tools for different stages and different jobs.

Dimension	ChatGPT (prompt + Structured Outputs)	Schematron (schema-only)
Interface	Prompt + optional schema	Schema only, no prompt
JSON validity	Valid with strict mode	100% schema-conforming
Drift risk	Higher (prompt + model updates)	Lower (schema is the contract)
Instructions / reasoning	Strong	None (schema fields only)
Cost at scale	Frontier-model rates	$0.03–$0.05 / M input
Best fit	Prototyping, synthesis	Production extraction

Cost shown as input-token rates; see the pricing table for full per-token rates.

So the honest answer to "what is the best AI for web scraping" is that it depends on the job: there is no single winner, only a better tool for each stage. The practical recommendation is simple. Prototype with ChatGPT because it is fast and forgiving, then harden the part that runs in production by moving the interface from a prompt to a schema. Keep the general LLM around for the synthesis and judgment tasks it is actually good at.

Conclusion

Prompt-based extraction is the right way to start and the wrong way to scale. A prompt has to define a shape and carry instructions at the same time, and the instruction half drifts as you tune it, as the model updates, and as edge cases pile up. Structured Outputs fixes the shape, but only a schema-as-interface fixes the contract. Prototype with prompts, ship on schemas, and keep the general model for the work that needs judgment instead of structure.

The fastest way to feel the difference is to take one brittle extraction prompt and replace it with a JSON Schema.

Extract typed JSON from messy HTML

Send HTML and a Pydantic, Zod, or JSON Schema definition to Schematron and get structured JSON back without writing selectors or prompts.

Quickstart Guide

AI Web Scraping in Python: From HTML to Validated JSON Schema — the full Python pipeline that fetches, cleans, and validates.
Best LLM for Data Extraction — how general LLMs compare to task-specific extraction models on quality.
HTML-to-JSON Extraction API — typed extraction as a product category, not generic conversion.
Automated Data Extraction from Websites: Architecture, Schema Design, and Validation — the architecture, schema, and validation patterns for running extraction beyond a one-off chat session.