Jun 17, 2026
Automated Data Extraction from Websites: Architecture, Schema Design, and Validation
Inference Research
What Automated Data Extraction Actually Requires
A prompt that pulls clean fields out of one product page feels like a solved problem. Run the same prompt across a thousand pages, a few site redesigns, and a quarter of traffic, and it stops being a solution and starts being an incident. The page that extracted perfectly last week now returns a null where the price used to be, and nothing in your logs says anything is wrong.
Automated data extraction from websites is the practice of turning web pages into typed, validated records on a repeatable schedule, without a human checking each page. Automatic data extraction in this sense is an architecture, not a single API call. Fetching is the easy half now; it is mostly commoditized. The hard half is the layer that turns messy HTML into data a downstream system can trust. That layer is what this article builds.
The shift matters now because the tooling moved. Through 2026, AI spread across the entire extraction lifecycle, from planning and crawling to extraction and validation, and the work moved with it: less time writing and maintaining selectors, more time specifying intent and supervising automated extraction. The interesting design decisions are no longer about CSS paths. They are about schemas, validation gates, and what happens when extraction fails.
Done well, data extraction automation is not a clever prompt; it is a pipeline with checkpoints. A production extraction job has to do six things a one-off prompt does not. It has to be repeatable, so the same input produces the same record. It has to return typed output, not free text. It has to validate that output before anything downstream consumes it. It has to handle failure deliberately, not silently. It has to notice when the upstream site changes. And it has to scale to the volume you actually run.
Selector-based scraping fails most of these quietly. The dangerous case is silent selector degradation: when a target site changes its markup, the selector still matches something, so the field is captured as a blank or a default and the error log stays silent. Schema-guided extraction moves the contract from a brittle DOM path to a typed schema, which is both the instruction to the model and the thing you validate against.
Read time: 10 minutes
A Reference Architecture for Automated Extraction
Before any code, it helps to see the whole pipeline. An automated extraction job is a sequence of stages, each with one job, and the model is only one of them.
Figure 1: Automated Extraction Pipeline
The stages are:
- Source the HTML, from a live crawler, a stored snapshot, or a cached page.
- Preprocess it, cleaning out scripts, styles, and boilerplate.
- Define the schema for the record you want, as a typed model or JSON Schema.
- Extract by sending the cleaned HTML and the schema to a schema-guided model.
- Validate the returned JSON against the schema on ingest.
- Retry failures with the validation error fed back, then route the unrecoverable ones to a review queue.
- Store the typed records.
- Monitor for drift in the data you are storing.
- Scale recurring, high-volume work onto a batch or async path.
The important boundary is between fetching and extracting. Sourcing pages is one problem; turning them into typed records is another. Schematron, the schema-guided extraction model used in the examples here, lives only in the extraction step. It is not a crawler, a proxy network, or a browser-automation layer, and keeping it in its lane is what makes the rest of the architecture composable.
Sources: Crawler Output, HTML Snapshots, and Cached Pages
The input side of the pipeline has three common shapes, and the choice affects everything downstream.
Live crawler output is the freshest but the least reproducible: by the time you debug a bad extraction, the page may have changed. Stored HTML snapshots trade freshness for replayability. If you keep the raw HTML you fetched, you can re-run extraction with a new schema or a fixed bug without re-crawling, and you can test a schema change against last month's pages before you ship it. Cached pages sit in between, useful when you re-process the same corpus repeatedly.
Decoupling fetch from extract is the single most useful structural decision here. It lets the crawl layer (your existing stack, or a dedicated crawler) evolve independently from the extraction layer, and it gives you a fixed corpus to validate against. If you want the full fetch-and-clean walkthrough in Python, the companion piece on building an AI web scraping pipeline in Python covers the retrieval side in detail.
Whatever the source, clean the HTML before extraction. Schematron was trained on HTML that had been pre-cleaned with lxml to strip scripts, styles, and inline JavaScript, so aligning your preprocessing with that training tends to improve quality. Alternatives like Readability, Trafilatura, BeautifulSoup, or targeted regex are fine; the guidance is simply to err on removing less content, because the model can still find what it needs in the noise.
import lxml.html as LH
from lxml.html.clean import Cleaner
HTML_CLEANER = Cleaner(
scripts=True,
javascript=True,
style=True,
inline_style=True,
safe_attrs_only=False,
)
def strip_noise(html: str) -> str:
"""Remove scripts, styles, and JavaScript from HTML using lxml."""
if not html or not html.strip():
return ""
try:
doc = LH.fromstring(html)
cleaned = HTML_CLEANER.clean_html(doc)
return LH.tostring(cleaned, encoding="unicode")
except Exception:
return ""Designing the Extraction Schema
In a schema-guided pipeline, the schema is not just validation. It is the only channel you have to tell the model what to extract. Schematron takes no system or user prompt; it reads the schema and returns conforming JSON. That makes schema design the place where most of your extraction quality is won or lost.
Two habits pay off. First, write field descriptions as if they were instructions, because they are. A field called price with the description "Primary price of the product" extracts more reliably than a bare price: float, and ambiguous fields that need any interpretation must say so explicitly, since there is no prompt to fall back on. Second, model the record you actually want, including nested objects and lists, rather than mirroring the page's DOM. The goal is typed business data, not a transcription of the HTML.
Here is a web-specific schema for a product record, defined as a Pydantic model. The same shape works in Zod for TypeScript pipelines. This is the typed JSON you want to turn messy pages into, expressed as a contract.
from pydantic import BaseModel, Field
# The schema is the instruction. Field descriptions tell Schematron what to extract;
# there is no prompt, so anything ambiguous must be stated here.
class Product(BaseModel):
name: str = Field(
...,
description="Exact product name as shown in the title or primary heading.",
)
price: float = Field(
...,
description="Primary price of the product.",
)
specs: dict[str, str] = Field(
default_factory=dict,
description="Specs of the product.",
)
tags: list[str] = Field(
default_factory=list,
description="Tags assigned of the product.",
)Required versus optional matters here too. Mark a field required only if a record is useless without it, because required fields are what your validation gate will reject on. Optional fields with sensible defaults keep partial records flowing instead of dumping every imperfect page into the review queue.
The Extraction Step: Schema-Guided Model Calls
With a schema in hand, the extraction call itself is small. Schematron is OpenAI-compatible, so you point the SDK at https://api.inference.net/v1, pass the cleaned HTML as the user message, and hand the schema in as the response format. Keep the temperature at zero; the model is trained to perform best there.
import os
from openai import OpenAI
from product_schema import Product
html = """
<div id="item">
<h2 class="title">MacBook Pro M3</h2>
<p>Price: <b>$2,499.99</b> USD</p>
<ul info>
<li>RAM: 16GB</li>
<li>Storage: 512GB SSD</li>
</ul>
<span class="tag">laptop</span>
<span class="tag">professional</span>
</div>
"""
client = OpenAI(
base_url="https://api.inference.net/v1",
api_key=os.environ.get("INFERENCE_API_KEY"),
)
resp = client.beta.chat.completions.parse(
model="inference-net/schematron-v2-small",
messages=[
{"role": "user", "content": html},
],
response_format=Product,
)
print(resp.choices[0].message.parsed.model_dump_json(indent=2))There are two models behind the same call. inference-net/schematron-v2-small is the higher-quality option for complex schemas and very long pages, and inference-net/schematron-v2-turbo is tuned for throughput and the lowest cost at scale. You select between them with the model string and nothing else, which makes it easy to start on one and move to the other. The schema-guided extraction model documentation covers both, along with the best-practice details.
For the product schema above, the call returns strictly valid JSON that conforms to the schema. A representative response:
{
"name": "MacBook Pro M3",
"price": 2499.99,
"specs": {
"RAM": "16GB",
"Storage": "512GB SSD"
},
"tags": [
"laptop",
"professional",
"macbook",
"apple"
]
}The output is schema-compliant by construction: the model runs in strict JSON mode with 100% schema adherence, so you get the shape you asked for rather than prose you have to parse. That property is what makes the next stage, validation, fast instead of forgiving.
If high-throughput extraction is where you are headed, Schematron V2 Turbo is the model built for it.
Try Schematron V2 Turbo
Use Schematron V2 Turbo for high-throughput HTML-to-JSON extraction when cost and latency matter.
Validation, Retries, and Review Queues
A schema-compliant response is not the same as a correct one. The JSON will have the right shape, but a price can still come back as zero, a required name can be empty, or a field can hold a plausible-but-wrong value. The validation gate is what turns "the model returned JSON" into "the record is safe to store."
Validate on ingest, every time, even though the model reliably returns valid JSON. Parsing the response back through the same Pydantic model that defined the schema catches type mismatches and missing required fields for free, and it is cheap relative to the cost of a bad record propagating downstream.
Figure 2: Validation, Retry, and Review Queue
When validation fails, retry with the error fed back into the next attempt. Including the specific validation error gives the model the context to correct itself, and a small cap keeps a stubborn page from looping forever; libraries like instructor default to a maximum of three attempts. Past that cap, do not silently accept the record and do not silently drop it. Route it to a human review queue, which is where ambiguous fields, missing required values, and genuinely malformed pages belong.
from pydantic import ValidationError
from product_schema import Product
MAX_ATTEMPTS = 3 # instructor defaults to a maximum of 3 retries
def extract(html: str, last_error: str | None = None) -> str:
"""Call Schematron and return the raw JSON string.
On a retry, last_error is the previous validation message; append it to the
request context so the model can correct itself. See schematron_call.py for
the underlying OpenAI-compatible call.
"""
...
def extract_validated(html: str) -> Product:
"""Extract a Product, retrying on validation failure, else escalate to review."""
last_error: str | None = None
for _ in range(MAX_ATTEMPTS):
raw = extract(html, last_error=last_error)
try:
return Product.model_validate_json(raw)
except ValidationError as err:
last_error = str(err)
# Retries exhausted: do not silently accept or drop the record.
raise ReviewQueue(html=html, error=last_error)
class ReviewQueue(Exception):
"""Unrecoverable record: route to a human review queue."""
def __init__(self, html: str, error: str | None):
self.html = html
self.error = error
super().__init__(error)The review queue is not an admission of failure; it is the pressure-release valve that lets the automated path stay automated. A job that escalates one percent of pages and ingests the rest cleanly is far more useful than one that either blocks on every edge case or quietly corrupts your dataset.
Monitoring Extraction Drift
Validation catches a bad record. Drift monitoring catches a bad week. Schema drift is what happens upstream over time: new fields appear, existing fields move, and types shift, like a price that used to be a number arriving as a string.
Schema-guided extraction is more resistant to layout changes than selectors, because the model reads meaning rather than DOM position, but it does not make drift monitoring optional. You still watch the data you are producing. Three signals cover most failures: per-field completeness, so you notice when a field's null or blank rate jumps; type stability, so you catch a field that starts arriving in a different shape; and value distribution, so an unusual swing in extracted values raises a flag.
The point of all three is to catch a break from the data side before a downstream consumer complains. A completeness alert on the price field is how you learn a retailer redesigned its product template, days before someone notices the dashboard went flat.
Batch and Async Processing at Scale
Synchronous calls are right for interactive or low-volume work. Once you are extracting from a large, recurring set of pages, the Batch API is the better path. It is built for exactly this kind of job: extracting structured data from a large number of documents is the first use case it lists.
The flow is asynchronous. You prepare a .jsonl file where each line is one request, with a custom_id, the method, the endpoint URL, and the request body. You upload that file, then create a batch against https://batch.inference.net/v1 with a 24-hour completion window. A single submission can carry up to 50,000 jobs, and a Group API handles smaller sets of 50 or fewer requests as one tracked job. The trade is latency for headroom: the Batch API runs with substantially higher rate limits than the synchronous endpoints, and each batch completes within the 24-hour window, often much sooner.
import os
from openai import OpenAI
# Batch API uses a different base URL from the synchronous API.
client = OpenAI(
base_url="https://batch.inference.net/v1",
api_key=os.environ["INFERENCE_API_KEY"],
)
# 1) Upload a .jsonl file; each line is one request:
# {"custom_id": "...", "method": "POST", "url": "/v1/chat/completions", "body": {...}}
batch_input_file = client.files.create(
file=open("batchinput.jsonl", "rb"),
purpose="batch",
)
# 2) Create the batch. completion_window can only be "24h".
batch = client.batches.create(
input_file_id=batch_input_file.id,
endpoint="/v1/chat/completions",
completion_window="24h",
metadata={
"description": "nightly extraction job",
},
# Optional. Must be HTTPS.
extra_body={
"webhook_url": "https://example.com/my_webhook",
},
)
print(batch)Rather than poll for completion, register an HTTPS webhook callback so the pipeline is notified when a batch finishes. The full request lifecycle, file format, and status semantics are in the Batch API documentation. For a high-volume recurring pipeline, this is the part of the architecture worth getting right early, because it is where cost and throughput live.
What It Costs: Schematron V2 Small vs Turbo
Cost in an extraction pipeline is dominated by the page itself: HTML is large, so input tokens vastly outnumber output tokens, and input pricing is what moves your bill. Both models are priced per million tokens.
| Model | Input ($/M) | Output ($/M) |
|---|---|---|
| Schematron V2 Small | $0.05 | $0.25 |
| Schematron V2 Turbo | $0.03 | $0.15 |
Turbo costs roughly 40 percent less on input than Small and processes pages faster, with a measured 2.5x throughput improvement over the previous-generation 8B model on a single H100. The natural worry is that the cheaper, faster model gives up too much quality. The published benchmarks say it gives up very little.

On the launch post's LLM-as-Judge benchmark, V2 Small scores 4.060 and V2 Turbo 4.039, both within striking distance of the older, larger 8B model at 4.070 and well ahead of the first-generation 3B at 3.909. The practical guidance follows from that: reach for Turbo for high-volume, cost-sensitive jobs, and use Small when a schema is genuinely complex or pages are very long and you want the extra quality headroom. If you want to verify this on your own pages rather than take a benchmark on faith, the sibling article on the published quality metrics walks through a reproducible test harness.
When Schematron Is Not the Right Layer
A credible architecture knows its own boundaries. Schema-guided extraction is the right tool for turning messy HTML into typed records, and the wrong tool for several adjacent problems.
If your bottleneck is getting the page at all, because of JavaScript rendering, anti-bot defenses, or proxy rotation, that is a retrieval problem, and a crawler or proxy layer solves it. Schematron extracts from HTML you already have; it does not fetch. The companion piece on separating crawling from extraction makes that split explicit. Very large pages are another limit: the context window is long but finite, so pages beyond it have to be chunked or truncated before extraction.
There is also a simpler boundary. If a page is stable and a single CSS selector reliably pulls the one field you need, a model may be more machinery than the job requires, and lightweight automated data extraction software built around selectors will do. The build-versus-buy tradeoff for that decision, including the real maintenance cost of DIY parsers, is the subject of the build vs buy this layer sibling. Use the model where messiness, variety, or schema complexity make selectors expensive, not as a reflex.
Conclusion
Automated data extraction stops being fragile when you treat it as an architecture instead of a prompt. Source the HTML, clean it, define a typed schema, extract against that schema, validate on ingest, retry with the error and escalate what survives, store the result, watch for drift, and move volume onto a batch path. Each stage has one job, and the model is just the one in the middle.
The fastest way to feel the difference is to run one page through it. Send raw HTML and a schema, and get typed JSON back, with no selectors and no prompt to maintain.
Extract typed JSON from messy HTML
Send HTML and a Pydantic, Zod, or JSON Schema definition to Schematron and get structured JSON back without writing selectors or prompts.
Related Reading
- HTML-to-JSON Extraction API — the product-hub view of schema-driven extraction from messy pages.
- AI Web Scraping in Python — the hands-on Python fetch, clean, and validate tutorial.
- Best LLM for Data Extraction — Schematron V2 quality metrics and a test harness for your own pages.
- Web Scraping with ChatGPT vs Schematron — when a general chat model handles extraction and when a task-specific model is the better fit.
Meet with our research team
Schedule a call with our research team to learn more about how Specialized Language Models can cut costs and improve performance.