Jun 14, 2026
AI Web Scraping in Python: HTML to Validated JSON
Inference Research
From Brittle Selectors to a Typed JSON Schema
Anyone who has shipped a scraper knows the failure mode: it works for a week, the target site nudges a class name, and your CSS selectors return empty strings. AI web scraping in Python flips the problem around. Instead of telling your code where the data lives in the DOM, you tell it the shape of the data you want, and a model reads the page for you.
This is a hands-on build, not a survey. You will fetch raw HTML, clean it, define a Pydantic schema, send it to an extraction model, and end with a validated Python object you can drop into a database or a RAG pipeline. We will use Schematron, a model trained specifically to turn messy HTML into typed JSON, and we will keep the fetch step and the extraction step cleanly separate so you can reuse your existing crawler. If you want the product-level companion to this walkthrough, the HTML-to-JSON extraction API covers the same idea from the API side.
Read time: 10 minutes
Why AI Web Scraping Is Different From Selector Scraping
AI web scraping is the practice of extracting structured data from web pages using a language model instead of hand-written CSS or XPath selectors. You define the fields you want, and the model reads the page content to fill them in, even when the markup changes between sites or over time.
Selector-based scraping with BeautifulSoup or lxml is precise and fast, but it ties your code to a page's structure. The moment a site reorders its DOM or renames a class, your selectors break, and you end up maintaining a separate rule set for every site you scrape. Schematron and similar models work the other way around. They read content rather than structure, so one schema generalizes across layouts. The catch is that you pay for per-token inference instead of CPU parsing, which matters at scale and which we will price out later.
There are two flavors of AI scraping, and the distinction is the whole point of this article. Prompt-based tools hand a model the page text plus an instruction like "extract the product name and price." Schema-guided extraction hands the model a schema and nothing else. With Schematron there is no system or user prompt for directions at all; the schema is the only instruction the model receives. That matters because plain JSON mode on a general model guarantees valid JSON, not JSON that matches your schema, so prompts can drift between runs while a schema holds the contract steady.
None of this makes selectors obsolete. If you are pulling one stable field from one page you control, a one-line CSS query is cheaper and simpler. We will come back to where the line sits.
The Pipeline: From Fetched HTML to a Validated Object
Every reliable AI scraping pipeline is the same six steps, and it helps to see them before we write any code:
- Fetch the HTML with your existing stack (requests, Playwright, or a crawler).
- Clean the HTML to remove scripts, styles, and boilerplate.
- Define the output you want as a Pydantic schema.
- Extract by sending the cleaned HTML and the schema to Schematron.
- Validate the result and handle the rare failure.
- Scale with batching when you have more than a handful of pages.
The line that matters most there is the split between step 1 and step 4. Fetching is one job and extraction is another. Schematron is the extraction layer; it does not fetch pages, render JavaScript, rotate proxies, or solve CAPTCHAs. Keep the two apart and you can swap crawlers without touching your extraction code, or swap extraction models without touching your crawler.
Figure 1: AI Web Scraping Pipeline (Fetch to Validated JSON)
Step 1: Fetch the HTML
For static pages, requests is all you need. For pages that build their content with JavaScript, you need a real browser, and Playwright is the standard choice in Python. Either way, the output is the same: a string of HTML you hand to the next step.
# Static pages: a plain GET is enough.
import requests
url = "https://example.com/product/123"
html = requests.get(url, timeout=30).text
# JavaScript-rendered pages: drive a real browser with Playwright.
# Run `playwright install chromium` once before using this.
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page()
page.goto(url, wait_until="networkidle")
html = page.content()
browser.close()You are not limited to these two libraries. Anything that produces HTML works here, including a full crawler like Crawl4AI, Firecrawl, Apify, or Scrapy. The extraction step downstream does not care how the bytes arrived, only that they did.
Step 2: Clean the HTML With lxml
Raw HTML is mostly noise: <script> tags, inline styles, tracking pixels, and navigation chrome that cost tokens without carrying any of the data you want. Stripping that out before extraction both improves accuracy and trims your token bill.
Schematron was trained on HTML that had been pre-cleaned with lxml.html.clean.Cleaner, so aligning your preprocessing with the training data tends to improve extraction quality and consistency. Here is the cleaner the docs recommend:
import lxml.html as LH
from lxml.html.clean import Cleaner
HTML_CLEANER = Cleaner(
scripts=True,
javascript=True,
style=True,
inline_style=True,
safe_attrs_only=False,
)
def strip_noise(html: str) -> str:
"""Remove scripts, styles, and JavaScript from HTML using lxml."""
if not html or not html.strip():
return ""
try:
doc = LH.fromstring(html)
cleaned = HTML_CLEANER.clean_html(doc)
return LH.tostring(cleaned, encoding="unicode")
except Exception:
return ""You do not have to use lxml. Readability, Trafilatura, BeautifulSoup, or even targeted regex are all acceptable, and Readability is often more aggressive about stripping content. When in doubt, err on the side of removing less; the model can still find the fields you asked for even with some clutter left in.
Step 3: Define a Pydantic Schema
This is where you describe what you want. A Pydantic model gives you the field names, the types, and a description per field that tells the model what each one means. Nested models and lists are both supported, so you can model a real record rather than a flat bag of strings.
from typing import Optional
from pydantic import BaseModel, Field
class Product(BaseModel):
name: str = Field(
...,
description="Exact product name as shown in the title or primary heading.",
)
price: float = Field(
...,
description="Primary price of the product as a number.",
)
specs: dict[str, str] = Field(
default_factory=dict,
description="Specs of the product.",
)
tags: list[str] = Field(
default_factory=list,
description="Tags assigned to the product.",
)
# Optional fields let the model emit null instead of inventing a value.
availability: Optional[str] = Field(
None,
description="Stock status if shown on the page, e.g. 'In stock'. Null if absent.",
)Because Schematron takes no prompt, those description strings carry all the instructions the model gets. A field called price with the description "Primary price of the product as a number" extracts far more reliably than a bare price: float. For data that may be absent on a page, make the field optional and say so in the description, so the model emits null instead of inventing a value. You can read more about how schemas drive output in the structured outputs docs.
Step 4: Call Schematron Through the OpenAI-Compatible API
Schematron speaks the OpenAI API, so you use the OpenAI Python SDK and point it at Inference.net by setting the base URL and your API key. The API quickstart covers account setup if you do not have a key yet.
The call itself is short. You pass the cleaned HTML as the user message and your Pydantic model as the response format, and you keep the temperature at zero, which is what the model is trained for. There is no system prompt, because the schema is the instruction.
import os
from openai import OpenAI
from schema import Product # from 03_schema.py
from clean_html import strip_noise # from 02_clean_html.py
client = OpenAI(
base_url="https://api.inference.net/v1",
api_key=os.environ.get("INFERENCE_API_KEY"),
)
cleaned_html = strip_noise(html) # `html` comes from the fetch step
# No system or user prompt for directions — the schema is the only instruction.
resp = client.beta.chat.completions.parse(
model="inference-net/schematron-v2-small",
messages=[{"role": "user", "content": cleaned_html}],
response_format=Product,
temperature=0,
)
product = resp.choices[0].message.parsedTwo model identifiers are available: inference-net/schematron-v2-small for the highest quality on complex schemas and very long pages, and inference-net/schematron-v2-turbo for maximum throughput and lowest cost. The first-generation schematron-3b and schematron-8b identifiers are retired and now route to the V2 models automatically, so always use the V2 names in new code.
Step 5: Read the JSON Output
The parse helper returns a typed object directly. Call resp.choices[0].message.parsed to get your Pydantic instance, or .model_dump_json(indent=2) to see the JSON. For the messy product HTML in our example, Schematron returns clean, conforming JSON:
{
"name": "MacBook Pro M3",
"price": 2499.99,
"specs": {
"RAM": "16GB",
"Storage": "512GB SSD"
},
"tags": [
"laptop",
"professional",
"macbook",
"apple"
],
"availability": null
}The output is strictly valid against your schema. Schematron runs in strict JSON mode with full schema adherence, and the keys come out in the same order you declared them, which makes diffs and downstream parsing predictable.
Step 6: Handle Validation Failures
Schematron should always return valid JSON that matches your schema, but production code validates anyway. The most common real-world failure is not malformed JSON; it is a page that exceeds the model's token budget, which shows up as a finish_reason of length and a truncated response.
from pydantic import ValidationError
from schema import Product # from 03_schema.py
def extract_product(client, cleaned_html: str) -> Product | None:
resp = client.beta.chat.completions.parse(
model="inference-net/schematron-v2-small",
messages=[{"role": "user", "content": cleaned_html}],
response_format=Product,
temperature=0,
)
choice = resp.choices[0]
# Most common real failure: the page exceeded the token budget and was cut off.
if choice.finish_reason == "length":
# Trim the HTML harder or chunk the page, then retry.
return None
# Validate on ingest even though strict JSON mode should guarantee a match.
try:
return Product.model_validate_json(choice.message.content)
except ValidationError as err:
# Route to a retry or a human review queue instead of dropping silently.
print("Validation failed:", err)
return NoneThe pattern is simple: check finish_reason first, then validate the payload with Product.model_validate_json, and route anything that fails to a retry or a human review queue. For oversized pages, the fix is upstream: trim the HTML harder in step 2, or chunk the page and extract per section, since the model accepts long but not unlimited input.
Batch Extraction: Scaling to Many Pages
Extracting one page is a tutorial. Extracting ten thousand is a job. For modest volumes, the simplest scale-up is to run the synchronous call concurrently with a thread pool, since each extraction is an independent network-bound request.
from concurrent.futures import ThreadPoolExecutor
from validate import extract_product # from 06_validate.py
def extract_many(client, pages: list[str], max_workers: int = 16) -> list:
"""Extract a Product from each cleaned HTML page concurrently."""
with ThreadPoolExecutor(max_workers=max_workers) as pool:
results = pool.map(lambda html: extract_product(client, html), pages)
return [r for r in results if r is not None]
# For tens of thousands of pages, switch to the asynchronous Batch API instead
# of a thread pool. Submit a JSONL file (custom_id, method, url, body per line)
# to https://batch.inference.net/v1; jobs complete within a 24-hour window and
# accept up to 50,000 requests per submission.When you outgrow a thread pool, switch to the asynchronous Batch API, which accepts up to 50,000 jobs in a single submission, runs them with higher rate limits, and completes within a 24-hour window with optional webhook updates. You upload a JSONL file where each line carries a custom_id, the method, the endpoint, and the same request body you would send synchronously. A smaller Group API handles batches of 50 or fewer in one payload with a single callback.
Schematron V2 Small vs Turbo: Quality, Speed, and Cost
Picking a model comes down to the shape of your pages and your volume. Turbo is the default for high-volume, mostly-structured pages where cost and latency dominate; Small is the choice when schemas are complex or pages are very long and you want the highest accuracy.
| Dimension | V2 Small | V2 Turbo |
|---|---|---|
| Best for | Complex schemas, long pages | High volume, low cost |
| Input price (/1M tok) | $0.05 | $0.03 |
| Output price (/1M tok) | $0.25 | $0.15 |
| Throughput (H100, 10k/500) | 2.47 req/s | 4.14 req/s |
| LLM-judge score (1-5) | 4.060 | 4.039 |
| Model id | schematron-v2-small | schematron-v2-turbo |
Prices from the model pages; throughput and judge scores from the Schematron V2 launch post. Model ids are prefixed with inference-net/.
The pricing is transparent and per-token, so you can estimate before you commit. Schematron V2 Small is $0.05 per million input tokens and $0.25 per million output tokens; Turbo is $0.03 per million input tokens and $0.15 per million output tokens. A cleaned product page of roughly 8,000 input tokens producing a 400-token JSON record costs about $0.0003 on Turbo, or roughly $0.30 per thousand pages; the same job on Small is about $0.0005 per page, or $0.50 per thousand.
Quality is not a guess either. On an LLM-as-a-judge evaluation graded on a one-to-five scale, Schematron V2 Small scores 4.060 and Turbo scores 4.039, both within range of the previous-generation 8B model at 4.070 and ahead of the older 3B at 3.909. On throughput, a single H100 serves 4.14 Turbo requests per second on a 10,000-input, 500-output workload, about 2.5 times the old 8B, while Small serves 2.47 per second.

If your pages are uniform and your volume is high, Turbo is usually the right starting point, and you can reach for Small on the pages that need it.
Once you have a schema working locally, the fastest way to see Turbo on your own pages is to run it directly.
Try Schematron V2 Turbo
Use Schematron V2 Turbo for high-throughput HTML-to-JSON extraction when cost and latency matter.
ScrapeGraphAI vs Schema-First Extraction
Because the search results for AI web scraping in Python are clustered around ScrapeGraphAI, it is worth drawing the contrast directly. ScrapeGraphAI is an LLM-driven scraper that you drive with a natural-language prompt and a URL; it interprets the page and returns JSON, and it bundles fetching, crawling, and extraction into one hosted API, with a popular open-source library behind it and credit-based pricing.
| Dimension | ScrapeGraphAI | Schematron |
|---|---|---|
| Interface | Prompt + URL | Schema only |
| Output guarantee | Prompt-shaped | Strict schema |
| Fetch/crawl included | Yes | No |
| Pricing model | Credits | Per token |
| Best for | All-in-one setup | Typed pipelines |
ScrapeGraphAI is a hosted API with an open-source library (26.5k+ GitHub stars); Schematron is the extraction layer you pair with your own crawler.
The core difference is the interface. A prompt is flexible but can produce different output shapes from run to run and page to page, while a schema is a fixed contract the model is constrained to fill, which is what you want when the output feeds a typed database or a downstream pipeline. ScrapeGraphAI is a strong choice when you want one tool to do everything with minimal setup. Schematron fits when you already have a crawler you trust, you want a stable typed contract, and you want to pay per token rather than per credit.
When Schematron Is Not the Right Layer
Schema-guided extraction is not the answer to every scraping problem, and it is worth being clear about where it stops.
It is not a fetch, render, proxy, or CAPTCHA layer. If your problem is getting the page at all, you need a crawler or a browser automation tool, and Schematron sits after that rather than in place of it. When a single stable selector already gets you the one field you need, a CSS query is cheaper and has no per-token cost. For inputs that are not HTML, like scanned PDFs, a document or OCR pipeline is the right first step, with extraction applied afterward if needed. Schematron is the extraction layer for messy web pages, not a replacement for your proxy network or your crawler.
Frequently Asked Questions
Can you scrape websites with AI in Python? Yes. You fetch the HTML with your usual tools, then use a schema-guided model to turn that HTML into typed JSON, which is the pipeline this article builds.
Do I still need a crawler? Yes. Schematron only does extraction; fetching and rendering remain the job of requests, Playwright, or a crawler.
Is the output guaranteed to match my schema? Schematron runs in strict JSON mode with full schema adherence, so the output conforms to your schema; you should still validate on ingest as a safety net.
How much does it cost per page? It is priced per token, so cost scales with page size; a typical cleaned product page runs a fraction of a cent, as worked out in the cost section above.
Should I use Small or Turbo? Start with Turbo for high-volume, uniform pages, and use Small when schemas are complex or pages are very long.
Conclusion
AI web scraping in Python is not magic. It is a disciplined pipeline: fetch HTML with the stack you already have, clean it with lxml, describe the output with a Pydantic schema, send it to Schematron, and validate what comes back. The schema is the contract, and the contract is what keeps your output stable while the pages underneath it keep changing.
The fastest way to internalize the pattern is to run it on a page you actually care about.
Extract typed JSON from messy HTML
Send HTML and a Pydantic, Zod, or JSON Schema definition to Schematron and get structured JSON back without writing selectors or prompts.
Related Reading
- HTML-to-JSON Extraction API — the product-level companion to this tutorial.
- Crawl4AI Complete Guide — the fetch and crawl layer that feeds Schematron.
- Best AI APIs — broader context for choosing an extraction API.
Meet with our research team
Schedule a call with our research team to learn more about how Specialized Language Models can cut costs and improve performance.