Jun 14, 2026
HTML-to-JSON Extraction API: Extract Typed Data from Messy Web Pages
Inference Research
Extract Typed Business Data, Not a Tag Tree
If you've spent time trying to use web-scraped data in an AI pipeline, you've probably hit the same wall: the data is on the page, but getting it into a form your code can actually use is the hard part. A price scraper that works on Monday breaks when the site redesigns on Friday. A regex that extracts product names handles 80% of pages and silently fails on the rest.
HTML-to-JSON extraction using a schema-driven API takes a different approach. Instead of writing selectors that target specific DOM nodes, you define the data structure you want: a Product with a name, a price, and a tags list. Then you send the HTML to a purpose-built extraction model. The model maps page content to your schema and returns validated JSON. No selectors to maintain. No brittle regex.
HTML-to-JSON extraction is the process of pulling typed, structured data out of a raw web page and returning it as a JSON object that conforms to a developer-defined schema. Unlike DOM serialization, which mirrors the tag hierarchy, extraction maps page content to real-world entities like products, job postings, or article metadata.
This guide covers the full pipeline: how to preprocess HTML, write effective schemas in Python (Pydantic) and TypeScript (Zod), choose between the two Schematron V2 model variants, and scale to thousands of pages with the Batch API.
Read time: 13 minutes
DOM Serialization vs. Typed Business JSON (Two Very Different Things)
The phrase "html to json" covers two fundamentally different operations. Confusing them leads to building the wrong tool.
What DOM Serialization Gives You
DOM serialization converts the HTML tag hierarchy into a JSON tree. A <h2> becomes a key; its text content becomes the value. A <ul> becomes an array of <li> values. The structure is preserved, but the semantics are lost.
A DOM serializer on a product page might return something like:
{
"div": {
"h2": "MacBook Pro M3",
"p": "Price: $2,499.99 USD",
"ul": ["RAM: 16GB", "Storage: 512GB SSD"]
}
}That output is structurally correct but semantically useless for a downstream pipeline. The price is a string buried inside a "p" key. There's no name field, no price: float, no tags array. Your code needs a second parsing step just to get data into a useful shape, and that second step breaks for the same reasons selectors do.
What Typed Business JSON Gives You
Typed extraction starts from a schema you define. You tell the model: the entity is a Product, with a name (string), a price (float), and tags (list of strings). The model reads page content and maps what it finds onto your schema semantically, not structurally.
The same product page via a schema-driven extraction API returns:
{
"name": "MacBook Pro M3",
"price": 2499.99,
"specs": {"RAM": "16GB", "Storage": "512GB SSD"},
"tags": ["laptop", "apple"]
}The price is a float. The name is a named field. The tags are an array. This is what downstream code like vector stores, databases, and agents actually needs.
DOM serializers are the right tool for document processing where structure traversal matters. For anything feeding a typed pipeline, schema-driven extraction is the correct approach. The rest of this article covers exactly that.
How HTML-to-JSON Extraction APIs Work
The architecture of a schema-driven extraction API is simpler than it looks. Three things happen in sequence: fetch the page, preprocess the HTML, extract against a schema.
Schema-First Extraction: No Prompt Required
Schematron is a family of purpose-built extraction models from Inference.net designed to run this pattern. The API interface is exactly what you'd expect if you've used OpenAI's structured outputs: pass the HTML as the user message and your schema as the response_format. That's it.
There is no system prompt. There is no instruction string telling the model to "extract the product name." Schematron was trained entirely on HTML-to-JSON extraction tasks, so the schema is the signal it needs. The docs cover how response_format and structured outputs work under the hood, but from a usage standpoint, the pattern is identical to any OpenAI-compatible SDK call.
If you're already using the OpenAI Python or TypeScript SDK, you swap one line, base_url, and the rest of your code is unchanged.
The Three-Step Pipeline: Fetch, Preprocess, Extract
Figure 1: HTML-to-JSON Extraction Pipeline
Schematron handles extraction only; it doesn't fetch URLs. This separation is deliberate. Teams that already have a crawler (Scrapy, Playwright grid, Crawl4AI) don't replace it; they add Schematron as the extraction step. For teams that need a crawler, the Crawl4AI guide walks through setting up open-source async crawling with JavaScript rendering.
Purpose-Built vs. General LLM + JSON Schema
You can point GPT-4o at a page with a JSON Schema in response_format, and it often produces conforming output. The problem is that GPT-4o wasn't trained on this task specifically. It is applying general reasoning to HTML while simultaneously trying to conform to a schema. Results are inconsistent on pages with noisy structure or ambiguous fields.
Schematron was trained specifically on HTML-to-JSON extraction. Both V2 models score above DeepSeek V3.2 and GPT-5.4 Nano on LLM-as-judge quality evaluation. The 100% schema adherence means no retry-and-validate loop; the output always conforms to your schema.
HTML Preprocessing: Cleaning Before You Extract
Sending raw HTML directly to an extraction model is wasteful. A typical unprocessed page contains <script> blocks, <style> sheets, cookie banners, navigation menus, and footer boilerplate. None of that helps extraction, and all of it consumes tokens.
Stripping Noise with lxml
The Schematron models were trained on HTML pre-cleaned with lxml.html.clean.Cleaner. Aligning your preprocessing with the model's training distribution is the single most impactful thing you can do for consistent extraction results.
import lxml.html as LH
from lxml.html.clean import Cleaner
HTML_CLEANER = Cleaner(
scripts=True, # remove <script> tags
javascript=True, # remove javascript: hrefs and event handlers
style=True, # remove <style> tags
inline_style=True, # remove style= attributes
safe_attrs_only=False,
)
def strip_noise(html: str) -> str:
"""Remove scripts, styles, and JavaScript from HTML using lxml."""
if not html or not html.strip():
return ""
try:
doc = LH.fromstring(html)
cleaned = HTML_CLEANER.clean_html(doc)
return LH.tostring(cleaned, encoding="unicode")
except Exception:
return ""What to strip: <script> tags, <style> blocks, inline JavaScript, and navigation boilerplate. These consume tokens without adding extractable signal.
What to keep: Text content, <a> href attributes, and semantic markup (<h1>–<h6>, <table>, product-specific data-* attributes). When in doubt, keep a little more; Schematron handles noise gracefully.
Alternative preprocessors that work: Mozilla Readability (strips nav, focuses on main content), Trafilatura (fast, configurable), and BeautifulSoup (for targeted element extraction when you know the DOM structure). The lxml Cleaner is the recommended default because it matches Schematron's training distribution.
JavaScript-rendered pages require a browser-based fetch before preprocessing. If your targets use React, Vue, or Next.js and render content client-side, you need Playwright or Puppeteer to get the fully-populated HTML first. The Crawl4AI guide covers async crawling with JavaScript rendering. Once you have the rendered HTML, preprocessing and extraction are identical.
Getting Started: Convert HTML to JSON in Python and TypeScript
Here's the end-to-end flow before diving into code:
- Install the OpenAI SDK and set your
INFERENCE_API_KEYenvironment variable. Get your key from the API Quickstart. - Fetch the HTML for the target page (requests, Playwright, or your existing crawler).
- Strip scripts and styles with
lxml.html.clean.Cleaner. - Define your output schema as a Pydantic
BaseModel(Python) or Zod object (TypeScript). - Call
client.beta.chat.completions.parse()withmodel="inference-net/schematron-v2-small"andresponse_format=YourModel. - Access the validated result at
response.choices[0].message.parsed.
Python with Pydantic
import os
import lxml.html as LH
from lxml.html.clean import Cleaner
from pydantic import BaseModel, Field
from openai import OpenAI
HTML_CLEANER = Cleaner(
scripts=True,
javascript=True,
style=True,
inline_style=True,
safe_attrs_only=False,
)
def strip_noise(html: str) -> str:
"""Remove scripts, styles, and JavaScript from HTML using lxml."""
if not html or not html.strip():
return ""
try:
doc = LH.fromstring(html)
cleaned = HTML_CLEANER.clean_html(doc)
return LH.tostring(cleaned, encoding="unicode")
except Exception:
return ""
# 1) Define your schema (nested data and lists are supported)
class Product(BaseModel):
name: str
price: float = Field(
...,
description="Primary price of the product.",
)
specs: dict[str, str] = Field(
default_factory=dict,
description="Specs of the product.",
)
tags: list[str] = Field(
default_factory=list,
description="Tags assigned of the product.",
)
# 2) Messy HTML (could be the full page; trim to the relevant region when possible)
raw_html = """
<div id="item">
<h2 class="title">MacBook Pro M3</h2>
<p>Price: <b>$2,499.99</b> USD</p>
<ul info>
<li>RAM: 16GB</li>
<li>Storage: 512GB SSD</li>
</ul>
<span class="tag">laptop</span>
<span class="tag">professional</span>
<span class="tag">macbook</span>
<span class="tag">apple</span>
</div>
"""
html = strip_noise(raw_html) # pre-clean before sending to the model
# 3) Client setup
client = OpenAI(
base_url="https://api.inference.net/v1",
api_key=os.environ.get("INFERENCE_API_KEY"),
)
resp = client.beta.chat.completions.parse(
model="inference-net/schematron-v2-small",
messages=[
{"role": "user", "content": html},
],
response_format=Product,
)
print(resp.choices[0].message.parsed.model_dump_json(indent=2))
# Expected output:
# {
# "name": "MacBook Pro M3",
# "price": 2499.99,
# "specs": {
# "RAM": "16GB",
# "Storage": "512GB SSD"
# },
# "tags": ["laptop", "professional", "macbook", "apple"]
# }The SDK call uses client.beta.chat.completions.parse() with response_format=Product. Pydantic handles deserialization automatically — the returned object is already a typed Product instance with no manual JSON parsing required.
The output for the MacBook Pro example:
{
"name": "MacBook Pro M3",
"price": 2499.99,
"specs": {"RAM": "16GB", "Storage": "512GB SSD"},
"tags": ["laptop", "apple"]
}Set temperature=0 for deterministic extraction. The model is trained for a single correct answer, not sampling across possibilities.
Note model="inference-net/schematron-v2-small". This variant prioritizes accuracy on complex schemas and dense pages. Section 6 covers when to switch to schematron-v2-turbo for higher throughput.
TypeScript with Zod
import OpenAI from "openai";
import { z } from "zod";
import { zodResponseFormat } from "openai/helpers/zod";
const openai = new OpenAI({
baseURL: "https://api.inference.net/v1",
apiKey: process.env.INFERENCE_API_KEY,
});
// 1) Define your schema (nested data and lists are supported)
const Product = z.object({
name: z.string(),
price: z
.number()
.describe(
"Primary price of the product."
),
specs: z
.record(z.string())
.describe(
"Specs of the product."
),
tags: z
.array(z.string())
.describe("Tags assigned of the product."),
});
// 2) Messy HTML (could be the full page; trim to the relevant region when possible)
const html = `
<div id="item">
<h2 class="title">MacBook Pro M3</h2>
<p>Price: <b>$2,499.99</b> USD</p>
<ul info>
<li>RAM: 16GB</li>
<li>Storage: 512GB SSD</li>
</ul>
<span class="tag">laptop</span>
<span class="tag">professional</span>
<span class="tag">macbook</span>
<span class="tag">apple</span>
</div>
`;
// 3) Chat Completions extraction (typed validation)
const resp = await openai.chat.completions.parse({
model: "inference-net/schematron-v2-turbo",
messages: [{ role: "user", content: html }],
response_format: zodResponseFormat(Product, "product"),
});
console.log(resp.choices[0].message.content);The TypeScript version uses zodResponseFormat(Product, "product") in place of the Pydantic model. The API contract is identical; only the schema wrapper differs between languages.
Both examples produce the same JSON output. The TypeScript version uses inference-net/schematron-v2-turbo to show both model identifiers early; the throughput and cost difference between them is covered in the next sections.
Raw JSON Schema is also supported via response_format: { type: "json_schema", json_schema: { ... } }, which is useful for languages without a Pydantic or Zod equivalent.
Schema Design for HTML Extraction
Getting a first extraction working is usually fast. Getting consistent results on production pages with varied formatting, missing fields, or ambiguous data requires schema design.
Writing Field Descriptions That Improve Accuracy
Field names establish the extraction contract, but they're not always enough. A page showing a sale price, an original price, and a shipping price gives the model three plausible values for a field named price: float. It will often pick correctly, but "often" isn't production-grade.
Field descriptions resolve ambiguity. In Pydantic, that's Field(..., description="Primary sale price of the product in USD, not original or shipping price"). In Zod, it's .describe("Primary sale price..."). The model reads description text as additional guidance when multiple candidates appear on the page.
Compare these two field definitions:
# Without description — potentially ambiguous
price: float
# With description — unambiguous
price: float = Field(..., description="Primary sale price of the product in USD.")Add descriptions on any field where the page could contain multiple valid values. The description text is the tie-breaker.
Handling Optional Fields and Nested Objects
Not every page has every field. For fields that may be absent, use float | None = Field(None, ...) in Pydantic or .optional() in Zod. When the source data doesn't have the field, the model returns null rather than hallucinating a value.
Designing Schemas for Reviewable Pipelines
Production extraction schemas should be easy to inspect after the model runs. Add fields that preserve context when a human or downstream validator needs to understand an answer: source_url, source_text, confidence_note, or evidence. These fields are not required for every schema, but they make review queues and debugging much easier. When an extracted price looks suspicious, the operator can inspect the relevant page fragment instead of re-opening the entire crawl result.
Keep the extraction schema close to what appears on the page, then normalize in application code. For example, extract price_text, currency, and price_amount if pages mix "$1,299", "USD 1,299", and "Starting at $1,299". Schematron maps page content into those fields; your database layer can enforce canonical currency codes, decimal precision, and business rules. That split keeps extraction understandable and avoids turning the schema into an all-purpose ETL layer.
import os
from pydantic import BaseModel, Field
from openai import OpenAI
class JobPosting(BaseModel):
title: str
company: str
location: str = Field(
...,
description="City and country of the role, e.g. 'San Francisco, US'. Extract from the location or address field.",
)
remote: bool = Field(
...,
description="True if the posting explicitly states the role is remote or fully remote.",
)
salary_min: float | None = Field(
None,
description="Lower bound of the advertised salary range in USD per year. Null if not stated.",
)
salary_max: float | None = Field(
None,
description="Upper bound of the advertised salary range in USD per year. Null if not stated.",
)
requirements: list[str] = Field(
default_factory=list,
description="List of required skills or qualifications as stated in the posting.",
)
client = OpenAI(
base_url="https://api.inference.net/v1",
api_key=os.environ.get("INFERENCE_API_KEY"),
)
resp = client.beta.chat.completions.parse(
model="inference-net/schematron-v2-small",
messages=[
{"role": "user", "content": html}, # html = your pre-cleaned job posting HTML
],
response_format=JobPosting,
)
posting = resp.choices[0].message.parsed
print(posting.model_dump_json(indent=2))This JobPosting schema demonstrates the full range: required string fields (title, company), optional floats (salary_min, salary_max), a list of strings (requirements), and a boolean (remote). Schematron enforces 100% schema adherence across all field types; nested objects and arrays of objects work the same way.
For raw JSON Schema users, Pydantic's Field(description="...") maps directly to the "description" key in JSON Schema, so this technique is available from any language or HTTP client.
Schematron V2 Small vs. V2 Turbo: Choosing the Right Model
Both V2 models handle the same extraction tasks. The choice is a tradeoff of accuracy against throughput and cost.
Quality scores: GPT 5.4 judge, 1–5 scale (April 2026 launch benchmark). Throughput: single H100. Pricing as of 2026-06-13.
| Model | Quality Score | Throughput | Input / 1M tokens | Output / 1M tokens |
|---|---|---|---|---|
inference-net/schematron-v2-small | 4.060 | 2.47 req/s | $0.05 | $0.25 |
inference-net/schematron-v2-turbo | 4.039 | 4.14 req/s | $0.03 | $0.15 |
Quality: V2 Small scores 4.060 and V2 Turbo scores 4.039 on an LLM-as-judge benchmark (GPT 5.4 judge, 1–5 scale). Both models outperform DeepSeek V3.2 and GPT-5.4 Nano on this benchmark. The full methodology is in the Schematron V2 launch post.
Throughput: V2 Turbo processes 4.14 requests per second on a single H100, roughly 2.5× faster than the V1 8B it replaces. V2 Small runs at 2.47 req/sec on the same hardware.
Pricing: V2 Small costs $0.05 per million input tokens and $0.25 per million output tokens. V2 Turbo costs $0.03 per million input tokens and $0.15 per million output tokens. At a typical cleaned page size of 5,000 tokens, V2 Turbo costs roughly $0.00015 per page in input tokens.
Decision guidance: Use V2 Small when the schema is complex (many optional fields, nested objects), the pages are long or content-dense, or accuracy is the primary constraint. Use V2 Turbo when throughput matters: real-time processing, large catalog ingestion, or cost-sensitive pipelines. The quality gap is minimal (4.060 vs 4.039), so for most workloads V2 Turbo is the right default.
Legacy routing: inference-net/schematron-3b routes transparently to V2 Turbo; inference-net/schematron-8b routes to V2 Small. Existing integrations continue to work without code changes, at V2 pricing.
Both models deliver better extraction quality per dollar than general-purpose LLMs on HTML-to-JSON workloads. See the Schematron V2 Turbo model page for current pricing and context window specs.
Schematron V2 Turbo's 2.5× throughput advantage and lower per-token cost make it the right starting point for high-volume extraction pipelines. Try it against your target pages to validate quality before committing to a model.
Try Schematron V2 Turbo
Use Schematron V2 Turbo for high-throughput HTML-to-JSON extraction when cost and latency matter.
Scaling to Thousands of Pages: Batch HTML Extraction
Single-page extraction via the synchronous API works well during development and for low-volume pipelines. For thousands or tens of thousands of pages, the Batch API handles the scale.
Building a Batch JSONL Pipeline
The Batch API uses a JSONL format: one extraction request per line, each a JSON object with custom_id, method, url, and body fields. Build the JSONL locally, upload it, create a batch job, and then poll or receive a webhook when it completes.
The batch endpoint uses https://batch.inference.net/v1 as the base URL, separate from the synchronous API.
import json
import os
import time
from pydantic import BaseModel, Field
from openai import OpenAI
class Product(BaseModel):
name: str
price: float = Field(..., description="Primary price of the product.")
specs: dict[str, str] = Field(default_factory=dict, description="Specs of the product.")
tags: list[str] = Field(default_factory=list, description="Tags assigned of the product.")
client = OpenAI(
base_url="https://batch.inference.net/v1",
api_key=os.environ["INFERENCE_API_KEY"],
)
# Step 1: Build JSONL input
# Each line is one extraction request; custom_id links inputs to outputs after completion.
pages = [
{"id": "page-001", "html": "<div>...</div>"},
{"id": "page-002", "html": "<div>...</div>"},
# ... up to 50,000 pages per batch
]
with open("batchinput.jsonl", "w") as f:
for page in pages:
request = {
"custom_id": page["id"],
"method": "POST",
"url": "/v1/chat/completions",
"body": {
"model": "inference-net/schematron-v2-small",
"messages": [{"role": "user", "content": page["html"]}],
"response_format": {
"type": "json_schema",
"json_schema": {
"name": "product",
"strict": True,
"schema": Product.model_json_schema(),
},
},
},
}
f.write(json.dumps(request) + "\n")
# Step 2: Upload batch input file
batch_input_file = client.files.create(
file=open("batchinput.jsonl", "rb"),
purpose="batch",
)
print(batch_input_file)
# Step 3: Create batch job (completion_window must be "24h")
batch = client.batches.create(
input_file_id=batch_input_file.id,
endpoint="/v1/chat/completions",
completion_window="24h",
metadata={
"description": "product extraction run",
},
# Optional. Must be HTTPS.
extra_body={
"webhook_url": "https://example.com/my_webhook",
},
)
print(batch)
# Step 4: Poll status until complete
while True:
retrieved_batch = client.batches.retrieve(batch.id)
print(retrieved_batch.status)
if retrieved_batch.status in ("completed", "failed", "expired", "cancelled"):
break
time.sleep(30)
# Step 5: Download results
# output_file_id is set when the batch reaches "completed" status
file_response = client.files.content(retrieved_batch.output_file_id)
print(file_response.text)
# Parse each output line; use custom_id to match back to input pages
for line in file_response.text.strip().splitlines():
result = json.loads(line)
page_id = result["custom_id"]
content = result["response"]["body"]["choices"][0]["message"]["content"]
product = Product.model_validate_json(content)
print(page_id, product.name, product.price)Batch limits: up to 50,000 requests per batch, with input files up to 200 MB. A single batch job handles price monitoring crawls, full catalog ingestions, and RAG corpus builds without needing to manage a concurrent request pool or rate-limit backoff. See the Batch API documentation for the full reference.
The most useful production pattern is to keep custom_id stable and meaningful. Use an ID that combines the source system and entity key, such as shopify:product:12345 or crawl:2026-06-13:urlhash. When results come back, you can upsert by custom_id, retry only failed records, and compare old and new extractions without guessing which page produced which JSON object.
Webhooks and Polling for Results
Pass webhook_url in the batch creation request and Inference.net POSTs to your endpoint when the job finishes, so no polling loop is needed.
Polling fallback: GET /v1/batches/{id} returns the current status field, such as validating, in_progress, completed, or failed. Once completed, the output_file_id in the response gives you the file to download. Each output line matches a custom_id from your input, making it easy to correlate results with source pages.
For smaller batches of up to 50 requests, the Group API at /api/async-inference/group is a lighter alternative. You submit the group as a single job and receive one callback when everything completes. It's useful for mid-pipeline fan-out patterns: kicking off 20–30 extraction requests for a product category and waiting for all of them before proceeding.
Cost at scale: At V2 Turbo pricing, 50,000 pages averaging 5,000 tokens of cleaned HTML is 250 million input tokens, roughly $7.50 per batch run in input costs. That's less than $0.0002 per page, competitive with any extraction approach that uses a capable model.
Schematron vs. Other HTML-to-JSON Tools
There are roughly four categories of tools in this space. Knowing where each fits makes the decision straightforward.
Schema Adherence: Yes = 100% schema adherence guaranteed; Best-effort = valid JSON but no schema guarantee; Per-field = field-level quality scores reported. Batch 50K+: Yes = native 50K-request batch; Partial = batch supported but not at this scale in a single job; No = per-page only.
| Tool | Approach | Schema Adherence | Batch (50K+ pages) |
|---|---|---|---|
| Schematron V2 (inference.net) | Dedicated extraction model | Yes | Yes |
| Firecrawl /agent | Scrape + LLM extraction | Best-effort | No |
| Scrapfly Extraction API | Multi-strategy AI models | Per-field | Partial |
| Diffbot API | Computer vision + ML | Best-effort | No |
| GPT-4o + JSON Schema | General LLM + strict mode | Yes | No |
All-in-one crawl + extract services (Firecrawl, Scrapfly) are convenient for teams without a crawler. Firecrawl's /agent endpoint handles fetching and structured extraction in one call. Scrapfly's Extraction API accepts HTML directly and offers pre-trained models for 20+ entity types alongside LLM-based extraction. The tradeoff: pricing is per crawl, extraction uses a general LLM rather than a purpose-built model, and dedicated batch processing at 50K+ pages isn't an option in the same way.
General LLMs with JSON Schema (GPT-4o, Claude) work for low-volume extraction when you're already paying for the model. With strict: true in response_format, schema adherence improves significantly. But these models weren't trained for HTML extraction. Quality degrades on long or noisy pages, latency is higher, and cost-per-extraction is substantially higher than Schematron.
DOM serializers are the wrong tool for typed extraction. They preserve structure, not semantics, as covered in the opening section.
Schematron V2 is the right choice when you have a fetch layer and need the best extraction quality on HTML you control. It composes with any crawler rather than replacing it. The 50,000-request batch capacity, purpose-built training, and 100% schema adherence combine into an extraction stack that scales cleanly.
The honest case for Firecrawl or Scrapfly: if you need managed crawl infrastructure and don't want to run a browser fleet, they handle both steps. Schematron is the extraction layer for teams that want precision on HTML they already fetch.
Common Pitfalls and How to Avoid Them
Four issues account for most extraction problems in production.
Sending un-cleaned HTML. Raw pages with <script> and <style> blocks waste tokens and reduce extraction accuracy on pages where JavaScript content is large relative to visible content. Pre-clean with lxml before sending.
Writing a system prompt. Schematron was not trained to read instruction prompts, and it ignores them. All extraction guidance must flow through the schema: field names, types, and .describe() annotations. If you're getting inconsistent results, better field descriptions are almost always the fix, not a prompt.
Forgetting JavaScript rendering. Schematron receives HTML, not URLs. If your target pages use React, Next.js, or client-side rendering, the static HTML you'd get from requests is often a near-empty skeleton. Use Playwright or Puppeteer to render the page before passing its HTML to the extraction API.
Hitting the 128K token limit. Most cleaned pages run 2,000–10,000 tokens, so this is rarely a problem in practice. For exceptionally long pages, such as full catalog archives or paginated tables, extract the relevant DOM section first using lxml (target the product container by CSS class or ID), then send only that section.
Conclusion
Schema-driven extraction with Schematron delivers typed business JSON from messy HTML: not tag trees, not selector maintenance, and not prompt engineering. The model enforces 100% field adherence, handles ambiguity through schema descriptions, and scales from a single-page prototype to a 50,000-page batch job without architectural changes.
For most workloads, start with V2 Turbo: 4.14 req/sec, $0.03 per million input tokens, and quality scores above general-purpose LLMs on extraction benchmarks. Switch to V2 Small when schemas are complex or pages are dense and accuracy matters more than throughput.
The Schematron documentation has the full API reference, additional schema examples, and the batch pipeline walkthrough.
Extract typed JSON from messy HTML
Send HTML and a Pydantic, Zod, or JSON Schema definition to Schematron and get structured JSON back without writing selectors or prompts.
Related Reading
- Schematron V2 launch post: benchmark methodology, quality scores, and throughput numbers behind the model comparison in this article
- LLM API Pricing Comparison: broader context for how Schematron pricing compares across the model landscape
- LLM Cost Optimization: practical guidance for cost-sensitive pipelines where V2 Turbo is the right choice
Meet with our research team
Schedule a call with our research team to learn more about how Specialized Language Models can cut costs and improve performance.