Jun 16, 2026
Best LLM for Data Extraction: Schematron Quality Metrics for HTML, Tables, and Noisy Pages
Inference Research
Best LLM for Data Extraction Is a Workload Question
Ask an AI engineering channel for the best LLM for data extraction and you'll get a list of frontier models. Grab one, point it at a product page, and it works on the first try. Then you run it across 100,000 pages and the wheels come off: the bill is uncomfortable, the latency drags your pipeline, and every few hundred pages the model returns JSON that almost matches your schema but not quite.
The list was never wrong, exactly. It was answering the wrong question. The right model depends on the layer you're solving and the metrics that matter for your pages, not on a general leaderboard. Two models with nearly identical reasoning scores can be an order of magnitude apart on cost per page, and on how often they hand back a record your code can actually use.
This guide does two things. First, it leads with published quality metrics for a task-specific extraction model, Schematron V2, so you have a concrete baseline instead of vibes. Then it hands you a reproducible evaluation harness so you can measure any model, general or task-specific, on your own pages and decide for yourself.
Read time: 10 minutes
Why "Best LLM for Data Extraction" Needs a Schema
Generic "best LLM" rankings measure reasoning, conversation, and coding. None of that tells you whether a model returns the exact typed record your pipeline expects. LLM data extraction is only measurable once you fix a schema: the required fields, their types, and the nested structure you want back. A schema turns a fuzzy task into a scored one.
That distinction matters because two outcomes that look similar are not. Valid JSON means the response parses as JSON. Schema adherence means the parsed object has the right fields, the right types, and every required key. A model can return perfectly valid JSON that fails your schema by dropping a required field or returning a price as a string.
This is why an extraction benchmark without a schema tells you almost nothing. If two models extract against different targets, their scores aren't comparable. Once a schema is fixed, you can ask the only questions that matter: does the output parse, does it adhere, and are the values correct? Models that guarantee schema-conforming output by design, like the schema-guided extraction Schematron performs, change the reliability math, because the validity question is settled before you ever run the model.
Crawl and Fetch Is Not Extraction
Before comparing models, separate two jobs that get bundled together. A web-data pipeline has distinct stages: fetch or crawl the page, clean the markup, extract typed fields, validate them, and store the result. The model question lives entirely in the extract step.
Figure 1: The Web-Data Pipeline (Extraction Is One Stage)
An LLM does not fetch pages, rotate proxies, render JavaScript, or solve CAPTCHAs. That work belongs to your retrieval layer, whether that's requests, Playwright, or a crawler like Crawl4AI. When someone asks for the best LLM for web scraping data extraction, the honest scope is narrow: HTML goes in, typed JSON comes out. Everything upstream is a separate tool choice.
One preprocessing step does belong with extraction. Schematron was trained on HTML that had been pre-cleaned with lxml to strip scripts, styles, and inline JavaScript, so aligning your preprocessing with that improves results. Readability, Trafilatura, or BeautifulSoup work too; the guidance is to err on the side of removing less content.
A Test Set That Reflects Real Pages
Evaluating on one page type tells you almost nothing, because noise, nesting, and ambiguity differ sharply by domain. A model that nails clean product pages can fall apart on a financial report. A useful extraction test set spans the page types you actually run.
Four domains stress different failure modes and cover most production extraction work.
| Domain | Example fields | What makes it hard |
|---|---|---|
| E-commerce product | name, price, variants, specs | Sale vs list price; variants |
| Real estate listing | address, beds, area, agent | Duplicate facts; hidden metadata |
| Financial page | metrics, period, guidance | Long, table-dense pages |
| Company profile | name, hq, headcount, links | Sparse, scattered fields |
E-commerce pages punish models that can't tell a sale price from a list price, or that flatten variants into one record. Real estate listings repeat the same fact in three places and bury others in metadata, so a model has to reconcile rather than copy. Financial pages are long and table-dense, which tests whether a model holds context across thousands of tokens without losing the thread. Company profiles are sparse and scattered, where the hard part is leaving a field null instead of hallucinating a plausible value.
For each domain, define a typed schema and label a small gold set, roughly 10 to 50 pages, with the correct values by hand. That gold set is what makes field accuracy measurable later, and it's worth labeling a few deliberately ugly pages, the ones with missing fields or duplicated facts, because that's where models diverge. Keep the test set focused on web HTML; scanned PDFs and image-only documents are a different problem that needs OCR first.
Schematron V2 Quality Metrics
Here's the baseline. Schematron V2 is a task-specific family trained for HTML-to-JSON extraction, and its quality numbers are published. The evaluation uses an LLM-as-a-judge methodology, with GPT 5.4 grading extractions on a 1-to-5 scale, the same approach used at the original launch. The full numbers and methodology are in the Schematron V2 launch post.
Extraction Quality: LLM-as-a-Judge
On the quality benchmark, Schematron V2 Small scores 4.060, within striking distance of the first-generation Schematron 8B at 4.070 and clearly ahead of the original 3B at 3.909. The throughput-tuned V2 Turbo scores 4.039, barely behind Small despite being optimized for speed. Both V2 models outscore DeepSeek V3.2 and GPT-5.4 Nano on this benchmark, two substantially larger general-purpose models. The original 3B and 8B are retired now (requests to those model IDs route transparently to their V2 successors), so the numbers here are the historical baseline the benchmark ran against, not models you would still deploy.

What this means for model selection: a small, specialized model can match a much larger one on extraction quality, because it spends all of its parameters on one task instead of the general intelligence you aren't using here.
Throughput: Requests per Second
Quality only matters if you can afford to run it at volume. On a single H100 with 10,000 input tokens and 500 output tokens per request, V2 Turbo processes 4.14 requests per second, against 2.47 for V2 Small, 2.47 for the original 3B, and 1.63 for the original 8B. That makes Turbo roughly a 2.5x throughput improvement over the original 8B.

For latency-sensitive pipelines, real-time price monitoring or parsing financial documents on a clock, that speed compounds across millions of pages.
Factuality: SimpleQA with Web Search
Extraction quality also shows up downstream. On SimpleQA, a factual question-answering benchmark run through a web-search-augmented pipeline with GPT-5 Nano as the base model and Exa for search, the extraction layer drives the score. V2 Small reaches 83.10 and V2 Turbo 79.42, against 85.58 for the original 8B and 75.47 for the original 3B. The same base model with no search scores 8.54.

The gap between roughly 80 and 8.54 is the whole argument for structured extraction as a layer: instead of stuffing raw HTML into a large model, you extract only the structured data that matters and make web-augmented factuality viable at scale.
The Metrics That Actually Decide a Model
Published numbers are a starting hypothesis. To choose for your workload, measure five things on your own pages.
| Metric | What it measures | How to compute |
|---|---|---|
| Valid JSON rate | Response parses as JSON | Parse, count successes |
| Schema adherence | Fields, types, required keys correct | Validate vs schema |
| Field accuracy | Values match the truth | Compare vs gold set |
| Latency | Per-page speed | p50 and p95 |
| Cost per page | Spend at volume | Tokens x rate |
Two of these are worth dwelling on. Valid JSON rate and schema adherence rate are where general-purpose models quietly cost you. When a model occasionally returns malformed or non-conforming output, you end up building a retry-and-repair loop around it, and that loop is real latency and real spend. A model with strict JSON mode and 100% schema adherence by design removes that tax entirely.
Field accuracy is the metric that separates "valid" from "correct." A response can adhere perfectly to your schema and still get the price wrong, swap two specs, or invent a value for a field that wasn't on the page. Scoring per-field values against a labeled gold set is the only way to catch that. Use exact match for atomic fields like names and SKUs, and a normalized comparison for prices, dates, and currencies so that $2,499.99 and 2499.99 count as a match. Field accuracy is the number that should decide which model ships, because it's the one that maps directly to whether your downstream data is trustworthy.
Latency and cost round out the picture, and they're easy to forget in a notebook where you test one page at a time. Measure p50 and p95 latency, not just the average, since a long tail on a few pages can stall a batch. Compute cost per page from the actual input and output token counts your pages produce, then multiply out to the volume you actually run. A model that's a few cents cheaper per thousand pages can be the difference between a pipeline that's viable and one that isn't at internet scale.
General-Purpose LLMs vs Task-Specific Extraction Models
With the metrics defined, the comparison is clearer, and choosing a data extraction LLM comes down to one trade-off between two families of models. General frontier models like Claude Sonnet 4, Gemini 2.5 Pro, and GPT-5.5 are strong when extraction shades into reasoning: ambiguous fields, inference across a page, or synthesis. The cost is real, though. They charge frontier prices per page, run slower at volume, and often need constrained decoding or retries to guarantee valid JSON.
Task-specific models take the opposite trade. Schematron is trained on HTML-to-JSON, returns strict JSON by design, and runs far cheaper and faster. The catch is that it has no prompt channel, so anything that needs instructions beyond the schema is out of scope. The published benchmark backs this up: the task-specific model matches its own 8B and beats two larger general models on the extraction judge.
| Dimension | General-purpose LLM | Task-specific model |
|---|---|---|
| Valid JSON | Often needs retries | Strict by design |
| Cost per page | Frontier prices | Far lower |
| Throughput | Slower at volume | Higher |
| Reasoning / synthesis | Strong | Out of scope |
| Prompt control | Yes | Schema only |
| Best fit | Ambiguous, reasoning | High-volume extraction |
The decision rule follows from the table. If the job is reasoning-heavy synthesis, reach for a frontier model. If it's high-volume, schema-strict extraction from web pages, a task-specific model wins on cost, speed, and reliability while holding quality. If your workload looks like the second case, Schematron V2 Turbo is the natural model to put under test.
Try Schematron V2 Turbo
Use Schematron V2 Turbo for high-throughput HTML-to-JSON extraction when cost and latency matter.
A Reproducible Evaluation Harness
This is the part most guides skip. Don't take the published numbers, or this article, on faith. Run the same evaluation on your own pages. The harness is five steps, and it's model-agnostic: swap the model identifier and you can score a frontier model on exactly the same gold set.
First, define the schema. Use Pydantic in Python or Zod in TypeScript, and put your extraction guidance in the field descriptions, since Schematron reads only the schema, not a prompt.
from pydantic import BaseModel, Field
# 1) Define your schema (nested data and lists are supported)
class Product(BaseModel):
name: str
price: float = Field(
...,
description="Primary price of the product.",
)
specs: dict[str, str] = Field(
default_factory=dict,
description="Specs of the product.",
)
tags: list[str] = Field(
default_factory=list,
description="Tags assigned of the product.",
)import { z } from "zod";
// 1) Define your schema (nested data and lists are supported)
const Product = z.object({
name: z.string(),
price: z.number().describe("Primary price of the product."),
specs: z.record(z.string()).describe("Specs of the product."),
tags: z.array(z.string()).describe("Tags assigned of the product."),
});Second, call the model. Schematron speaks the OpenAI-compatible API, so you point the OpenAI SDK at https://api.inference.net/v1 and pass the schema as the response format. Keep temperature at 0, which is what the model is trained for, and pre-clean the HTML with lxml first.
import lxml.html as LH
from lxml.html.clean import Cleaner
HTML_CLEANER = Cleaner(
scripts=True,
javascript=True,
style=True,
inline_style=True,
safe_attrs_only=False,
)
def strip_noise(html: str) -> str:
"""Remove scripts, styles, and JavaScript from HTML using lxml."""
if not html or not html.strip():
return ""
try:
doc = LH.fromstring(html)
cleaned = HTML_CLEANER.clean_html(doc)
return LH.tostring(cleaned, encoding="unicode")
except Exception:
return ""import os
from openai import OpenAI
from pydantic_schema import Product
# 2) Messy HTML (could be the full page; pre-clean with lxml first)
html = """
<div id="item">
<h2 class="title">MacBook Pro M3</h2>
<p>Price: <b>$2,499.99</b> USD</p>
<ul info>
<li>RAM: 16GB</li>
<li>Storage: 512GB SSD</li>
</ul>
<span class="tag">laptop</span>
<span class="tag">professional</span>
<span class="tag">macbook</span>
<span class="tag">apple</span>
</div>
"""
# 3) Client setup (OpenAI-compatible API)
client = OpenAI(
base_url="https://api.inference.net/v1",
api_key=os.environ.get("INFERENCE_API_KEY"),
)
resp = client.beta.chat.completions.parse(
model="inference-net/schematron-v2-small",
messages=[
{"role": "user", "content": html},
],
response_format=Product,
temperature=0,
)
print(resp.choices[0].message.parsed.model_dump_json(indent=2))For the example HTML, the model returns a schema-conforming object like this.
{
"name": "MacBook Pro M3",
"price": 2499.99,
"specs": {
"RAM": "16GB",
"Storage": "512GB SSD"
},
"tags": [
"laptop",
"professional",
"macbook",
"apple"
]
}Third and fourth, validate and score. Parse the response against your schema and handle errors explicitly, even though strict JSON mode should always return conforming output, then compare each field to the gold value to compute valid-JSON, schema-adherence, and field-accuracy rates.
import json
from pydantic import ValidationError
from pydantic_schema import Product
def normalize(value):
"""Loose comparison: numbers compare numerically, strings case-insensitively."""
if isinstance(value, (int, float)):
return round(float(value), 2)
return str(value).strip().lower()
def score(raw_responses: list[str], gold: list[dict]) -> dict:
"""Score model output against a labeled gold set.
raw_responses: the model's raw string output, one per page.
gold: the correct field values for each page, same order.
Returns valid-JSON, schema-adherence, and field-accuracy rates.
"""
n = len(gold)
valid_json = 0
schema_ok = 0
field_hits = 0
field_total = 0
for raw, truth in zip(raw_responses, gold):
# 1) Valid JSON: does it parse at all?
try:
obj = json.loads(raw)
valid_json += 1
except (json.JSONDecodeError, TypeError):
continue
# 2) Schema adherence: validate against the Pydantic model
try:
parsed = Product(**obj)
schema_ok += 1
except ValidationError:
continue
# 3) Field accuracy: compare each gold field to the parsed value
for field, expected in truth.items():
field_total += 1
actual = getattr(parsed, field, None)
if normalize(actual) == normalize(expected):
field_hits += 1
return {
"valid_json_rate": valid_json / n,
"schema_adherence_rate": schema_ok / n,
"field_accuracy": (field_hits / field_total) if field_total else 0.0,
}
if __name__ == "__main__":
# Swap in your own model's raw outputs and labeled gold set.
responses = ['{"name": "MacBook Pro M3", "price": 2499.99, "specs": {}, "tags": []}']
gold = [{"name": "MacBook Pro M3", "price": 2499.99}]
print(json.dumps(score(responses, gold), indent=2))Fifth, compare. Run the same harness against a frontier model by changing one identifier, and put the two side by side on valid JSON, schema adherence, field accuracy, latency, and cost. The point of keeping the harness model-agnostic is exactly this: the only fair comparison is the same schema, the same pages, and the same scorer, with nothing changing but the model. Once you have that table for your four domains, the "best LLM" question answers itself for your workload.
To run this across thousands of pages instead of a handful, submit the jobs through the Batch API rather than looping synchronously, which keeps the evaluation cheap and lets you score a statistically meaningful sample instead of a token few.
Cost: Schematron V2 Small vs Turbo
Cost per page is usually the metric that decides volume work, and it's straightforward to compute. Schematron V2 Small is priced at $0.05 per million input tokens and $0.25 per million output tokens; V2 Turbo is $0.03 per million input and $0.15 per million output.
Using the same page profile as the throughput benchmark, 10,000 input tokens plus 500 output tokens per page, the arithmetic works out to roughly $62.50 per 100,000 pages for Small and $37.50 per 100,000 pages for Turbo. These are worked numbers from the per-token rates, not a published benchmark, and your real cost depends on your page sizes.
| Model | Input $/1M | Output $/1M | ~Cost / 100k pages |
|---|---|---|---|
| V2 Small | $0.05 | $0.25 | $62.50 |
| V2 Turbo | $0.03 | $0.15 | $37.50 |
Cost per 100k pages is derived arithmetic at 10k input + 500 output tokens per page, not a published benchmark; actual cost depends on your page sizes.

For context on how that compares to general-purpose model APIs, see our model API pricing comparison. And if you're planning a pipeline well into the millions of pages, dedicated capacity and volume pricing are worth a conversation before you commit a budget.
Scale structured web extraction
Planning a high-volume extraction pipeline? Talk through throughput, schema design, and dedicated capacity for Schematron with an Inference engineer.
Choosing Between Schematron V2 Small and Turbo
The two variants map cleanly to the two binding constraints most teams hit.
| Binding constraint | Pick |
|---|---|
| Field accuracy, complex schemas, long pages | V2 Small |
| Throughput, lowest cost, latency-sensitive | V2 Turbo |
| Mixed or unsure | Test both on your gold set |
Pick V2 Small when extraction quality on complex or deeply nested schemas and very long pages is the priority. Pick V2 Turbo when throughput and cost are the binding constraint and quality is still strong enough, which the benchmark says it is. When you're unsure, run both through the harness on your gold set; the field-accuracy gap on your pages is the answer, not a rule of thumb.
When Schematron Is Not the Right Layer
A task-specific extraction model is the wrong tool for several real jobs, and saying so is part of choosing well.
If you need to retrieve pages, render JavaScript, rotate proxies, or get past CAPTCHAs, that's a crawler or fetch layer, not an extraction model. Schematron only handles the extract step.
If the task requires synthesizing across multiple documents, summarizing, or any reasoning beyond pulling fields into a schema, reach for a general LLM. Schematron has no prompt channel, so it can't follow instructions that don't fit in a schema.
If your source is an image-only or scanned PDF, run OCR or document intelligence first, then normalize the extracted text. And if you want free-form answers rather than schema-conforming records, an extraction model isn't the right shape at all.
Conclusion: Pick the Layer, Then Prove It on Your Pages
The best LLM for data extraction isn't a single model. It's whichever one wins on valid JSON, schema adherence, field accuracy, latency, and cost for your specific pages. The published Schematron V2 metrics give you a strong starting hypothesis: near-8B quality, higher throughput, and a cost low enough for internet-scale work. The harness is how you confirm or kill that hypothesis on your own data.
Start by defining a schema, running it against a few of your hardest pages, and checking whether the output is something your pipeline can use without a repair loop.
Extract typed JSON from messy HTML
Send HTML and a Pydantic, Zod, or JSON Schema definition to Schematron and get structured JSON back without writing selectors or prompts.
Related Reading
- HTML-to-JSON Extraction API: Extract Typed Data from Messy Web Pages — the typed-extraction fundamentals behind this benchmark.
- Crawl4AI: The Complete Guide to LLM Web Scraping — how to fetch the pages you then extract.
- LLM API Pricing Comparison 2026 — how general-purpose model API costs stack up.
Meet with our research team
Schedule a call with our research team to learn more about how Specialized Language Models can cut costs and improve performance.