Jun 16, 2026
Data Extraction SDKs and APIs: Build vs Buy for HTML-to-JSON Extraction Pipelines
Inference Research
The Build-vs-Buy Decision for HTML-to-JSON Extraction
The first version of a web scraper is almost always cheap. You open the page, find the price in a <span class="price">, write a selector, and ship it. The expensive part arrives later, when the site ships an A/B test, renames a class, or moves the price into client-side rendering. The selector still runs. It just returns nothing, or the wrong thing, and the bad data flows downstream until a dashboard goes blank and someone files a ticket.
That maintenance tail is the real subject of any build-vs-buy decision for data extraction. A data extraction SDK or a data extraction API turns raw web pages into typed JSON without per-site parsing code: instead of maintaining CSS selectors and XPath, you pass cleaned HTML and a schema to a hosted model and get back conforming records. Calling an API to extract data from a website this way, rather than hand-writing an HTML-to-JSON parser per site, reframes the build-vs-buy question as who maintains the parsing logic, and how much that maintenance costs as your number of source sites grows.
This guide frames that decision for backend and platform engineers. It covers what DIY parsers actually cost to keep alive, an SDK/API evaluation checklist you can score vendors against, a working integration with a real schema and validation step, observability and error handling for production, a grounded cost note, a staged migration plan, and an honest section on when a hosted extraction API is the wrong layer.
Read time: 12 minutes
What "Build" Actually Costs: DIY Parser Maintenance
The cost of building is not the first parser. It is the slope of the maintenance curve as you add sites and as the web underneath you keeps moving:
- Selector and XPath fragility. CSS and XPath are coupled to the page's DOM. A class rename, a markup refresh, or an experiment quietly produces empty or wrong fields. The failure shows up downstream, not at extraction time.
- Per-site sprawl. Each new source site needs its own selector set. N sites means N parsers to keep alive, so build cost scales roughly linearly with the number of sites you cover.
- JS-render drift. When a site moves data into client-side rendering, a static fetch stops working and you inherit a headless browser to install, pin, and babysit.
- Schema drift. When the target output changes, a new field or a renamed one, every per-site parser needs editing.
- The silent-failure tax. Brittle parsers fail open. Wrong data lands in RAG corpora, price monitors, and enrichment tables until someone notices. The cost is data-quality incidents, not only engineering hours.
A short example makes the fragility concrete. The parser below works perfectly until the day the markup changes:
from bs4 import BeautifulSoup
def parse_product(html: str) -> dict:
soup = BeautifulSoup(html, "html.parser")
# Each of these is a bet that the markup never changes.
# Rename ".price" to ".product-price" upstream and `price` becomes None,
# silently, with no error raised here or anywhere downstream.
title_el = soup.select_one("div#item h2.title")
price_el = soup.select_one("div#item p b")
spec_els = soup.select("div#item ul li")
tag_els = soup.select("div#item span.tag")
price_text = price_el.get_text(strip=True) if price_el else ""
price = price_text.replace("$", "").replace(",", "").replace(" USD", "")
specs = {}
for li in spec_els:
text = li.get_text(strip=True)
if ":" in text:
key, value = text.split(":", 1)
specs[key.strip()] = value.strip()
return {
"name": title_el.get_text(strip=True) if title_el else None,
"price": float(price) if price else None, # blows up or returns None on layout drift
"specs": specs,
"tags": [t.get_text(strip=True) for t in tag_els],
}Every selector in that file is a small bet that the page will not change. Across dozens of sites, you are running dozens of those bets at once, and you only learn you lost when the data is already wrong.
What "Buy" Changes: One Schema Instead of N Parsers
The buy alternative replaces per-site parsing code with one schema per record type. A schema-guided extraction model reads the page's content rather than its tag structure, so a layout change does not require a code change. You describe a Product or a JobPosting once, and the model maps each site's markup into that shape, which is what lets you collapse many brittle parsers onto a single definition.
The table below contrasts the two approaches across the dimensions that actually drive cost over time.
| Dimension | Build (DIY parsers) | Buy (schema-guided API) |
|---|---|---|
| New source onboarding | New selector set per site | Reuse one schema |
| Layout change | Edit selectors | No code change |
| JS-rendered pages | Add headless browser | Pre-clean, then extract |
| Schema change | Edit every parser | Edit one schema |
| Failure visibility | Fails silently | Validate on ingest |
| Cost shape | Engineering time | Per-token usage |
The "no code change on layout" and per-schema reuse properties follow from schema-guided extraction reading page content rather than DOM structure.
The decision is not all-or-nothing, and it is reversible. You can buy the extraction step for the sites that change the most, keep a hand-written parser for the one stable site you have owned for years, and migrate in stages. The migration plan later in this guide walks through exactly that.
Figure 1: Build vs Buy in the Extraction Pipeline. Fetch and clean stay in your stack. The build-vs-buy choice happens at the extract step: keep a hand-written parser for a stable single site, or buy schema-guided extraction when many sites and changing layouts make per-site parsers expensive. Validation and storage are the same either way.
Separate Crawl/Fetch from Extraction
The most common mistake in this decision is treating "extraction API" and "scraping API" as the same purchase. They are not. A web data pipeline has distinct layers, and build-vs-buy is a per-layer choice:
- Crawl / fetch — get the HTML. This layer owns proxies, rate limits, JavaScript rendering, and anti-bot handling.
- Clean — strip scripts, styles, and boilerplate.
- Extract — turn cleaned HTML into typed fields.
- Validate — confirm the output conforms to the schema.
- Store — persist typed JSON to a database, warehouse, or vector store.
You can keep your existing crawler (requests, Playwright, Scrapy, Crawl4AI, Firecrawl, Apify, or Bright Data) and buy only the layer that hurts. For most teams, fetching is a solved, commoditized problem. Extraction quality is where the engineering time disappears.
Schematron, the schema-guided extraction model used in the examples here, is the extraction layer only. It does not fetch URLs, rotate proxies, or render JavaScript. It takes HTML you already have and a schema, and returns conforming JSON. Before sending HTML, pre-clean it with lxml to match how the model was trained, which improves extraction quality and consistency; Readability, Trafilatura, BeautifulSoup, or even targeted regex are acceptable alternatives. For the mechanics of how schema-guided extraction turns messy markup into typed records, the companion HTML-to-JSON extraction API guide goes deeper than we will here.
Data Extraction SDK vs API: What You Are Actually Adopting
"SDK" and "API" get used interchangeably in this space, so it is worth being precise about what you adopt when you buy an HTML-to-JSON API.
The API is the OpenAI-compatible Chat Completions endpoint at https://api.inference.net/v1. The SDK is whatever OpenAI-compatible client your team already uses, most likely the OpenAI Python or TypeScript library. There is no bespoke extraction client to learn. Adopting extraction means three small changes: point the base URL at the inference endpoint, select an extraction model, and pass your schema as the response_format.
That schema is enforced through the same Structured Outputs contract used elsewhere on the platform. Structured Outputs guarantees schema adherence, whereas plain JSON mode only guarantees valid JSON with no promise that it matches your shape. The practical effect is that you stop writing retry-and-reparse loops around a general model that occasionally returns the wrong fields.
An SDK/API Evaluation Checklist
When you do decide to buy, score candidates on the same six dimensions rather than on landing-page adjectives. A quick way to run the evaluation:
- Confirm it accepts your schema (Pydantic, Zod, or JSON Schema), not just its own record shapes.
- Check that schema adherence is guaranteed, not best-effort.
- Verify there is an async or batch path for volume, with completion notifications.
- Model the cost per page at your real token sizes, not the headline per-token rate.
- Measure latency and throughput on your own pages.
- Confirm you can observe failures: retries, validation errors, and per-page cost.
The table below expands each criterion into what to actually score and why it matters.
| Criterion | What to score | Why it matters |
|---|---|---|
| Schema support | Pydantic / Zod / JSON Schema | You control the output shape |
| Validation | Strict adherence, typed return | Catch bad records at ingest |
| Async & batch | Bulk path with callbacks | Backfills and recurring jobs |
| Cost | Price per extracted page | Spend scales with volume |
| Latency & throughput | Per-page ms, req/s | Bounds request paths and backfills |
| Observability | Retries, errors, metering | See failures before users do |
Score schema adherence as strict, not best-effort: that is the difference between Structured Outputs and plain JSON mode. Async/batch and per-page cost map to the Batch API limits and current per-token rates; throughput maps to published req/s figures; validation is a documented best practice; user-defined schema support is core to the model.
Schema Support
The single most important criterion is whether the vendor extracts into your schema. Tools that auto-classify content into their own record shapes save you schema-writing but hand you a structure you do not control. A schema-guided model takes a user-defined JSON Schema, or a typed Pydantic or Zod model, and returns conforming JSON. Adherence should be strict: the model should fill every required field and never invent an out-of-range enum value, which is the difference between structured outputs and plain JSON mode.
Validation
Even with guaranteed schema adherence, validate on ingest. Validation is your contract boundary: it is where a malformed record gets caught before it pollutes a downstream store, and where you decide whether a null means "absent on the page" or "needs review." A good SDK makes this a one-liner by returning an already-typed object you can hand straight to Pydantic or Zod.
Async and Batch Processing
Single-request extraction is fine for interactive use, but backfills and recurring jobs need a batch path. Score whether the vendor offers asynchronous processing with completion callbacks. The Batch API here accepts up to 50,000 requests per batch, an input file up to 200 MB, a 24-hour completion window, and an optional webhook on completion. For smaller jobs, the Group API takes up to 50 requests in a single payload and returns one callback when the group is done.
Cost
Compare cost per extracted page, not per token in isolation. For a typical cleaned page the input tokens dominate, so a vendor's input rate and how aggressively you can trim HTML matter more than the output rate. We work a concrete example in the cost note below.
Latency and Throughput
If you are extracting in a request path, latency per page is the number to watch. If you are running pipelines, throughput in requests per second per accelerator is what bounds your backfill time. Both should be measured on your own pages, since page size drives both.
Observability and Error Handling
The extraction step sits in the middle of a pipeline, so you need to see when it misbehaves. Score whether you can wire retries, route validation failures somewhere safe, and meter per-page cost. The next two sections turn this criterion into working code.
An Example Integration: Schema, Call, JSON, Validation
Here is the buy path end to end. Assume you have already fetched and cleaned the HTML for a product page; the only new step is extraction plus validation.
Python with Pydantic
Define the output shape as a Pydantic model with descriptions on the ambiguous fields, then call the model with temperature=0 for deterministic extraction. Because the model takes no system or user prompt, the schema is the only instruction it receives, so field names, types, and descriptions are how you steer it.
import os
from pydantic import BaseModel, Field
from openai import OpenAI
# 1) Define your schema (nested data and lists are supported)
class Product(BaseModel):
name: str
price: float = Field(
...,
description="Primary listed price in the page's currency.",
)
currency: str = Field(
...,
description="ISO 4217 currency code, e.g. USD.",
)
in_stock: bool = Field(
...,
description="True if the product is purchasable now.",
)
specs: dict[str, str] = Field(
default_factory=dict,
description="Specs of the product.",
)
# 2) Client setup (OpenAI-compatible)
client = OpenAI(
base_url="https://api.inference.net/v1",
api_key=os.environ.get("INFERENCE_API_KEY"),
)
# 3) Extraction with typed validation (cleaned_html is your pre-cleaned page)
resp = client.beta.chat.completions.parse(
model="inference-net/schematron-v2-small",
messages=[
{"role": "user", "content": cleaned_html},
],
response_format=Product,
temperature=0,
)
product = resp.choices[0].message.parsed # already a validated Product
print(product.model_dump_json(indent=2))
# Representative output:
# {
# "name": "MacBook Pro M3",
# "price": 2499.99,
# "currency": "USD",
# "in_stock": true,
# "specs": { "RAM": "16GB", "Storage": "512GB SSD" }
# }The parse() helper validates the response against your Pydantic model on the way back, so a successful return is already a typed Product object rather than a dict you have to check by hand. If you are calling from a language without a typed helper, the raw JSON Schema form works the same way through response_format.
TypeScript with Zod
The TypeScript path is identical in shape. Define a Zod schema, wrap it with zodResponseFormat, and validate the response with .parse().
import OpenAI from "openai";
import { z } from "zod";
import { zodResponseFormat } from "openai/helpers/zod";
const openai = new OpenAI({
baseURL: "https://api.inference.net/v1",
apiKey: process.env.INFERENCE_API_KEY,
});
// 1) Define your schema (nested data and lists are supported)
const Product = z.object({
name: z.string(),
price: z.number().describe("Primary listed price in the page's currency."),
currency: z.string().describe("ISO 4217 currency code, e.g. USD."),
inStock: z.boolean().describe("True if the product is purchasable now."),
specs: z.record(z.string()).describe("Specs of the product."),
});
// 2) Chat Completions extraction (cleanedHtml is your pre-cleaned page)
const resp = await openai.chat.completions.parse({
model: "inference-net/schematron-v2-turbo",
messages: [{ role: "user", content: cleanedHtml }],
response_format: zodResponseFormat(Product, "product"),
temperature: 0,
});
// 3) Validate on ingest
const product = Product.parse(JSON.parse(resp.choices[0].message.content!));
console.log(product);In both languages you wrote one schema, made one call, and got back a validated record. There were no selectors, no prompt engineering, and nothing site-specific to maintain.
Observability and Error Handling in Production
A working call is not a production integration. Because extraction sits between fetch and storage, the failure modes you need to instrument are specific:
- Transport errors and rate limits. Retry with backoff. The OpenAI-compatible client surfaces standard HTTP errors, so your existing retry middleware applies.
- Validation failures on ingest. Even with strict schema adherence, validate the parsed object and route failures to a dead-letter queue or review table instead of dropping them.
- Field-level confidence. Make genuinely optional fields
Optionaland treat a null as "needs review" rather than forcing the model to guess a value. - Per-field auditing. Store the source URL and a hash of the cleaned HTML alongside each record, so you can re-extract deterministically when the schema changes.
- Cost and latency metering. Track input and output tokens per page so you can project spend and pick the right model per workload.
The wrapper below combines the call, validation, a bounded retry, and a dead-letter path into something you can drop into a worker:
import hashlib
import os
import time
from openai import OpenAI
from pydantic import BaseModel, Field, ValidationError
client = OpenAI(
base_url="https://api.inference.net/v1",
api_key=os.environ.get("INFERENCE_API_KEY"),
)
DEAD_LETTER: list[dict] = []
class Product(BaseModel):
name: str
price: float = Field(..., description="Primary listed price in the page's currency.")
currency: str = Field(..., description="ISO 4217 currency code, e.g. USD.")
in_stock: bool = Field(..., description="True if the product is purchasable now.")
def extract_product(source_url: str, cleaned_html: str, max_retries: int = 3) -> Product | None:
html_hash = hashlib.sha256(cleaned_html.encode("utf-8")).hexdigest()
for attempt in range(1, max_retries + 1):
try:
resp = client.beta.chat.completions.parse(
model="inference-net/schematron-v2-small",
messages=[{"role": "user", "content": cleaned_html}],
response_format=Product,
temperature=0,
)
# parsed is already validated against Product on the way back
product = resp.choices[0].message.parsed
if product is None: # validate on ingest, fail closed
raise ValidationError.from_exception_data("Product", [])
return product
except ValidationError as exc:
# malformed output: do not retry forever, route to review
DEAD_LETTER.append({"url": source_url, "html_hash": html_hash, "error": str(exc)})
return None
except Exception:
# transport / rate-limit: retry with backoff
if attempt == max_retries:
DEAD_LETTER.append({"url": source_url, "html_hash": html_hash, "error": "exhausted retries"})
return None
time.sleep(2 ** attempt)
return NoneThe important property is that it fails closed. A page the model cannot extract cleanly ends up in the dead-letter list with its source URL, not silently absent from your dataset.
Schematron V2 Small vs. Turbo: A Cost Note
The cost and latency criteria from the checklist come down to which model you run. There are two extraction models, and they trade quality against throughput and price.
| Model | Quality (judge) | Throughput | Input $/M | Output $/M |
|---|---|---|---|---|
| V2 Small | 4.060 | 2.47 req/s | $0.05 | $0.25 |
| V2 Turbo | 4.039 | 4.14 req/s | $0.03 | $0.15 |
Quality is an LLM-as-judge score (1–5) and throughput is measured on a single H100. Pricing is per million tokens from the model pages. Use Small for complex schemas or long pages; use Turbo for high-throughput, cost-sensitive pipelines.
Schematron V2 Small posts the higher quality score and is the choice for complex schemas or very long pages, while V2 Turbo runs faster and cheaper and is the default for high-throughput, cost-sensitive pipelines. Both quality scores come from an LLM-as-judge benchmark in the Schematron V2 launch post. In that benchmark both models outperformed DeepSeek V3.2 and GPT-5.4 Nano, and Turbo ran about 2.5 times faster than the previous-generation 8B model.
To make the cost concrete, take a cleaned page of roughly 6,000 input tokens and 300 output tokens. Input tokens dominate, so Schematron V2 Turbo at $0.03 per million input and $0.15 per million output tokens lands well under a tenth of a cent per page. V2 Small at $0.05 per million input and $0.25 per million output buys the top quality score when a complex schema or a long page needs it. Both sit far below the per-page cost of a frontier general-purpose model doing the same extraction, and neither carries the maintenance tail of DIY selectors.
A note for teams already on older identifiers: calls to inference-net/schematron-3b are served by V2 Turbo and calls to inference-net/schematron-8b are served by V2 Small, with V2 pricing on the rerouted requests. The models are built for long-context HTML, so most pages fit in a single request after cleaning. Updating your model identifiers explicitly keeps your code, dashboards, and receipts consistent.
If throughput and cost are what will decide this for you, Schematron V2 Turbo is the model to start with.
Try Schematron V2 Turbo
Use Schematron V2 Turbo for high-throughput HTML-to-JSON extraction when cost and latency matter.
A Migration Plan from In-House Parsers to a Schema-Guided API
You do not need a big-bang cutover to move off in-house parsers. A staged plan lets you prove field-level accuracy before you trust the new layer:
- Inventory the per-site parsers and the record types they produce, then pick the highest-maintenance site to migrate first.
- Define the schema once as a Pydantic or Zod model for that record type, taking the union of fields the old parsers produced.
- Shadow-run the extraction model next to the existing parser on the same fetched HTML, diff the outputs, and measure field agreement.
- Promote the model to primary for that site once agreement clears your bar, keep the old parser as a fallback for one release, then delete it.
- Collapse the remaining per-site parsers onto the same schema, since the model, not a selector, now handles each site's layout.
- Move volume to batch for backfills and recurring jobs, using the batch path's 50,000-request limit and completion webhook.
- Update model identifiers explicitly if you were on legacy ids, so dashboards and billing reflect the model you intend to call.
The shadow-run step is the one that earns trust. Running both extractors over the same HTML and comparing field agreement gives you a measurable gate, rather than a leap of faith, before anything reaches production.
When a Hosted Extraction API Is Not the Right Layer
Buying is not always the answer, and a schema-guided model is the wrong tool for several real jobs. Be honest about these before you adopt:
- You need a crawler or proxy network. Fetching, JavaScript rendering, rate limiting, and anti-bot handling are the crawl layer. If your problem is getting the page at all, extraction is not what you are missing.
- You want DOM-tree serialization. If you literally want the HTML structure mirrored into nested JSON, a converter library is the right tool, not a schema-guided model.
- Your workflow needs free-form instructions. The model takes no prompt; all guidance flows through the schema. Tasks that need natural-language instructions per request are a poor fit.
- You are document- or OCR-first. For PDFs and scanned images, OCR or document-intelligence tools come first. Extraction fits after you have HTML or text.
- You have one stable site and a five-line parser. If a single source never changes its layout, the maintenance argument for buying is weak. Keep the parser.
The point of the checklist is to make this call deliberately. Buy the layer that is costing you maintenance, and keep the parts that are already cheap.
Conclusion
Build versus buy for HTML-to-JSON extraction comes down to one question: who maintains the parsing logic as your sites multiply and the web underneath you keeps shifting. Building means owning selectors, headless browsers, and a silent-failure tax that scales with every site you add. Buying means defining one schema per record type and letting a model handle each site's layout.
Score candidates on the six checklist criteria (schema support, validation, async and batch, cost, latency, and observability), and remember that the decision is per layer. Keep the crawler you already have, buy only the extraction step, and migrate in stages with a shadow-run gate so you never cut over on faith. When extraction quality is what is eating your time, a schema-guided API replaces brittle parsing code with a typed contract you can validate on every page.
Extract typed JSON from messy HTML
Send HTML and a Pydantic, Zod, or JSON Schema definition to Schematron and get structured JSON back without writing selectors or prompts.
Related Reading
- HTML-to-JSON Extraction API: Extract Typed Data from Messy Web Pages — the deeper dive on how schema-guided extraction turns messy HTML into typed JSON.
- Crawl4AI Complete Guide — the crawl and fetch layer that feeds the extractor.
- LLM Cost Optimization — broader framing for choosing models per workload by cost.
Meet with our research team
Schedule a call with our research team to learn more about how Specialized Language Models can cut costs and improve performance.