News

    Introducing Catalyst: Train self-improving AI models

    Learn more

    Jun 18, 2026

    ScrapeGraphAI Alternatives: Schema-First Extraction

    Inference Research

    ScrapeGraphAI Made LLM Scraping Easy. Then Production Arrived.

    ScrapeGraphAI made a real promise feel achievable. Write a sentence describing the data you want, point it at a page, and get JSON back. For a prototype, that is genuinely delightful. The trouble shows up later, once the same extraction has to run unattended across thousands of pages, feed a database that expects a fixed shape, and not wake anyone at 3 a.m.

    At that point the question changes. It stops being "can this tool extract the data?" and becomes "will it extract the same shape every time, and can I check the result before it reaches my store?" That is when people start searching for ScrapeGraphAI alternatives. Not because the tool is bad, but because production rewards a different property than prototyping does.

    This article compares ScrapeGraphAI's prompt-graph approach with schema-first extraction, using Schematron as the concrete example. The comparison is deliberately objective and the lens is operational: repeatability, validation, cost, and latency. We will look at what each tool optimizes for, walk through a working schema-first example end to end, and lay out where each one is the right call.

    Read time: 9 minutes


    What ScrapeGraphAI Does

    ScrapeGraphAI is an AI web scraping platform that turns natural-language prompts into structured data. It ships as both an MIT-licensed open-source Python library and a managed cloud API, so you can run it yourself or call it as a service. You describe what you want in plain language and it returns structured results, with no CSS selectors to write or maintain.

    The open-source library uses LLMs and graph logic to build scraping pipelines: you supply a prompt describing the information you need, and that prompt flows through LLM-powered graph nodes that fetch and interpret the page. It exposes pipeline classes for different shapes of work, including SmartScraperGraph for a single page from a prompt and URL, SearchGraph for extracting across top search results, and SmartScraperMultiGraph for several sources under one prompt. The library is actively maintained, with roughly 27,200 GitHub stars and a 99.9% Python codebase. It plugs into OpenAI, Groq, Azure, Gemini, MiniMax, and local models via Ollama, and you bring your own model API key.

    The cloud API wraps the same idea in a hosted service with six endpoints: Scrape, Extract, Search, Crawl, Monitor, and Schema. The Extract endpoint takes a required user_prompt, a textual description of what to pull, plus an optional output_schema defined in Pydantic or Zod. It accepts a URL, raw HTML, or markdown, which means the managed service can fetch the page for you.

    The defining trait across both forms is the same: extraction is driven by a natural-language prompt. A schema, when you use one, sits on top of that prompt rather than replacing it. ScrapeGraphAI is strong precisely where that flexibility helps: fast prototyping, fetching and extracting in a single call, and tolerating layout changes because the model reads the page semantically instead of by selector.

    What Schema-First Extraction Does Differently

    Schema-first extraction inverts the interface. Instead of describing the data in prose, you declare the exact output you want as a typed schema, a JSON Schema or its Pydantic or Zod equivalent, and the model's only job is to fill that schema from the page. The contract is the schema itself, and your type system enforces it.

    Schematron is the example we will use throughout. It is a family of long-context models trained to extract clean, typed JSON from messy HTML, purpose-built for web scraping and turning arbitrary pages into structured data. The interface is the part that matters here: the model does not use a user or system prompt at all. It only uses the schema to extract the data. There is no instruction text to tune, drift, or A/B test. The schema is the whole specification.

    The family has two members. Schematron V2 Small delivers the highest extraction quality for complex schemas and very long pages, while V2 Turbo is optimized for throughput and cost at scale. Both run in strict JSON mode with 100% schema adherence, so the output always conforms to the schema you passed. You can read the full interface in the schema-guided extraction docs.

    One boundary is worth stating immediately, because it shapes everything below: Schematron is an extraction layer only. It does not fetch or crawl. You bring the HTML, from any crawler or HTTP client, and it returns typed JSON.

    Prompt Graph vs JSON Schema: The Interface Is the Decision

    Strip away branding and the choice between these tools is a choice of interface. A prompt graph encodes your intent in natural language, optionally tightened with a schema. A schema-first model encodes intent purely in types. That single difference cascades into every operational property you care about.

    A prompt is a soft contract. It works, but its meaning can shift as pages vary, as edge cases appear, or as the underlying model is updated beneath you. A schema is a hard contract: a validator either accepts the output or rejects it, with no interpretation in between. ScrapeGraphAI lets you add an output_schema to its Extract call, which genuinely narrows this gap. The remaining difference is structural. With a prompt graph the schema is a wrapper around a prompt; with a schema-first model the schema is the entire interface, and there is no prompt left to drift.

    The table below maps the two approaches across the dimensions that decide reliability.

    DimensionScrapeGraphAI (prompt graph)Schema-first (Schematron)
    Primary interfaceNatural-language promptJSON Schema
    Schema roleOptional, on top of promptThe entire interface
    Output contractSoft (prose)Hard (validator)
    Output shapeCan vary100% schema-conforming
    Fetch/crawl includedYes (cloud Scrape/Crawl)No (extract only)
    Best fitPrototyping, explorationRepeatable pipelines

    Schematron uses no prompt at all; the schema is the whole specification. ScrapeGraphAI's cloud Extract takes a required prompt plus an optional schema and can fetch the page itself.

    Neither column is wrong. They optimize for different things. A prompt graph optimizes for getting something useful out of an unfamiliar page quickly. A schema-first model optimizes for getting the same well-typed thing out of every page, forever.

    Local Library vs Managed API

    The other axis the comparison turns on is where the work runs. There are three practical shapes here, and they have different operational costs.

    You can self-host the ScrapeGraphAI library. You run the process, you choose the LLM provider, and you pay that provider directly. You get full control and you own the operations, including latency and cost, both of which depend on the model you wire in. You can call the ScrapeGraphAI cloud API, which is managed and credit-priced and fetches pages for you. Or you can call Schematron, which is a managed, OpenAI-compatible API priced per token. If you would rather run extraction yourself, the original Schematron weights remain available on Hugging Face for self-hosting, since the first-generation models were retired and their requests now route to the V2 successors.

    The practical takeaway: "managed versus self-hosted" is not a single decision. You are choosing between hosting a library against your own model bill, calling a credit-priced API that bundles fetching, or calling a token-priced extraction API that you point at HTML you already have.

    A Schema-First Extraction Example, End to End

    The fastest way to feel the difference is to run one extraction. The flow is fetch, clean, define the schema, call the model, and validate. Fetching is out of scope here, since extraction is a separate layer, so we start from HTML you already have. If you want the full pipeline including retrieval, the companion walkthrough on how to build the full pipeline in Python covers fetching and cleaning in detail.

    The Messy HTML

    Here is a small, realistically messy product fragment. Note the inline tags, the price buried in bold, and the specs scattered across list items.

    <div id="item">
      <h2 class="title">MacBook Pro M3</h2>
      <p>Price: <b>$2,499.99</b> USD</p>
      <ul info>
        <li>RAM: 16GB</li>
        <li>Storage: 512GB SSD</li>
      </ul>
      <span class="tag">laptop</span>
      <span class="tag">professional</span>
      <span class="tag">macbook</span>
      <span class="tag">apple</span>
    </div>

    Before sending HTML to Schematron, pre-clean it with lxml to strip scripts, styles, and boilerplate. This is recommended because it matches how the training data was preprocessed, though Readability, Trafilatura, or BeautifulSoup are all acceptable.

    Define the Schema

    The schema is the entire specification, so this is where your effort goes. A Pydantic model with field descriptions tells the model exactly what each field means.

    from pydantic import BaseModel, Field
    
    # 1) Define your schema (nested data and lists are supported)
    class Product(BaseModel):
        name: str
        price: float = Field(
            ...,
            description="Primary price of the product.",
        )
        specs: dict[str, str] = Field(
            default_factory=dict,
            description="Specs of the product.",
        )
        tags: list[str] = Field(
            default_factory=list,
            description="Tags assigned of the product.",
        )

    The Zod equivalent is a one-to-one translation for TypeScript users. Field descriptions do real work here: because there is no prompt, a clear description is how you disambiguate fields that need interpretation.

    Call the Schematron API

    The call uses the OpenAI SDK pointed at Inference's base URL. There is no prompt, only the HTML as content and the schema as the response format. Keep temperature at zero, which is what the model is trained for.

    import os
    
    from openai import OpenAI
    
    from product_schema import Product
    
    # Messy HTML (could be the full page; trim to the relevant region when possible).
    # In a real pipeline this comes from your crawler or HTTP client.
    html = open("messy-product.html").read()
    
    # Client setup (OpenAI-compatible)
    client = OpenAI(
        base_url="https://api.inference.net/v1",
        api_key=os.environ.get("INFERENCE_API_KEY"),
    )
    
    # Schematron uses no prompt: only the HTML content and the response schema.
    resp = client.beta.chat.completions.parse(
        model="inference-net/schematron-v2-small",
        messages=[
            {"role": "user", "content": html},
        ],
        response_format=Product,
    )
    
    print(resp.choices[0].message.parsed.model_dump_json(indent=2))

    We use inference-net/schematron-v2-small here for top quality on a structured schema. Swapping to inference-net/schematron-v2-turbo is a one-line change when you want more throughput.

    The JSON Output

    For the fragment above, the model returns JSON that conforms to the schema:

    {
      "name": "MacBook Pro M3",
      "price": 2499.99,
      "specs": {
        "RAM": "16GB",
        "Storage": "512GB SSD"
      },
      "tags": [
        "laptop",
        "professional",
        "macbook",
        "apple"
      ]
    }

    The price is parsed to a number, the specs are keyed, and the tags are a clean array, with no DOM noise carried through.

    Validate Before You Store

    Strict JSON mode means the output should always conform. Even so, validate on ingest. A validation gate costs nothing and turns any future surprise into a caught exception instead of a corrupt row.

    from pydantic import ValidationError
    
    from product_schema import Product
    
    # `raw` is the JSON string returned by Schematron. Strict JSON mode means it
    # should always conform, but validating on ingest is cheap insurance.
    raw = resp.choices[0].message.content  # from the extraction call
    
    try:
        product = Product.model_validate_json(raw)
        # store `product` with confidence: it matches the schema by construction
        save(product)
    except ValidationError as exc:
        # route to a dead-letter queue / review instead of writing a bad row
        handle_invalid(raw, exc)

    Because the same Pydantic model defines both the request schema and the validator, this step is nearly free to write. That symmetry is the quiet advantage of schema-first extraction.

    Evaluating on What Matters in Production

    Prototypes reward flexibility; production rewards predictability. Here is how the two approaches compare on the four properties that actually determine whether a pipeline is calm or constantly on fire.

    Repeatability

    Repeatability is about the shape of the output, not just its content. When extraction is prompt-driven, that shape can drift as pages change or as the model underneath you is updated. A strict-schema model returns 100% schema-conforming JSON by construction, so the shape is fixed no matter what the page throws at it. Schematron V2 holds this even while staying fast, producing 100% schema-compliant output across its evaluation set.

    Validation

    With schema-first extraction, validation is trivial because the same type defines both ends of the call. With a prompt-first tool you reach the same guarantee by attaching an output_schema, which is exactly the right move when you use one. The difference is whether validation is the default or something you remember to add.

    Cost

    The two cost models are structurally different, and the difference matters for long pages. ScrapeGraphAI's cloud is credit-priced: an Extract call costs 5 credits and a markdown scrape costs 1, on plans from a free tier up to higher volumes. That decouples cost from page size. Schematron is token-priced, so cost tracks input length, which is the variable that dominates when you extract from long HTML.

    ModelInput ($/M)Output ($/M)Best for
    Schematron V2 Small$0.05$0.25Complex schemas, long pages
    Schematron V2 Turbo$0.03$0.15Throughput and cost at scale

    Schematron is token-priced, so cost scales with HTML length, unlike a per-request credit model.

    Schematron V2 token pricing, Small vs Turbo, USD per million input and output tokens
    Figure 1: Schematron V2 Token Pricing - USD per million tokens, Small vs Turbo

    V2 Turbo runs at roughly 40% lower input and output cost than V2 Small, which is the lever you pull when volume rather than schema complexity is your constraint.

    Latency and Throughput

    Throughput is where Turbo earns its name. On a single H100 with a 10,000-token input and 500-token output, V2 Turbo reaches 4.14 requests per second, a 2.5x improvement over the original 8B model, while V2 Small sustains 2.47 requests per second. Crucially, that speed does not come at the cost of quality: on an LLM-as-a-judge benchmark, V2 Small scores 4.060 and V2 Turbo 4.039 on a 1-to-5 scale, both close to the original 8B at 4.070.

    Schematron V2 throughput in requests per second on a single H100 versus the original 8B
    Figure 2: Schematron V2 Throughput - Requests per second on a single H100 (10K in / 500 out)
    Schematron V2 extraction quality on an LLM-as-a-judge benchmark, V2 versus legacy models
    Figure 3: Schematron V2 Extraction Quality - LLM-as-a-judge score, 1 to 5 scale

    If cost and latency are your binding constraints, Turbo is the model to reach for first. It is the same interface as Small, just tuned for volume.

    Try Schematron V2 Turbo

    Use Schematron V2 Turbo for high-throughput HTML-to-JSON extraction when cost and latency matter.

    Crawl and Fetch Are a Separate Layer

    A lot of confusion in this space comes from collapsing two jobs into one. A web-data pipeline really has distinct layers: fetch or crawl, clean, extract, validate, store. Keeping them separate makes each one easier to reason about.

    Figure 4: The Web-Data Pipeline by Layer

    Fetch/crawl tools (Firecrawl, Crawl4AI, ScrapeGraphAI Scrape/Crawl, or a plain HTTP client) handle retrieval. Cleaning typically uses lxml. A schema-first model such as Schematron handles only Extract, returning JSON that the Validate step checks before Store.

    ScrapeGraphAI's Scrape and Crawl endpoints, Firecrawl, and Crawl4AI all live in the fetch and crawl layer. Schematron lives only in the extract layer. If your real bottleneck is retrieving pages, handling JavaScript rendering, proxies, pagination, or a site-wide crawl, then a crawler is the right tool, and you hand its HTML to a schema-first extractor afterward. For a fuller treatment of the retrieval side, see when the bottleneck is the fetch layer and a comparison that breaks the stack down by crawling and extraction by layer. The clean division of labor is a crawler for retrieval and a schema-first model for the typed JSON.

    Use-Case Recommendations

    Match the tool to the stage of work and the contract you need:

    • Prototyping, one-off pulls, exploration: ScrapeGraphAI's prompt interface is fast and forgiving, and getting something useful out of an unfamiliar page is the whole goal.
    • Recurring pipelines that need a fixed JSON contract across many pages: schema-first extraction, where 100% schema adherence is the point.
    • Long HTML such as filings or long product pages: a long-context schema-first model, while watching token cost.
    • High-volume or batch work: a token-priced API with async processing. Schematron's batch extraction at scale handles up to 50,000 jobs per submission, and dedicated capacity is worth a conversation past a certain volume.
    • Fetch and extract in one managed call with minimal setup: ScrapeGraphAI's cloud, which bundles retrieval.

    When Schematron Is Not the Right Layer

    Being honest about boundaries is part of choosing well. Schematron is the wrong layer when:

    • You need to fetch or crawl. It does no JavaScript rendering, proxying, CAPTCHA handling, or site crawling. That is the crawler's job.
    • You need free-form reasoning or summarization beyond schema-shaped extraction. A general LLM with a prompt fits better, though you can sometimes encode the requirement explicitly in field descriptions since there is no prompt channel.
    • Your pages exceed the context window. Up to 128K tokens fit per request; larger pages need chunking or truncation.
    • You are writing a throwaway script where iterating on a prompt is faster than designing a schema. Reach for ScrapeGraphAI or a general LLM.

    If you are still deciding which model to trust on your own pages, the writeup on published quality metrics and a test harness shows how to measure extraction quality directly rather than taking anyone's word for it.

    Conclusion

    Prompt graphs and schema-first extraction are not really competing to be the same thing. A prompt graph is built for flexibility, which is exactly what you want while exploring a page or shipping a prototype. A schema-first model is built for a repeatable, checkable contract, which is what an unattended production pipeline needs. So the right choice comes down to one question: what does your pipeline actually have to guarantee?

    If your next step is to send HTML and a schema and get typed JSON back, the quickstart below is the shortest path to a working extraction.

    Extract typed JSON from messy HTML

    Send HTML and a Pydantic, Zod, or JSON Schema definition to Schematron and get structured JSON back without writing selectors or prompts.

    CONTACT

    Meet with our research team

    Schedule a call with our research team to learn more about how Specialized Language Models can cut costs and improve performance.