News

    Introducing Catalyst: Train self-improving AI models

    Learn more

    Jun 22, 2026

    Financial Data Extraction Software for Web Pages & Filings

    Inference Research

    Financial Data Extraction Software Starts With the Right Layer

    Open a 10-K filing page, an investor-relations page, or an earnings release in your browser and the numbers you want are right there: revenue by segment, guidance ranges, share counts, the fiscal year end. Open the same page's HTML and those numbers are buried in nested tables, footnote markup, and boilerplate. The data exists. Getting it out as something a database can trust is the hard part.

    Financial data extraction software is the layer that turns financial web surfaces, HTML filings, IR pages, and long tabular pages, into typed, validated JSON. That is a different job from the PDF and OCR document-capture tools that dominate this category, which read pixels off scanned statements. If you are reading this, you probably already have the page HTML. What you need is reliable structured records out of it.

    The fetch step is rarely the bottleneck. The bottleneck is producing the same fields, in the same shape, every quarter and across every company, with enough provenance that an auditor can trace a number back to its source. A schema-guided extraction model does exactly that: you hand it your schema and the page HTML, and it returns JSON that conforms. Schematron is that model. This guide covers the schemas financial pages need, how to keep the crawl separate from the extraction, an end-to-end example, evidence preservation and validation, a cost model for monitoring filings at scale, and the cases where this is the wrong layer.

    Read time: 10 minutes


    What Financial Web Extraction Actually Needs

    A general-purpose LLM prompt can pull a few fields out of a clean snippet. Production financial extraction is a higher bar, and the gap between the two is what separates a credible tool from a one-off script.

    Filing pages are long and noisy, so the extractor has to handle large inputs. Schematron is built for long-context HTML and is robust to messy markup. The output structure has to be strict and repeatable: the same fields, the same types, every time, so downstream joins and time-series comparisons do not break. Schematron's strict JSON mode gives 100% schema adherence.

    Financial pages are also full of ambiguity. Is a figure GAAP or adjusted? As-reported or restated? Which of three periods in a table is the current quarter? Schematron has no system or user prompt; it reads only the schema. So every piece of guidance, including how to resolve ambiguous fields, lives in the schema and its field descriptions. That constraint is a feature here: the rules are written down in the schema-guided extraction model's interface rather than hidden in a prompt that drifts. Ambiguous fields that need real disambiguation get explicit field descriptions.

    Normalization is part of the structure problem. The same metric appears as "Total revenue," "Net revenues," and "Revenue, net" across filers, and values show up in thousands, millions, or with currency symbols inline. You cannot prompt your way to consistency across thousands of pages, but you can encode it: name the canonical field, describe the units you expect, and let the schema enforce the shape on every page. Because the interface is the schema and not a prompt, the same definition produces the same field whether you run it today or next quarter.

    Two more requirements are specific to finance. You need evidence: the source text behind each number, so values are auditable. And you need cost discipline, because monitoring filings is a recurring, high-volume job. Both are addressed below.

    Common Schemas: Company Profile, Metric Table, Guidance, Filing Metadata

    Most financial extraction work reduces to four reusable schemas. A company profile captures name, ticker, exchange, sector, headquarters, fiscal year end, and description. A metric table captures line items with period, value, unit, and currency. Guidance captures a metric, a period, a low and high range, and the GAAP or non-GAAP basis. Filing metadata captures form type, filer, period of report, filing date, and source URL.

    Because Schematron reads only the schema, you express all of this with Pydantic, Zod, or JSON Schema, and nested objects and lists are supported. Field descriptions do the work that a prompt would otherwise do: they tell the model what each field means and how to fill it. Here is a metric-table schema that also carries an evidence field, so every extracted value keeps the verbatim text it came from.

    from pydantic import BaseModel, Field
    
    class MetricRow(BaseModel):
        line_item: str = Field(
            ...,
            description="Canonical metric name, e.g. 'Total revenue'. Normalize variants like 'Net revenues' to the same name.",
        )
        period: str = Field(
            ...,
            description="Reporting period the value covers, e.g. 'Q1 2026' or 'FY2025'.",
        )
        value: float = Field(
            ...,
            description="Numeric value with no symbols or separators. Prefer the GAAP figure when both GAAP and adjusted are shown.",
        )
        unit: str = Field(
            ...,
            description="Scale of the value, e.g. 'thousands', 'millions', or 'units'.",
        )
        currency: str = Field(
            ...,
            description="ISO currency code such as 'USD'. Use 'USD' if the page shows a $ amount with no other currency.",
        )
        source_text: str = Field(
            ...,
            description="Verbatim text from the page that this value was drawn from, for auditability.",
        )
    
    class FinancialReport(BaseModel):
        company: str = Field(..., description="Issuer or company name as shown on the page.")
        metrics: list[MetricRow] = Field(
            default_factory=list,
            description="One row per reported metric found on the page.",
        )

    The source_text field is the whole auditability story in one line of schema. You do not need a separate citation system; you add a field, describe it, and the model fills it from the page. This is the same pattern whether you are pulling typed JSON instead of raw DOM from any page, which is covered in more depth in our guide to the HTML-to-JSON extraction API.

    Separate the Crawl From the Extraction

    The most important architectural decision is to keep fetching and extracting as distinct steps. Schematron is the extraction layer, not a crawler, proxy network, or browser-automation tool. You fetch with whatever you already use, then extract.

    A financial extraction pipeline has six steps:

    1. Fetch the filing, IR, or tabular page HTML with your existing stack: requests, Playwright, a crawler, or cached snapshots.
    2. Clean the HTML with lxml to strip scripts, styles, and boilerplate. This matches Schematron's training preprocessing and cuts tokens.
    3. Define the schema for the records you want.
    4. Extract by sending the cleaned HTML and the schema to Schematron.
    5. Validate the response on ingest.
    6. Store or monitor, using the Batch or Group API for recurring runs.

    Figure 1: Financial Extraction Pipeline

    The cleaning step is worth doing even though Schematron tolerates messy markup. Aligning preprocessing with how the model was trained improves consistency, and on a long filing page it removes a lot of tokens you would otherwise pay for. lxml is recommended because it matches training, but Readability, Trafilatura, BeautifulSoup, or regex also work; err on removing less.

    import lxml.html as LH
    from lxml.html.clean import Cleaner
    
    HTML_CLEANER = Cleaner(
        scripts=True,
        javascript=True,
        style=True,
        inline_style=True,
        safe_attrs_only=False,
    )
    
    def strip_noise(html: str) -> str:
        """Remove scripts, styles, and JavaScript from HTML using lxml."""
        if not html or not html.strip():
            return ""
        try:
            doc = LH.fromstring(html)
            cleaned = HTML_CLEANER.clean_html(doc)
            return LH.tostring(cleaned, encoding="unicode")
        except Exception:
            return ""

    Extract Financial Data From a Filing Page

    With the schema and the clean step in place, the extraction itself is a single API call. Here is the end-to-end path on a small piece of filing HTML.

    Clean the HTML

    Run the page through the lxml cleaner from the previous section. What is left is the same content with the noise removed, ready to send.

    Define the snippet

    A realistic fragment from a filing or IR page looks like this. The numbers are wrapped in tables and spans, with labels that a selector-based scraper would have to chase across layouts.

    <section id="results">
      <h3 class="company">Acme Robotics, Inc.</h3>
      <table class="financials">
        <tr><th>Metric</th><th>Q1 2026</th></tr>
        <tr>
          <td>Total revenue</td>
          <td><span class="amt">$1,482.3</span> <small>(in millions)</small></td>
        </tr>
        <tr>
          <td>Operating income</td>
          <td><span class="amt">$214.9</span> <small>(in millions)</small></td>
        </tr>
      </table>
      <p class="note">Revenue figures are presented on a GAAP basis.</p>
    </section>

    Call Schematron

    Schematron speaks the OpenAI-compatible API, so you point the OpenAI SDK at https://api.inference.net/v1 and pass your Pydantic model as the response format. Use client.beta.chat.completions.parse so the response is parsed straight into your model. Keep temperature at 0, which is what the model is trained for.

    import os
    from openai import OpenAI
    
    from financial_schema import FinancialReport
    from clean_html_lxml import strip_noise
    
    raw_html = open("filing_snippet.html").read()
    html = strip_noise(raw_html)
    
    client = OpenAI(
        base_url="https://api.inference.net/v1",
        api_key=os.environ.get("INFERENCE_API_KEY"),
    )
    
    resp = client.beta.chat.completions.parse(
        model="inference-net/schematron-v2-small",
        messages=[
            {"role": "user", "content": html},
        ],
        response_format=FinancialReport,
        temperature=0,
    )
    
    print(resp.choices[0].message.parsed.model_dump_json(indent=2))

    Representative JSON output

    The response is strictly valid JSON that conforms to the schema. Each metric carries its value, unit, currency, period, and the source_text it was drawn from.

    {
      "company": "Acme Robotics, Inc.",
      "metrics": [
        {
          "line_item": "Total revenue",
          "period": "Q1 2026",
          "value": 1482.3,
          "unit": "millions",
          "currency": "USD",
          "source_text": "Total revenue $1,482.3 (in millions)"
        },
        {
          "line_item": "Operating income",
          "period": "Q1 2026",
          "value": 214.9,
          "unit": "millions",
          "currency": "USD",
          "source_text": "Operating income $214.9 (in millions)"
        }
      ]
    }

    That is the core loop. Everything else in this guide, validation, evidence, cost, and scale, is built around that one call.

    Preserve Evidence and Validate on Ingest

    A correct number nobody can trace back to the page is still a liability when an auditor asks where it came from. Evidence and validation are what make the output safe to put in front of an analyst or an auditor.

    Evidence preservation is handled by the schema. The source_text field on each record carries the verbatim quote that backs the number, so anyone reviewing the output can trace a value to the exact phrasing on the page. Because the model fills the schema you give it, this costs you one field, not a separate pipeline.

    Validation happens on ingest. Even though strict JSON mode should always return schema-conforming JSON, you still parse the response with Pydantic and handle validation errors explicitly. A valid record goes to your store; an invalid one goes to a review or retry queue. The same discipline of validation gates and async monitoring is covered in our guide to automated data extraction from websites.

    Figure 2: Validation and Auditability Flow

    from pydantic import ValidationError
    
    from financial_schema import FinancialReport
    
    def ingest(raw_json: str):
        """Validate a Schematron response and route it. Strict JSON mode should
        always return schema-conforming JSON, but validate anyway."""
        try:
            report = FinancialReport.model_validate_json(raw_json)
        except ValidationError as err:
            return send_to_review(raw_json, reason=str(err))
    
        # Flag rows that are missing the evidence that makes a value auditable.
        missing_evidence = [m for m in report.metrics if not m.source_text.strip()]
        if missing_evidence:
            return send_to_review(raw_json, reason="empty evidence field")
    
        return store(report)
    
    def store(report: FinancialReport):
        ...  # write typed records to your database
    
    def send_to_review(raw_json: str, reason: str):
        ...  # enqueue for human correction with the failure reason

    If a field is genuinely ambiguous, the fix is in the schema, not a prompt. Tighten the field description so the model knows, for example, to prefer the GAAP figure and to record the adjusted one separately.

    For monitoring at scale, the review queue is where human attention should be spent, not on every extraction. A practical setup routes records that fail validation, that are missing a required value, or whose evidence field is empty to a reviewer, and lets the rest flow through. Because each record carries its own source_text, a reviewer can confirm or correct a value against the exact phrasing without reopening the filing page. You can also dedupe on filing metadata, since the same metric can appear in multiple sections of a single document, and keep the instance whose evidence matches the primary financial statement. None of this requires a second model; it is schema fields plus ordinary application logic.

    A Cost Model for Monitoring Filings Pages

    Monitoring filings is recurring work, so unit cost matters more than the cost of a single call. Current per-token pricing is on the model pages: Schematron V2 Small is $0.05 per million input tokens and $0.25 per million output tokens, and V2 Turbo is $0.03 per million input and $0.15 per million output. (The April 2026 launch post lists older, higher input pricing; use the model pages. )

    At a representative 10,000 input tokens and 500 output tokens per filing page, that works out to roughly $0.000625 per page on Small and $0.000375 per page on Turbo, or about $62.50 and $37.50 per 100,000 pages. This is straight arithmetic from the per-token rates, not a published benchmark, but it is enough to size a monitoring budget.

    Extraction cost per 100k filing pages, Schematron V2 Small vs Turbo
    Extraction Cost per 100k Filing Pages - At 10k input + 500 output tokens per page

    Throughput matters too. On a single H100 at 10k input and 500 output tokens, V2 Turbo runs at 4.14 requests per second, a 2.5x improvement over the original 8B model, and the launch post specifically calls out financial document parsing as a latency-sensitive case where that speed compounds. The model choice comes down to the workload: use V2 Small for the most complex schemas or very long pages, and V2 Turbo for maximum throughput and the lowest cost at scale.

    ModelInput $/MOutput $/MBest for
    V2 Small$0.05$0.25Complex schemas, very long pages
    V2 Turbo$0.03$0.15Throughput, lowest cost at scale

    For high-throughput, cost-sensitive monitoring, V2 Turbo is usually the right default. If you want to see how it behaves on your own filing pages, try it directly.

    Try Schematron V2 Turbo

    Use Schematron V2 Turbo for high-throughput HTML-to-JSON extraction when cost and latency matter.

    For recurring runs, do not call the synchronous API in a loop. The Batch API takes up to 50,000 extraction jobs at once with optional webhook updates inside a 24-hour window, and the Group API sends up to 50 requests in one payload with a single callback. If you are weighing model quality before you commit, our writeup on the published quality metrics and the Schematron V2 quality and throughput numbers go deeper.

    When Schematron Is Not the Right Layer

    Schematron is the right tool when you have HTML or text and need a custom typed schema. Two situations call for something else first.

    Scanned or image-only filings are one. Schematron extracts from HTML and text, not from pixels. If your source is a scanned PDF or an image, run OCR first to produce text, then extract.

    Clean structured data that already exists is the other. SEC filings ship machine-readable XBRL alongside the human-readable HTML, and tools that convert XBRL to JSON return standardized line items fast, without any model guessing. When the exact figures you need are already in a clean XBRL feed or a vendor API, parse that feed. There is no reason to extract a number from rendered HTML when the filer has already tagged it for you.

    Schematron earns its place on the surfaces where no clean feed exists. XBRL covers the core financial statements of US public filers, but it does not cover an investor-relations page, a press release, a foreign filing outside the EDGAR system, a non-financial metric a company reports in prose, or a private company's posted report. It also does not give you a custom schema; you take the tagged taxonomy as-is. When you need one consistent typed schema across many heterogeneous sources, or you are pulling fields that no standardized feed exposes, HTML extraction is the right layer. A common production pattern is to use both: parse XBRL where it exists, and fall back to schema-guided extraction for everything else, writing both into the same typed records.

    Conclusion

    Financial data extraction software is most useful when you treat it as a layer rather than a monolith. Keep your existing fetch step, clean the HTML, define the schema for the records you actually need, let a schema-guided model produce typed JSON with evidence fields, validate on ingest, and budget per page. That gives you records that are repeatable across quarters and companies and traceable back to the source text.

    If you are planning a high-volume filings-monitoring pipeline, the throughput, schema design, and dedicated-capacity questions are worth talking through before you build.

    Scale structured web extraction

    Planning a high-volume extraction pipeline? Talk through throughput, schema design, and dedicated capacity for Schematron with an Inference engineer.

    CONTACT

    Meet with our research team

    Schedule a call with our research team to learn more about how Specialized Language Models can cut costs and improve performance.