News

    Introducing Catalyst: Train self-improving AI models

    Learn more

    Jun 21, 2026

    Real Estate Data Scraping: Extract Listings, Agents, and Prices into JSON

    Inference Research

    What Real Estate Data Scraping Actually Means

    Fetching a listing page is mostly a solved problem. You point a headless browser at a Zillow or Redfin URL, wait for the map and the photo carousel to hydrate, and pull the HTML. The part that breaks pipelines comes next: turning that messy, inconsistent markup into a listing record you can actually trust. The price shows up three different ways on the same page, the square footage is buried in a spec table, and the agent's name is in a JSON blob you didn't know was there.

    Real estate data scraping is the automated extraction of property listing data, such as the address, price, beds and baths, area, agent, open house dates, and listing status, from web pages into structured records. The goal is not raw HTML. It is a typed listing object that a database, a comps model, or a search index can consume without hand-cleaning.

    The pipeline this article builds has five steps:

    1. Fetch the listing HTML with your own crawler or browser.
    2. Clean the HTML to strip scripts, styles, and boilerplate.
    3. Define the listing record you want as a typed schema.
    4. Extract the record with a schema-guided model.
    5. Normalize and validate the fields before you store them.

    Steps three through five are where most guides go quiet. They list the fields you want in prose and then leave you to write brittle selectors. This one gives you a real schema, a working extraction call, and a validation step you can run. The extraction layer here is Schematron, a family of long-context models trained to extract clean, typed JSON from messy HTML. It sits after retrieval. It does not crawl pages for you, and that separation is the whole point: you keep the fetch stack you already have and swap brittle parsing for a model that reads the page. If you want the broader pattern first, see how to turn any page into typed JSON.

    Read time: 10 minutes


    The Listing Schema: Address, Price, Beds/Baths, Agent, Status

    Start with the schema, because in Schematron the schema is the only instruction the model gets. There are no system or user prompts. The field names, the types, and the descriptions you attach to them are what tell the model what to pull and how to shape it. A vague schema produces vague output; a precise one does most of your work for you.

    Here is a listing model that covers the fields the brief cares about. It is a schema-guided extraction model, so the descriptions matter as much as the types.

    from enum import Enum
    
    from pydantic import BaseModel, Field
    
    
    class ListingStatus(str, Enum):
        for_sale = "for_sale"
        pending = "pending"
        sold = "sold"
        off_market = "off_market"
    
    
    class Agent(BaseModel):
        name: str = Field(..., description="Listing agent's full name.")
        brokerage: str = Field(
            default="",
            description="Brokerage the agent represents, if shown.",
        )
        phone: str = Field(
            default="",
            description="Agent contact phone number, if shown.",
        )
    
    
    class Listing(BaseModel):
        address: str = Field(
            ...,
            description="Full street address including city, state, and postal code.",
        )
        price: float = Field(
            ...,
            description="List price as a number in the listing's primary currency, excluding HOA fees.",
        )
        currency: str = Field(
            default="USD",
            description="ISO 4217 currency code for the price.",
        )
        beds: float = Field(
            ...,
            description="Number of bedrooms.",
        )
        baths: float = Field(
            ...,
            description="Number of bathrooms; use 2.5 for two full plus one half bath.",
        )
        area_sqft: float = Field(
            ...,
            description="Interior living area in square feet.",
        )
        status: ListingStatus = Field(
            ...,
            description="Current listing status.",
        )
        agent: Agent = Field(
            ...,
            description="Listing agent and brokerage details.",
        )
        open_houses: list[str] = Field(
            default_factory=list,
            description="Scheduled open house dates and times, as shown on the page.",
        )

    A few choices worth calling out. Price is a number with a separate currency field, not a string, because you will compare and aggregate it later, and a string like "$450K" is useless for that. Status is an enum, so "pending" and "Pending Sale" collapse to one value your filters can rely on. The agent is a nested object, since a listing carries both the property and the person behind it, which mirrors how schema.org models a listing as a RealEstateListing wrapped around a RealEstateAgent. Open houses are a list, because a property can have several. Beds and baths are numbers rather than text so a half-bath reads as 2.5, not "2 full, 1 half."

    The descriptions are not decoration. Because the schema is the only instruction the model receives, a field described as "list price in the listing's primary currency, excluding HOA fees" extracts more reliably than a bare price: float. Spend your effort there. Pydantic is the example throughout this article, but Zod and plain JSON Schema work the same way and produce the same guarantees.

    Why Real Estate Pages Are Hard to Scrape

    Real estate web scraping looks easy until you do it at scale. Selectors feel fine on the first listing and fall apart by the hundredth. Four things make property pages especially hostile to rule-based parsing, and they are the reason real estate listing web scraping has a reputation for breaking.

    Dynamic sections. Listings load through JavaScript, infinite scroll, and AJAX calls, so the rendered DOM looks nothing like the HTML the server first sent. A selector written against the initial markup misses half the page.

    Duplicate facts. The same property shows up across platforms, and even within one page the price or status can appear in several places with conflicting values. Reconciling them means standardizing addresses and doing fuzzy matching, not just grabbing the first match.

    Hidden metadata. Listing pages embed JSON-LD and microdata in script tags that are often cleaner and richer than the visible HTML, but they sit outside the rendered text where a naive scraper never looks.

    Inconsistent formats. Price renders as "$450,000" on one site, "450000" on another, and "$450K" on a third. Square footage sometimes includes the basement and sometimes does not. Address conventions differ everywhere, and layouts change often enough to break a selector overnight.

    ChallengeWhy it breaks selectors
    Dynamic sectionsRendered DOM differs from the initial HTML
    Duplicate factsSame property, conflicting values across sources
    Hidden metadataClean JSON-LD lives in script tags, not the DOM
    Inconsistent formatsPrice, area, and address shapes vary per site

    A model that reads the whole page tolerates all four. It is robust to noisy markup and trained for long documents, so it can take the full listing page, hidden metadata included, and still return the field you asked for. Layout drift that would break a selector is just more of the noise the model already expects. That is the real win: you maintain one schema instead of a per-site library of selectors that rot every time a platform ships a redesign.

    Fetch and Clean the Page (Separate from Extraction)

    Most of what people call web scraping real estate data is really two jobs, and keeping them in separate boxes is what makes the pipeline maintainable. Getting past anti-bot defenses, rendering JavaScript, and rotating proxies is a crawler and browser problem that you own; Schematron starts once you have HTML in hand. Use requests for static pages and a headless browser for the JavaScript-heavy ones. For a full walkthrough of that side, the guide on how to fetch HTML and clean it in Python goes deeper than we will here.

    Once you have the raw HTML, clean it. Schematron was trained on HTML pre-cleaned with lxml to strip scripts, styles, and inline JavaScript, so matching that preprocessing tends to improve extraction quality and cut your token count at the same time. You do not have to use lxml; Readability, Trafilatura, BeautifulSoup, or targeted regex are all fine. It is recommended only because it matches how the model was trained.

    import lxml.html as LH
    from lxml.html.clean import Cleaner
    
    HTML_CLEANER = Cleaner(
        scripts=True,
        javascript=True,
        style=True,
        inline_style=True,
        safe_attrs_only=False,
    )
    
    
    def strip_noise(html: str) -> str:
        """Remove scripts, styles, and JavaScript from HTML using lxml."""
        if not html or not html.strip():
            return ""
        try:
            doc = LH.fromstring(html)
            cleaned = HTML_CLEANER.clean_html(doc)
            return LH.tostring(cleaned, encoding="unicode")
        except Exception:
            return ""
    
    
    # A small, messy listing fragment of the kind a fetcher returns.
    raw_html = """
    <div class="listing" data-id="MLS-882104">
      <script type="application/ld+json">{"@type":"RealEstateListing"}</script>
      <h1 class="addr">142 Linden Ave, Oakland, CA 94610</h1>
      <span class="price">$1,095,000</span>
      <ul class="facts">
        <li>3 bd</li><li>2.5 ba</li><li>1,840 sqft</li>
      </ul>
      <div class="status">Pending</div>
      <div class="agent">Listed by <b>Dana Ruiz</b>, Bay Bridge Realty</div>
      <div class="oh">Open House: Sat 1-4pm</div>
    </div>
    """
    
    cleaned_html = strip_noise(raw_html)

    That strip_noise step is the boundary between the two boxes. Everything above it is your crawler. Everything below it is extraction.

    Figure 1: The Listing Extraction Pipeline

    Run Schematron with Your Listing Schema

    With a clean HTML string and a listing schema, the extraction call is short. Schematron speaks the OpenAI-compatible API, so you point the OpenAI SDK at the Inference base URL and pass your key. If you have not set that up, the OpenAI-compatible API quickstart covers it.

    import os
    
    from openai import OpenAI
    
    from fetch_clean import cleaned_html
    from listing_schema import Listing
    
    client = OpenAI(
        base_url="https://api.inference.net/v1",
        api_key=os.environ.get("INFERENCE_API_KEY"),
    )
    
    resp = client.beta.chat.completions.parse(
        model="inference-net/schematron-v2-small",
        messages=[
            {"role": "user", "content": cleaned_html},
        ],
        response_format=Listing,
        temperature=0,
    )
    
    print(resp.choices[0].message.parsed.model_dump_json(indent=2))

    A few things are happening in that call. You pass the cleaned HTML as the message content and the listing model as the response format; the schema does the rest. Temperature is zero, which is what the model is trained for. Strict JSON mode means the response conforms to your schema every time. And because the model handles long context, a full, long listing page fits in a single request; you only need to truncate or chunk when a page runs past the context window.

    The result is a clean object that matches the schema you defined:

    {
      "address": "142 Linden Ave, Oakland, CA 94610",
      "price": 1095000,
      "currency": "USD",
      "beds": 3,
      "baths": 2.5,
      "area_sqft": 1840,
      "status": "pending",
      "agent": {
        "name": "Dana Ruiz",
        "brokerage": "Bay Bridge Realty",
        "phone": ""
      },
      "open_houses": [
        "Sat 1-4pm"
      ]
    }

    If you are running this at any volume, throughput and cost start to matter more than the last fraction of a quality point.

    Try Schematron V2 Turbo

    Use Schematron V2 Turbo for high-throughput HTML-to-JSON extraction when cost and latency matter.

    Normalize Fields and Validate Required Values

    Schema-valid is not the same as correct. The model returns JSON that matches your types, but it cannot know that a price of zero means "not listed" rather than "free," or that a missing address should block the record from your database. Those are semantic checks, and they belong to you. The Schematron docs make the same point: validate on ingest even though the output is already schema-conforming.

    Normalization comes first, and the listing world gives you plenty to normalize. Parse the price string into a number and a currency, since the same value renders as "$450,000", "450000", or "$450K" depending on the source. Convert area to one consistent unit, because square footage figures sometimes include the basement and sometimes do not. Map the status text onto your enum, collapsing the dozen ways a site can say "under contract" into one. And standardize the address: that last step does double duty, since a canonical address is also what lets you deduplicate the same property when it appears across several sources with conflicting details.

    import re
    
    from pydantic import ValidationError
    
    from listing_schema import Listing
    
    REQUIRED_FIELDS = ("address", "price", "status")
    
    
    def normalize_price(value) -> float:
        """Turn '$450,000', '450000', or '$450K' into a float."""
        if isinstance(value, (int, float)):
            return float(value)
        text = str(value).strip().lower().replace(",", "").replace("$", "")
        multiplier = 1_000 if text.endswith("k") else 1
        text = text.rstrip("k")
        return float(text) * multiplier
    
    
    def normalize_address(value: str) -> str:
        """Collapse whitespace so the same property dedupes across sources."""
        return re.sub(r"\s+", " ", value).strip().lower()
    
    
    def to_record(raw_json: str) -> dict | None:
        """Validate the Schematron response, then normalize and gate it."""
        try:
            listing = Listing.model_validate_json(raw_json)
        except ValidationError as exc:
            # Schema-invalid output is rare, but handle it explicitly.
            print(f"validation failed: {exc}")
            return None
    
        record = listing.model_dump()
        record["price"] = normalize_price(record["price"])
        record["address_key"] = normalize_address(record["address"])
    
        # Gate on required values: store if complete, otherwise send to review.
        missing = [f for f in REQUIRED_FIELDS if not record.get(f)]
        if missing:
            print(f"sending to review queue, missing: {missing}")
            return None
    
        return record

    Then gate on required fields. If address, price, and status are all present, the record goes to storage. If any are missing, it goes to a review queue instead of silently corrupting your dataset. A review queue sounds like overhead, but it usually catches a tiny fraction of records, and it is far cheaper than finding bad rows weeks later in a report someone already trusted.

    Batch Extraction and Update Detection

    One listing is a function call. A catalog of them is a batch job. The Batch API accepts up to 50,000 extraction jobs at once, runs them asynchronously, and can fire webhooks as they finish within a 24-hour window. For smaller runs, the Group API takes up to 50 requests in a single payload and gives you one callback when the whole group is done.

    Listings are not static, so you will re-extract them on a schedule. When you do, diff the normalized record, not the raw HTML. A price change, a status flip to pending or sold, or a newly posted open house are real updates worth acting on. Raw HTML changes constantly for reasons that have nothing to do with the listing, like ad slots and layout tweaks, so diffing it produces noise instead of signal. Diffing the typed record instead means a meaningful change is one your downstream systems can react to directly: trigger an alert when a watched property goes pending, or refresh a comp when a price drops.

    This also keeps re-extraction cheap. You only pay to run pages again, and on long or inconsistent listing pages the long-context model handles the variation without per-site special-casing. If you want this wired into a fuller architecture with validation gates and monitoring, the piece on extraction architecture with validation gates lays out the surrounding system.

    Small vs Turbo: A Cost Note for Listing Extraction

    Before the cost question, the quality question. Schematron V2 comes in two variants, and both hold up against much larger general-purpose models. On an LLM-as-a-judge benchmark graded on a one-to-five scale, V2 Small scores 4.060 and V2 Turbo scores 4.039, against 4.070 for the original 8B model and 3.909 for the first-generation 3B. Both V2 models outscore DeepSeek V3.2 and GPT-5.4 Nano on the same benchmark despite being far smaller.

    Schematron Extraction Quality
    Schematron Extraction Quality - LLM-as-a-judge average score (1-5 scale)

    So the choice between them is mostly about throughput and price, not quality.

    DimensionV2 SmallV2 Turbo
    Input price$0.05 / M$0.03 / M
    Output price$0.25 / M$0.15 / M
    Quality score4.0604.039
    Best forHardest schemasHigh-volume batch

    Prices are per million tokens. Quality is the LLM-as-a-judge average (1-5 scale).

    The practical guidance: reach for Turbo on high-volume, cost-sensitive batch runs, which is most catalog work, and keep Small for the hardest schemas or the longest pages where you want every bit of quality. For very high volume, dedicated capacity brings the per-page cost down further.

    When Schematron Is Not the Right Layer

    A model is the right tool for messy extraction, not for everything. Three cases call for something simpler.

    If your only problem is getting the page, this is not your tool. Rendering JavaScript, solving anti-bot challenges, and rotating proxies are crawler and browser jobs; Schematron only runs once you already have HTML.

    If the data is already clean structured JSON-LD, just parse it. When a listing page hands you a well-formed RealEstateListing block, deterministic parsing is faster, cheaper, and exactly as accurate as a model.

    And if you need fields that are not on the page, like computed comps or geocoded coordinates, no extractor can invent them. Schematron pulls what is present, and any field that requires synthesis has to be spelled out explicitly in the schema description. When the real question is whether to build this layer yourself at all, the breakdown of whether to build the extraction layer or call an API is a good next read.

    Conclusion

    The reliable path through real estate data scraping is the same five steps every time: define the listing schema, fetch and clean the page, extract with a schema-guided model, normalize and validate the fields, then batch the job and diff normalized records to catch updates. That is what separates durable real estate scraping from a script that breaks on the next redesign. The schema carries the intent, the model absorbs the mess, and your validation gate keeps the bad rows out.

    The fastest way to see whether it fits your pages is to run one. Take a single listing page, write the schema for the record you want, and send both to Schematron.

    Extract typed JSON from messy HTML

    Send HTML and a Pydantic, Zod, or JSON Schema definition to Schematron and get structured JSON back without writing selectors or prompts.

    CONTACT

    Meet with our research team

    Schedule a call with our research team to learn more about how Specialized Language Models can cut costs and improve performance.