Jun 15, 2026

    AI Regression Testing for LLM Apps: How to Gate Prompts, Models, and Agents

    Inference Research

    The Prompt Tweak That Breaks Production

    A one-line edit to a system prompt ships on Friday afternoon. The diff is trivial, the unit tests are green, and code review took ninety seconds. By Monday, the support agent built on that prompt is quietly refusing half of its tool calls. No exception was thrown. No test failed. The first signal is a customer email.

    That failure mode is what AI regression testing exists to catch. The term gets used two ways, so it's worth being precise: most pages ranking for "AI regression testing" describe AI tools that automate traditional QA suites: self-healing selectors, visual diffing, test generation. This article is about the other meaning. AI regression testing, in the sense that matters to teams shipping LLM apps, is the practice of verifying that a change to a prompt, a model, or an agent's logic doesn't break behavior that previously worked, before and after that change reaches users.

    The thesis is simple: prompt edits, model swaps, and agent changes are code changes. They deserve the same discipline: no merge without a passing regression suite. The rest of this guide walks through how to build that suite for an LLM app: why these systems regress silently, which changes need a gate, how to build a regression dataset from production traces instead of hand-written examples, how to write pass/fail criteria with LLM-as-a-judge rubrics, how to wire the gate into CI, and how to catch the regressions that only show up after deploy.

    Read time: 13 minutes


    Why LLM Apps Regress Silently

    Traditional regression testing assumes a deterministic system: same input, same output, assert equality. LLM apps violate that assumption three different ways, and each one defeats a different layer of your existing test pyramid.

    Nondeterminism Breaks assertEqual

    An LLM's output is a sample from a distribution, not a value. Run the same prompt twice and you can get two different completions, both acceptable, or one acceptable and one subtly wrong. Correctness is graded, not boolean. A unit test that asserts string equality is either flaky or pinned to a single canned response that proves nothing. QA regression suites have the same blind spot one level up: they replay user flows and check that screens render and buttons work, which they will, right up until the text inside the response stops being good.

    This is also why prompt testing can't be eyeballed. A prompt edit that fixes the case you were staring at can degrade ten cases you weren't, and nothing in a traditional pipeline will tell you. That blind spot is exactly what LLM regression testing exists to close.

    Models Change Under You

    Even with zero changes to your code, the model behind your API can move. Researchers at Stanford and UC Berkeley measured this directly: between the March and June 2023 snapshots of GPT-4, accuracy on a prime-vs-composite identification task fell from 84.0% to 51.1%, and the fraction of code generations that were directly executable dropped from 52.0% to 10.0%. Same model name, same API, materially different behavior.

    GPT-4 Behavior Drift, March vs June 2023
    Figure 1: GPT-4 Behavior Drift, March vs June 2023 - Same model name, different snapshot — Stanford/Berkeley study (arXiv 2307.09009)

    Providers also retire models outright. OpenAI removed GPT-4o, GPT-4.1, GPT-4.1 mini, and o4-mini from ChatGPT in February 2026 and deprecated the chatgpt-4o-latest API snapshot the same week. Its deprecations page is a standing calendar of forced migrations. When the pin you've been holding expires, you will swap models on the provider's schedule, not yours. You'll want a regression suite waiting when you do.

    Agent Loops Drift

    Single-call apps regress at the level of one response. Agents regress at the level of trajectories. Add a tool, reword a tool description, change the routing logic or what gets stuffed into memory, and the failure shows up as shape: loops that run three extra iterations, the wrong tool selected for a familiar request, context dropped between steps. The final answer can even look fine while the run quietly costs four times as much. These regressions are invisible in single-call diffs. They only become legible across many runs, in traces.

    The Three Change Types You Need to Gate

    A regression gate is only useful if you know which pull requests trigger it. For LLM apps, three classes of change should never merge ungated.

    Change typeExample triggerTypical regressionGate checks
    Prompt editSystem-prompt tweakBroken tool callsOutput quality
    Model swapProvider deprecationRefusal, format shiftsSide-by-side scores
    Agent logicNew tool, routingLoops, wrong toolTrajectory checks

    Notes: model deprecations are routine (OpenAI retired the GPT-4o family from ChatGPT in February 2026), and behavior can drift even without a swap, as the 2023 GPT-4 snapshot study showed.

    Prompt edits are the highest-frequency change and get the least ceremony. A system-prompt tweak can shift tool-call rates, output formats, and refusal behavior all at once. Because prompts look like copy rather than code, they're often edited directly in a dashboard with no review at all, which is exactly backwards given their blast radius.

    Model and provider swaps happen less often but change more. Deprecations force them, cost optimization motivates them, and new releases tempt them. A swap can alter refusal behavior, JSON-mode quirks, tool-calling conventions, and latency in one move. The gate's job here is comparative: run baseline and candidate on the same dataset and find out exactly what you'd be trading.

    Agent logic changes (new tools, edited descriptions, routing, memory) need the gate to look at trajectories, not just final outputs. Did tool selection stay correct? Did loop counts hold steady? Did the run still complete the task?

    All three funnel through the same machinery. Only the emphasis differs: single-turn output quality for prompt edits, side-by-side comparison for swaps, trajectory-level checks for agent changes.

    Build the Regression Dataset from Production Traces

    Here's where most LLM testing advice goes wrong: it starts with hand-written test cases. Hand-written cases encode what you imagined users would do. Production traffic contains what they actually did: the ambiguous phrasing, the half-formed requests, the adversarial weirdness that actually broke you last quarter. The best regression dataset isn't written; it's curated from real traces.

    A good eval dataset is small, stable, and challenging: pick the hard cases, make them your benchmark, and resist the urge to churn them. Stability is what makes scores comparable across runs. If the dataset moves, the baseline means nothing.

    Figure 2: The LLM Regression Gate Pipeline

    Capture: Tracing SDK or Gateway

    You can't curate what you didn't record, so capture comes first. There are two paths, and they're independent.

    The first is to install a tracing SDK in-process. Catalyst's tracing packages, @inference/tracing for TypeScript and inference-catalyst-tracing for Python, initialize with a single setup() call before your LLM clients are constructed, and the Python SDK auto-detects whichever instrumented packages you have installed. Wrapping a logical unit of work in agent_span nests every LLM call under an agent row carrying agent.id, agent.name, and session.id, which is what makes runs findable and curatable later.

    # Install the tracing SDK with the OpenAI extra:
    #   pip install 'inference-catalyst-tracing[openai]'
    #
    # Configure export before the app starts:
    #   export CATALYST_OTLP_ENDPOINT="https://telemetry.inference.net"
    #   export CATALYST_OTLP_TOKEN="<your-token>"
    #   export CATALYST_SERVICE_NAME="support-agent"
    
    import os
    
    from inference_catalyst_tracing import agent_span, setup
    from openai import OpenAI
    
    tracing = setup()  # auto-detects installed packages; call before clients
    client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
    
    # Wrap each session so every LLM call nests under an AGENT row
    # carrying agent.id, agent.name, and session.id — this is what makes
    # runs filterable and curatable into datasets later.
    with agent_span(
        tracing.tracer,
        agent_id="support-agent",
        agent_name="Support Agent",
        session_id="session-4821",
    ) as span:
        span.set_input("My last invoice was charged twice — can you fix it?")
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {"role": "user", "content": "My last invoice was charged twice — can you fix it?"}
            ],
            max_tokens=16,
        )
        span.set_output(response.choices[0].message.content or "")
    
    tracing.shutdown()  # flush batched spans before a short-lived process exits

    The second path is to route traffic through the Catalyst Gateway: point your existing SDK at https://api.inference.net/v1, keep your provider keys, and every request is captured with full payloads, cost, latency, and token counts at roughly 10ms of overhead. Tagging requests with the x-inference-task-id header keeps each AI feature's traffic separately filterable, which pays off the moment you start curating.

    Curate: Failures and Golden Paths into Datasets

    With traffic flowing, building the dataset is a filtering exercise. Filter captured inferences by model, task, provider, or status code, then save the filtered slice as an eval dataset directly from the dashboard. Curate two kinds of cases. Failures, the sessions users actually hit problems in, are your regression bait; if a change reintroduces one of these, the gate should catch it. Golden paths, the representative sessions that must keep working, are your insurance against collateral damage.

    One more property matters if you also fine-tune: eval data and training data must never overlap, and Catalyst enforces that automatically by excluding overlapping examples from training datasets. Your benchmark stays a benchmark.

    Write Pass/Fail Criteria with LLM-as-a-Judge Rubrics

    A regression gate needs a number to threshold on. For graded qualities (was the answer grounded, did it complete the task, were the tool calls right) that number comes from an LLM judge scoring outputs against a rubric.

    A rubric is a plain-English description of one quality dimension, scored numerically. You can generate one from your data, start from a template, or just write it yourself. Rubrics inject context through three template variables: {{ conversation_context }} for the input, {{ conversation_response }} for the reference output, and {{ eval_model_response }} for the output being scored. The last one is required in every rubric.

    That reference variable is what makes rubrics fit regression testing so naturally. An adherence rubric uses all three variables and scores the candidate's output against the known-good response already in your dataset; a direct rubric omits the reference and judges the output on its own terms. For gating changes, adherence is the workhorse: your curated golden paths are literally the reference answers.

    Specificity is everything. "Is this response accurate?" produces mush; "does the extracted JSON contain all required fields with correct types?" produces a signal. The default 1–10 range gives the judge room to express degrees; a 1–3 range works for nearly boolean dimensions. Describe what separates each score level, or the judge will improvise.

    # Rubric: Tool-Call Correctness and Task Completion (adherence, 1-10)
    
    You are judging a support agent's response against a known-good reference
    response from the same conversation.
    
    The conversation so far:
    
    {{ conversation_context }}
    
    The reference response that previously resolved this session correctly:
    
    {{ conversation_response }}
    
    The response you are scoring:
    
    {{ eval_model_response }}
    
    Score from 1 to 10 using these levels:
    
    - **9-10:** Calls the same tools as the reference (or an equivalent set) with
      correct arguments, completes the user's task, and stays grounded in the
      conversation. Differences from the reference are stylistic only.
    - **7-8:** Completes the task with correct tool calls, but with minor gaps:
      an unnecessary extra tool call, slightly off formatting, or missing a
      secondary detail the reference included.
    - **4-6:** Partially completes the task. A required tool call is missing,
      has wrong arguments, or the response answers from memory where the
      reference used a tool.
    - **2-3:** Fails the task: wrong tool selected, refuses a request the
      reference fulfilled, or contradicts information in the conversation.
    - **1:** No meaningful attempt — off-topic, empty, or hallucinated content
      unrelated to the conversation.

    Two operational notes. Use the strongest model available as your judge: a judge weaker than the models it's scoring can't reliably distinguish quality. And budget for the cross-product: every sample runs through every candidate model, and every output gets one judge call, so 10 samples across 3 models means 30 generations and 30 judge scores.

    When to Use Exact Match and Code Assertions Instead

    Not everything needs a judge. Deterministic contracts (JSON schema validity, required fields, a regex the output must match, a latency budget) are better checked with exact-match and code assertions: they're free, instant, and have zero variance. This is the assertion model that frameworks like promptfoo and DeepEval are built around, and Evidently's descriptor checks (semantic similarity, length, regex, toxicity) live in the same family. The practical rule: assert the contract with code, judge the quality with a rubric, and don't pay an LLM to check whether a field exists.

    Wire the Gate into CI

    Everything so far is preparation. The gate itself is mechanical: every PR that touches a prompt, a model pin, or agent logic triggers an eval run, and the merge waits on the score.

    The whole loop is scriptable from the eval CLI: create a rubric from a markdown file, materialize an eval dataset, launch a run group against one or more models with a judge model, and poll for results.

    # One-time setup (locally or in the CI image):
    #   npm install -g @inference/cli && inf auth login
    
    # 1. Create the rubric from a markdown file (one-time; reuse the ID afterwards)
    inf eval rubric create -n tool-call-correctness-v1 -f ./rubric.md
    # → Rubric rub_abc12 / version rv_xyz45 created.
    
    # 2. Materialize the eval dataset (traffic-backed, upload-backed, or from a file)
    inf dataset create -n support-agent-eval -t eval --file ./samples.jsonl
    # → Dataset ds_def78 created.
    
    # 3. On every prompt/model/agent PR: launch the run group.
    #    Pin the rubric version so the measuring stick can't move, and pass
    #    baseline + candidate models for a side-by-side on identical samples.
    inf eval run \
      --rubric-id rub_abc12 \
      --rubric-version-id rv_xyz45 \
      --dataset-id ds_def78 \
      --models openai:gpt-5.2,anthropic:claude-sonnet-4-6 \
      --judge-model anthropic:claude-sonnet-4-6
    # → Run group rg_20260415_152340 created.
    
    # 4. Read the verdict: one row per model with status, average score,
    #    failed samples, and completed/total. Gate the merge on
    #    "candidate average >= baseline average - tolerance".
    inf eval get rg_20260415_152340

    Three details make the gate trustworthy rather than noisy. First, pin the rubric version with --rubric-version-id so the measuring stick can't move between baseline and candidate. Second, size the sample deliberately: --sample-size draws up to 100 samples per model from the dataset, and too few samples make the gate flaky in exactly the way that teaches engineers to ignore it. Third, gate on the comparison, not an absolute. inf eval get reports a per-model average score alongside failed-sample counts and completion totals; the verdict you want is "candidate within tolerance of baseline," because absolute scores drift with judge and dataset in ways that paired comparisons don't.

    Model swaps get the same treatment with one twist: pass both route IDs to --models and the run becomes a side-by-side on identical samples and an identical rubric. OpenAI, Anthropic, open-source, and your own fine-tuned models can all sit in the same comparison. Before promoting a candidate, read the per-sample breakdown rather than trusting the average; aggregates are where localized regressions hide.

    If you'd rather start from real traffic than hand-built test sets, this is the loop to steal.

    Run evals on real production traffic

    Score traced runs with rubric-based evals instead of hand-built test sets, and gate prompt or model changes on the results.

    Catch What Offline Gates Miss

    An offline gate answers one question: how would this candidate perform on the benchmark? That's the right question at merge time, but it isn't the same as "how is production behaving?" Offline evals re-generate outputs from your dataset's inputs. They don't judge what your deployed model actually said to a real user this morning. Meanwhile the traffic distribution shifts under your benchmark: new user cohorts, new phrasing, new edge cases your curated set hasn't absorbed yet.

    Closing that gap takes two mechanisms. The first is online evaluation: scoring live production outputs continuously, with sample-rate controls to manage judge cost. That's coming to Catalyst but isn't shipped yet, so treat it as the direction of travel rather than something to deploy today.

    The second is shipped, and it's the one that catches regressions nobody wrote a rubric for: scheduled cross-run trace analysis. Halo is an open-source, MIT-licensed engine that reads OpenTelemetry-compatible spans, decomposes them across many runs to find systemic failure modes, and writes findings that cite the specific trace IDs they came from. Hosted inside Catalyst, you can run Halo on a schedule: hourly, daily, weekly, or monthly, with a lookback window that pre-fills to match the cadence. A daily run reviewing the last 24 hours of an agent's traces is the post-deploy analog of the CI gate: it flags the recurring failure, names the traces that prove it, and does it before the user complaints aggregate into a pattern someone notices.

    It's scriptable too: inf halo run create starts an analysis, inf halo run poll exits zero only when the run completes, and inf halo conversation get prints the report. A nightly job after each deploy window is a few lines of cron.

    The loop closes where it started. Failures Halo surfaces get curated back into the eval dataset, so the next prompt PR gets gated against the newest known failure mode. The regression suite compounds.

    LLM Testing Tools and Evaluation Frameworks

    LLM evaluation framework options split into two families, and the split is about where test data comes from.

    Config-file frameworks assume you'll write test cases. promptfoo is the canonical example: declarative YAML configs, a CLI, CI/CD integration, and a mix of programmatic assertions and model-graded checks. It's also no longer independent: OpenAI announced in March 2026 that it's acquiring promptfoo, with the tool staying open source and the technology folding into OpenAI's Frontier platform. DeepEval takes the pytest-native route: assert_test() assertions and a deepeval test run command slot LLM evals into the same CI muscle memory as unit tests, with 50+ metrics spanning RAG, agents, and tool use. Evidently approaches it as descriptor checks against golden references (semantic similarity, length, regex, toxicity, sentiment), though its regression-testing tutorial explicitly defers CI integration, LLM-as-judge scoring, and agentic systems.

    Trace-native platforms invert the starting point: capture production behavior first, then test against it. That's the Catalyst model: datasets built from captured traffic, rubric-based judge evals, a CLI gate for CI, and Halo for cross-run analysis.

    ToolTypeTest data sourceCI story
    promptfooOSS CLIHand-written YAMLCLI in pipeline
    DeepEvalOSS frameworkHand-written pytest casespytest-native
    EvidentlyOSS libraryCurated golden setsLibrary-level
    CatalystTrace-native platformProduction tracesinf eval CLI

    Notes: OpenAI announced it is acquiring promptfoo (March 2026); it remains open source. Evidently's regression tutorial defers CI integration and agentic systems. Catalyst builds datasets from captured traffic and adds Halo cross-run analysis after deploy.

    The choice heuristic is one question: where should your test data come from? If hand-written examples genuinely cover your risk (stable task, narrow input space), a config-file LLM testing framework is lightweight and excellent. If your failures keep coming from inputs you didn't anticipate, you need capture and curation, and the eval layer belongs where the traces already are.

    Worked Example: Gating a Prompt Change End to End

    Here's the whole methodology compressed into one change: prompt regression testing in practice. The scenario: a support agent's system prompt gets a one-line addition, "always answer in the user's language," and the team has been burned before by exactly this kind of edit silently degrading tool-calling.

    1. Capture is already running. The agent was instrumented with the tracing SDK at launch: setup() at boot, each session wrapped in agent_span so every run carries its agent and session identity.
    2. Curate the dataset. Filter the last month of the support task's traffic, pull the hard cases (sessions where tool calls failed or users rephrased in frustration) plus a spread of golden paths, and save roughly 60–80 of them as an eval dataset. Hard, stable, and small is the goal.
    3. Write the rubric. An adherence rubric scoring tool-call correctness and task completion against each session's reference response, on a 1–10 scale with each level described.
    4. Run the gate. The PR triggers inf eval run with the rubric version pinned, the dataset fixed, and both the baseline and candidate configurations scored by the same judge model.
    5. Read the verdict. inf eval get shows the candidate's average within a point of baseline, but the per-sample rows reveal three multilingual sessions where tool calls now fail. The PR goes back for one more iteration; the second run comes back clean and merges.
    6. Watch the aftermath. The nightly scheduled Halo run reviews the next day's traces and reports no new recurring failures. When it eventually does find one, that trace becomes next month's regression case.

    Total ceremony added to the PR: one CI job and a few minutes of judge time. Total regressions that reach users without a fight: ideally, none twice.

    Treat Every Prompt Like a Deploy

    LLM apps don't fail loudly. Models drift under stable APIs, deprecations force migrations on someone else's calendar, and a prompt edit can break behavior three features away. AI regression testing is the discipline that turns those surprises into failed CI checks: a dataset curated from production traces, rubrics that turn quality into thresholds, a gate on every prompt, model, and agent PR, and scheduled trace analysis for everything the offline gate can't see.

    Every piece of that loop sits downstream of one prerequisite: captured traffic. Start there: instrument the app, let real sessions accumulate, and your first regression dataset is a filtering exercise instead of a writing project.

    Trace your first agent run in minutes

    Install the Catalyst tracing SDK and call setup() before your clients. You get the full trace tree: agent, LLM, and tool spans with cost, latency, and token usage.


    References

    1. Chen, L., Zaharia, M., Zou, J. — "How Is ChatGPT's Behavior Changing over Time?" — https://arxiv.org/abs/2307.09009
    2. OpenAI — Retiring GPT-4o and other ChatGPT models — https://help.openai.com/en/articles/20001051-retiring-gpt-4o-and-other-chatgpt-models
    3. OpenAI — API Deprecations — https://developers.openai.com/api/docs/deprecations
    4. OpenAI — OpenAI to acquire Promptfoo — https://openai.com/index/openai-to-acquire-promptfoo/
    5. promptfoo — https://github.com/promptfoo/promptfoo
    6. DeepEval — Unit Testing in CI/CD — https://deepeval.com/docs/evaluation-unit-testing-in-ci-cd
    7. Evidently AI — LLM Regression Testing Tutorial — https://www.evidentlyai.com/blog/llm-regression-testing-tutorial
    8. Catalyst docs — Traces Quickstart, Gateway, Datasets, Eval, Halo — https://docs.inference.net
    CONTACT

    Meet with our research team

    Schedule a call with our research team to learn more about how Specialized Language Models can cut costs and improve performance.