Silent-regression detection

Replay bank + periodic LLM-judge re-eval catches upstream provider degradations that would otherwise go unnoticed.

Why this exists

When OpenAI silently ships a new gpt-4o snapshot that's slightly worse on your workload, you probably won't notice — the model name didn't change, the API didn't change, and average latency looks fine. The regression shows up only if you specifically compare quality on the same prompts over time.

Provara does this automatically.

How it works

Replay bank — periodically captures top-quality historical prompts per cell (tenant-scoped), gated by judge-rated ≥ 4 and embedding-based diversity sampling
Replay execute — scheduled job re-runs those prompts against the current model-of-record
Judge compare — LLM-as-judge scores the fresh response against the original
Alert — if the median fresh score drops below a threshold, emit a regression event

What you get

Regression events on /dashboard/quality (or via GET /v1/regression/events)
Boosted ε-greedy exploration on affected cells (PROVARA_REGRESSED_EXPLORATION_RATE, default 0.5) so alternative models get tested
Resolution UI — dismiss as noise or mark resolved with a note

Opt-in

Feature is behind an opt-in flag per tenant:

curl -X POST https://gateway.provara.xyz/v1/regression/opt-in \
  -H "Authorization: Bearer <token>" \
  -d '{"enabled": true}'

Free tier is excluded; Pro+ can enable. See the self-host vs Cloud matrix for tier specifics.

Silent-regression detection

Why this exists

How it works

What you get

Opt-in

On this page