Evals
Run a dataset of prompts against a model (or Provara's own classifier), score each result, and get an aggregate — your golden test set and your prod quality monitor on the same loop.
Tier: Intelligence (paid). Self-host and cloud tenants without a Pro+ subscription get a 402 with an upgrade payload.
What Evals actually does
You give Evals three things:
- A dataset — a list of prompts, each optionally labeled with an expected answer.
- A target — the thing you want to test. Usually a
(provider, model)pair likeopenai/gpt-4o, but it can also be Provara's own built-in classifier. - A scorer — how to decide if the target did well on each prompt.
Provara runs every prompt against the target, scores each result, stores outputs + scores + cost, and shows you a per-case breakdown plus an aggregate. Runs are idempotent and replayable — the same dataset, rerun tomorrow, tells you whether anything drifted.
The point: your regression check and your prod quality monitor run on the same scoring logic, against the same models, in the same plane. No drift between what "good" means in staging and in production.
When you'd reach for it
- Pre-deploy regression check. Before flipping a prompt version or migrating a model, rerun the canonical dataset and confirm scores didn't drop.
- Accuracy on a classifier or extractor. You have prompts with a known-correct answer (a label, a field value, a JSON schema). You want to know how often the system gets it right.
- Model comparison. Same dataset, same scorer, two models — see which one is actually better on your traffic shape, not on someone else's benchmark.
- Quality baseline. Lock in a numeric bar a new model needs to clear before you trust it in production.
The three concepts
Dataset
A JSONL file where each line is one test case:
| Field | Required | Description |
|---|---|---|
input | yes | ChatMessage[] — exactly what you'd POST to /v1/chat/completions. Text or multimodal. |
expected | sometimes | The correct answer. Used by the match-based scorers. Ignored by the LLM judge. |
metadata | no | Arbitrary object. Handy for tagging cases (category, source, ticket ID). |
Example — mixed dataset with open-ended cases and labeled cases:
{"input":[{"role":"user","content":"summarize the attached report in one paragraph"}]}
{"input":[{"role":"user","content":"extract the invoice total from this email"}],"expected":"$1,247.50"}
{"input":[{"role":"user","content":"is this message spam? yes or no"}],"expected":"no"}Target
What you're testing. Two kinds:
- A model — pick any
(provider, model)pair from your registered providers. Provara sends each case'sinputas a completion request and captures the response. Costs tokens. - The Provara classifier — Provara has an internal classifier that labels every production request with a
taskType(coding / creative / summarization / qa / general / vision) and acomplexity(simple / medium / complex). You can evaluate its accuracy by picking Provara classifier as the target. It runs in-process (zero tokens, zero cost) and emits a label likecoding/medium (conf=0.84/0.71)per case.
Scorer
How Evals decides whether the target did well on a case. Three options:
- LLM judge (1–5) — a cheap grader model reads the prompt + response and rates quality from 1 (terrible) to 5 (excellent). Subjective but captures nuance. The judge model honors the dashboard-pinned judge config (
/dashboard/routing→ Judge); falls back to cheapest-available if the pin isn't registered. - Exact match (pass/fail) — the target's output must equal the case's
expectedfield exactly (whitespace-trimmed, first token). 5 = pass, 1 = fail. - Regex match (pass/fail) — the target's output must match a JS regex stored in
expected. 5 = pass, 1 = fail.
Which scorer should I use?
| Situation | Scorer | Why |
|---|---|---|
| Open-ended chat response, subjective quality | LLM judge | There's no single "right" answer — the judge approximates human taste. |
| Classifier, label extraction, category | Exact match | There IS a right answer. Match-based is faster, cheaper, deterministic. |
| Structured field extraction with variation | Regex match | Accept multiple valid forms (e.g. \$?1,?247(\.\d{2})?). |
| Provara classifier as target | Exact match | Classifier output is a label. LLM-judging a label always fails (see below). |
| Code output compared to a reference | LLM judge | Exact match rarely helps (whitespace, variable names). Judge with a clear system prompt on the dataset. |
The trap everyone hits once: picking Provara classifier as the target and leaving the scorer on LLM judge. The classifier outputs a label like "coding/medium", not a chat response. The judge (correctly) sees that label as a terrible answer to "refactor this function" and scores it 1/5 every time. The dashboard auto-switches to Exact match when you pick the classifier target specifically to prevent this — but if your JSONL has cases where expected is missing, match-scored cases will all fail too, so always set an expected label per case.
Quickstart
- Open
/dashboard/evals. - Click Create dataset, paste JSONL (one case per line):
{"input":[{"role":"user","content":"what is the capital of France"}],"expected":"Paris"} {"input":[{"role":"user","content":"is this spam: congrats you won"}],"expected":"yes"} - Pick the dataset, target, and scorer. Click Run.
- The run executes cases in batches of 4. The run detail page updates live — each case lands with its score and the raw output for inspection.
- When complete, you get an aggregate: an average 1–5 for LLM-judge runs, or a pass rate (% of cases where score = 5) for match-based runs.
Evaluating the Provara classifier
Build a dataset where each case's expected is the correct taskType/complexity label:
{"input":[{"role":"user","content":"write me a python one-liner to reverse a string"}],"expected":"coding/simple"}
{"input":[{"role":"user","content":"summarize this research paper and extract the three main claims"}],"expected":"summarization/medium"}
{"input":[{"role":"user","content":"design a sharded queue with exactly once semantics and backpressure"}],"expected":"coding/complex"}Pick Provara classifier as the target (the scorer will auto-switch to Exact match). The aggregate becomes a classifier-accuracy number on your labeled set; the per-case view shows the predicted label plus the confidence the classifier had. Cases where the classifier got it wrong are the exact examples worth studying to improve the heuristic keywords or LLM-fallback prompt.
Aggregate semantics
- LLM judge runs — the aggregate is the simple average across cases with a non-null score. Failed parses or unreachable judges leave
score: null; those count toward progress but not the average. - Match-based runs — the aggregate is
((avgScore - 1) / 4) × 100%, which collapses to the pass rate when scores are always 1 or 5. Displayed as e.g.64% pass.
Both update live during the run so long datasets don't wait for a final reveal.
API
Same endpoints the dashboard uses:
POST /v1/evals/datasets { name, description?, jsonl }
GET /v1/evals/datasets list
GET /v1/evals/datasets/:id detail + first 5 cases preview
POST /v1/evals/datasets/:id/cases append a case (used by "Save as eval case")
DELETE /v1/evals/datasets/:id cascades to dependent runs + results
POST /v1/evals/runs { datasetId, provider, model, scorer? }
GET /v1/evals/runs list (last 50)
GET /v1/evals/runs/:id run + all results + completedCasesscorer is optional and defaults to llm-judge. Valid values: llm-judge, exact-match, regex-match.
POST /v1/evals/runs returns immediately (202 Accepted with { runId, status: "queued" }); the executor works in the background. Poll /runs/:id to track progress.
To run against the Provara classifier: {"provider": "provara", "model": "classifier", "scorer": "exact-match"}.
Cost accounting
Model-target runs write rows to cost_logs with a requestId of eval-<runId>-<caseIndex>. They appear on your spend dashboard alongside production traffic but don't contaminate the adaptive router's EMA — evals deliberately skip the /v1/chat/completions path so eval feedback doesn't mix with live feedback.
Classifier-target runs cost $0 — the classifier runs in-process.
Building datasets from production traffic
The fastest way to seed a dataset is from real requests that went through Provara:
- Open
/dashboard/logs, find the request you want to use as a test case. - Click Fork to Playground (top-right of the detail page).
- Edit the prompt / settings / model if you want, rerun.
- Click Save as eval case in the top bar — pick an existing dataset or create a new one inline.
Each saved case stores the user→assistant exchange as a single JSONL line with expected set to the observed response. Run the dataset later against a candidate model to catch regressions before they ship.
Roadmap
Shipped:
- Dataset CRUD + JSONL parsing
- Run executor with bounded concurrency
- LLM judge, exact-match, regex-match scorers
- Classifier target
- Live aggregate updates
- Save-from-playground flow
Deferred:
- CLI + GitHub Action — run a dataset against a candidate target in CI, block the PR on regression. Primary distribution unlock.
- Prompt-version variants — same dataset, two prompt versions, compare scores side-by-side.
- Side-by-side run comparison UI — two completed runs, per-case delta + aggregate delta.
- Fuzzy match scorer — between exact and LLM judge; useful when "close enough" counts (token-level similarity, BLEU, etc.).
Related
- Adaptive routing — judge scoring for live traffic
- Silent-regression detection — replay bank for post-hoc regression catch; evals are the pre-deploy counterpart
- Analytics — where eval cost shows up on the dashboard