Guardrails
PII detection, prompt injection firewalling, content policies, and custom regex rules with redact / flag / block actions.
Guardrails sit in the request path alongside the classifier and rate limiter. Rules are stored per-tenant; each rule can target input, output, or both.
Built-in detectors
- PII — SSN, credit card, email, phone, IP address
- Prompt Injection Firewall — signature-based checks for instruction override, system prompt extraction, role takeover, and delimiter injection attempts
- Content — built-in policy presets
- Regex — custom tenant-defined patterns
- Token limit — refuse requests exceeding a max token count
Prompt Injection Firewall
The Prompt Injection Firewall is an opt-in preset backed by built-in input rules. It blocks common jailbreak and prompt-leakage signatures before the request reaches provider routing.
The default path is intentionally fast and deterministic. It catches obvious attacks such as "ignore previous instructions", "show your system prompt", role takeover prompts, and fake system-message delimiters. Semantic context scans and hybrid judge mode are available through the scan API.
Context scanning
Use POST /v1/admin/guardrails/scan to check untrusted text before adding it to model context. The scan endpoint accepts a source so applications can distinguish direct user input from retrieved content and tool output:
{
"source": "retrieved_context",
"content": "Document text or tool output to inspect"
}Supported sources are user_input, retrieved_context, tool_output, and model_output. Retrieved context and tool output are never rewritten by the scan endpoint; a blocking match returns decision: "quarantine" so callers can drop that chunk without corrupting JSON or structured tool results. Direct user input and model output can still return decision: "redact" when a redact rule applies.
By default scans run in signature mode. Callers can opt into semantic classification by sending mode: "semantic" or mode: "hybrid". Semantic mode asks the configured judge model to classify prompt-injection risk and returns semantic.flagged, semantic.confidence, semantic.riskLevel, semantic.category, semantic.evidence, and semantic.recommendedAction. Hybrid mode runs the judge only when the signature scan already found something suspicious.
Tool-call alignment
For non-streaming chat completions, Provara checks model-requested tool_calls before returning them to the client. The alignment check blocks undeclared tools, invalid JSON arguments, and arguments that appear to smuggle prompt-injection or secret-exfiltration instructions. Obvious lexical intent mismatches are logged as flags and allowed so the MVP avoids blocking legitimate short prompts.
When a dangerous tool call is blocked, the chat response returns HTTP 400 with error.code: "tool_call_alignment_blocked" and writes a Tool-call alignment guardrail log. Streaming tool-call enforcement buffers tool-call deltas until alignment checks pass, so blocked calls are not emitted to the client first.
Advanced firewall activity is also written to firewall_events for audit and analytics. Events record the surface (scan or tool_call_alignment), source, mode, decision, action, confidence/category fields when a semantic judge runs, tool name when applicable, and request/tenant metadata. Admins can fetch recent events from GET /v1/admin/guardrails/firewall/events.
Semantic and hybrid scan modes are gated to tenants with Intelligence access. Signature scans remain the default deterministic path.
Tenant admins can tune firewall defaults through GET/PATCH /v1/admin/guardrails/firewall/settings. The defaults preserve the safe path: defaultScanMode: "signature", toolCallAlignment: "block", and streamingEnforcement: true. Teams can switch tool-call alignment to flag for observation mode or off for a temporary bypass while leaving scan defaults unchanged.
Actions
- block — refuse the request (HTTP 400)
- redact — replace the matched span with
[REDACTED]before forwarding upstream - flag — log but allow through
Logging
Every guardrail hit writes to guardrail_logs with requestId, tenantId, rule name, matched text (redacted), and action taken. Viewable on /dashboard/guardrails.
Tenant-scoped vs global
Built-in detectors are tenant-scoped with builtIn=true flags so tenants can enable/disable the set Provara ships. Custom regex rules are tenant-created entirely.
Evals
Run a dataset of prompts against a model (or Provara's own classifier), score each result, and get an aggregate — your golden test set and your prod quality monitor on the same loop.
Jailbreak detection
Detect and block prompt-injection attempts that try to extract system prompts, override instructions, or pivot the assistant off-policy.