Vision / multimodal
Send image content to vision-capable models through the same /v1/chat/completions endpoint.
Provara accepts OpenAI-shape multimodal content arrays — a user message's content can be a plain string or an array of parts, where each part is either text or an image URL. The gateway translates into each provider's native format before the upstream call.
POST /v1/chat/completions
{
"messages": [
{
"role": "user",
"content": [
{ "type": "text", "text": "What's wrong with this breaker panel?" },
{ "type": "image_url", "image_url": { "url": "data:image/jpeg;base64,..." } }
]
}
]
}Both http(s) URLs and data:<mime>;base64,<payload> URIs are accepted. Data URIs are the portable option — they work across every provider without pre-upload.
Routing
When any message carries an image_url part, the routing engine restricts candidates to models known to accept images. Text-only models are excluded even if cheaper. If no vision-capable provider is registered, the request returns:
502 no_capable_provider
Request contains image content but no registered vision-capable model is available.The vision classification goes in its own (taskType: "vision", complexity: "complex") cell for adaptive routing, so learning on vision workloads doesn't contaminate text cells and vice-versa. A/B tests are skipped entirely for vision requests — their variants can't be safely routed when a model might be text-only.
Supported providers and models
| Provider | Vision-capable models |
|---|---|
| OpenAI | gpt-4o, gpt-4o-mini, gpt-4.1, gpt-4.1-mini |
| Anthropic | claude-opus-4-6, claude-sonnet-4-6, claude-haiku-4-5-* |
gemini-2.5-pro, gemini-2.5-flash, gemini-2.0-flash | |
| Z.ai | glm-5v-turbo |
| Custom / Ollama | Pass-through — model-gated; see "Custom models" below |
The list lives in packages/gateway/src/routing/model-capabilities.ts (VISION_CAPABLE set). User-pinned model overrides bypass this filter on the assumption that you know what you're doing — the upstream will reject with its own error if the model can't actually see.
Custom models
For custom (OpenAI-compatible) providers registered via /dashboard/providers, set the modalities column on the model_registry row to ["text", "image"] to opt the model into the vision candidate pool. The default is ["text"].
Provider translation
The gateway translates our OpenAI-shaped image_url parts into each provider's native format:
- OpenAI, openai-compatible (Mistral, xAI, Z.ai, Ollama): pass through unchanged. The OpenAI SDK handles the array shape natively; whether the target model actually supports vision is upstream-gated.
- Anthropic:
data:<mime>;base64,<data>→{type: "image", source: {type: "base64", media_type, data}}; plain URLs →{type: "image", source: {type: "url", url}}. - Google / Gemini: data URIs →
{inlineData: {mimeType, data}}; plain URLs →{fileData: {mimeType, fileUri}}. Gemini'sfileDataexpects a URI it can fetch — use the Files API upstream if you need long-lived references.
Cache interaction
Image-bearing requests skip both exact-match and semantic caches:
- Exact-match would never hit — image bytes make every prompt unique.
- Semantic embeddings are text-only in the current release, so a match would ignore the image content entirely.
This means the "Tokens Saved" dashboard metric always stays at 0 for vision traffic. That's expected, not a bug. See analytics for context.
Guardrails
Input guardrails run on the text portion of each message only. If a rule fires:
- Block — the whole request is rejected (including the image).
- Redact — text parts are replaced with the redacted string; image parts pass through unchanged.
If you need PII scanning inside image content, that's an OCR-first pipeline on the client side before the request reaches the gateway.
Regression detection
Image-bearing requests are excluded from the replay bank by default. Running regression on vision is ~10–15× the token cost of text (image tokens across candidate replays plus the judge-model call) and the heuristic judge struggles to grade subjective visual outputs reliably.
The filter operates at two levels: whole cells with taskType = "vision" are skipped in the bank-population cycle, and any individual stored prompt that contains an image_url part is also skipped (defense-in-depth against pre-existing rows that pre-date the cell-level filter).
Set PROVARA_REGRESSION_INCLUDE_VISION=true to opt back in. Budget accordingly — a single vision replay event can cost more than a day of text regression on a busy cell.
Known limitations
- Pricing: the gateway's cost calculation uses the model's per-token text pricing; vision-specific image-token pricing (OpenAI's "detail" tiers, Anthropic's image-tile counts) isn't yet factored in. Cost figures for vision requests are lower bounds, not exact.
- Classifier heuristics: the keyword classifier is text-only. Image-bearing requests short-circuit to
taskType: "vision"without consulting the classifier — that's correct routing but it does mean mixed image+code requests don't get the "coding" label. - Replay bank: regression-detection replays stored prompts against candidate models. Image messages can still be replayed (the
data:URI is preserved in the stored prompt JSON) but the stored prompt row can grow large; operators with high image volume should monitorrequests.promptcolumn size. - Playground: no web upload UI yet — use the API directly for vision testing.
Related
- Analytics — where vision routing shows up on the dashboard
- Adaptive routing — how the
visioncell learns separately - Semantic cache — cache behavior (skipped for vision)