Features
Semantic cache
Two-layer response caching — exact-match (in-memory) + semantic similarity via embeddings.
Every cacheable request (temperature = 0, not mid-A/B-test) checks two caches before hitting a provider:
- Exact match — in-memory hash of
(messages, provider, model). Sub-millisecond. - Semantic match — OpenAI embedding of the messages, compared against prior cached embeddings via cosine similarity. Hit when similarity >=
PROVARA_SEMANTIC_CACHE_THRESHOLD(default 0.97).
Hits return immediately; the gateway logs cached=true and reports tokensSavedInput / tokensSavedOutput for dashboard savings calculation.
Tuning
| Env var | Default | Purpose |
|---|---|---|
PROVARA_SEMANTIC_CACHE_ENABLED | true | Hard off-switch for the semantic layer |
PROVARA_SEMANTIC_CACHE_THRESHOLD | 0.97 | Cosine similarity required for a match |
PROVARA_EMBEDDING_MODEL | text-embedding-3-small | Must be one of OpenAI's embedding models |
PROVARA_EMBEDDING_PROVIDER | openai | Only OpenAI is supported in MVP |
The exact-match cache is always on; the semantic cache depends on having an OpenAI API key (env or DB-stored) for embeddings.