Usage
The TensorZero Gateway supports the following cache modes:write_only(default): Only write to cache but don’t serve cached responsesread_only: Only read from cache but don’t write new entrieson: Both read from and write to cacheoff: Disable caching completely
Example
Technical Notes
- The cache applies to individual model requests, not inference requests. This means that the following will be cached separately: multiple variants of the same function; multiple calls to the same function with different parameters; individual model requests for inference-time optimizations; and so on.
- The
max_age_sparameter applies to the retrieval of cached responses. The cache does not automatically delete old entries (i.e. not a TTL). - When the gateway serves a cached response, the usage fields are set to zero.
- The cache data is stored in ClickHouse.
- For batch inference, the gateway only writes to the cache but does not serve cached responses.
- Inference caching also works for embeddings, using the same cache modes and options as chat completion inference. Caching works for single embeddings. Batch embedding requests (multiple inputs) will write to the cache but won’t serve cached responses.