Prompt Caching
Reduce latency and cost by caching frequently reused prompt content. When the same content appears across requests, the provider can skip reprocessing cached tokens and charge a reduced rate.
Prompt caching is handled by the underlying providers. RouterHub passes through caching directives and reports cache usage in the response.
How It Works
Each provider implements prompt caching differently, but the core idea is the same: prefix content that is identical across requests is processed once and reused from cache on subsequent calls.
| Provider | Mechanism | Minimum Cacheable Tokens | Cache Lifetime |
|---|---|---|---|
| Claude | Explicit breakpoints via cache_control |
1,024 tokens (Haiku 4.5, Opus 4.5, Opus 4.6: 4,096) | 5 minutes (default) or 1 hour (where supported) |
| Gemini | Automatic — longest matching prefix is cached | Provider-managed | Automatic (provider-managed) |
| GPT | Automatic prefix caching; optional prompt_cache_key for explicit control |
1,024 tokens | Automatic (minutes); up to 24 hours with prompt_cache_retention on supported models |
Claude: Explicit Cache Breakpoints
Claude uses explicit cache_control breakpoints. You mark the point in your prompt where the cache should end using cache_control on content blocks, system blocks, and tool definitions. Everything up to and including the marked block becomes a cacheable prefix.
Supported Locations
cache_control can be placed on:
- System text blocks
- Tool definitions
- User message content blocks (text, image, document, tool_result)
- Assistant message content blocks (text, tool_use)
RouterHub applies cache_control as per-block breakpoints on individual content blocks, system blocks, and tools. This approach is supported by all Claude backends (AWS Bedrock, GCP Vertex, and direct Anthropic API). The top-level cache_control convenience field (which auto-applies a breakpoint to the last cacheable block) is not used, as it is not supported on Bedrock and Vertex.
cache_control Object
| Field | Type | Required | Description |
|---|---|---|---|
| type | string | Required | "ephemeral" |
| ttl | string | Optional | "5m" (default) or "1h" (where supported by the model and backend) |
Anthropic Format (/v1/messages)
Add cache_control directly on content blocks, system blocks, and tools. RouterHub passes these through to Claude.
Caching a system prompt
curl https://api.routerhub.ai/v1/messages \
-H "x-api-key: $ROUTERHUB_API_KEY" \
-H "anthropic-version: 2023-06-01" \
-H "Content-Type: application/json" \
-d '{
"model": "anthropic/claude-sonnet-4",
"max_tokens": 1024,
"system": [
{
"type": "text",
"text": "You are a legal assistant. Here is the full text of the contract to analyze: ... (long document) ...",
"cache_control": {"type": "ephemeral"}
}
],
"messages": [
{"role": "user", "content": "Summarize the key terms."}
]
}'from anthropic import Anthropic
client = Anthropic(
base_url="https://api.routerhub.ai",
api_key="YOUR_API_KEY",
)
message = client.messages.create(
model="anthropic/claude-sonnet-4",
max_tokens=1024,
system=[
{
"type": "text",
"text": "You are a legal assistant. Here is the full text of the contract to analyze: ... (long document) ...",
"cache_control": {"type": "ephemeral"},
}
],
messages=[
{"role": "user", "content": "Summarize the key terms."}
],
)
print(message.content[0].text)Caching tools and conversation history
{
"model": "anthropic/claude-sonnet-4",
"max_tokens": 1024,
"tools": [
{
"name": "search_database",
"description": "Search the product database",
"input_schema": {
"type": "object",
"properties": {
"query": {"type": "string"}
},
"required": ["query"]
},
"cache_control": {"type": "ephemeral"}
}
],
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Here is the product catalog: ... (large text) ...",
"cache_control": {"type": "ephemeral"}
},
{
"type": "text",
"text": "Find products related to wireless charging."
}
]
}
]
}Using 1-hour TTL
For content that should remain cached longer, set the TTL to "1h" (where supported by the model and backend):
{
"type": "text",
"text": "Reference documentation that rarely changes...",
"cache_control": {"type": "ephemeral", "ttl": "1h"}
}OpenAI Format (/v1/chat/completions)
When using the OpenAI-compatible endpoint with Claude models, explicit cache breakpoints are not available and automatic prefix caching is not supported. To use prompt caching with Claude, use the /v1/messages endpoint with explicit cache_control on your content blocks.
Gemini: Automatic Caching
Gemini models automatically cache the longest matching prefix of your prompt. No explicit markup is needed. If successive requests share the same leading content (system prompt, initial messages, etc.), Gemini reuses the cached prefix.
Best Practices
- Place static content (system prompts, reference documents) at the beginning of the conversation
- Keep the order of messages consistent across requests
- Append new messages at the end rather than modifying earlier ones
Example
curl https://api.routerhub.ai/v1/chat/completions \
-H "Authorization: Bearer $ROUTERHUB_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "google/gemini-2.5-pro",
"messages": [
{"role": "system", "content": "You are an expert code reviewer. Here is the full codebase: ... (large context) ..."},
{"role": "user", "content": "Review the authentication module."}
]
}'On the next request, keep the system message identical and only change the user message to benefit from automatic caching:
curl https://api.routerhub.ai/v1/chat/completions \
-H "Authorization: Bearer $ROUTERHUB_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "google/gemini-2.5-pro",
"messages": [
{"role": "system", "content": "You are an expert code reviewer. Here is the full codebase: ... (large context) ..."},
{"role": "user", "content": "Now review the database layer."}
]
}'GPT: Automatic and Explicit Caching
GPT models automatically cache prompt prefixes of 1,024 tokens or more. No explicit markup is needed for basic caching. Cached input tokens are billed at a reduced rate (up to 90% less than base input price) with no separate cache-write fee.
For more control, RouterHub passes through two optional parameters:
| Field | Type | Description |
|---|---|---|
| prompt_cache_key | string | A hint to improve cache routing. Requests with the same key are more likely to hit the same cached prefix, but cache hits still require an exact prefix match. |
| prompt_cache_retention | string | Cache retention policy: "in_memory" (default automatic behavior) or "24h" (retain for up to 24 hours, on supported models). |
Best Practices
- Structure prompts so that static content (system instructions, few-shot examples, reference documents) comes first
- Keep the shared prefix identical across requests — even small changes invalidate automatic caching
- Dynamic content (the actual user query) should go at the end
- Use
prompt_cache_keyto improve cache routing for related requests that reuse the same exact prompt prefix - Set
prompt_cache_retentionto"24h"for workloads where the same prompt is reused over longer periods (on supported models)
Example: Automatic Caching
curl https://api.routerhub.ai/v1/chat/completions \
-H "Authorization: Bearer $ROUTERHUB_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "openai/gpt-4.1",
"messages": [
{"role": "system", "content": "You are a customer support agent. Here are the support policies: ... (large document) ..."},
{"role": "user", "content": "A customer wants to return a product after 45 days."}
]
}'Example: Explicit Cache Key with 24h Retention
curl https://api.routerhub.ai/v1/chat/completions \
-H "Authorization: Bearer $ROUTERHUB_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "openai/gpt-4.1",
"prompt_cache_key": "support-policies-v2",
"prompt_cache_retention": "24h",
"messages": [
{"role": "system", "content": "You are a customer support agent. Here are the support policies: ... (large document) ..."},
{"role": "user", "content": "A customer wants to return a product after 45 days."}
]
}'Cache Usage in Responses
RouterHub reports cache hit and creation metrics in the response usage object so you can verify caching is working.
OpenAI Format
Cache usage appears in usage.prompt_tokens_details:
{
"usage": {
"prompt_tokens": 12500,
"completion_tokens": 150,
"total_tokens": 12650,
"prompt_tokens_details": {
"cached_tokens": 12000
}
}
}| Field | Description |
|---|---|
| prompt_tokens | Total input tokens (inclusive of cached and cache-creation tokens) |
| prompt_tokens_details.cached_tokens | Tokens read from cache (reduced cost) |
Anthropic Format
Cache usage appears directly on the usage object:
{
"usage": {
"input_tokens": 500,
"output_tokens": 150,
"cache_creation_input_tokens": 12000,
"cache_read_input_tokens": 0
}
}On the first request, cache_creation_input_tokens reflects the tokens written to cache. Subsequent requests with the same prefix will show cache_read_input_tokens instead:
{
"usage": {
"input_tokens": 500,
"output_tokens": 140,
"cache_creation_input_tokens": 0,
"cache_read_input_tokens": 12000,
"cache_creation": {
"ephemeral_5m_input_tokens": 0,
"ephemeral_1h_input_tokens": 0
}
}
}| Field | Description |
|---|---|
| input_tokens | Fresh (non-cached) input tokens |
| cache_creation_input_tokens | Tokens written to cache on this request |
| cache_read_input_tokens | Tokens read from cache (reduced cost) |
| cache_creation.ephemeral_5m_input_tokens | Cache creation tokens with 5-minute TTL |
| cache_creation.ephemeral_1h_input_tokens | Cache creation tokens with 1-hour TTL |
Pricing Impact
Cached tokens are billed at reduced rates compared to fresh input tokens. The exact discount depends on the provider:
| Provider | Cache Write Cost | Cache Read Cost |
|---|---|---|
| Claude | 25% more than base input price | 90% less than base input price |
| Gemini | No separate write fee | Varies by caching mode; implicit cache hits are typically discounted (check Google pricing for current rates) |
| GPT | No separate write fee | Up to 90% less than base input price |
For high-volume use cases with long, repeated prompts, caching can significantly reduce both cost and latency.
Tips
- Put static content first. System prompts, reference documents, and tool definitions should appear before dynamic user queries.
- Avoid modifying cached prefixes. Even a single character change in a cached block invalidates the cache for that block and everything after it.
- Use longer TTLs for stable content. For Claude, set
"ttl": "1h"on content that remains constant across many requests (where supported by the model and backend). For GPT, setprompt_cache_retentionto"24h"on supported models. - Place breakpoints strategically. For Claude, put
cache_controlon the last block you want included in the cached prefix. You can use up to 4 breakpoints per request. - Monitor cache usage. Check the
cached_tokens(OpenAI format) orcache_read_input_tokens(Anthropic format) fields to verify caching is working.