Prompt Caching

Reduce latency and cost by caching frequently reused prompt content. When the same content appears across requests, the provider can skip reprocessing cached tokens and charge a reduced rate.

Prompt caching is handled by the underlying providers. RouterHub passes through caching directives and reports cache usage in the response.


How It Works

Each provider implements prompt caching differently, but the core idea is the same: prefix content that is identical across requests is processed once and reused from cache on subsequent calls.

Provider Mechanism Minimum Cacheable Tokens Cache Lifetime
Claude Explicit breakpoints via cache_control 1,024 tokens (Haiku 4.5, Opus 4.5, Opus 4.6: 4,096) 5 minutes (default) or 1 hour (where supported)
Gemini Automatic — longest matching prefix is cached Provider-managed Automatic (provider-managed)
GPT Automatic prefix caching; optional prompt_cache_key for explicit control 1,024 tokens Automatic (minutes); up to 24 hours with prompt_cache_retention on supported models

Claude: Explicit Cache Breakpoints

Claude uses explicit cache_control breakpoints. You mark the point in your prompt where the cache should end using cache_control on content blocks, system blocks, and tool definitions. Everything up to and including the marked block becomes a cacheable prefix.

Supported Locations

cache_control can be placed on:

RouterHub applies cache_control as per-block breakpoints on individual content blocks, system blocks, and tools. This approach is supported by all Claude backends (AWS Bedrock, GCP Vertex, and direct Anthropic API). The top-level cache_control convenience field (which auto-applies a breakpoint to the last cacheable block) is not used, as it is not supported on Bedrock and Vertex.

cache_control Object

Field Type Required Description
type string Required "ephemeral"
ttl string Optional "5m" (default) or "1h" (where supported by the model and backend)

Anthropic Format (/v1/messages)

Add cache_control directly on content blocks, system blocks, and tools. RouterHub passes these through to Claude.

Caching a system prompt

curl https://api.routerhub.ai/v1/messages \
  -H "x-api-key: $ROUTERHUB_API_KEY" \
  -H "anthropic-version: 2023-06-01" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "anthropic/claude-sonnet-4",
    "max_tokens": 1024,
    "system": [
      {
        "type": "text",
        "text": "You are a legal assistant. Here is the full text of the contract to analyze: ... (long document) ...",
        "cache_control": {"type": "ephemeral"}
      }
    ],
    "messages": [
      {"role": "user", "content": "Summarize the key terms."}
    ]
  }'
from anthropic import Anthropic

client = Anthropic(
    base_url="https://api.routerhub.ai",
    api_key="YOUR_API_KEY",
)

message = client.messages.create(
    model="anthropic/claude-sonnet-4",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "You are a legal assistant. Here is the full text of the contract to analyze: ... (long document) ...",
            "cache_control": {"type": "ephemeral"},
        }
    ],
    messages=[
        {"role": "user", "content": "Summarize the key terms."}
    ],
)
print(message.content[0].text)

Caching tools and conversation history

{
  "model": "anthropic/claude-sonnet-4",
  "max_tokens": 1024,
  "tools": [
    {
      "name": "search_database",
      "description": "Search the product database",
      "input_schema": {
        "type": "object",
        "properties": {
          "query": {"type": "string"}
        },
        "required": ["query"]
      },
      "cache_control": {"type": "ephemeral"}
    }
  ],
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "Here is the product catalog: ... (large text) ...",
          "cache_control": {"type": "ephemeral"}
        },
        {
          "type": "text",
          "text": "Find products related to wireless charging."
        }
      ]
    }
  ]
}

Using 1-hour TTL

For content that should remain cached longer, set the TTL to "1h" (where supported by the model and backend):

{
  "type": "text",
  "text": "Reference documentation that rarely changes...",
  "cache_control": {"type": "ephemeral", "ttl": "1h"}
}

OpenAI Format (/v1/chat/completions)

When using the OpenAI-compatible endpoint with Claude models, explicit cache breakpoints are not available and automatic prefix caching is not supported. To use prompt caching with Claude, use the /v1/messages endpoint with explicit cache_control on your content blocks.


Gemini: Automatic Caching

Gemini models automatically cache the longest matching prefix of your prompt. No explicit markup is needed. If successive requests share the same leading content (system prompt, initial messages, etc.), Gemini reuses the cached prefix.

Best Practices

Example

curl https://api.routerhub.ai/v1/chat/completions \
  -H "Authorization: Bearer $ROUTERHUB_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "google/gemini-2.5-pro",
    "messages": [
      {"role": "system", "content": "You are an expert code reviewer. Here is the full codebase: ... (large context) ..."},
      {"role": "user", "content": "Review the authentication module."}
    ]
  }'

On the next request, keep the system message identical and only change the user message to benefit from automatic caching:

curl https://api.routerhub.ai/v1/chat/completions \
  -H "Authorization: Bearer $ROUTERHUB_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "google/gemini-2.5-pro",
    "messages": [
      {"role": "system", "content": "You are an expert code reviewer. Here is the full codebase: ... (large context) ..."},
      {"role": "user", "content": "Now review the database layer."}
    ]
  }'

GPT: Automatic and Explicit Caching

GPT models automatically cache prompt prefixes of 1,024 tokens or more. No explicit markup is needed for basic caching. Cached input tokens are billed at a reduced rate (up to 90% less than base input price) with no separate cache-write fee.

For more control, RouterHub passes through two optional parameters:

Field Type Description
prompt_cache_key string A hint to improve cache routing. Requests with the same key are more likely to hit the same cached prefix, but cache hits still require an exact prefix match.
prompt_cache_retention string Cache retention policy: "in_memory" (default automatic behavior) or "24h" (retain for up to 24 hours, on supported models).

Best Practices

Example: Automatic Caching

curl https://api.routerhub.ai/v1/chat/completions \
  -H "Authorization: Bearer $ROUTERHUB_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-4.1",
    "messages": [
      {"role": "system", "content": "You are a customer support agent. Here are the support policies: ... (large document) ..."},
      {"role": "user", "content": "A customer wants to return a product after 45 days."}
    ]
  }'

Example: Explicit Cache Key with 24h Retention

curl https://api.routerhub.ai/v1/chat/completions \
  -H "Authorization: Bearer $ROUTERHUB_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-4.1",
    "prompt_cache_key": "support-policies-v2",
    "prompt_cache_retention": "24h",
    "messages": [
      {"role": "system", "content": "You are a customer support agent. Here are the support policies: ... (large document) ..."},
      {"role": "user", "content": "A customer wants to return a product after 45 days."}
    ]
  }'

Cache Usage in Responses

RouterHub reports cache hit and creation metrics in the response usage object so you can verify caching is working.

OpenAI Format

Cache usage appears in usage.prompt_tokens_details:

{
  "usage": {
    "prompt_tokens": 12500,
    "completion_tokens": 150,
    "total_tokens": 12650,
    "prompt_tokens_details": {
      "cached_tokens": 12000
    }
  }
}
Field Description
prompt_tokens Total input tokens (inclusive of cached and cache-creation tokens)
prompt_tokens_details.cached_tokens Tokens read from cache (reduced cost)

Anthropic Format

Cache usage appears directly on the usage object:

{
  "usage": {
    "input_tokens": 500,
    "output_tokens": 150,
    "cache_creation_input_tokens": 12000,
    "cache_read_input_tokens": 0
  }
}

On the first request, cache_creation_input_tokens reflects the tokens written to cache. Subsequent requests with the same prefix will show cache_read_input_tokens instead:

{
  "usage": {
    "input_tokens": 500,
    "output_tokens": 140,
    "cache_creation_input_tokens": 0,
    "cache_read_input_tokens": 12000,
    "cache_creation": {
      "ephemeral_5m_input_tokens": 0,
      "ephemeral_1h_input_tokens": 0
    }
  }
}
Field Description
input_tokens Fresh (non-cached) input tokens
cache_creation_input_tokens Tokens written to cache on this request
cache_read_input_tokens Tokens read from cache (reduced cost)
cache_creation.ephemeral_5m_input_tokens Cache creation tokens with 5-minute TTL
cache_creation.ephemeral_1h_input_tokens Cache creation tokens with 1-hour TTL

Pricing Impact

Cached tokens are billed at reduced rates compared to fresh input tokens. The exact discount depends on the provider:

Provider Cache Write Cost Cache Read Cost
Claude 25% more than base input price 90% less than base input price
Gemini No separate write fee Varies by caching mode; implicit cache hits are typically discounted (check Google pricing for current rates)
GPT No separate write fee Up to 90% less than base input price

For high-volume use cases with long, repeated prompts, caching can significantly reduce both cost and latency.


Tips