Prompt Caching

Every model on Ember Cloud includes built-in prompt caching. When you send a request whose prompt shares a prefix with a recent request, the cached portion is served faster and at a reduced cost.

Caching is fully automatic — no opt-in, no configuration, no changes to your requests. Ember Cloud manages cache behavior across all models so you don't have to.

Cached input tokens are billed at a discounted cache_read rate. Check the /v1/models endpoint for per-model pricing.

How It Works

When you send a chat completion request, Ember Cloud tokenizes your prompt and checks whether a prefix of those tokens already exists in the KV-cache from a recent request.

Cache miss

The full prompt is processed from scratch. This happens on the first request or when the prompt has changed significantly.

Cache hit

The cached prefix is reused and only new tokens are computed. This is faster and cheaper.

Cache entries expire after a short TTL. Sending identical or prefix-sharing requests in quick succession maximizes cache reuse.

Supported Models

All models on Ember Cloud support prompt caching. Each model has a minimum activation threshold, shown below.

ModelMin Threshold
glm-4.5~32 tokens
glm-4.5-air~64 tokens
glm-4.7-flash~32 tokens
glm-4.7~300 tokens
glm-5~256 tokens
minimax-m2.5~100 tokens
kimi-k2.5~256 tokens

Prompts below the listed threshold are processed without caching.

Detecting Cache Hits

The response includes a usage.prompt_tokens_details.cached_tokens field that tells you how many input tokens were served from cache.

Response with cached tokens
{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "model": "glm-4.5",
  "choices": [...],
  "usage": {
    "prompt_tokens": 1553,
    "completion_tokens": 12,
    "total_tokens": 1565,
    "prompt_tokens_details": {
      "cached_tokens": 1552
    }
  }
}

You can check for cache hits in your application code:

from openai import OpenAI

client = OpenAI(
    api_key="your-api-key",
    base_url="https://api.embercloud.ai/v1",
)

response = client.chat.completions.create(
    model="glm-4.5",
    messages=[
        {"role": "system", "content": "You are a helpful assistant..."},
        {"role": "user", "content": "What is prompt caching?"},
    ],
)

usage = response.usage
cached = getattr(usage, "prompt_tokens_details", None)
if cached and cached.cached_tokens > 0:
    print(f"Cache hit: {cached.cached_tokens}/{usage.prompt_tokens} tokens cached")
else:
    print("Cache miss")

Maximizing Cache Hits

Use a consistent system prompt

Place your system instructions at the beginning of the message array. Since caching works on token prefixes, a stable system prompt ensures subsequent requests reuse the cached prefix.

Keep variable content at the end

Put user-specific or changing content at the end of your messages. The more tokens that match from the beginning, the higher the cache hit ratio.

Send requests in quick succession

Cache entries have a limited TTL (typically seconds to minutes). Batching requests close together maximizes reuse.

Use prompts above the threshold

Some models require a few hundred tokens before caching activates. If your prompt is very short, adding a system prompt can push it above the minimum.

FAQ

Is there a minimum token count for caching to trigger?

Yes — it varies by model, roughly in the 32-300 token range. See the table above for exact thresholds. Most models cache at well under 300 tokens, so virtually any real-world request benefits.

Do I need to enable caching?

No. Caching is always on and fully automatic. There is no flag or parameter to set.