Prompt Caching for Agents

Prompt Caching for Agents

Prompt caching for agents is the practice of arranging stable instructions, policies, and tool schemas so providers can reuse repeated prompt prefixes and reduce latency and token cost.

Key points

  • Datadog found that 69 percent of input tokens in its customer traces were system-prompt tokens: internal instructions, policies, and tool guidance repeated through agent chains [src-037].
  • Heavily scaffolded agents often repeat guardrails and tool guidance verbatim, creating a cost and latency bottleneck [src-037].
  • Prompt caching can reduce cost and increase speed without changing model behavior when stable scaffolding is reusable across calls [src-037].
  • Datadog found that even among models supporting prompt caching, only 28 percent of LLM call spans showed cached-read input tokens [src-037].
  • Low cache-hit rates often come from prompt layout problems: dynamic content appears too early, stable blocks are reordered, or reusable prefixes are rewritten between requests [src-037].
  • Practical caching requires modular prompt design: stable system instructions, policy blocks, and tool schemas should stay ordered and reusable, while dynamic state is injected later [src-037].
  • Pope explains the hardware reason cache hits are cheaper: reusing stored KV Cache state avoids rematerializing that context from token IDs with another model forward pass [src-042].
  • Cache-duration pricing can reflect memory-tier choices: short-lived caches can sit in faster memory, while longer-lived caches may move to slower cheaper tiers [src-042].
  • OpenAI's Build Hour gives concrete API mechanics: prompt caching starts at 1,024 tokens, caches additional 128-token blocks, requires an exact contiguous prefix, works across text, images, and audio, and is automatic on supported endpoints [src-084].
  • The session describes default ephemeral caches of roughly 5-10 minutes plus extended prompt caching that can store a cache for up to 24 hours [src-084].
  • OpenAI frames cached input discounts as very large for current models: about 90% on GPT-5-family cached tokens and nearly 99% for realtime audio cached tokens [src-084].
  • The prompt_cache_key parameter helps route related requests to the same backend by combining the prefix hash with a developer-supplied key, improving cache locality for high-volume or branching agent workloads [src-084].
  • Warp's customer spotlight adds practical scoping: stable global prompts and tools first, user-specific context later, task-level traces preserved turn by turn, and edits appended rather than rewriting earlier cached messages [src-084].
  • OpenAI notes that caching itself has no intelligence tradeoff for identical prefixes; the tradeoff is architectural when teams over-optimize cache hits at the expense of useful context curation [src-084].

Related entities

Related concepts

Source references

  • [src-037] Datadog — "State of AI Engineering" (2026-04-21)
  • [src-042] Dwarkesh Patel — "How GPT, Claude, and Gemini are actually trained and served – Reiner Pope" (2026-04-29)
  • [src-084] OpenAI Codex, Workspace Agents, Prompt Caching, and Superintelligence Policy cluster (2026-02-09 to 2026-05-08)