The practice of converting source documents to plain markdown before feeding them to Claude, exploiting the tokeniser’s efficiency on clean text versus format-heavy file types.
Reduction ratios
| Format | Token reduction |
|---|---|
| — | — |
| HTML → markdown | ~90% |
| PDF → markdown | 65–70% |
| DOCX → markdown | ~33% |
A 40-page PDF can occupy the same token space as a 130-page markdown file. [011]
Key points
- PDFs and HTML carry layout metadata, CSS, and formatting noise the model does not need for most tasks — only the text content matters [011]
- Recommended conversion tool: Dockling (and similar converters) for fast automated conversion [011]
- Exception: OCR and vision tasks require the original file format [011]
- Pairs naturally with Claude Code Memory best practices: CLAUDE.md should route to separate files rather than inlining all context [011]
Related concepts
- Context Rot — high-format documents accelerate context rot when ingested raw
- Token Economics — document format is a significant token cost lever
- Context Management — pre-processing documents is a standard context hygiene step
Source references
- [011] Nate Herk — Claude Code power features cluster (2026-04-20 to 2026-04-27)