Retrieval-Augmented Generation (RAG)

The mainstream pattern for answering questions over a document collection: index the documents as vector embeddings, retrieve the most similar chunks at query time, inject them into the LLM’s context, and generate an answer grounded in the retrieved fragments. The default architecture for enterprise knowledge systems, and the comparison baseline every newer pattern (e.g. LLM Knowledge Bases (Karpathy pattern)) is measured against.

Key points

  • Classic RAG architecture has four parts: a chunking pipeline (splits documents into passages), an embedding model (converts passages to vectors), a vector database (stores and searches them), and a retrieval-then-generate step at query time. Each part is a potential failure point.
  • The “95% token reduction” claim often attached to LLM wikis is misleading [src-002]. That number compares the wiki pattern against naive full-document loading, not against well-tuned RAG with proper chunking. Against optimised RAG, both approaches inject similar token volumes — roughly 2,000–5,000 tokens per query for small knowledge bases. The real differentiator is infrastructure cost and retrieval reliability, not raw token count.
  • Where RAG still dominates: enterprise scale (millions of documents), heterogeneous sources that don’t fit neatly into a single wiki, and use cases where the retrieval corpus must stay immutable and independently updatable [src-002].
  • Where RAG is overkill: personal knowledge bases under ~100–200 pages. At that scale, the LLM Knowledge Bases (Karpathy pattern) pattern (markdown + index) matches RAG’s retrieval quality without any of the infrastructure.
  • Hybrid approaches exist and are worth considering once a wiki grows past its context-window ceiling: keep the wiki for conceptual navigation, use RAG for long-tail document retrieval.

Trade-offs vs LLM Knowledge Bases (Karpathy pattern) [src-002]

Factor RAG LLM Wiki
Infrastructure Embedding model + vector DB + chunking Just markdown files
Scale ceiling Millions of documents ~200 pages / 100K tokens
Retrieval reliability Can miss relevant passages Reads index directly — no misses
Maintenance Re-embed when sources change Periodic LLM lint
Human readability None (vectors are opaque) Excellent (markdown)
Deduplication Not applicable (chunks independent) LLM-dependent, fragile at scale
Time-series analysis Requires separate analytics layer Not built in

Related entities

_(none yet — no specific RAG tool has been ingested as an entity in the wiki)_

Related concepts

Updates from bulk ingest

From src-006 (cluster 3)

  • RAG is not a synonym for vector search — the four retrieval methods (filters, SQL, full-context, chunk-based vectors) should be chosen by matching the human equivalent of the question [kOKavHnlPik].
  • Chunk-based retrieval fails on tabular data and on whole-document questions like ‘how many rules total’ because the agent never sees the full table or document [kOKavHnlPik, irg-2IfAjpo].
  • Full-context stuffing is increasingly viable — GPT-5 Mini’s 400k-token window makes cramming entire PDFs into the system prompt cheap enough for many use cases, and Nate used this in the Agentic Arena RAG challenge [kOKavHnlPik].
  • Multimodal RAG is now possible with Gemini Embedding 2, which embeds text, images, video, audio, and documents into one vector space — enabling unified retrieval across modalities from a single query [hem5D1uvy-w].
  • Managed alternatives to self-hosted vector stores (Gemini File Search, Pinecone Assistant, OpenAI Vector Store) strip out the pipeline plumbing but introduce their own trade-offs around deduplication, chunk granularity, and data residency [irg-2IfAjpo].
  • Operational hygiene (metadata tagging, delete flows, recycling-bin workarounds) is as important as retrieval quality — stale or duplicate vectors silently degrade agent answers [5uw1wE6niGc].
  • Claude Code plus Gemini Embedding 2 plus Pinecone compresses a full multimodal RAG build from hours in n8n to roughly 30 minutes of natural-language prompting [hem5D1uvy-w].

Source references

  • [src-002] Robin Cartier — “Karpathy’s LLM Knowledge Base: A Practitioner’s Verdict” (2026-04-08)
  • [src-006] Nate Herk cluster (see summaries/src-006-*.md)