// Topic

RAG & Retrieval

A practitioner's guide to retrieval-augmented generation: embeddings, hybrid search, reranking, evaluation, and the optimizations that decide whether your RAG system works.

Retrieval-augmented generation is easy to demo and hard to ship. The demo works because the answer sits in the top result. Production breaks because the right chunk lands at rank 14, the embedding misses the intent, and nobody measured recall before launch.

The posts below cover the retrieval stack end to end, from how embeddings encode meaning to reranking, hybrid search, caching, and the evaluation metrics that tell you whether any of it works.

Foundations

How retrieval encodes meaning and where it fits next to fine-tuning and memory.

Vector Embeddings: a Guide to the Geometry of Meaning in Ai

Everything in AI starts with a vector. Here is how embedding models turn human language into high-dimensional geometry, why dimensionality reduction matters, and how to choose between OpenAI, Cohere, and self-hosted models.

16 min read ai llm rag

RAG vs Fine-Tuning: A Better Decision Framework

Choosing between Retrieval-Augmented Generation (RAG) and Fine-Tuning is the most common architectural mistake in AI. Here is how to decide based on knowledge frequency, data privacy, and behavior requirements.

7 min read ai llm rag

RAG vs Memory: What AI Developers Need to Know

Understand the fundamental differences between RAG and memory systems for LLM applications, when to use each, and how to combine them in production.

14 min read ai rag memory

Retrieval quality

Getting the right chunk to the top, not just a similar one.

Hybrid Search: Combining Bm25 and Vector Search for Better Retrieval

Hybrid search combines BM25 sparse retrieval with dense vector search. Here's how reciprocal rank fusion works, what it costs, and when the combination actually beats either method alone.

11 min read ai rag vector-search

Reranking in RAG: Why Your Top-K Results Are Probably Wrong

Vector databases return results based on semantic similarity. I explain why that is rarely enough for production RAG and how a cross-encoder reranker fixes the problem.

11 min read ai rag vector-search

How Anthropic's Contextual Retrieval Changes RAG Architecture

Anthropic says Contextual Retrieval cut top-20 retrieval failure by 49% with contextual embeddings plus contextual BM25. I walk through the mechanism, the benchmark, and the part of the RAG pipeline it changes.

11 min read ai rag infrastructure

Optimization and cost

Cutting latency and spend without hurting answer quality.

Semantic Caching: The RAG Optimization Nobody Talks About

Semantic caching returns cached LLM responses for semantically similar queries, cutting API costs by 40-70% on the right workloads. Here's how the mechanism works and where it fails.

10 min read ai rag infrastructure

Structured Outputs with LLMs: JSON Mode, Function Calling, and When to Use Each

JSON mode, function calling, and structured outputs solve different problems. Here's when each one actually makes sense and what they cost you.

12 min read ai llm infrastructure

Evaluation

The metrics that separate a production RAG system from a demo.

RAG Evaluation Metrics: What Actually Matters

A practical guide to RAGAs, recall, precision, and the metrics that separate production RAG systems from prototypes.

11 min read rag evaluation llm

Benchmarks

Real numbers on retrieval running close to the user.

100ms Vector Search in the Browser: PGlite vs. SQLite-vec Head-to-Head

I benchmarked PGlite and SQLite-vec on a MacBook Air M2 to find the fastest WASM vector database for local-first AI applications.

10 min read vector-search wasm pglite

Frequently asked questions

What is the difference between RAG and fine-tuning?

RAG adds knowledge at query time by retrieving documents into the prompt. Fine-tuning bakes behavior or knowledge into the model weights through training. RAG suits facts that change often, and fine-tuning suits stable behavior and format. The decision-framework post covers how to choose.

Why are my top-k retrieval results wrong?

Vector similarity ranks by semantic closeness, which is not the same as relevance to the question. The fix is a reranking step that re-scores candidates with a model built for relevance. The reranking post shows the measured difference.

What is hybrid search?

Hybrid search combines keyword retrieval (BM25) with dense vector search and fuses the rankings, usually with reciprocal rank fusion. It catches exact terms that embeddings miss and meaning that keywords miss.

How do I evaluate a RAG system?

Measure retrieval and generation separately. Track recall and precision on retrieval, and faithfulness and answer relevance on generation. The evaluation-metrics post covers which numbers actually predict production quality.

Related topics

Need a technical writer who gets this?

I write for DevTools and B2B SaaS companies. Let's talk.