// Writing

Thinking out loud.

Writing on AI infrastructure, developer tooling, and technical content strategy. Practitioner-level posts from someone who builds the systems and writes about them.

Reranking in Rag: Why Your Top-k Results Are Probably Wrong
Vector databases return results based on semantic similarity. I explain why that is rarely enough for production RAG and how a cross-encoder reranker fixes the problem.
9 min read ai rag vector-search
Vector Embeddings: a Guide to the Geometry of Meaning in Ai
Everything in AI starts with a vector. Here is how embedding models turn human language into high-dimensional geometry, why dimensionality reduction matters, and how to choose between OpenAI, Cohere, and self-hosted models.
15 min read ai llm rag
The State of Open Source Ai Memory in 2026: Beyond the Context Window Myth
AI architecture has reached a plateau in model reasoning. The next frontier of differentiation lives in stateful memory systems that solve identity fragmentation at production scale.
8 min read ai agents infrastructure
Llm Context Windows Explained: Why More Is Not Always Better
Context windows are expanding to millions of tokens. Here is why the middle of your context still gets ignored, what long context actually costs, and how to build production systems that use these massive windows effectively.
4 min read ai llm infrastructure
Prompt Caching: What It Is and When the Math Works
Prompt caching can reduce LLM costs by up to 90% and cut latency by half. Here is the engineering guide to how it works, why prefix matching matters, and how to calculate your ROI.
3 min read ai llm infrastructure
Mixture of Experts: How Moe Models Are Cheap to Run but Expensive to Host
DeepSeek V3 has 671B parameters but only activates 37B per token. Here's how mixture of experts works, why it cuts inference costs, and the catch nobody puts in the headline.
7 min read ai llm inference
The Best Llms for Coding in 2026: an Engineering Review
Not all models are created equal for software development. Here is a benchmark-backed guide to choosing the right LLM for autonomous agents, algorithmic logic, and repository-scale refactoring as of March 2026.
4 min read ai llm technical-writing
The Model Context Protocol (mcp) Explained: a Universal Language for Ai Tools
Model Context Protocol (MCP) is the new standard for connecting AI models to data sources and tools. Here is why it matters, how it works, and why it is the missing link for agentic infrastructure.
4 min read ai agents infrastructure
Agent Harnesses: the Infrastructure Layer Your Llm Agent Actually Needs
Every production AI agent needs a harness. Here is what one contains, why frameworks often are not enough, and how to build the layer that actually determines reliability.
9 min read ai agents infrastructure
Rag Vs Fine-tuning: a Better Decision Framework
Choosing between Retrieval-Augmented Generation (RAG) and Fine-Tuning is the most common architectural mistake in AI. Here is how to decide based on knowledge frequency, data privacy, and behavior requirements.
2 min read ai llm rag
Time to First Token (ttft): the Metric That Determines Ai Snappiness
Users do not care about total throughput. They care about how fast the first word appears. Here is the engineering guide to measuring and optimizing Time to First Token (TTFT) in production.
3 min read ai llm infrastructure
Speculative Decoding: How to Speed up Llm Inference for Free
LLM inference is memory-bound, not compute-bound. Speculative decoding uses this fact to speed up generation by 2-3x using a smaller draft model to predict tokens for a larger one.
2 min read ai llm infrastructure