// Writing

Thinking out loud.

Writing on AI infrastructure, developer tooling, and technical content strategy. Practitioner-level posts from someone who builds the systems and writes about them.

Reranking in Rag: Why Your Top-k Results Are Probably Wrong

Vector databases return results based on semantic similarity. I explain why that is rarely enough for production RAG and how a cross-encoder reranker fixes the problem.

9 min read ai rag vector-search

Vector Embeddings: a Guide to the Geometry of Meaning in Ai

Everything in AI starts with a vector. Here is how embedding models turn human language into high-dimensional geometry, why dimensionality reduction matters, and how to choose between OpenAI, Cohere, and self-hosted models.

15 min read ai llm rag

The State of Open Source Ai Memory in 2026: Beyond the Context Window Myth

AI architecture has reached a plateau in model reasoning. The next frontier of differentiation lives in stateful memory systems that solve identity fragmentation at production scale.

8 min read ai agents infrastructure

Llm Context Windows Explained: Why More Is Not Always Better

Context windows are expanding to millions of tokens. Here is why the middle of your context still gets ignored, what long context actually costs, and how to build production systems that use these massive windows effectively.

4 min read ai llm infrastructure

Prompt Caching: What It Is and When the Math Works

Prompt caching can reduce LLM costs by up to 90% and cut latency by half. Here is the engineering guide to how it works, why prefix matching matters, and how to calculate your ROI.

3 min read ai llm infrastructure

Mixture of Experts: How Moe Models Are Cheap to Run but Expensive to Host

DeepSeek V3 has 671B parameters but only activates 37B per token. Here's how mixture of experts works, why it cuts inference costs, and the catch nobody puts in the headline.

7 min read ai llm inference

The Best Llms for Coding in 2026: an Engineering Review

Not all models are created equal for software development. Here is a benchmark-backed guide to choosing the right LLM for autonomous agents, algorithmic logic, and repository-scale refactoring as of March 2026.

4 min read ai llm technical-writing

The Model Context Protocol (mcp) Explained: a Universal Language for Ai Tools

Model Context Protocol (MCP) is the new standard for connecting AI models to data sources and tools. Here is why it matters, how it works, and why it is the missing link for agentic infrastructure.

4 min read ai agents infrastructure

Agent Harnesses: the Infrastructure Layer Your Llm Agent Actually Needs

Every production AI agent needs a harness. Here is what one contains, why frameworks often are not enough, and how to build the layer that actually determines reliability.

9 min read ai agents infrastructure

Rag Vs Fine-tuning: a Better Decision Framework

Choosing between Retrieval-Augmented Generation (RAG) and Fine-Tuning is the most common architectural mistake in AI. Here is how to decide based on knowledge frequency, data privacy, and behavior requirements.

2 min read ai llm rag

Time to First Token (ttft): the Metric That Determines Ai Snappiness

Users do not care about total throughput. They care about how fast the first word appears. Here is the engineering guide to measuring and optimizing Time to First Token (TTFT) in production.

3 min read ai llm infrastructure

Speculative Decoding: How to Speed up Llm Inference for Free

LLM inference is memory-bound, not compute-bound. Speculative decoding uses this fact to speed up generation by 2-3x using a smaller draft model to predict tokens for a larger one.

2 min read ai llm infrastructure