// Writing

Thinking out loud.

Writing on AI infrastructure, developer tooling, and technical content strategy. Practitioner-level posts from someone who builds the systems and writes about them.

2026-03-13
LLM context windows: what the number on the spec sheet actually means
A 1M token context window sounds impressive. Here's why the model often can't reliably use most of it, and what actually happens to your data inside a full context window.
6 min read ai llm inference
2026-03-13
Mixture of experts: how MoE models are cheap to run but expensive to host
DeepSeek V3 has 671B parameters but only activates 37B per token. Here's how mixture of experts works, why it cuts inference costs, and the catch nobody puts in the headline.
7 min read ai llm inference
2026-03-12
Speculative decoding explained: how LLMs generate tokens faster without changing the output
Speculative decoding cuts LLM latency by 2-3x without changing output quality. Here's how the draft-verify loop works, why acceptance rate is the only number that matters, and when it actually hurts.
7 min read ai llm inference
2026-03-11
Best LLMs for coding in 2026 (it's not what benchmarks say)
HumanEval scores are nearly meaningless in 2026. Here's how to pick the right coding LLM based on your actual task: agentic coding, autocomplete, or cost-sensitive production.
10 min read ai llm coding
2026-03-10
Agent harnesses: the infrastructure layer your LLM agent actually needs
Every production AI agent needs a harness. Here's what one contains, why frameworks often aren't enough, and how to build the layer that actually determines reliability.
12 min read ai agents infrastructure
2026-03-09
The Model Context Protocol explained: why MCP matters architecturally
A practical explanation of the Model Context Protocol, why it matters architecturally, how FastAPI MCP fits in, and how to avoid framework lock-in in agent systems.
10 min read ai agents mcp
2026-03-08
Time to first token: what TTFT measures and how to reduce it
TTFT is the latency metric that matters most for chat, copilots, and interactive AI apps. Here is what time to first token measures and how to reduce it without optimizing the wrong thing.
13 min read ai llm latency
2026-03-07
Prompt caching: what it is and when the math works in your favor
Prompt caching can cut latency and input cost, but only when your prompts share a stable prefix often enough to make cache reuse real. Here is how I think about it in production.
10 min read ai llm prompt-caching
2026-03-06
RAG vs fine-tuning: how to actually decide
RAG and fine-tuning solve different problems. Here is how I decide between them, when to combine them, and where teams waste time choosing the wrong one.
11 min read ai rag fine-tuning