// Writing
Thinking out loud.
Writing on AI infrastructure, developer tooling, and technical content strategy. Practitioner-level posts from someone who builds the systems and writes about them.
2026-03-13
LLM context windows: what the number on the spec sheet actually means
A 1M token context window sounds impressive. Here's why the model often can't reliably use most of it, and what actually happens to your data inside a full context window.
2026-03-13
Mixture of experts: how MoE models are cheap to run but expensive to host
DeepSeek V3 has 671B parameters but only activates 37B per token. Here's how mixture of experts works, why it cuts inference costs, and the catch nobody puts in the headline.
2026-03-12
Speculative decoding explained: how LLMs generate tokens faster without changing the output
Speculative decoding cuts LLM latency by 2-3x without changing output quality. Here's how the draft-verify loop works, why acceptance rate is the only number that matters, and when it actually hurts.
2026-03-11
Best LLMs for coding in 2026 (it's not what benchmarks say)
HumanEval scores are nearly meaningless in 2026. Here's how to pick the right coding LLM based on your actual task: agentic coding, autocomplete, or cost-sensitive production.
2026-03-10
Agent harnesses: the infrastructure layer your LLM agent actually needs
Every production AI agent needs a harness. Here's what one contains, why frameworks often aren't enough, and how to build the layer that actually determines reliability.
2026-03-09
The Model Context Protocol explained: why MCP matters architecturally
A practical explanation of the Model Context Protocol, why it matters architecturally, how FastAPI MCP fits in, and how to avoid framework lock-in in agent systems.
2026-03-08
Time to first token: what TTFT measures and how to reduce it
TTFT is the latency metric that matters most for chat, copilots, and interactive AI apps. Here is what time to first token measures and how to reduce it without optimizing the wrong thing.
2026-03-07
Prompt caching: what it is and when the math works in your favor
Prompt caching can cut latency and input cost, but only when your prompts share a stable prefix often enough to make cache reuse real. Here is how I think about it in production.
2026-03-06
RAG vs fine-tuning: how to actually decide
RAG and fine-tuning solve different problems. Here is how I decide between them, when to combine them, and where teams waste time choosing the wrong one.