// Writing
Thinking out loud.
Writing on AI infrastructure, developer tooling, and technical content strategy. Practitioner-level posts from someone who builds the systems and writes about them.
Why AI Agents Keep Failing in Production and What the Field Is Doing About It
I have spent two years watching agents fail in production. Here is what I keep seeing and what the field is starting to do about it.
A Taxonomy of AI Agents That Actually Explains What You Are Building
Most AI agent taxonomies are either too academic or too vague to be useful. Here is the classification I use when I need to decide what kind of agent to build.
Python model.predict(): The Function That Turns Data Into Decisions
A practical guide to model.predict() across scikit-learn, Keras, PyTorch, and XGBoost—what it does, how it behaves differently across frameworks, and the gotchas that will bite you in production.
RAG vs Memory: What AI Developers Need to Know
Understand the fundamental differences between RAG and memory systems for LLM applications, when to use each, and how to combine them in production.
Short-Term Memory for AI Agents: A Practical Guide
Context windows are not memory. Here is what every production AI agent engineer needs to understand about token budgets, overflow handling, and how short-term and long-term memory actually work together.
AI Memory Management for LLMs: What Actually Works
A senior engineer's breakdown of what memory management for LLMs actually looks like in production: eviction strategies, KV cache management, importance-weighted retention, and why your agent keeps forgetting things.
Context Windows vs Memory: Why They Are Not the Same Thing
A 1M token context window is not memory. Treating it like one is how you build expensive systems that still forget what they were doing last Tuesday.
State of AI Agent Memory in 2026
The memory stack for AI agents has exploded into a fragmented mess of competing approaches. Here is what actually works, what is still research, and why the next 18 months will sort the winners from the wreckage.
The BEAM Memory Benchmark: Why 1M Context Windows Are Not Enough
The BEAM benchmark reveals that LLMs fail catastrophically at retrieving facts from the middle of long contexts. Here is what the data actually shows, why it happens, and what matters for real deployments.
How Memory Works in HyperAgents
A deep dive into how HyperAgents retain context across interactions, layer memory architectures, and handle session continuity in production.
Memory for Voice AI Agents: What Text Chatbots Cannot Do
Voice AI agents live or die by how they manage memory across a real-time streaming pipeline. Text chatbots solve memory with RAG. Voice agents need something different.
Memory Hierarchy in AI Systems: From Sensory to Semantic
How layered memory architecture helps AI systems achieve long-term context, personalization, and continuous learning — and why flat memory fails.
How Memory Works in Claude Code
A practical guide to understanding how Claude Code retains context across sessions, uses project files, and manages long-term memory for coding tasks.
How Memory Works in DeerFlow
A deep dive into the memory architecture of DeerFlow: layered context passing, session state files, sub-agent isolation, and how it compares to Letta, AutoGen, and CrewAI.
Technical Writing for Engineers: The 80/20 Guide
Most engineering documentation fails for the same reasons. Here is what actually moves the needle.
LLM token budgets: a practical guide to cost control
Real numbers, real pricing, and concrete strategies for keeping your LLM spend predictable.
RAG Evaluation Metrics: What Actually Matters
A practical guide to RAGAs, recall, precision, and the metrics that separate production RAG systems from prototypes.
What Nobody Tells You About Error Handling in Production AI Agents
Hard-won lessons from running AI agents in production: the error patterns that actually break systems, and the patterns that fix them.
The 800ms Barrier: Profiling the Latency Chain of a Real-Time Gemini 3.1 Voice Agent
I built a sub-second latency voice assistant and profiled every millisecond of the Audio-to-Audio request/response loop on a MacBook Air M2. Here is the bottleneck analysis.
Context Engineering as Heap Management: Measuring Accuracy vs. KV Cache Eviction
VRAM is too expensive to waste on low-attention tokens. I benchmarked KV cache eviction strategies to treat LLM context like a managed heap, reaching 90% pruning with zero recall loss.