// Topic

LLM Inference & Cost

How LLMs actually run, and how to make them faster and cheaper. Context windows, token budgets, prompt caching, speculative decoding, TTFT, and KV-cache internals, all benchmarked.

Every LLM feature comes down to tokens moving through memory. Latency, cost, and quality are all downstream of how many tokens you send, how the model caches them, and how fast the first one comes back.

I benchmark this on a MacBook Air M2 with real workloads and real pricing. The posts below explain the internals that drive your bill, then show the levers that cut it, from prompt caching to speculative decoding.

Context and tokens

What you are actually paying for, and why bigger windows do not fix everything.

Llm Context Windows Explained: Why More Is Not Always Better

Context windows are expanding to millions of tokens. Here is why the middle of your context still gets ignored, what long context actually costs, and how to build production systems that use these massive windows effectively.

7 min read ai llm infrastructure

Context Windows vs Memory: Why They Are Not the Same Thing

A 1M token context window is not memory. Treating it like one is how you build expensive systems that still forget what they were doing last Tuesday.

18 min read ai llm memory

Token Counting Isn't Optional: a Practical Guide to Llm Cost Control

I explain the mechanics of LLM tokenization, why JSON burns your API budget, and how to architect systems for strict token efficiency.

8 min read llm infrastructure ai

LLM token budgets: a practical guide to cost control

Real numbers, real pricing, and concrete strategies for keeping your LLM spend predictable.

10 min read ai cost backend

Speed and cost levers

The techniques that cut latency and spend without changing the output.

Prompt Caching: What It Is and When the Math Works

Prompt caching can reduce LLM costs by up to 90% and cut latency by half. Here is the engineering guide to how it works, why prefix matching matters, and how to calculate your ROI.

5 min read ai llm infrastructure

Speculative Decoding: How to Speed up Llm Inference for Free

LLM inference is memory-bound, not compute-bound. Speculative decoding uses this fact to speed up generation by 2-3x using a smaller draft model to predict tokens for a larger one.

8 min read ai llm infrastructure

Time to First Token (TTFT): The Metric That Determines AI Snappiness

Users do not care about total throughput. They care about how fast the first word appears. Here is the engineering guide to measuring and optimizing Time to First Token (TTFT) in production.

8 min read ai llm infrastructure

LLM Inference Optimization: What Actually Works in Production

A practical breakdown of the inference optimization techniques that move the needle — batching, quantization, caching, and attention kernels — with concrete numbers and the tradeoffs between them.

7 min read ai llm infrastructure

Internals

What happens inside the model and the cache while it runs.

Context Engineering as Heap Management: Measuring Accuracy vs. KV Cache Eviction

VRAM is too expensive to waste on low-attention tokens. I benchmarked KV cache eviction strategies to treat LLM context like a managed heap, reaching 90% pruning with zero recall loss.

17 min read llm kv-cache memory-optimization

Mixture of Experts: How Moe Models Are Cheap to Run but Expensive to Host

DeepSeek V3 has 671B parameters but only activates 37B per token. Here's how mixture of experts works, why it cuts inference costs, and the catch nobody puts in the headline.

8 min read ai llm inference

Model selection and reasoning

Choosing the right model and testing whether it can actually reason.

The Best LLMs for Coding in 2026: An Engineering Review

Not all models are created equal for software development. Here is a benchmark-backed guide to choosing the right LLM for autonomous agents, algorithmic logic, and repository-scale refactoring as of March 2026.

7 min read ai llm technical-writing

Lambda Calculus as AI Reasoning Benchmark

I have used lambda calculus to test whether AI systems can actually reason through composition, or whether they are just pattern-matching their way to plausible outputs.

7 min read ai reasoning benchmarking formal methods

Latency benchmarks

Profiling the full chain on real-time workloads.

The 800ms Barrier: Profiling the Latency Chain of a Real-Time Gemini 3.1 Voice Agent

I built a sub-second latency voice assistant and profiled every millisecond of the Audio-to-Audio request/response loop on a MacBook Air M2. Here is the bottleneck analysis.

21 min read voice-ai real-time gemini

Frequently asked questions

What is time to first token (TTFT)?

TTFT is how long a model takes to produce its first output token after a request. It is the latency users actually feel, because it controls how fast a response starts streaming. The TTFT post breaks down the chain that determines it.

How does prompt caching reduce cost?

Prompt caching stores the processed form of a repeated prefix so the model skips recomputing it. When a large system prompt or document is reused across calls, caching can cut both cost and latency sharply. The prompt-caching post shows when the math works.

Why does context in the middle of a long window get ignored?

Models attend unevenly across a long context and tend to lose facts placed in the middle. A larger window does not fix this on its own. The context-windows and BEAM benchmark posts show the measured drop.

What is speculative decoding?

Speculative decoding uses a small fast model to draft several tokens that a larger model then verifies in one pass. It speeds up generation without changing the output, because the large model still has the final say.

Need a technical writer who gets this?

I write for DevTools and B2B SaaS companies. Let's talk.