What is time to first token (TTFT)?
TTFT is how long a model takes to produce its first output token after a request. It is the latency users actually feel, because it controls how fast a response starts streaming. The TTFT post breaks down the chain that determines it.
How does prompt caching reduce cost?
Prompt caching stores the processed form of a repeated prefix so the model skips recomputing it. When a large system prompt or document is reused across calls, caching can cut both cost and latency sharply. The prompt-caching post shows when the math works.
Why does context in the middle of a long window get ignored?
Models attend unevenly across a long context and tend to lose facts placed in the middle. A larger window does not fix this on its own. The context-windows and BEAM benchmark posts show the measured drop.
What is speculative decoding?
Speculative decoding uses a small fast model to draft several tokens that a larger model then verifies in one pass. It speeds up generation without changing the output, because the large model still has the final say.