If your assistant feels slow, users notice it before they care why. They are not thinking about throughput or model size. They are waiting for the first useful sign that the system is alive.
That delay has a name: time to first token, or TTFT.
It measures how long the system takes to start responding. A model can stream quickly once it gets going and still feel slow if it waits too long before the first token.
For interactive products, TTFT is usually the first latency metric I would fix. The main levers are smaller models, shorter prompts, cache reuse, better placement, and less queueing. Throughput work comes later.
What TTFT actually measures
TTFT is the time between sending a request and receiving the first generated token back from the model.
That sounds neat because it collapses several steps into one number. The actual path is longer:
- the request travels from the client to the inference service
- the request may wait in a queue
- the system tokenizes and validates the prompt
- the model runs the prefill pass over the input
- the server schedules the first decode step
- the first token travels back to the client
The OpenAI latency optimization guide treats latency as a mix of model choice, prompt size, and infrastructure behavior. That framing is the right one. TTFT is not one operation. It is a visible summary of several operations that happen before generation starts.
When the first token arrives late, the delay can come from network distance, queueing, prompt size, model size, cache misses, or server scheduling. The top-line number alone does not tell you which one is responsible.
TTFT is not the same as throughput
These metrics get blurred together too often.
| Metric | What it measures | What it changes in the product |
|---|---|---|
| TTFT | Delay before the first token appears | Whether the app feels responsive at the start |
| Throughput | Tokens generated per second after generation begins | How fast long answers stream |
| Total latency | Time until the entire response completes | End-to-end wait time |
A system can do well on throughput and still feel slow.
Suppose one path starts streaming in 350 milliseconds and emits 65 tokens per second. Another path starts in 2.1 seconds and emits 140 tokens per second. The second path wins on raw generation speed. The first path usually feels better in chat because the user sees output immediately.
That is the practical distinction. Once output is flowing, the interface has already crossed an important threshold. A long pause before that point creates more friction than a slightly slower stream after it.
Why interactive apps should optimize TTFT first
Interactive products run on short loops. A user asks a question. A coding assistant suggests a fix. A support tool drafts a reply. An internal agent reports status. In each case, the model is part of a live workflow, not an offline batch job.
That changes the performance target.
For a batch summarization job, I care more about total runtime, cost, and tokens per second. For a streamed assistant, I care first about whether it starts quickly enough to keep the workflow moving.
TTFT usually matters most in products like these:
- chat assistants
- code copilots
- agent dashboards with live progress
- support tools that draft replies
- search products that stream synthesized answers
- internal AI tools used many times per hour
Throughput usually matters more in:
- offline summarization pipelines
- report generation jobs
- document processing queues
- evaluation runs
- asynchronous workloads where nobody watches the response stream live
The metric should match the workload. That sounds obvious, but a lot of latency work goes wrong right here.
What sits inside TTFT
When I am debugging TTFT, I split it into stages.
| Stage | What happens | Common source of delay |
|---|---|---|
| Network ingress | Request reaches the model service | Long physical distance, slow gateways |
| Queueing | Request waits for compute | Load spikes, oversubscribed GPUs |
| Prompt processing | Input is tokenized and prepared | Large prompts, tool-heavy contexts |
| Prefill | Model reads the input prompt | Large context windows, large models |
| First decode step | Model generates the first output token | Model compute cost, server scheduling |
| Network egress | First token reaches the client | Streaming path overhead |
This breakdown matters because each stage has different fixes. Prompt compression helps prefill. Regional placement helps network time. Warm capacity helps queueing. Speculative decoding helps when decode is the limiting factor.
Without that split, optimization turns into guesswork.
Model size is usually the first real lever
The cleanest TTFT improvement is often model selection.
Before a model can emit anything, it has to read the prompt. Larger models generally make that first pass more expensive. Long prompts make the effect worse. A heavier model may still be the right choice for deep analysis, but that does not make it the right default for a short interactive turn.
I usually frame the decision with three questions:
- does this request need the strongest model available
- does it need that model on the first turn
- can a faster model handle the interaction and escalate only when necessary
That is a better starting point than treating model quality as one ranking.
The OpenAI latency guide connects model choice directly to latency. That is not just an API concern. It is an architectural one. If every request routes to the biggest model by default, TTFT usually pays for it.
For interactive paths, I prefer a tiered setup:
| Request type | Better default |
|---|---|
| Short Q&A | Small or medium model with strong instruction following |
| Code suggestions | Fast coding model tuned for short turns |
| Agent planning | Mid-tier model, escalate when needed |
| Long analysis | Larger model where total quality matters more than start time |
A lot of TTFT work disappears once routing becomes sane.
Prompt size is a TTFT tax
Prompt length is one of the easiest TTFT costs to introduce and one of the hardest to notice once it becomes normal.
Conversation history grows. Retrieval systems add more chunks. Tool descriptions expand. System prompts accumulate policy, style, product context, and fallback instructions. Each change looks minor in isolation. The model still has to read all of it before it can answer.
When I look at a slow interactive path, these are the first questions I ask:
- does the request need the full conversation history
- are retrieved chunks ranked tightly enough
- are tool descriptions longer than they need to be
- is the system prompt doing work that belongs elsewhere
- is stale context hanging around because nobody pruned it
Shorter prompts are not automatically better. If you remove the context that makes the answer correct, the product gets faster and worse. The goal is to stop rereading irrelevant material on every turn.
Prompt caching changes the economics of repeated context
Once a prefix repeats across requests, caching becomes one of the strongest practical TTFT levers.
Both Anthropic prompt caching and OpenAI prompt caching are built around the same idea: if the stable part of the prompt has already been processed, the platform can reuse that work instead of paying the full prefill cost again.
On the self-hosted side, vLLM documents automatic prefix caching, which applies the same idea at the serving layer.
This matters most when you have:
- long stable system prompts
- repeated tool definitions
- shared policy text across requests
- similar multi-turn structures where only the tail changes
Caching rewards prompt discipline. If the stable instructions sit in a consistent prefix, the system can reuse them. If the prompt gets rebuilt differently every time, the opportunity disappears.
Speculative decoding helps, but not with every TTFT problem
Speculative decoding works by having a smaller draft model predict candidate tokens and a larger model verify them. When the guesses line up, generation moves faster.
vLLM’s speculative decoding documentation is a useful reference for how the serving path works.
The practical limit is simple. Speculative decoding helps when decode speed is the bottleneck. It does not fix queueing, long cross-region hops, or prompt-heavy prefill.
I would reach for it after answering two questions:
- Is decode a meaningful part of the delay before the response feels usable?
- Have I already reduced prompt and placement overhead?
If the answer to the second question is no, there are usually easier wins first.
Placement still matters
TTFT is shaped by distance and contention.
If the user is in one region and the model runs in another, the request has to cross that distance twice before the first token appears. If the serving fleet is saturated, the request waits. If the system has to cold-start a worker or load weights under pressure, the user waits longer.
That is why placement and capacity planning belong in any serious TTFT discussion.
These checks are worth doing early:
- is the inference endpoint close to the users who need low latency
- is the traffic pattern creating queue spikes at peak time
- is autoscaling slow enough to expose cold paths
- is batching tuned for utilization at the expense of per-request responsiveness
- is a gateway or proxy adding work that the user does not benefit from
Many serving optimizations improve machine efficiency. That does not mean they improve the first moment of the product experience.
Throughput optimizations and TTFT optimizations are different work
Some changes improve the time before the first token. Some improve generation after it starts. A few help both.
| Optimization | Helps TTFT | Helps throughput | Notes |
|---|---|---|---|
| Smaller model | Yes | Often | Best first check for interactive paths |
| Shorter prompts | Yes | Sometimes | Mainly reduces prefill cost |
| Prompt or prefix caching | Yes | Sometimes | Strong when prompts share a stable prefix |
| Regional placement | Yes | No | Cuts network delay and can reduce queueing |
| Warm capacity | Yes | No | Avoids queue spikes and cold starts |
| Speculative decoding | Sometimes | Yes | Best when decode is the constraint |
| Large dynamic batching | Often no | Yes | Can make TTFT worse |
For interactive systems, I start on the TTFT side of that table before I chase throughput.
A practical order for TTFT work
When I need to reduce TTFT, I use this order:
- Measure TTFT separately from total latency. A blended number hides too much.
- Check queueing and regional placement. If the request is waiting before execution, model tweaks will not save it.
- Measure prompt size and prefill cost. Large contexts explain more latency than teams expect.
- Test a smaller or faster model. This can cut both prefill and decode cost.
- Introduce prompt or prefix caching where the prefix is stable.
- Explore speculative decoding if decode still limits responsiveness.
This order avoids a common mistake: reaching for a sophisticated serving trick before confirming that the prompt is twice as large as it should be.
How to measure TTFT without fooling yourself
TTFT sounds easy to measure, but client-side timing and server-side timing get mixed together quickly.
A useful measurement setup records:
- request sent timestamp
- first byte received timestamp
- first token rendered timestamp in the client
- total response completion timestamp
- queue time if the serving system exposes it
- prompt token count and output token count
- model name and region
Those fields let you answer concrete questions. If TTFT rises with prompt size, prefill is likely involved. If TTFT spikes only during high concurrency, queueing is a better suspect. If server-side first byte is stable but client render time is erratic, the issue may sit in the streaming path or frontend.
For products with multiple model routes, I also want TTFT broken out by route. One slow path can disappear inside an average.
Where teams usually waste time
One failure mode shows up again and again. The team sees latency complaints and optimizes the easiest number to benchmark. That is often output tokens per second. The chart improves. The app still feels slow because the painful part was the wait before the first visible response.
Prompt growth is another trap. Retrieval adds more context. Tools add more schemas. Product logic adds more instructions. Quality can improve for a while, but TTFT drifts upward until the product feels heavier than it should.
The third trap is sending every request through one model path. That keeps routing simple, but it forces very different workloads into the same latency budget.
These are architectural mistakes, not tuning details.
What I would optimize for different workloads
| Workload | Primary metric | Secondary metric |
|---|---|---|
| Chat assistant | TTFT | Total latency |
| Coding copilot | TTFT | Suggestion quality |
| Agent progress UI | TTFT | Reliability |
| Async document summarization | Throughput | Cost |
| Batch evaluation pipeline | Throughput | Total runtime |
| Long report generation | Total latency | Cost |
A single stack can serve all of these workloads, but it should not treat them the same way. If one latency strategy gets applied to all of them, one class of workload will get a worse tradeoff than it needs.
The practical takeaway
For interactive systems, TTFT is usually the first metric to fix.
The checklist is short:
- choose a model that fits the interaction, not just the benchmark
- keep prompts smaller and more stable than they naturally become over time
- use caching when repeated prefixes make it worthwhile
- keep inference close to the user and avoid avoidable queueing
- treat speculative decoding as a targeted optimization, not a default answer
Once those pieces are in place, throughput work becomes more useful.
Frequently asked questions
What is a good TTFT for an LLM app?
There is no single threshold that fits every product. For interactive apps, lower is usually better. Once the delay pushes past about a second, the pause becomes much more noticeable. The acceptable number depends on the value of the answer and how often the user has to wait for it.
Is TTFT the same as first byte latency?
They are close, but not always identical. First byte latency refers to when the first byte arrives from the server. TTFT refers to when the first generated token becomes available. Streaming and client rendering can make those differ slightly.
Does prompt caching always reduce TTFT?
No. It helps when requests share a stable prefix that the system can reuse. If the prompt structure changes significantly each time, the cache hit rate may be too low to matter.
Does speculative decoding always improve responsiveness?
No. It helps when decode speed is a meaningful bottleneck. If the delay is mostly queueing, network time, or prefill, the impact on TTFT will be limited.
Should I optimize TTFT or total latency first?
For interactive apps, I would usually start with TTFT because it drives perceived responsiveness. For offline or asynchronous workloads, total latency and throughput are often more important.