User VAD + STT LLM TTS Audio Input VAD fires (+510ms) Transcript TTFT (+320ms) Text → Audio