User
VAD + STT
LLM
TTS
Audio Input
VAD fires (+510ms)
Transcript
TTFT (+320ms)
Text → Audio