Achieving Sub-500ms Latency in Voice Agents

If your AI pauses for a second before answering, the human user will instinctively repeat themselves, assuming the AI didn't hear them. Breaking the 500 millisecond response barrier is the holy grail of conversational AI.

Why 500ms Matters

Human conversational turn-taking happens with gaps of roughly 200-300 milliseconds. Sometimes, the gap is negative (we interrupt or talk over each other). If an AI takes longer than 500ms to respond, the illusion of a fluid conversation is broken. The interaction feels like a walkie-talkie rather than a phone call.

The Stack for Speed

To achieve sub-500ms latency from the moment the user stops speaking to the moment the AI's first audio byte hits their speaker, you must optimize every hop in the network:

VAD Tuning: Standard VAD waits 500ms to ensure the user is actually done speaking. You must tune this down to ~250ms and accept that the AI might occasionally cut the user off early, mitigating this with proactive "listening" sounds.
STT: Deepgram Nova. Currently the industry leader in streaming STT speed, capable of returning final transcripts in single-digit milliseconds after audio ceases.
LLM: Claude 3.5 Haiku or Groq. You need an LLM with massive Time-To-First-Token (TTFT) performance. Groq's LPU hardware streaming Llama 3 is incredibly fast, or Claude 3.5 Haiku offers excellent speed/reasoning balance.
TTS: Cartesia Sonic. Cartesia was built specifically for real-time streaming, achieving synthesis in under 150ms.

The Geographic Constraint

You cannot beat physics. If your WebRTC server is in Virginia, your LLM endpoint is in California, and your user is in London, you will fail. True sub-500ms latency requires co-locating your orchestration servers in the same AWS/GCP regions as the LLM and TTS endpoints you are querying.