The Tech Behind Our Low-Latency Voice Engine
A deep dive into how we achieved sub-400ms latency using DeepGram and Fish Audio deployed on edge networks.
When building a voice-first AI tutor, latency is everything. A 2-second delay between a student's question and the AI's response breaks the conversational flow and destroys engagement. Here's how we got it under 400ms.
The Architecture
Our voice pipeline has three stages, each optimized independently:
- Speech-to-Text (STT): We use DeepGram's Nova-2 model with streaming transcription. The model begins processing audio chunks as they arrive, rather than waiting for the full utterance.
- LLM Processing: Our teaching agent runs on optimized inference endpoints with streaming responses. We begin TTS synthesis on the first sentence while the LLM is still generating the rest.
- Text-to-Speech (TTS): Fish Audio provides natural-sounding voices with support for multiple African accents. We pre-warm the TTS connection to eliminate cold-start delays.
Edge Deployment
We deploy our voice proxy servers on edge networks close to our primary user base in West Africa. This alone shaved 150ms off the round-trip time compared to US-based servers.
The Result
Our p95 end-to-end latency (from end of student speech to start of AI audio response) is now 380ms. This makes conversations feel genuinely natural and keeps students engaged for longer sessions.
Share this article