The Tech Behind Our Low-Latency Voice Engine

A deep dive into how we achieved sub-400ms latency using DeepGram and Fish Audio deployed on edge networks.

Chukwuemeka Peters

The Tech Behind Our Low-Latency Voice Engine

When building a voice-first AI tutor, latency is everything. A 2-second delay between a student's question and the AI's response breaks the conversational flow and destroys engagement. Here's how we got it under 400ms.

The Architecture

Our voice pipeline has three stages, each optimized independently:

Speech-to-Text (STT): We use DeepGram's Nova-2 model with streaming transcription. The model begins processing audio chunks as they arrive, rather than waiting for the full utterance.
LLM Processing: Our teaching agent runs on optimized inference endpoints with streaming responses. We begin TTS synthesis on the first sentence while the LLM is still generating the rest.
Text-to-Speech (TTS): Fish Audio provides natural-sounding voices with support for multiple African accents. We pre-warm the TTS connection to eliminate cold-start delays.

Edge Deployment

We deploy our voice proxy servers on edge networks close to our primary user base in West Africa. This alone shaved 150ms off the round-trip time compared to US-based servers.

The Result

Our p95 end-to-end latency (from end of student speech to start of AI audio response) is now 380ms. This makes conversations feel genuinely natural and keeps students engaged for longer sessions.

#AI#Voice#Engineering

Share this article