How I built a sub-500ms latency voice agent from scratch | Nick Tikhonov

Hacker News

March 2, 2026

AI-Generated Deep Dive Summary

Nick Tikhonov shares his journey of building a voice agent from scratch, achieving sub-500ms latency—a significant improvement over existing platforms like Vapi. By leveraging off-the-shelf tools such as ElevenLabs and Vapi, he discovered that these solutions often mask underlying complexities. His experiment involved wiring together speech-to-text (STT), language models (LLM), and text-to-speech (TTS) into a real-time pipeline, ultimately outperforming Vapi’s setup by 2× in terms of speed. Voice agents pose unique challenges compared to text-based ones due to their continuous, real-time nature. The orchestration layer must seamlessly manage transitions between states—determining when the user is speaking or listening and responding with minimal delay. This requires canceling speech generation and synthesis during pauses while handling background noise and filler sounds without interruption. Tikhonov’s build highlights the importance of understanding the architecture behind voice agents. While all-in-one SDKs offer convenience, they often leave developers in the dark about performance bottlenecks. By building his own orchestration layer, he achieved faster response times (around 400ms) and provided insights into how model selection and geography can impact performance. This exploration is crucial for anyone interested in voice technology, as it underscores the need to master real-time coordination between components. Tikhonov’s work not only demonstrates technical feasibility but also offers practical lessons for developers aiming to optimize voice agents. His approach provides a deeper understanding of the systems at play, enabling better decision-making and improved user experiences.

Verticals

techstartups

Originally published on Hacker News on 3/2/2026