And not just any solution — we needed one that could run locally, perform reliably in industrial environments, and integrate seamlessly with our assistant's natural language interface. As a team, we decided to build it ourselves. In this post, I will walk you through how we developed our real-time Speech API, what its architecture looks like, and what we learned along the way.
We’ve been developing SABOT, our intelligent voice assistant tailored for industrial machines. With SABOT, machine operators can speak naturally to their equipment — give commands, ask questions, receive updates — all through voice.
But to make SABOT truly viable in a factory setting, we needed high-quality speech capabilities that could function offline, be hosted on-premise for privacy and latency reasons, and provide our customers with full control over the solution.
That’s how the project began. What started as a requirement for SABOT evolved into a standalone component: a modular Speech API that performs both real-time transcription and voice synthesis. We now use it in SABOT and plan to reuse it in future customer projects.
Here’s a high-level overview of the system:
We designed the API to transcribe live microphone input in real time, convert text responses into natural-sounding speech, operate efficiently with low latency and strong language support, integrate easily with systems like SABOT, and run entirely on local infrastructure. It’s now part of SABOT’s speech pipeline, but we’ve built it to be reusable across different setups.
Under the hood, the system is modular and scalable. Here’s a full architectural breakdown:
Our speech processing system is divided into three distinct layers, each optimized for a specific part of the voice interaction pipeline.
The client layer runs on the user’s device and manages all user-facing interactions through a coordinated set of components. The Audio Recorder captures and formats raw microphone input, converting it to PCM (Pulse-Code Modulation) for processing. Real-time communication is handled by our WebSocket Client, which streams audio chunks to the backend, while the Recognition Result Receiver processes the incoming transcribed text. For the reverse flow, the Request Processor and REST Client send text for synthesis and manage the resulting audio playback through the Audio Player, which delivers speech directly to the device’s speaker.
The Speech API is the computational core of our system. Voice Activity Detection (VAD) acts as an intelligent gatekeeper, filtering out silence and background noise to ensure that only relevant speech reaches the recognition engine. This keeps the system efficient and accurate. Our Speech Processor and Recognition Engine rely on OpenAI’s Whisper models, chosen for their speed, accuracy, and multilingual support. These models can also be fine-tuned for domain-specific vocabularies, making them ideal for specialized applications.
For text-to-speech, our Text Processor and Synthesizer Engine use the StyleTTS2 architecture — an open-source solution supporting multiple voice profiles across genders and speaking styles. This produces a natural, pleasant audio experience. The Audio Streamer completes the cycle by streaming the generated audio back to the client for immediate playback.
Our Model Registry maintains all models in an organized, easily accessible format, enabling seamless switching and updates. This design choice keeps the system highly modular, whether we’re adding new voices, switching languages, or optimizing for specific hardware setups. The registry manages both STT and TTS models, each supporting different voices and styles for diverse application needs.
From the outset, we aimed for a local-first system. In industrial settings, this means no cloud dependency — machines can still understand and respond to voice commands even without internet access. It also means full data privacy, with all audio and text staying on-site, lower latency for faster responses, and complete control over models, voices, and behavior. This decision shaped every technical choice we made and led us to develop a robust, self-contained solution.
The system operates seamlessly from the moment an operator speaks to when the response is heard. For speech-to-text, the operator speaks into a microphone, and the audio is streamed to the backend. VAD filters out silence and noise, and the recognition engine transcribes the relevant parts. The transcribed text is then returned to the client.
For text-to-speech, once the client sends the text to the backend, it’s processed and passed to the synthesizer engine. The generated audio is streamed back to the client and played through the speaker in real time.
Building this system taught us several important lessons that shaped our approach to real-time voice processing. VAD (Voice Activity Detection) proved essential — without it, the STT engine wasted resources on background noise and silence. The right VAD implementation significantly boosted performance. Streaming via WebSockets ensured the lowest latency, which was crucial for maintaining a smooth user experience. Our decision to decouple models via a registry pattern made upgrades seamless and kept the core API logic clean and modular.
Surprisingly, voice style has a strong impact on user experience even in industrial environments. A well-chosen voice profile makes the assistant feel more natural and less robotic — something users appreciated more than we expected. We also found that fine-tuning STT and TTS models for specific vocabularies or accents is extremely effective, opening up promising possibilities we’ll explore in future posts.
What started as a component for SABOT evolved into a real-time Speech API service that we now reuse across projects and tailor for various industrial applications. We’re continuing to enhance and expand the system, with plans to fine-tune STT models for sectors like automotive and manufacturing, introduce branded voice personalities for machines, and integrate more deeply with multimodal assistants that combine voice, screen, and gesture inputs.
Building this Speech API from scratch using open-source models was an incredibly rewarding experience. It gave us the flexibility and control we needed, while enabling us to create a solution tailored to our exact requirements. I hope this post gave you insight into how we built our Speech API and the challenges we overcame. If you’re working on similar systems or want to learn more about our solution, feel free to reach out.
Thanks for reading!