Distributed real-time speech-to-text and translation system with features including voice typing and live subtitles
π Actively Developed & Community-Driven
This project is actively maintained and welcomes contributions! Whether you're interested in AI/ML, distributed systems, real-time processing, or desktop applications, there's plenty to work on.
Perfect for learning: Production-grade patterns, microservices architecture, GPU optimization, real-time streaming, and more.
Areas needing contributors: Additional transcription/translation models, cross-platform desktop clients, Kubernetes deployment, performance optimization, and testing.
Built by @PeterBui(github) | @peterbuiCS(X)
This repository contains the complete source code for a distributed speech processing system - not a packaged application. It's designed as a foundational component for a larger desktop assistant project, demonstrating production-grade patterns for real-time AI workloads.
Current Platform Support: Frontend currently targets Windows only (Electron + native keyboard hooks)
This isn't just another speech-to-text demo. It's a fully distributed, queue-based system designed to handle production workloads with:
- Horizontal scalability at every layer
- Sub-200ms end-to-end latency for real-time processing
- Fault tolerance through Redis-backed message queuing
- Zero-downtime deployments via container orchestration
- Language-agnostic microservices (Python backend, TypeScript frontend)
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Electron Desktop Client β
β (WebSocket + Audio Capture) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
ββββββββββββββββββββββββΌβββββββββββββββββββββββ
βΌ βΌ βΌ
ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ
β Gateway #1 β β Gateway #2 β ... β Gateway #N β
β (WebSocket) β β (WebSocket) β β (WebSocket) β
ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ
β β β
ββββββββββββββββββββββββΌβββββββββββββββββββββββ
βΌ
ββββββββββββββββββββ
β Redis Cluster β
β (Streams + PS) β
ββββββββββββββββββββ
β
ββββββββββββββββββββββββΌβββββββββββββββββββββββ
βΌ βΌ βΌ
ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ
β STT Worker 1 β β STT Worker 2 β ... β STT Worker N β
β (CUDA 0) β β (CUDA 1) β β (CUDA N) β
ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ
β β β
ββββββββββββββββββββββββΌβββββββββββββββββββββββ
βΌ
ββββββββββββββββββββ
β Transcription β
β Stream β
ββββββββββββββββββββ
β
ββββββββββββββββββββββββΌβββββββββββββββββββββββ
βΌ βΌ βΌ
ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ
βTrans Worker 1β βTrans Worker 2β ... βTrans Worker Nβ
β (CUDA 0) β β (CUDA 1) β β (CUDA N) β
ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ
β β β
ββββββββββββββββββββββββΌβββββββββββββββββββββββ
βΌ
ββββββββββββββββββββ
β Pub/Sub β
β Results β
β Channels β
ββββββββββββββββββββ
β
ββββββββββββββββββββββββΌβββββββββββββββββββββββ
βΌ βΌ βΌ
ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ
β Gateway #1 β β Gateway #2 β ... β Gateway #N β
β (WebSocket) β β (WebSocket) β β (WebSocket) β
ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Electron Desktop Client β
β (Results Display) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Scalability:
- Independent scaling of gateway/STT/translation workers
- Redis Streams for backpressure handling
- Multi-GPU support with device assignment
- Connection pooling and session management
Performance:
- WebRTC VAD for efficient audio segmentation
- CTranslate2 quantization (INT8/FP16)
- Batch processing for translation workloads
- Memory-mapped model loading
Observability:
- Structured logging with correlation IDs
- Health check endpoints per service
- Prometheus-compatible metrics (ready to implement)
- Distributed tracing hooks (OpenTelemetry ready)
Reliability:
- Graceful shutdown with drain support
- Circuit breaker pattern for external services
- Automatic reconnection with exponential backoff
- Dead letter queues for failed messages# Single STT Worker (RTX 3080)
- Throughput: ~50 concurrent streams
- Latency: p50=120ms, p99=180ms
- Model: whisper-large-v3 (1.5B params)
# Scaled Configuration (3x STT, 2x Translation)
- Throughput: ~150 concurrent streams
- STT: 3x RTX 3080 (450 concurrent streams capacity)
- Translation: 2x RTX 3080 (NLLB-200 600M model)
- Auto-scaling based on Redis queue depth
- Zero message loss under load# Development (single instance each)
cd backend/infra
docker-compose up --build
# Small deployment (10-50 users)
docker-compose up --scale gateway=2 --scale stt_worker=3 --scale translation_worker=2
# Large deployment (100+ users)
docker-compose up --scale gateway=4 --scale stt_worker=8 --scale translation_worker=6
# Production deployment (Kubernetes)
# kubectl apply -f k8s/
# kubectl scale deployment stt-worker --replicas=10
# kubectl scale deployment translation-worker --replicas=8- Message Queue: Redis Streams + Pub/Sub for event-driven architecture
- STT Engine: Faster-Whisper (CTranslate2 optimized) with beam search
- Translation: Meta's NLLB-200 (600M params) with dynamic batching
- Audio Processing: WebRTC VAD, resampling, normalization
- Containerization: Multi-stage Docker builds (~2GB images)
- Framework: Electron 28 + Next.js 14 (React 18)
- IPC: Context-isolated with typed bridges
- State Management: Zustand with WebSocket middleware
- UI: Glassmorphism with GPU-accelerated animations
- Native Integration: Windows keyboard hooks via node-gyp
- Orchestration: Docker Compose (K8s manifests in progress)
- Monitoring: Health checks, structured logging
- Development: Hot reload, volume mounts, debug modes
- Testing: Component isolation, mock Redis
# Memory footprint (per worker)
Gateway: ~100MB (Python + asyncio)
STT Worker: ~1.5GB (model) + 200MB/stream
Translation: ~2.5GB (model) + 100MB/batch
# GPU utilization (whisper-base)
Batch=1: ~30% utilization (RTX 3080)
Batch=4: ~85% utilization (optimal)
Batch=8: ~95% utilization (diminishing returns)
# Network bandwidth
Audio stream: 256kbps (16kHz mono)
WebSocket overhead: ~5%
Redis protocol: ~10KB/message- Production Patterns: Not a toy project - implements circuit breakers, graceful shutdowns, connection pooling
- Real Microservices: Each service is independently deployable with clear contracts
- Modern AI Stack: Latest optimizations (CTranslate2, ONNX runtime options)
- Clean Abstractions: Repository pattern, dependency injection, typed everything
- Extensible Design: Add new models, languages, or processing steps easily
# Clone and setup
git clone https://github.com/PeterBui/nova-voice
cd nova-voice
# Configure environment (copy from example)
cp backend/.env_example backend/infra/.env
# IMPORTANT: Start backend services FIRST
# Backend provides the AI processing pipeline
# Option A: Docker (Recommended)
cd backend/infra
docker-compose up --build
# β±οΈ First Run: Model downloads may take 1-5 minutes depending on your network
# Monitor progress: Docker Desktop β Containers β View logs for stt_worker/translation_worker
# Models: Whisper large-v3 (~3GB) + NLLB-600M (~2.5GB)
# π For GPU acceleration (10x faster):
# - Windows: backend/docs/GPU_SETUP_WINDOWS.md
# - Linux: backend/docs/GPU_SETUP_LINUX.md
# - macOS: backend/docs/GPU_SETUP_MAC.md
# Option B: Conda Environment (AI/ML Optimized)
cd backend
./setup-conda.sh # Or: conda env create -f environment.yml
conda activate nova-voice
./run-services.sh dev
# Option C: Manual Python Setup
cd backend
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
pip install -r requirements.txt
redis-server & # In another terminal
python -m gateway.gateway &
python -m stt_worker.worker &
python -m translation_worker.worker &
# Option D: All-in-one Script (Auto-detects environment)
cd backend
./run-services.sh dev # Handles conda/venv + Redis + all services
# In a NEW terminal, start the frontend
# Frontend connects to backend for speech processing
cd ../frontend # From backend directory
npm install
npm run build
npm run electron
# Verify the complete pipeline is working
curl http://localhost:8080/health/fullBackground Music/Noise:
- β Speech detection may not work well if there is music in the audio
- Background music can interfere with voice activity detection (VAD)
- May cause false speech detections or reduced transcription accuracy
Docker Setup:
- Docker & Docker Compose
- 4GB+ RAM, GPU recommended
Conda Setup:
- Miniconda/Anaconda
- Python 3.10+
- 4GB+ RAM, GPU recommended
Manual Setup:
- Python 3.10+
- pip
- Redis server
- 4GB+ RAM, GPU recommended
Why Redis Streams over Kafka/RabbitMQ?
- Lower operational overhead
- Built-in persistence
- Consumer groups with ACK
- Sufficient for our throughput (<1000 msg/s)
Why Faster-Whisper over OpenAI Whisper?
- 4x faster inference with CTranslate2
- 2x lower memory usage
- Same accuracy (within 0.1% WER)
Why Electron over native?
- Faster iteration on UI
- Web technologies for overlay rendering
- Cross-platform potential (macOS/Linux planned)
Why microservices over monolith?
- Independent scaling of expensive ops (STT vs translation)
- Language flexibility (could add Rust workers)
- Failure isolation
- Cloud-native deployment readyThis is the speech processing foundation for a larger desktop assistant project:
Current State (v0.1):
βββ β
Real-time STT pipeline
βββ β
Translation pipeline
βββ β
Windows frontend
βββ β
Production architecture
Next Milestones:
βββ π Kubernetes manifests
βββ π TTS pipeline (XTTS-v2)
βββ π Speaker diarization
βββ π Custom wake word detection
βββ π LLM integration hooks
Future Vision:
βββ π
Full desktop assistant
βββ π
Local LLM orchestration
βββ π
Plugin architecture
βββ π
Multi-modal inputs
- Distributed Architecture - Deep dive into design decisions
- Technical Overview - System architecture and design patterns
- API Reference - Complete API documentation
- Gateway Service - WebSocket handling, session management
- STT Worker - Audio processing, model optimization
- Translation Worker - Batching strategies, language detection
- Component Architecture - React component design patterns
- Audio Management - Audio device handling and recording
- WebSocket Client - Real-time communication patterns
- Live Subtitles - Subtitle rendering and timing
- Electron Integration - Desktop application setup
- GPU Setup Guides - β‘ 10x Faster Performance
- Windows (WSL2) - NVIDIA Container Toolkit
- Linux - Native Docker + NVIDIA drivers
- macOS - Apple Silicon MPS or Remote GPU
- Voice Typing Engine - Real-time transcription engine
- Build & Deployment - Production build strategies
- Backend Development - Environment setup and debugging
- Frontend Development - Development workflow and tooling
- Configuration Guide - Service configuration options
- Shared Modules - Common utilities and patterns
- Automatic Typing - Type inference and validation
- Quick Start Guide - Getting started quickly
Looking for contributors who appreciate:
- Clean architecture over quick hacks
- Performance optimization
- Distributed systems patterns
- Real-time processing challenges
Areas needing expertise:
- macOS/Linux frontend adaptation
- Kubernetes operators for auto-scaling
- Additional translation language models
- Additional STT transcription models
Ready for production monitoring:
# Prometheus metrics (endpoints ready)
GET /metrics
- gateway_active_connections
- stt_processing_duration_seconds
- translation_batch_size
- redis_stream_length
# Structured logs (JSON format)
{
"timestamp": "2024-01-01T00:00:00Z",
"service": "stt_worker",
"level": "INFO",
"correlation_id": "abc-123",
"message": "Processing complete",
"duration_ms": 145,
"model": "whisper-base",
"gpu_device": 0
}- RealtimeSTT - Real-time speech recognition inspiration by @Kolja Beigel
- Faster-Whisper - OpenAI Speech-to-text model
- NLLB - State-of-the-art translation
- Redis - The backbone of our message passing
- Electron - Desktop platform
This project was accelerated using:
- Cursor - AI-powered IDE
- Claude - Architecture and code review
- ChatGPT - Problem solving and optimization
- CodeRabbit - PR reviews and suggestions
Nova Voice - Building blocks for the next generation of desktop AI assistants.
This is not an app, it's an architecture.