feat: Adding Audio APIs [waiting for vllm-proxy change]#377
feat: Adding Audio APIs [waiting for vllm-proxy change]#377ilblackdragon wants to merge 5 commits intomainfrom
Conversation
Summary of ChangesHello @ilblackdragon, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly expands the platform's capabilities by introducing robust audio processing features. It provides both traditional RESTful endpoints for speech-to-text and text-to-speech, and a sophisticated real-time WebSocket API designed for interactive voice conversations. The changes involve deep integration with existing inference provider infrastructure, the creation of new service layers for managing audio workflows and usage, and the establishment of dedicated API routes, all while adhering to strict privacy guidelines. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces comprehensive audio APIs for speech-to-text (STT) and text-to-speech (TTS), including both REST endpoints and a real-time WebSocket API. The implementation integrates well with the existing inference provider pool and includes usage tracking. The code is generally well-structured with clear separation of concerns. However, there are a few areas that need attention, particularly regarding the real-time streaming behavior and consistent service layer utilization. One critical point highlighted by existing rules is the handling of background tasks for billing operations to ensure data integrity during graceful shutdowns.
Audio API Implementation Plan
Overview
Implement comprehensive audio APIs including file-based endpoints and real-time WebSocket support for voice conversations.
REST Endpoints:
POST /v1/audio/transcriptions- File-based speech-to-textPOST /v1/audio/speech- File-based text-to-speech (supports streaming)WebSocket Realtime API:
wss://host/v1/realtime- Real-time bidirectional audio streaming for voice-to-voice conversationsVoice-to-Voice Pipeline: STT → LLM → TTS (works with any model combination)
Providers: Both vLLM-hosted models and external providers (OpenAI, etc.)
Implementation Steps
1. Add Audio Models to Inference Providers
File:
crates/inference_providers/src/models.rsAdd new types:
2. Extend InferenceProvider Trait
File:
crates/inference_providers/src/lib.rsAdd methods with default implementations (returns
ModelNotSupported):3. Implement vLLM Provider Audio Methods
File:
crates/inference_providers/src/vllm/mod.rsImplement:
audio_transcription()- POST multipart form to{base_url}/v1/audio/transcriptionsaudio_speech()- POST JSON to{base_url}/v1/audio/speech, return binary audioaudio_speech_stream()- POST with streaming response, return audio chunks3b. Implement External Provider Audio Methods
File:
crates/inference_providers/src/external/mod.rsAdd audio support to
ExternalProvider:/v1/audio/transcriptionsand/v1/audio/speechExternalBackendpattern - addaudio_transcription()andaudio_speech()methods to trait4. Create Audio Service
New files:
crates/services/src/audio/mod.rscrates/services/src/audio/ports.rsports.rs:
mod.rs:
AudioServiceImplwithinference_poolandusage_servicedependenciesUpdate:
crates/services/src/lib.rs- addpub mod audio;andpub mod realtime;5. Add API Request/Response Models
File:
crates/api/src/models.rsAdd:
6. Create Audio Routes
New file:
crates/api/src/routes/audio.rsFor streaming TTS, use
Transfer-Encoding: chunkedandContent-Type: audio/mpeg(or requested format).Update:
crates/api/src/routes/mod.rs- addpub mod audio;7. Register Routes and Service
File:
crates/api/src/lib.rsAdd:
Update
DomainServicesstruct to includeaudio_service.Update
init_domain_services_with_poolto createAudioServiceImpl.Update
build_app_with_configto callbuild_audio_routesand merge into v1 router.8. Add E2E Tests
New file:
crates/api/tests/e2e_audio_api.rsTests:
test_audio_transcription- POST multipart with audio filetest_audio_speech- POST JSON, verify binary audio responsetest_audio_transcription_validation- missing file/model errorstest_audio_speech_validation- missing voice, input too long errorsPart 1: Critical Files Summary
crates/inference_providers/src/models.rscrates/inference_providers/src/lib.rscrates/inference_providers/src/vllm/mod.rscrates/inference_providers/src/external/mod.rscrates/inference_providers/src/external/backend.rscrates/services/src/audio/ports.rscrates/services/src/audio/mod.rscrates/services/src/lib.rscrates/api/src/models.rsstreamfield to speech request)crates/api/src/routes/audio.rscrates/api/src/routes/mod.rscrates/api/src/lib.rscrates/api/tests/e2e_audio_api.rsPart 2: WebSocket Realtime API
9. Add WebSocket Dependencies
File:
crates/api/Cargo.tomlAdd:
10. Create Realtime Session Types
New file:
crates/services/src/realtime/mod.rs11. Create Realtime Service
File:
crates/services/src/realtime/ports.rsFile:
crates/services/src/realtime/session.rsImplements the STT → LLM → TTS pipeline:
commit_audio_buffer()- Sends accumulated audio to STT model, returns transcriptgenerate_response()- Sends transcript + context to LLM, streams response12. Create WebSocket Route Handler
New file:
crates/api/src/routes/realtime.rs13. Register WebSocket Route
File:
crates/api/src/lib.rsUpdate
build_app_with_configto include realtime routes.Updated Critical Files Summary
crates/api/Cargo.tomlcrates/services/src/realtime/mod.rscrates/services/src/realtime/ports.rscrates/services/src/realtime/session.rscrates/api/src/routes/realtime.rsPrivacy Requirements
NEVER log:
OK to log:
Verification
Part 1: REST Endpoints
cargo buildcargo test --lib --binscargo test --test e2e_audio_api(requires PostgreSQL)Part 2: WebSocket Realtime API
WebSocket test with websocat:
E2E WebSocket tests:
cargo test --test e2e_realtime_api