Skip to content

feat: Adding Audio APIs [waiting for vllm-proxy change]#377

Draft
ilblackdragon wants to merge 5 commits intomainfrom
whisper
Draft

feat: Adding Audio APIs [waiting for vllm-proxy change]#377
ilblackdragon wants to merge 5 commits intomainfrom
whisper

Conversation

@ilblackdragon
Copy link
Member

Audio API Implementation Plan

Overview

Implement comprehensive audio APIs including file-based endpoints and real-time WebSocket support for voice conversations.

REST Endpoints:

  • POST /v1/audio/transcriptions - File-based speech-to-text
  • POST /v1/audio/speech - File-based text-to-speech (supports streaming)

WebSocket Realtime API:

  • wss://host/v1/realtime - Real-time bidirectional audio streaming for voice-to-voice conversations

Voice-to-Voice Pipeline: STT → LLM → TTS (works with any model combination)

Providers: Both vLLM-hosted models and external providers (OpenAI, etc.)


Implementation Steps

1. Add Audio Models to Inference Providers

File: crates/inference_providers/src/models.rs

Add new types:

// Audio transcription params and response
pub struct AudioTranscriptionParams {
    pub model: String,
    pub audio_data: Vec<u8>,
    pub filename: String,
    pub language: Option<String>,
    pub prompt: Option<String>,
    pub response_format: Option<String>,  // json, text, srt, verbose_json, vtt
    pub temperature: Option<f32>,
    pub timestamp_granularities: Option<Vec<String>>,
}

pub struct AudioTranscriptionResponse { text, task, language, duration, words, segments }
pub struct AudioTranscriptionResponseWithBytes { response, raw_bytes, audio_duration_seconds }

// Text-to-speech params and response
pub struct AudioSpeechParams {
    pub model: String,
    pub input: String,
    pub voice: String,
    pub response_format: Option<String>,  // mp3, opus, aac, flac, wav, pcm
    pub speed: Option<f32>,
}

pub struct AudioSpeechResponseWithBytes { audio_data, content_type, raw_bytes, character_count }

// Error type
pub enum AudioError { InvalidAudioFormat, TranscriptionFailed, SynthesisFailed, ModelNotSupported, HttpError }

// Streaming TTS result
pub type AudioStreamingResult = Pin<Box<dyn Stream<Item = Result<AudioChunk, AudioError>> + Send>>;
pub struct AudioChunk { data: Vec<u8>, is_final: bool }

2. Extend InferenceProvider Trait

File: crates/inference_providers/src/lib.rs

Add methods with default implementations (returns ModelNotSupported):

async fn audio_transcription(&self, params: AudioTranscriptionParams, request_hash: String)
    -> Result<AudioTranscriptionResponseWithBytes, AudioError>;

async fn audio_speech(&self, params: AudioSpeechParams, request_hash: String)
    -> Result<AudioSpeechResponseWithBytes, AudioError>;

async fn audio_speech_stream(&self, params: AudioSpeechParams, request_hash: String)
    -> Result<AudioStreamingResult, AudioError>;

3. Implement vLLM Provider Audio Methods

File: crates/inference_providers/src/vllm/mod.rs

Implement:

  • audio_transcription() - POST multipart form to {base_url}/v1/audio/transcriptions
  • audio_speech() - POST JSON to {base_url}/v1/audio/speech, return binary audio
  • audio_speech_stream() - POST with streaming response, return audio chunks

3b. Implement External Provider Audio Methods

File: crates/inference_providers/src/external/mod.rs

Add audio support to ExternalProvider:

  • For OpenAI-compatible backends: forward requests to /v1/audio/transcriptions and /v1/audio/speech
  • Use existing ExternalBackend pattern - add audio_transcription() and audio_speech() methods to trait
  • Handle API key and base URL from provider config

4. Create Audio Service

New files:

  • crates/services/src/audio/mod.rs
  • crates/services/src/audio/ports.rs

ports.rs:

pub struct TranscribeRequest { model, audio_data, filename, language, prompt, response_format, temperature, timestamp_granularities, organization_id, workspace_id, api_key_id, model_id, request_hash }
pub struct SpeechRequest { model, input, voice, response_format, speed, organization_id, workspace_id, api_key_id, model_id, request_hash }
pub struct TranscribeResponse { text, language, duration, words, segments }
pub struct SpeechResponse { audio_data, content_type }
pub enum AudioServiceError { ModelNotFound, ProviderError, InvalidRequest, UsageError, InternalError }

#[async_trait]
pub trait AudioServiceTrait: Send + Sync {
    async fn transcribe(&self, request: TranscribeRequest) -> Result<TranscribeResponse, AudioServiceError>;
    async fn synthesize(&self, request: SpeechRequest) -> Result<SpeechResponse, AudioServiceError>;
    async fn synthesize_stream(&self, request: SpeechRequest)
        -> Result<Pin<Box<dyn Stream<Item = Result<Vec<u8>, AudioServiceError>> + Send>>, AudioServiceError>;
}

mod.rs:

  • AudioServiceImpl with inference_pool and usage_service dependencies
  • Transcribe: get provider, call audio_transcription, record usage (audio_seconds)
  • Synthesize: get provider, call audio_speech, record usage (character_count)

Update: crates/services/src/lib.rs - add pub mod audio; and pub mod realtime;

5. Add API Request/Response Models

File: crates/api/src/models.rs

Add:

pub struct AudioTranscriptionRequest { model, language, prompt, response_format, temperature, timestamp_granularities }
pub struct AudioTranscriptionResponse { text, task, language, duration, words, segments }
pub struct AudioSpeechRequest { model, input, voice, response_format, speed, stream: Option<bool> }

impl AudioSpeechRequest {
    pub fn validate(&self) -> Result<(), String> { /* validate model, input length <= 4096, voice, speed 0.25-4.0, format */ }
}

6. Create Audio Routes

New file: crates/api/src/routes/audio.rs

pub struct AudioRouteState { audio_service, models_service }

// POST /v1/audio/transcriptions (multipart form)
pub async fn transcribe_audio(State, Extension<WorkspaceContext>, Extension<RequestBodyHash>, Multipart)
    -> Result<Json<AudioTranscriptionResponse>, (StatusCode, Json<ErrorResponse>)>

// POST /v1/audio/speech (JSON body, returns binary audio or streaming audio)
// If request has stream: true, returns chunked audio stream
// Otherwise returns complete binary audio
pub async fn generate_speech(State, Extension<WorkspaceContext>, Extension<RequestBodyHash>, Json<AudioSpeechRequest>)
    -> Result<Response, (StatusCode, Json<ErrorResponse>)>

For streaming TTS, use Transfer-Encoding: chunked and Content-Type: audio/mpeg (or requested format).

Update: crates/api/src/routes/mod.rs - add pub mod audio;

7. Register Routes and Service

File: crates/api/src/lib.rs

Add:

pub fn build_audio_routes(audio_service, models_service, auth_state, usage_state, rate_limit_state) -> Router {
    Router::new()
        .route("/audio/transcriptions", post(transcribe_audio))
        .route("/audio/speech", post(generate_speech))
        .layer(DefaultBodyLimit::max(25 * 1024 * 1024))  // 25MB for audio
        .with_state(audio_state)
        .layer(usage_check_middleware)
        .layer(rate_limit_middleware)
        .layer(auth_middleware_with_workspace_context)
        .layer(body_hash_middleware)
}

Update DomainServices struct to include audio_service.

Update init_domain_services_with_pool to create AudioServiceImpl.

Update build_app_with_config to call build_audio_routes and merge into v1 router.

8. Add E2E Tests

New file: crates/api/tests/e2e_audio_api.rs

Tests:

  • test_audio_transcription - POST multipart with audio file
  • test_audio_speech - POST JSON, verify binary audio response
  • test_audio_transcription_validation - missing file/model errors
  • test_audio_speech_validation - missing voice, input too long errors

Part 1: Critical Files Summary

File Action
crates/inference_providers/src/models.rs Add audio params/response types
crates/inference_providers/src/lib.rs Extend InferenceProvider trait
crates/inference_providers/src/vllm/mod.rs Implement audio methods for vLLM
crates/inference_providers/src/external/mod.rs Implement audio methods for external providers
crates/inference_providers/src/external/backend.rs Extend ExternalBackend trait with audio methods
crates/services/src/audio/ports.rs NEW: Service traits
crates/services/src/audio/mod.rs NEW: Service implementation
crates/services/src/lib.rs Export audio module
crates/api/src/models.rs Add API models (add stream field to speech request)
crates/api/src/routes/audio.rs NEW: Route handlers
crates/api/src/routes/mod.rs Export audio routes
crates/api/src/lib.rs Register routes, init service
crates/api/tests/e2e_audio_api.rs NEW: E2E tests

Part 2: WebSocket Realtime API

9. Add WebSocket Dependencies

File: crates/api/Cargo.toml

Add:

axum = { version = "0.7", features = ["ws"] }
tokio-tungstenite = "0.21"

10. Create Realtime Session Types

New file: crates/services/src/realtime/mod.rs

pub mod ports;
pub mod session;

// Session state for a realtime connection
pub struct RealtimeSession {
    pub session_id: String,
    pub conversation_id: Option<Uuid>,
    pub stt_model: String,      // e.g., "whisper-1"
    pub llm_model: String,      // e.g., "gpt-4"
    pub tts_model: String,      // e.g., "tts-1"
    pub tts_voice: String,      // e.g., "alloy"
    pub audio_buffer: Vec<u8>,  // Accumulated audio input
    pub context: Vec<Message>,  // Conversation history
}

// Client → Server events
pub enum ClientEvent {
    SessionUpdate { session: SessionConfig },
    InputAudioBufferAppend { audio: String },  // base64 audio chunk
    InputAudioBufferCommit,
    InputAudioBufferClear,
    ConversationItemCreate { item: ConversationItem },
    ResponseCreate { response: Option<ResponseConfig> },
    ResponseCancel,
}

// Server → Client events
pub enum ServerEvent {
    SessionCreated { session: Session },
    SessionUpdated { session: Session },
    InputAudioBufferCommitted { item_id: String },
    InputAudioBufferCleared,
    InputAudioBufferSpeechStarted { audio_start_ms: i32, item_id: String },
    InputAudioBufferSpeechStopped { audio_end_ms: i32, item_id: String },
    ConversationItemCreated { item: ConversationItem },
    ConversationItemInputAudioTranscriptionCompleted { item_id: String, transcript: String },
    ResponseCreated { response: Response },
    ResponseOutputItemAdded { item: ConversationItem },
    ResponseOutputItemDone { item: ConversationItem },
    ResponseTextDelta { item_id: String, delta: String },
    ResponseTextDone { item_id: String, text: String },
    ResponseAudioDelta { item_id: String, delta: String },  // base64 audio chunk
    ResponseAudioDone { item_id: String },
    ResponseDone { response: Response },
    Error { type_: String, code: String, message: String },
}

11. Create Realtime Service

File: crates/services/src/realtime/ports.rs

#[async_trait]
pub trait RealtimeServiceTrait: Send + Sync {
    async fn create_session(&self, config: SessionConfig, ctx: &WorkspaceContext)
        -> Result<RealtimeSession, RealtimeError>;

    async fn handle_audio_chunk(&self, session: &mut RealtimeSession, audio_base64: &str)
        -> Result<(), RealtimeError>;

    async fn commit_audio_buffer(&self, session: &mut RealtimeSession)
        -> Result<TranscriptionResult, RealtimeError>;

    async fn generate_response(&self, session: &mut RealtimeSession)
        -> Result<Pin<Box<dyn Stream<Item = ServerEvent> + Send>>, RealtimeError>;
}

File: crates/services/src/realtime/session.rs

Implements the STT → LLM → TTS pipeline:

  1. commit_audio_buffer() - Sends accumulated audio to STT model, returns transcript
  2. generate_response() - Sends transcript + context to LLM, streams response
  3. For each LLM text chunk, generates TTS audio chunk and streams both

12. Create WebSocket Route Handler

New file: crates/api/src/routes/realtime.rs

use axum::{
    extract::{ws::{Message, WebSocket, WebSocketUpgrade}, State},
    response::IntoResponse,
};

pub async fn realtime_handler(
    ws: WebSocketUpgrade,
    State(state): State<RealtimeRouteState>,
    Extension(workspace_ctx): Extension<WorkspaceContext>,
) -> impl IntoResponse {
    ws.on_upgrade(move |socket| handle_realtime_socket(socket, state, workspace_ctx))
}

async fn handle_realtime_socket(
    mut socket: WebSocket,
    state: RealtimeRouteState,
    ctx: WorkspaceContext,
) {
    // Create session
    let mut session = state.realtime_service
        .create_session(SessionConfig::default(), &ctx)
        .await
        .expect("Failed to create session");

    // Send session.created event
    let created_event = ServerEvent::SessionCreated { session: session.to_api() };
    socket.send(Message::Text(serde_json::to_string(&created_event).unwrap())).await.ok();

    // Main event loop
    while let Some(msg) = socket.recv().await {
        match msg {
            Ok(Message::Text(text)) => {
                let event: ClientEvent = serde_json::from_str(&text).unwrap();
                match event {
                    ClientEvent::InputAudioBufferAppend { audio } => {
                        state.realtime_service.handle_audio_chunk(&mut session, &audio).await.ok();
                    }
                    ClientEvent::InputAudioBufferCommit => {
                        match state.realtime_service.commit_audio_buffer(&mut session).await {
                            Ok(transcript) => {
                                // Send transcription event
                                let event = ServerEvent::ConversationItemInputAudioTranscriptionCompleted {
                                    item_id: transcript.item_id,
                                    transcript: transcript.text,
                                };
                                socket.send(Message::Text(serde_json::to_string(&event).unwrap())).await.ok();
                            }
                            Err(e) => { /* send error event */ }
                        }
                    }
                    ClientEvent::ResponseCreate { .. } => {
                        // Generate LLM response and stream TTS audio
                        let mut stream = state.realtime_service.generate_response(&mut session).await.unwrap();
                        while let Some(event) = stream.next().await {
                            socket.send(Message::Text(serde_json::to_string(&event).unwrap())).await.ok();
                        }
                    }
                    // Handle other events...
                }
            }
            Ok(Message::Binary(audio)) => {
                // Direct binary audio input (alternative to base64)
                state.realtime_service.handle_audio_chunk(&mut session,
                    &base64::encode(&audio)).await.ok();
            }
            Ok(Message::Close(_)) => break,
            _ => {}
        }
    }
}

13. Register WebSocket Route

File: crates/api/src/lib.rs

pub fn build_realtime_routes(
    realtime_service: Arc<dyn RealtimeServiceTrait>,
    audio_service: Arc<dyn AudioServiceTrait>,
    completion_service: Arc<dyn CompletionServiceTrait>,
    auth_state: &AuthState,
) -> Router {
    let state = RealtimeRouteState { realtime_service, audio_service, completion_service };

    Router::new()
        .route("/realtime", get(realtime_handler))
        .with_state(state)
        .layer(auth_middleware_with_workspace_context)  // Auth via query param or header
}

Update build_app_with_config to include realtime routes.


Updated Critical Files Summary

File Action
crates/api/Cargo.toml Add WebSocket dependencies
crates/services/src/realtime/mod.rs NEW: Realtime session types and events
crates/services/src/realtime/ports.rs NEW: Realtime service trait
crates/services/src/realtime/session.rs NEW: STT → LLM → TTS pipeline
crates/api/src/routes/realtime.rs NEW: WebSocket handler

Privacy Requirements

NEVER log:

  • Audio file contents
  • Transcription text
  • TTS input text

OK to log:

  • Request IDs, model names
  • Audio duration (seconds), character count
  • Error codes, latency metrics

Verification

Part 1: REST Endpoints

  1. Build: cargo build
  2. Unit tests: cargo test --lib --bins
  3. E2E tests: cargo test --test e2e_audio_api (requires PostgreSQL)
  4. Manual test transcription:
    curl -X POST http://localhost:3000/v1/audio/transcriptions \
      -H "Authorization: Bearer sk-test-xxx" \
      -F "file=@test.wav" \
      -F "model=whisper-1"
  5. Manual test speech:
    curl -X POST http://localhost:3000/v1/audio/speech \
      -H "Authorization: Bearer sk-test-xxx" \
      -H "Content-Type: application/json" \
      -d '{"model":"tts-1","input":"Hello world","voice":"alloy"}' \
      --output output.mp3

Part 2: WebSocket Realtime API

  1. WebSocket test with websocat:

    # Connect to realtime endpoint
    websocat "ws://localhost:3000/v1/realtime?authorization=sk-test-xxx"
    
    # Send session.update to configure models
    {"type":"session.update","session":{"stt_model":"whisper-1","llm_model":"gpt-4","tts_model":"tts-1","voice":"alloy"}}
    
    # Send audio chunk (base64 encoded)
    {"type":"input_audio_buffer.append","audio":"<base64_audio_data>"}
    
    # Commit audio for transcription
    {"type":"input_audio_buffer.commit"}
    
    # Request LLM response with TTS
    {"type":"response.create"}
  2. E2E WebSocket tests: cargo test --test e2e_realtime_api

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @ilblackdragon, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands the platform's capabilities by introducing robust audio processing features. It provides both traditional RESTful endpoints for speech-to-text and text-to-speech, and a sophisticated real-time WebSocket API designed for interactive voice conversations. The changes involve deep integration with existing inference provider infrastructure, the creation of new service layers for managing audio workflows and usage, and the establishment of dedicated API routes, all while adhering to strict privacy guidelines.

Highlights

  • New Audio APIs (REST): Introduced REST endpoints for file-based speech-to-text (POST /v1/audio/transcriptions) and text-to-speech (POST /v1/audio/speech), including support for streaming TTS responses.
  • Real-time WebSocket API: Implemented a wss://host/v1/realtime WebSocket endpoint for bidirectional audio streaming, enabling real-time voice-to-voice conversations through an integrated STT → LLM → TTS pipeline.
  • Inference Provider Integration: Extended both vLLM-hosted models and external providers (OpenAI-compatible, Google Gemini) to support the new audio transcription and speech synthesis functionalities.
  • Dedicated Service Layer: Created new AudioService and RealtimeService modules within the service layer to encapsulate business logic, provider routing, and usage tracking for audio operations.
  • API Route and Model Definitions: Added new API models for audio requests and responses, along with dedicated Axum routes and middleware for handling audio-related HTTP and WebSocket traffic, including body size limits and authentication.
  • Comprehensive E2E Testing: Included new end-to-end tests for both the RESTful audio endpoints and the WebSocket real-time API to ensure functionality and integration.
  • Privacy Considerations: Explicitly defined privacy requirements to prevent logging of sensitive audio file contents, transcription text, or TTS input text, while allowing logging of metadata for usage tracking.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces comprehensive audio APIs for speech-to-text (STT) and text-to-speech (TTS), including both REST endpoints and a real-time WebSocket API. The implementation integrates well with the existing inference provider pool and includes usage tracking. The code is generally well-structured with clear separation of concerns. However, there are a few areas that need attention, particularly regarding the real-time streaming behavior and consistent service layer utilization. One critical point highlighted by existing rules is the handling of background tasks for billing operations to ensure data integrity during graceful shutdowns.

@nickpismenkov nickpismenkov self-assigned this Jan 27, 2026
@nickpismenkov nickpismenkov changed the title feat: Adding Audio APIs feat: Adding Audio APIs [waiting for vllm-proxy change] Jan 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants