feat: Adding Audio APIs [waiting for vllm-proxy change] by ilblackdragon · Pull Request #377 · nearai/cloud-api

ilblackdragon · 2026-01-20T23:11:17Z

Audio API Implementation Plan

Overview

Implement comprehensive audio APIs including file-based endpoints and real-time WebSocket support for voice conversations.

REST Endpoints:

POST /v1/audio/transcriptions - File-based speech-to-text
POST /v1/audio/speech - File-based text-to-speech (supports streaming)

WebSocket Realtime API:

wss://host/v1/realtime - Real-time bidirectional audio streaming for voice-to-voice conversations

Voice-to-Voice Pipeline: STT → LLM → TTS (works with any model combination)

Providers: Both vLLM-hosted models and external providers (OpenAI, etc.)

Implementation Steps

1. Add Audio Models to Inference Providers

File: crates/inference_providers/src/models.rs

Add new types:

// Audio transcription params and response
pub struct AudioTranscriptionParams {
    pub model: String,
    pub audio_data: Vec<u8>,
    pub filename: String,
    pub language: Option<String>,
    pub prompt: Option<String>,
    pub response_format: Option<String>,  // json, text, srt, verbose_json, vtt
    pub temperature: Option<f32>,
    pub timestamp_granularities: Option<Vec<String>>,
}

pub struct AudioTranscriptionResponse { text, task, language, duration, words, segments }
pub struct AudioTranscriptionResponseWithBytes { response, raw_bytes, audio_duration_seconds }

// Text-to-speech params and response
pub struct AudioSpeechParams {
    pub model: String,
    pub input: String,
    pub voice: String,
    pub response_format: Option<String>,  // mp3, opus, aac, flac, wav, pcm
    pub speed: Option<f32>,
}

pub struct AudioSpeechResponseWithBytes { audio_data, content_type, raw_bytes, character_count }

// Error type
pub enum AudioError { InvalidAudioFormat, TranscriptionFailed, SynthesisFailed, ModelNotSupported, HttpError }

// Streaming TTS result
pub type AudioStreamingResult = Pin<Box<dyn Stream<Item = Result<AudioChunk, AudioError>> + Send>>;
pub struct AudioChunk { data: Vec<u8>, is_final: bool }

2. Extend InferenceProvider Trait

File: crates/inference_providers/src/lib.rs

Add methods with default implementations (returns ModelNotSupported):

async fn audio_transcription(&self, params: AudioTranscriptionParams, request_hash: String)
    -> Result<AudioTranscriptionResponseWithBytes, AudioError>;

async fn audio_speech(&self, params: AudioSpeechParams, request_hash: String)
    -> Result<AudioSpeechResponseWithBytes, AudioError>;

async fn audio_speech_stream(&self, params: AudioSpeechParams, request_hash: String)
    -> Result<AudioStreamingResult, AudioError>;

3. Implement vLLM Provider Audio Methods

File: crates/inference_providers/src/vllm/mod.rs

Implement:

audio_transcription() - POST multipart form to {base_url}/v1/audio/transcriptions
audio_speech() - POST JSON to {base_url}/v1/audio/speech, return binary audio
audio_speech_stream() - POST with streaming response, return audio chunks

3b. Implement External Provider Audio Methods

File: crates/inference_providers/src/external/mod.rs

Add audio support to ExternalProvider:

For OpenAI-compatible backends: forward requests to /v1/audio/transcriptions and /v1/audio/speech
Use existing ExternalBackend pattern - add audio_transcription() and audio_speech() methods to trait
Handle API key and base URL from provider config

4. Create Audio Service

New files:

crates/services/src/audio/mod.rs
crates/services/src/audio/ports.rs

ports.rs:

pub struct TranscribeRequest { model, audio_data, filename, language, prompt, response_format, temperature, timestamp_granularities, organization_id, workspace_id, api_key_id, model_id, request_hash }
pub struct SpeechRequest { model, input, voice, response_format, speed, organization_id, workspace_id, api_key_id, model_id, request_hash }
pub struct TranscribeResponse { text, language, duration, words, segments }
pub struct SpeechResponse { audio_data, content_type }
pub enum AudioServiceError { ModelNotFound, ProviderError, InvalidRequest, UsageError, InternalError }

#[async_trait]
pub trait AudioServiceTrait: Send + Sync {
    async fn transcribe(&self, request: TranscribeRequest) -> Result<TranscribeResponse, AudioServiceError>;
    async fn synthesize(&self, request: SpeechRequest) -> Result<SpeechResponse, AudioServiceError>;
    async fn synthesize_stream(&self, request: SpeechRequest)
        -> Result<Pin<Box<dyn Stream<Item = Result<Vec<u8>, AudioServiceError>> + Send>>, AudioServiceError>;
}

mod.rs:

AudioServiceImpl with inference_pool and usage_service dependencies
Transcribe: get provider, call audio_transcription, record usage (audio_seconds)
Synthesize: get provider, call audio_speech, record usage (character_count)

Update: crates/services/src/lib.rs - add pub mod audio; and pub mod realtime;

5. Add API Request/Response Models

File: crates/api/src/models.rs

Add:

pub struct AudioTranscriptionRequest { model, language, prompt, response_format, temperature, timestamp_granularities }
pub struct AudioTranscriptionResponse { text, task, language, duration, words, segments }
pub struct AudioSpeechRequest { model, input, voice, response_format, speed, stream: Option<bool> }

impl AudioSpeechRequest {
    pub fn validate(&self) -> Result<(), String> { /* validate model, input length <= 4096, voice, speed 0.25-4.0, format */ }
}

6. Create Audio Routes

New file: crates/api/src/routes/audio.rs

pub struct AudioRouteState { audio_service, models_service }

// POST /v1/audio/transcriptions (multipart form)
pub async fn transcribe_audio(State, Extension<WorkspaceContext>, Extension<RequestBodyHash>, Multipart)
    -> Result<Json<AudioTranscriptionResponse>, (StatusCode, Json<ErrorResponse>)>

// POST /v1/audio/speech (JSON body, returns binary audio or streaming audio)
// If request has stream: true, returns chunked audio stream
// Otherwise returns complete binary audio
pub async fn generate_speech(State, Extension<WorkspaceContext>, Extension<RequestBodyHash>, Json<AudioSpeechRequest>)
    -> Result<Response, (StatusCode, Json<ErrorResponse>)>

For streaming TTS, use Transfer-Encoding: chunked and Content-Type: audio/mpeg (or requested format).

Update: crates/api/src/routes/mod.rs - add pub mod audio;

7. Register Routes and Service

File: crates/api/src/lib.rs

Add:

pub fn build_audio_routes(audio_service, models_service, auth_state, usage_state, rate_limit_state) -> Router {
    Router::new()
        .route("/audio/transcriptions", post(transcribe_audio))
        .route("/audio/speech", post(generate_speech))
        .layer(DefaultBodyLimit::max(25 * 1024 * 1024))  // 25MB for audio
        .with_state(audio_state)
        .layer(usage_check_middleware)
        .layer(rate_limit_middleware)
        .layer(auth_middleware_with_workspace_context)
        .layer(body_hash_middleware)
}

Update DomainServices struct to include audio_service.

Update init_domain_services_with_pool to create AudioServiceImpl.

Update build_app_with_config to call build_audio_routes and merge into v1 router.

8. Add E2E Tests

New file: crates/api/tests/e2e_audio_api.rs

Tests:

test_audio_transcription - POST multipart with audio file
test_audio_speech - POST JSON, verify binary audio response
test_audio_transcription_validation - missing file/model errors
test_audio_speech_validation - missing voice, input too long errors

Part 1: Critical Files Summary

File	Action
`crates/inference_providers/src/models.rs`	Add audio params/response types
`crates/inference_providers/src/lib.rs`	Extend InferenceProvider trait
`crates/inference_providers/src/vllm/mod.rs`	Implement audio methods for vLLM
`crates/inference_providers/src/external/mod.rs`	Implement audio methods for external providers
`crates/inference_providers/src/external/backend.rs`	Extend ExternalBackend trait with audio methods
`crates/services/src/audio/ports.rs`	NEW: Service traits
`crates/services/src/audio/mod.rs`	NEW: Service implementation
`crates/services/src/lib.rs`	Export audio module
`crates/api/src/models.rs`	Add API models (add `stream` field to speech request)
`crates/api/src/routes/audio.rs`	NEW: Route handlers
`crates/api/src/routes/mod.rs`	Export audio routes
`crates/api/src/lib.rs`	Register routes, init service
`crates/api/tests/e2e_audio_api.rs`	NEW: E2E tests

Part 2: WebSocket Realtime API

9. Add WebSocket Dependencies

File: crates/api/Cargo.toml

Add:

axum = { version = "0.7", features = ["ws"] }
tokio-tungstenite = "0.21"

10. Create Realtime Session Types

New file: crates/services/src/realtime/mod.rs

pub mod ports;
pub mod session;

// Session state for a realtime connection
pub struct RealtimeSession {
    pub session_id: String,
    pub conversation_id: Option<Uuid>,
    pub stt_model: String,      // e.g., "whisper-1"
    pub llm_model: String,      // e.g., "gpt-4"
    pub tts_model: String,      // e.g., "tts-1"
    pub tts_voice: String,      // e.g., "alloy"
    pub audio_buffer: Vec<u8>,  // Accumulated audio input
    pub context: Vec<Message>,  // Conversation history
}

// Client → Server events
pub enum ClientEvent {
    SessionUpdate { session: SessionConfig },
    InputAudioBufferAppend { audio: String },  // base64 audio chunk
    InputAudioBufferCommit,
    InputAudioBufferClear,
    ConversationItemCreate { item: ConversationItem },
    ResponseCreate { response: Option<ResponseConfig> },
    ResponseCancel,
}

// Server → Client events
pub enum ServerEvent {
    SessionCreated { session: Session },
    SessionUpdated { session: Session },
    InputAudioBufferCommitted { item_id: String },
    InputAudioBufferCleared,
    InputAudioBufferSpeechStarted { audio_start_ms: i32, item_id: String },
    InputAudioBufferSpeechStopped { audio_end_ms: i32, item_id: String },
    ConversationItemCreated { item: ConversationItem },
    ConversationItemInputAudioTranscriptionCompleted { item_id: String, transcript: String },
    ResponseCreated { response: Response },
    ResponseOutputItemAdded { item: ConversationItem },
    ResponseOutputItemDone { item: ConversationItem },
    ResponseTextDelta { item_id: String, delta: String },
    ResponseTextDone { item_id: String, text: String },
    ResponseAudioDelta { item_id: String, delta: String },  // base64 audio chunk
    ResponseAudioDone { item_id: String },
    ResponseDone { response: Response },
    Error { type_: String, code: String, message: String },
}

11. Create Realtime Service

File: crates/services/src/realtime/ports.rs

#[async_trait]
pub trait RealtimeServiceTrait: Send + Sync {
    async fn create_session(&self, config: SessionConfig, ctx: &WorkspaceContext)
        -> Result<RealtimeSession, RealtimeError>;

    async fn handle_audio_chunk(&self, session: &mut RealtimeSession, audio_base64: &str)
        -> Result<(), RealtimeError>;

    async fn commit_audio_buffer(&self, session: &mut RealtimeSession)
        -> Result<TranscriptionResult, RealtimeError>;

    async fn generate_response(&self, session: &mut RealtimeSession)
        -> Result<Pin<Box<dyn Stream<Item = ServerEvent> + Send>>, RealtimeError>;
}

File: crates/services/src/realtime/session.rs

Implements the STT → LLM → TTS pipeline:

commit_audio_buffer() - Sends accumulated audio to STT model, returns transcript
generate_response() - Sends transcript + context to LLM, streams response
For each LLM text chunk, generates TTS audio chunk and streams both

12. Create WebSocket Route Handler

New file: crates/api/src/routes/realtime.rs

use axum::{
    extract::{ws::{Message, WebSocket, WebSocketUpgrade}, State},
    response::IntoResponse,
};

pub async fn realtime_handler(
    ws: WebSocketUpgrade,
    State(state): State<RealtimeRouteState>,
    Extension(workspace_ctx): Extension<WorkspaceContext>,
) -> impl IntoResponse {
    ws.on_upgrade(move |socket| handle_realtime_socket(socket, state, workspace_ctx))
}

async fn handle_realtime_socket(
    mut socket: WebSocket,
    state: RealtimeRouteState,
    ctx: WorkspaceContext,
) {
    // Create session
    let mut session = state.realtime_service
        .create_session(SessionConfig::default(), &ctx)
        .await
        .expect("Failed to create session");

    // Send session.created event
    let created_event = ServerEvent::SessionCreated { session: session.to_api() };
    socket.send(Message::Text(serde_json::to_string(&created_event).unwrap())).await.ok();

    // Main event loop
    while let Some(msg) = socket.recv().await {
        match msg {
            Ok(Message::Text(text)) => {
                let event: ClientEvent = serde_json::from_str(&text).unwrap();
                match event {
                    ClientEvent::InputAudioBufferAppend { audio } => {
                        state.realtime_service.handle_audio_chunk(&mut session, &audio).await.ok();
                    }
                    ClientEvent::InputAudioBufferCommit => {
                        match state.realtime_service.commit_audio_buffer(&mut session).await {
                            Ok(transcript) => {
                                // Send transcription event
                                let event = ServerEvent::ConversationItemInputAudioTranscriptionCompleted {
                                    item_id: transcript.item_id,
                                    transcript: transcript.text,
                                };
                                socket.send(Message::Text(serde_json::to_string(&event).unwrap())).await.ok();
                            }
                            Err(e) => { /* send error event */ }
                        }
                    }
                    ClientEvent::ResponseCreate { .. } => {
                        // Generate LLM response and stream TTS audio
                        let mut stream = state.realtime_service.generate_response(&mut session).await.unwrap();
                        while let Some(event) = stream.next().await {
                            socket.send(Message::Text(serde_json::to_string(&event).unwrap())).await.ok();
                        }
                    }
                    // Handle other events...
                }
            }
            Ok(Message::Binary(audio)) => {
                // Direct binary audio input (alternative to base64)
                state.realtime_service.handle_audio_chunk(&mut session,
                    &base64::encode(&audio)).await.ok();
            }
            Ok(Message::Close(_)) => break,
            _ => {}
        }
    }
}

13. Register WebSocket Route

File: crates/api/src/lib.rs

pub fn build_realtime_routes(
    realtime_service: Arc<dyn RealtimeServiceTrait>,
    audio_service: Arc<dyn AudioServiceTrait>,
    completion_service: Arc<dyn CompletionServiceTrait>,
    auth_state: &AuthState,
) -> Router {
    let state = RealtimeRouteState { realtime_service, audio_service, completion_service };

    Router::new()
        .route("/realtime", get(realtime_handler))
        .with_state(state)
        .layer(auth_middleware_with_workspace_context)  // Auth via query param or header
}

Update build_app_with_config to include realtime routes.

Updated Critical Files Summary

File	Action
`crates/api/Cargo.toml`	Add WebSocket dependencies
`crates/services/src/realtime/mod.rs`	NEW: Realtime session types and events
`crates/services/src/realtime/ports.rs`	NEW: Realtime service trait
`crates/services/src/realtime/session.rs`	NEW: STT → LLM → TTS pipeline
`crates/api/src/routes/realtime.rs`	NEW: WebSocket handler

Privacy Requirements

NEVER log:

Audio file contents
Transcription text
TTS input text

OK to log:

Request IDs, model names
Audio duration (seconds), character count
Error codes, latency metrics

Verification

Part 1: REST Endpoints

Build: cargo build
Unit tests: cargo test --lib --bins
E2E tests: cargo test --test e2e_audio_api (requires PostgreSQL)

Manual test transcription:

curl -X POST http://localhost:3000/v1/audio/transcriptions \
  -H "Authorization: Bearer sk-test-xxx" \
  -F "file=@test.wav" \
  -F "model=whisper-1"

Manual test speech:

curl -X POST http://localhost:3000/v1/audio/speech \
  -H "Authorization: Bearer sk-test-xxx" \
  -H "Content-Type: application/json" \
  -d '{"model":"tts-1","input":"Hello world","voice":"alloy"}' \
  --output output.mp3

Part 2: WebSocket Realtime API

WebSocket test with websocat:

# Connect to realtime endpoint
websocat "ws://localhost:3000/v1/realtime?authorization=sk-test-xxx"

# Send session.update to configure models
{"type":"session.update","session":{"stt_model":"whisper-1","llm_model":"gpt-4","tts_model":"tts-1","voice":"alloy"}}

# Send audio chunk (base64 encoded)
{"type":"input_audio_buffer.append","audio":"<base64_audio_data>"}

# Commit audio for transcription
{"type":"input_audio_buffer.commit"}

# Request LLM response with TTS
{"type":"response.create"}

E2E WebSocket tests: cargo test --test e2e_realtime_api

gemini-code-assist · 2026-01-20T23:11:47Z

Summary of Changes

Hello @ilblackdragon, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands the platform's capabilities by introducing robust audio processing features. It provides both traditional RESTful endpoints for speech-to-text and text-to-speech, and a sophisticated real-time WebSocket API designed for interactive voice conversations. The changes involve deep integration with existing inference provider infrastructure, the creation of new service layers for managing audio workflows and usage, and the establishment of dedicated API routes, all while adhering to strict privacy guidelines.

Highlights

New Audio APIs (REST): Introduced REST endpoints for file-based speech-to-text (POST /v1/audio/transcriptions) and text-to-speech (POST /v1/audio/speech), including support for streaming TTS responses.
Real-time WebSocket API: Implemented a wss://host/v1/realtime WebSocket endpoint for bidirectional audio streaming, enabling real-time voice-to-voice conversations through an integrated STT → LLM → TTS pipeline.
Inference Provider Integration: Extended both vLLM-hosted models and external providers (OpenAI-compatible, Google Gemini) to support the new audio transcription and speech synthesis functionalities.
Dedicated Service Layer: Created new AudioService and RealtimeService modules within the service layer to encapsulate business logic, provider routing, and usage tracking for audio operations.
API Route and Model Definitions: Added new API models for audio requests and responses, along with dedicated Axum routes and middleware for handling audio-related HTTP and WebSocket traffic, including body size limits and authentication.
Comprehensive E2E Testing: Included new end-to-end tests for both the RESTful audio endpoints and the WebSocket real-time API to ensure functionality and integration.
Privacy Considerations: Explicitly defined privacy requirements to prevent logging of sensitive audio file contents, transcription text, or TTS input text, while allowing logging of metadata for usage tracking.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces comprehensive audio APIs for speech-to-text (STT) and text-to-speech (TTS), including both REST endpoints and a real-time WebSocket API. The implementation integrates well with the existing inference provider pool and includes usage tracking. The code is generally well-structured with clear separation of concerns. However, there are a few areas that need attention, particularly regarding the real-time streaming behavior and consistent service layer utilization. One critical point highlighted by existing rules is the handling of background tasks for billing operations to ensure data integrity during graceful shutdowns.

crates/services/src/realtime/mod.rs

crates/api/src/routes/audio.rs

crates/inference_providers/src/external/gemini.rs

crates/inference_providers/src/external/openai_compatible.rs

crates/inference_providers/src/vllm/mod.rs

crates/services/src/audio/mod.rs

crates/services/src/realtime/mod.rs

crates/api/src/routes/realtime.rs

ilblackdragon added 3 commits January 20, 2026 22:16

Initial implementation of Audio APIs

eae2dd2

Add gemini implementation for Audio API

6202977

Fixes for openai handling of different response formats

4011860

ilblackdragon had a problem deploying to Cloud API test env January 20, 2026 23:13 — with GitHub Actions Failure

gemini-code-assist bot reviewed Jan 20, 2026

View reviewed changes

think-in-universe requested a review from hanakannzashi January 26, 2026 03:06

ilblackdragon mentioned this pull request Jan 27, 2026

feat: add audio transcriptions endpoint #392

Open

Merge branch 'main' into whisper

414aed1

nickpismenkov self-assigned this Jan 27, 2026

nickpismenkov changed the title ~~feat: Adding Audio APIs~~ feat: Adding Audio APIs [waiting for vllm-proxy change] Jan 28, 2026

review fixes

26bff4c

nickpismenkov had a problem deploying to Cloud API test env January 28, 2026 00:25 — with GitHub Actions Failure

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Adding Audio APIs [waiting for vllm-proxy change]#377

feat: Adding Audio APIs [waiting for vllm-proxy change]#377
ilblackdragon wants to merge 5 commits intomainfrom
whisper

ilblackdragon commented Jan 20, 2026

Uh oh!

gemini-code-assist bot commented Jan 20, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ilblackdragon commented Jan 20, 2026

Audio API Implementation Plan

Overview

Implementation Steps

1. Add Audio Models to Inference Providers

2. Extend InferenceProvider Trait

3. Implement vLLM Provider Audio Methods

3b. Implement External Provider Audio Methods

4. Create Audio Service

5. Add API Request/Response Models

6. Create Audio Routes

7. Register Routes and Service

8. Add E2E Tests

Part 1: Critical Files Summary

Part 2: WebSocket Realtime API

9. Add WebSocket Dependencies

10. Create Realtime Session Types

11. Create Realtime Service

12. Create WebSocket Route Handler

13. Register WebSocket Route

Updated Critical Files Summary

Privacy Requirements

Verification

Part 1: REST Endpoints

Part 2: WebSocket Realtime API

Uh oh!

gemini-code-assist bot commented Jan 20, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants