MLX Audio Swift

A modular Swift SDK for audio processing with MLX on Apple Silicon

Architecture

MLXAudio follows a modular design allowing you to import only what you need:

MLXAudioCore: Base types, protocols, and utilities
MLXAudioCodecs: Audio codec implementations (SNAC, Encodec, Vocos, Mimi, DACVAE)
MLXAudioTTS: Text-to-Speech models (Qwen3-TTS, Soprano, VyvoTTS, Orpheus, Marvis TTS, Pocket TTS)
MLXAudioSTT: Speech-to-Text models (Qwen3-ASR, Voxtral Realtime, Parakeet, GLMASR)
MLXAudioVAD: Voice Activity Detection & Speaker Diarization (Sortformer, SmartTurn)
MLXAudioSTS: Speech-to-Speech models (LFM2.5-Audio, SAM-Audio, MossFormer2-SE)
MLXAudioUI: SwiftUI components for audio interfaces

Installation

Add MLXAudio to your project using Swift Package Manager:

dependencies: [
    .package(url: "https://github.com/Blaizzy/mlx-audio-swift.git", branch: "main")
]

// Import only what you need
.product(name: "MLXAudioTTS", package: "mlx-audio-swift"),
.product(name: "MLXAudioCore", package: "mlx-audio-swift")

Quick Start

Text-to-Speech

import MLXAudioTTS
import MLXAudioCore

// Load a TTS model from HuggingFace
let model = try await SopranoModel.fromPretrained("mlx-community/Soprano-80M-bf16")

// Generate audio
let audio = try await model.generate(
    text: "Hello from MLX Audio Swift!",
    parameters: GenerateParameters(
        maxTokens: 200,
        temperature: 0.7,
        topP: 0.95
    )
)

// Save to file
try saveAudioArray(audio, sampleRate: Double(model.sampleRate), to: outputURL)

Speech-to-Text

import MLXAudioSTT
import MLXAudioCore

// Load audio file
let (sampleRate, audioData) = try loadAudioArray(from: audioURL)

// Load STT model
let model = try await GLMASRModel.fromPretrained("mlx-community/GLM-ASR-Nano-2512-4bit")

// Transcribe
let output = model.generate(audio: audioData)
print(output.text)

Speaker Diarization

import MLXAudioVAD
import MLXAudioCore

// Load audio file
let (sampleRate, audioData) = try loadAudioArray(from: audioURL)

// Load diarization model
let model = try await SortformerModel.fromPretrained(
    "mlx-community/diar_streaming_sortformer_4spk-v2.1-fp16"
)

// Detect who is speaking when
let output = try await model.generate(audio: audioData, threshold: 0.5)
for segment in output.segments {
    print("Speaker \(segment.speaker): \(segment.start)s - \(segment.end)s")
}

Streaming Generation

for try await event in model.generateStream(text: text, parameters: parameters) {
    switch event {
    case .token(let token):
        print("Generated token: \(token)")
    case .audio(let audio):
        print("Final audio shape: \(audio.shape)")
    case .info(let info):
        print(info.summary)
    }
}

Supported Models

TTS Models

Model	Model README	HuggingFace Repo
Qwen3-TTS	—	mlx-community/Qwen3-TTS-12Hz-0.6B-Base-8bit
Soprano	Soprano README	mlx-community/Soprano-80M-bf16
VyvoTTS	VyvoTTS README	mlx-community/VyvoTTS-EN-Beta-4bit
Orpheus	Orpheus README	mlx-community/orpheus-3b-0.1-ft-bf16
Marvis TTS	Marvis TTS README	Marvis-AI/marvis-tts-250m-v0.2-MLX-8bit
Pocket TTS	Pocket TTS README	mlx-community/pocket-tts

STT Models

Model	Model README	HuggingFace Repo
Qwen3-ASR	Qwen3-ASR README	mlx-community/Qwen3-ASR-1.7B-bf16
Qwen3-ForcedAligner	Qwen3-ASR README	mlx-community/Qwen3-ForcedAligner-0.6B-bf16
Voxtral Realtime	Voxtral README	mlx-community/Voxtral-Mini-4B-Realtime-2602-fp16
Parakeet	Parakeet README	mlx-community/parakeet-tdt-0.6b-v3
GLMASR	GLMASR README	mlx-community/GLM-ASR-Nano-2512-4bit

STS Models

Model	Model README	HuggingFace Repo
LFM2.5-Audio	LFM Audio README	mlx-community/LFM2.5-Audio-1.5B-6bit
SAM-Audio	SAM Audio README	mlx-community/sam-audio-large-fp16
MossFormer2-SE	—	starkdmi/MossFormer2-SE-fp16

VAD / Speaker Diarization Models

Model	Model README	HuggingFace Repo
Sortformer	Sortformer README	mlx-community/diar_streaming_sortformer_4spk-v2.1-fp16
SmartTurn	SmartTurn README	mlx-community/smart-turn-v3

Features

Modular architecture for minimal app size - import only what you need
Automatic model downloading from HuggingFace Hub
Native async/await support for seamless Swift integration
Streaming audio generation for real-time TTS
Type-safe Swift API with comprehensive error handling
Optimized for Apple Silicon with MLX framework

Advanced Usage

Custom Generation Parameters

let parameters = GenerateParameters(
    maxTokens: 1200,
    temperature: 0.7,
    topP: 0.95,
    repetitionPenalty: 1.5,
    repetitionContextSize: 30
)

let audio = try await model.generate(text: "Your text here", parameters: parameters)

Audio Codec Usage

import MLXAudioCodecs

// Load SNAC codec
let snac = try await SNAC.fromPretrained("mlx-community/snac_24khz")

// Encode audio to tokens
let tokens = try snac.encode(audio)

// Decode tokens back to audio
let reconstructed = try snac.decode(tokens)

Voice Selection for Multi-Voice Models

// For models supporting multiple voices (like LlamaTTS/Orpheus)
let audio = try await model.generate(
    text: "Hello!",
    voice: "tara",  // Options: tara, leah, jess, leo, dan, mia, zac, zoe
    parameters: parameters
)

Requirements

macOS 14+ or iOS 17+
Apple Silicon (M1 or later) recommended for optimal performance
Xcode 15+
Swift 5.9+

Examples

Check out the Examples/VoicesApp directory for a complete SwiftUI application demonstrating:

Loading and running TTS models
Playing generated audio
UI components for model interaction

Additional usage examples can be found in the test files.

Credits

Built on MLX Swift
Uses swift-transformers
Inspired by MLX Audio (Python)

License

MIT License - see LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 237 Commits
.github		.github
Examples		Examples
Sources		Sources
Tests		Tests
.gitignore		.gitignore
LICENSE		LICENSE
Package.resolved		Package.resolved
Package.swift		Package.swift
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MLX Audio Swift

Architecture

Installation

Quick Start

Text-to-Speech

Speech-to-Text

Speaker Diarization

Streaming Generation

Supported Models

TTS Models

STT Models

STS Models

VAD / Speaker Diarization Models

Features

Advanced Usage

Custom Generation Parameters

Audio Codec Usage

Voice Selection for Multi-Voice Models

Requirements

Examples

Credits

License

About

Uh oh!

Releases 1

Sponsor this project

Uh oh!

Packages

Uh oh!

Contributors 16

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

MLX Audio Swift

Architecture

Installation

Quick Start

Text-to-Speech

Speech-to-Text

Speaker Diarization

Streaming Generation

Supported Models

TTS Models

STT Models

STS Models

VAD / Speaker Diarization Models

Features

Advanced Usage

Custom Generation Parameters

Audio Codec Usage

Voice Selection for Multi-Voice Models

Requirements

Examples

Credits

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Sponsor this project

Uh oh!

Packages 0

Uh oh!

Contributors 16

Languages

Packages