A modular Swift SDK for audio processing with MLX on Apple Silicon
MLXAudio follows a modular design allowing you to import only what you need:
- MLXAudioCore: Base types, protocols, and utilities
- MLXAudioCodecs: Audio codec implementations (SNAC, Encodec, Vocos, Mimi, DACVAE)
- MLXAudioTTS: Text-to-Speech models (Qwen3-TTS, Soprano, VyvoTTS, Orpheus, Marvis TTS, Pocket TTS)
- MLXAudioSTT: Speech-to-Text models (Qwen3-ASR, Voxtral Realtime, Parakeet, GLMASR)
- MLXAudioVAD: Voice Activity Detection & Speaker Diarization (Sortformer, SmartTurn)
- MLXAudioSTS: Speech-to-Speech models (LFM2.5-Audio, SAM-Audio, MossFormer2-SE)
- MLXAudioUI: SwiftUI components for audio interfaces
Add MLXAudio to your project using Swift Package Manager:
dependencies: [
.package(url: "https://github.com/Blaizzy/mlx-audio-swift.git", branch: "main")
]
// Import only what you need
.product(name: "MLXAudioTTS", package: "mlx-audio-swift"),
.product(name: "MLXAudioCore", package: "mlx-audio-swift")import MLXAudioTTS
import MLXAudioCore
// Load a TTS model from HuggingFace
let model = try await SopranoModel.fromPretrained("mlx-community/Soprano-80M-bf16")
// Generate audio
let audio = try await model.generate(
text: "Hello from MLX Audio Swift!",
parameters: GenerateParameters(
maxTokens: 200,
temperature: 0.7,
topP: 0.95
)
)
// Save to file
try saveAudioArray(audio, sampleRate: Double(model.sampleRate), to: outputURL)import MLXAudioSTT
import MLXAudioCore
// Load audio file
let (sampleRate, audioData) = try loadAudioArray(from: audioURL)
// Load STT model
let model = try await GLMASRModel.fromPretrained("mlx-community/GLM-ASR-Nano-2512-4bit")
// Transcribe
let output = model.generate(audio: audioData)
print(output.text)import MLXAudioVAD
import MLXAudioCore
// Load audio file
let (sampleRate, audioData) = try loadAudioArray(from: audioURL)
// Load diarization model
let model = try await SortformerModel.fromPretrained(
"mlx-community/diar_streaming_sortformer_4spk-v2.1-fp16"
)
// Detect who is speaking when
let output = try await model.generate(audio: audioData, threshold: 0.5)
for segment in output.segments {
print("Speaker \(segment.speaker): \(segment.start)s - \(segment.end)s")
}for try await event in model.generateStream(text: text, parameters: parameters) {
switch event {
case .token(let token):
print("Generated token: \(token)")
case .audio(let audio):
print("Final audio shape: \(audio.shape)")
case .info(let info):
print(info.summary)
}
}| Model | Model README | HuggingFace Repo |
|---|---|---|
| Qwen3-TTS | — | mlx-community/Qwen3-TTS-12Hz-0.6B-Base-8bit |
| Soprano | Soprano README | mlx-community/Soprano-80M-bf16 |
| VyvoTTS | VyvoTTS README | mlx-community/VyvoTTS-EN-Beta-4bit |
| Orpheus | Orpheus README | mlx-community/orpheus-3b-0.1-ft-bf16 |
| Marvis TTS | Marvis TTS README | Marvis-AI/marvis-tts-250m-v0.2-MLX-8bit |
| Pocket TTS | Pocket TTS README | mlx-community/pocket-tts |
| Model | Model README | HuggingFace Repo |
|---|---|---|
| Qwen3-ASR | Qwen3-ASR README | mlx-community/Qwen3-ASR-1.7B-bf16 |
| Qwen3-ForcedAligner | Qwen3-ASR README | mlx-community/Qwen3-ForcedAligner-0.6B-bf16 |
| Voxtral Realtime | Voxtral README | mlx-community/Voxtral-Mini-4B-Realtime-2602-fp16 |
| Parakeet | Parakeet README | mlx-community/parakeet-tdt-0.6b-v3 |
| GLMASR | GLMASR README | mlx-community/GLM-ASR-Nano-2512-4bit |
| Model | Model README | HuggingFace Repo |
|---|---|---|
| LFM2.5-Audio | LFM Audio README | mlx-community/LFM2.5-Audio-1.5B-6bit |
| SAM-Audio | SAM Audio README | mlx-community/sam-audio-large-fp16 |
| MossFormer2-SE | — | starkdmi/MossFormer2-SE-fp16 |
| Model | Model README | HuggingFace Repo |
|---|---|---|
| Sortformer | Sortformer README | mlx-community/diar_streaming_sortformer_4spk-v2.1-fp16 |
| SmartTurn | SmartTurn README | mlx-community/smart-turn-v3 |
- Modular architecture for minimal app size - import only what you need
- Automatic model downloading from HuggingFace Hub
- Native async/await support for seamless Swift integration
- Streaming audio generation for real-time TTS
- Type-safe Swift API with comprehensive error handling
- Optimized for Apple Silicon with MLX framework
let parameters = GenerateParameters(
maxTokens: 1200,
temperature: 0.7,
topP: 0.95,
repetitionPenalty: 1.5,
repetitionContextSize: 30
)
let audio = try await model.generate(text: "Your text here", parameters: parameters)import MLXAudioCodecs
// Load SNAC codec
let snac = try await SNAC.fromPretrained("mlx-community/snac_24khz")
// Encode audio to tokens
let tokens = try snac.encode(audio)
// Decode tokens back to audio
let reconstructed = try snac.decode(tokens)// For models supporting multiple voices (like LlamaTTS/Orpheus)
let audio = try await model.generate(
text: "Hello!",
voice: "tara", // Options: tara, leah, jess, leo, dan, mia, zac, zoe
parameters: parameters
)- macOS 14+ or iOS 17+
- Apple Silicon (M1 or later) recommended for optimal performance
- Xcode 15+
- Swift 5.9+
Check out the Examples/VoicesApp directory for a complete SwiftUI application demonstrating:
- Loading and running TTS models
- Playing generated audio
- UI components for model interaction
Additional usage examples can be found in the test files.
- Built on MLX Swift
- Uses swift-transformers
- Inspired by MLX Audio (Python)
MIT License - see LICENSE file for details.