A machine learning pipeline for unsupervised speech representation learning. This project converts continuous speech (specifically the Bible) into discrete acoustic tokens and learns to resynthesize speech from those tokens.
The goal is to enable NLP-like capabilities (translation, search, pattern discovery) directly on audio data for low-resource languages (like SaterΓ©-MawΓ©).
Traditional speech processing relies on text transcripts (ASR). We don't. We use Self-Supervised Learning to discover the "units" of a language purely from audio.
flowchart LR
subgraph Input
A[π£οΈ Raw Audio]
end
subgraph Discretization
B[XLSR-53<br/>Features] --> C[K-Means<br/>Clustering] --> D[Discrete Tokens<br/>31 87 14...]
end
subgraph Applications
D --> E["Language Modeling<br/>(BPE, GPT)"]
D --> F["Vocoder<br/>(Synthesis)"]
end
A --> B
F --> G[π£οΈ Resynthesized Audio]
style D fill:#f9f,stroke:#333,stroke-width:2px
After completing all 3 phases, you have a Speech Synthesis (Vocoder) System for the target language.
flowchart LR
subgraph Input
A[Acoustic Units<br/>31 87 14 43...]
end
subgraph TrainedModel ["Trained Vocoder (Phase 3)"]
B[Generator V2<br/>HiFi-GAN + Pitch]
end
subgraph Output
C[π Natural Audio<br/>16kHz WAV]
end
A --> B --> C
style B fill:#f9f,stroke:#333,stroke-width:2px
| Phase | Output | Purpose |
|---|---|---|
| Phase 1 | K-Means model + Unit sequences | Creates a "vocabulary" of ~100 speech sounds (learned phonemes) |
| Phase 2 | BPE tokenizer + Motif analysis | Discovers common sound patterns (optional, for linguistic analysis) |
| Phase 3 | Vocoder model (.pt file) | Neural network that converts unit sequences β audio waveforms |
Source Audio β ASR β Translation β Unit Prediction β [Your Vocoder] β Target Audio
β
(This is what you train)
Translate from Portuguese to SaterΓ©, then synthesize natural SaterΓ© speech.
- Train on native speaker recordings β Generate new speech in that voice/language
- Preserve acoustic characteristics of indigenous languages like SaterΓ©-MawΓ©
- Convert text (via text-to-unit model) β audio narration in the target language
- Generate consistent, natural-sounding Bible readings
- The vocoder is the acoustic backend for any TTS system
- Pair with a Text-to-Unit model for complete text-to-speech
This pipeline trains the acoustic model (vocoder): Units β Audio
For complete Text β Speech, you also need: Text β Units (not included here)
Options for Text-to-Unit:
- Tacotron-style sequence-to-sequence model
- Translation model that outputs units instead of text
- GPT-style language model trained on unit sequences
The pipeline consists of three distinct training phases.
Extracts features from audio and learns a discrete vocabulary of 100 sounds.
flowchart LR
A[Audio Segment] -->|XLSR-53 Layer 14| B[1024-dim Vectors]
B -->|K-Means| C[Cluster Centroids]
B -->|Nearest Neighbor| D[Unit Sequence]
Note: We currently use XLSR-53 for feature extraction. Meta's newer MMS model (1,400+ languages) may provide better representations for low-resource languages. See docs/MMS_VS_XLSR53.md for a detailed comparison and migration guide.
Analyzes the sequence of units to find recurring motifs (acoustic "words").
flowchart LR
A[Unit Sequence<br/>31 31 87 43] -->|SentencePiece| B[BPE Training]
B --> C[Vocabulary<br/>Motifs]
C --> D[Tokenized Sequence<br/>105 43]
Trains a generative model to convert discrete units back into continuous audio waveforms.
flowchart TB
A[Unit Sequence] -->|Generator| B[Fake Audio]
C[Real Audio] -->|Discriminator| D[Real/Fake Score]
B -->|Discriminator| D
B -->|Mel Loss| C
- Python 3.10+
- Modal account (for cloud GPU training)
ffmpeginstalled locally
pip install -r requirements.txt
python3 -m modal token set --token-id <id> --token-secret <secret>1. Segment Audio Locally (one-time, run on your machine)
# For Portuguese
python scripts/segment_audio.py --language portuguese
# For SaterΓ©-MawΓ©
python scripts/segment_audio.py --language satere2. Upload to Cloud Storage (one-time, run on your machine)
python3 -m modal run scripts/upload_to_modal.py --language portuguese3. Run Training Pipeline (on Modal)
# Run full pipeline with V2 vocoder (best quality)
python3 -m modal run src/training/run_full_pipeline.py::main
# Run full pipeline with V1 vocoder (faster, simpler)
python3 -m modal run src/training/run_full_pipeline.py::main --vocoder-version v1# Phase 1: Discover acoustic units (~2-4 hours)
python3 -m modal run --detach src/training/phase1_acoustic.py
# Phase 2: Learn motifs/BPE (~30 min)
python3 -m modal run --detach src/training/phase2_bpe.py
# Phase 3 V2: Train Enhanced Vocoder (~4-8 hours) β RECOMMENDED
python3 -m modal run --detach src/training/phase3_vocoder_v2.py
# OR Phase 3 V1: Train Original Vocoder (~2-4 hours)
python3 -m modal run --detach src/training/phase3_vocoder.py# Skip Phase 1 (already done), run 2 and 3
python3 -m modal run src/training/run_full_pipeline.py::main --phases 2,3
# Run only vocoder training (Phases 1 & 2 already done)
python3 -m modal run src/training/run_full_pipeline.py::main --phases 3All training scripts support multiple languages via the --language parameter.
| Language | Code | Segmented Audio | Units Output | Vocoder Checkpoints |
|---|---|---|---|---|
| Portuguese | portuguese |
segmented_audio/ |
portuguese_units/ |
vocoder_v2_checkpoints/ |
| SaterΓ©-MawΓ© | satere |
segmented_audio_satere/ |
satere_units/ |
vocoder_v2_satere_checkpoints/ |
# 1. Segment locally
python scripts/segment_audio.py --language satere
# 2. Upload to Modal
python3 -m modal run scripts/upload_to_modal.py --language satere
# 3. Run full pipeline for SaterΓ©
python3 -m modal run src/training/run_full_pipeline.py::main --language satere
# OR run phases individually
python3 -m modal run --detach src/training/phase1_acoustic.py::main_skip_segmentation --language satere
python3 -m modal run --detach src/training/phase2_bpe.py::main --language satere
python3 -m modal run --detach src/training/phase3_vocoder_v2.py::main --language satere# Test SaterΓ© vocoder
python3 -m modal run src/training/vocoder_test_v2.py::main --language satere --num-samples 50
# Download SaterΓ© checkpoints
modal volume get bible-audio-data vocoder_v2_satere_checkpoints/ ./modal_downloads/vocoder_v2_satere/
# Download test results
modal volume get bible-audio-data vocoder_v2_satere_test_output/ ./modal_downloads/vocoder_v2_satere_test/To add support for a new language, update the LANGUAGE_CONFIGS dictionary in each training script:
LANGUAGE_CONFIGS = {
# ... existing languages ...
"new_language": {
"segmented_dir": f"{AUDIO_MOUNT}/segmented_audio_new",
"output_dir": f"{AUDIO_MOUNT}/new_language_units",
"vocoder_dir": f"{AUDIO_MOUNT}/vocoder_v2_new_checkpoints",
"corpus_file": "new_language_corpus_timestamped.json",
},
}The V2 vocoder addresses the "robotic audio" problem with several improvements.
Note: V2 only affects Phase 3 (vocoder training). Phases 1 and 2 remain the same. You must run Phases 1 and 2 first before training V2.
# Ensure Phases 1 and 2 are complete (check Modal volume)
modal volume ls bible-audio-data portuguese_units/
# Should show: portuguese_kmeans.pkl, all_units_for_bpe.txt, portuguese_corpus_timestamped.json| Feature | V1 | V2 |
|---|---|---|
| Pitch conditioning | β | β 32-bin F0 embedding |
| Generator | Simple TransConv | HiFi-GAN with MRF |
| Discriminator | MSD only | MPD + MSD |
| Losses | Mel + Adversarial | Mel + STFT + FM + Adversarial |
| Segment length | 1 second | 2 seconds |
| Audio quality | Robotic | Natural prosody |
# RECOMMENDED: Full pipeline with V2
python3 -m modal run src/training/run_full_pipeline.py::main
# OR: Only V2 vocoder (if Phases 1 & 2 done)
python3 -m modal run --detach src/training/phase3_vocoder_v2.py::main
# Custom parameters
python3 -m modal run --detach src/training/phase3_vocoder_v2.py::main \
--epochs 1000 --segment-length 32000 --patience 100
# Resume training from checkpoint
python3 -m modal run --detach src/training/phase3_vocoder_v2.py::main --resume v2_latest.pt# Test quality metrics (after training)
python3 -m modal run src/training/vocoder_test_v2.py::main --num-samples 50
# Download test results
modal volume get bible-audio-data vocoder_v2_test_output/ ./modal_downloads/vocoder_v2_test/- docs/VOCODER_V2_ARCHITECTURE.md - Complete technical guide with diagrams
- docs/ROBOTIC_AUDIO_ANALYSIS.md - Why V1 sounds robotic + solutions
- docs/SEGMENT_PREPARATION.md - Segment size impact and recommendations
model-training/
βββ src/
β βββ models/ # Neural Network Architectures (Jupytext)
β β βββ generator.py # V1 Vocoder (basic upsampling)
β β βββ generator_v2.py # V2 Vocoder (HiFi-GAN + pitch conditioning)
β β βββ discriminator.py # V1 Multi-Scale Discriminator
β β βββ discriminator_v2.py # V2 MPD + MSD with spectral norm
β βββ training/ # Cloud Training Scripts
β β βββ run_full_pipeline.py # π Run all phases at once
β β βββ phase1_acoustic.py # Feature extraction & clustering
β β βββ phase2_bpe.py # BPE motif discovery
β β βββ phase3_vocoder.py # V1 GAN training (simpler)
β β βββ phase3_vocoder_v2.py # V2 GAN training (enhanced)
β β βββ vocoder_test.py # V1 quality testing
β β βββ vocoder_test_v2.py # V2 quality testing with F0
β β βββ validate_units.py # Unit validation
βββ scripts/ # Local Utilities
β βββ segment_audio.py # Silence-based segmentation
β βββ upload_to_modal.py # Data transfer
βββ docs/
β βββ ARCHITECTURE.md # V1 design decisions
β βββ VOCODER_V2_ARCHITECTURE.md # V2 complete technical guide
β βββ ROBOTIC_AUDIO_ANALYSIS.md # Why V1 sounds robotic + solutions
β βββ SEGMENT_PREPARATION.md # Segment size impact on training
β βββ MMS_VS_XLSR53.md # MMS vs XLSR-53 comparison & migration
β βββ AUDIOLM_INTEGRATION.md # AudioLM architecture & integration guide
β βββ PIPELINE.md # Step-by-step manual
βββ audio_data/ # Raw input files (gitignored)
See docs/ARCHITECTURE.md for a comprehensive analysis.
Key Highlights:
- Why 100 Units? Balanced trade-off between phonetic granularity and model trainability.
- Why Layer 14? Best layer in XLSR-53 for phonetic content, filtering out speaker identity.
- Why Robotic Audio (V1)? We deliberately discarded pitch (F0) to focus on phonetic content.
- V2 Solution: Re-injects pitch via conditioning, uses HiFi-GAN architecture and enhanced losses.
| Technology | Document | Potential Benefit |
|---|---|---|
| MMS (Meta) | MMS_VS_XLSR53.md | Better low-resource language support (1,400+ languages) |
| AudioLM (Google) | AUDIOLM_INTEGRATION.md | State-of-the-art quality via semantic + acoustic tokens |
These documents provide deep technical analysis of how emerging architectures could enhance our pipeline.
- Compression: ~775x reduction in bitrate (Raw Audio β Discrete Tokens).
- Intelligibility: High. The vocoder successfully reconstructs words from tokens.
- Naturalness: Low. Prosody is flat due to F0 loss.
- Intelligibility: High. Same phonetic reconstruction.
- Naturalness: Medium-High. Pitch conditioning restores prosody.
- F0 Accuracy: < 20 Hz RMSE (good pitch tracking).
- MCD: < 5.0 (good spectral quality).
Private - shemaobt organization.