Skip to content

Bible Audio Acoustic Tokenization Pipeline - XLSR-53 + K-Means + BPE

Notifications You must be signed in to change notification settings

shemaobt/model-training

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

43 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Bible Audio Acoustic Tokenization & Vocoder

A machine learning pipeline for unsupervised speech representation learning. This project converts continuous speech (specifically the Bible) into discrete acoustic tokens and learns to resynthesize speech from those tokens.

The goal is to enable NLP-like capabilities (translation, search, pattern discovery) directly on audio data for low-resource languages (like SaterΓ©-MawΓ©).

🧠 Core Concept

Traditional speech processing relies on text transcripts (ASR). We don't. We use Self-Supervised Learning to discover the "units" of a language purely from audio.

flowchart LR
    subgraph Input
        A[πŸ—£οΈ Raw Audio]
    end

    subgraph Discretization
        B[XLSR-53<br/>Features] --> C[K-Means<br/>Clustering] --> D[Discrete Tokens<br/>31 87 14...]
    end

    subgraph Applications
        D --> E["Language Modeling<br/>(BPE, GPT)"]
        D --> F["Vocoder<br/>(Synthesis)"]
    end

    A --> B
    F --> G[πŸ—£οΈ Resynthesized Audio]
    
    style D fill:#f9f,stroke:#333,stroke-width:2px
Loading

🎯 What This Pipeline Produces

After completing all 3 phases, you have a Speech Synthesis (Vocoder) System for the target language.

flowchart LR
    subgraph Input
        A[Acoustic Units<br/>31 87 14 43...]
    end
    
    subgraph TrainedModel ["Trained Vocoder (Phase 3)"]
        B[Generator V2<br/>HiFi-GAN + Pitch]
    end
    
    subgraph Output
        C[πŸ”Š Natural Audio<br/>16kHz WAV]
    end
    
    A --> B --> C
    
    style B fill:#f9f,stroke:#333,stroke-width:2px
Loading

Phase Outputs

Phase Output Purpose
Phase 1 K-Means model + Unit sequences Creates a "vocabulary" of ~100 speech sounds (learned phonemes)
Phase 2 BPE tokenizer + Motif analysis Discovers common sound patterns (optional, for linguistic analysis)
Phase 3 Vocoder model (.pt file) Neural network that converts unit sequences β†’ audio waveforms

Use Cases

1. Speech-to-Speech Translation

Source Audio β†’ ASR β†’ Translation β†’ Unit Prediction β†’ [Your Vocoder] β†’ Target Audio
                                                          ↑
                                                    (This is what you train)

Translate from Portuguese to SaterΓ©, then synthesize natural SaterΓ© speech.

2. Voice Preservation for Endangered Languages

  • Train on native speaker recordings β†’ Generate new speech in that voice/language
  • Preserve acoustic characteristics of indigenous languages like SaterΓ©-MawΓ©

3. Audio Bible Generation

  • Convert text (via text-to-unit model) β†’ audio narration in the target language
  • Generate consistent, natural-sounding Bible readings

4. Low-Resource TTS Foundation

  • The vocoder is the acoustic backend for any TTS system
  • Pair with a Text-to-Unit model for complete text-to-speech

What's Needed for Full TTS?

This pipeline trains the acoustic model (vocoder): Units β†’ Audio

For complete Text β†’ Speech, you also need: Text β†’ Units (not included here)

Options for Text-to-Unit:

  • Tacotron-style sequence-to-sequence model
  • Translation model that outputs units instead of text
  • GPT-style language model trained on unit sequences

πŸ—οΈ Pipeline Architecture

The pipeline consists of three distinct training phases.

Phase 1: Acoustic Tokenization

Extracts features from audio and learns a discrete vocabulary of 100 sounds.

flowchart LR
    A[Audio Segment] -->|XLSR-53 Layer 14| B[1024-dim Vectors]
    B -->|K-Means| C[Cluster Centroids]
    B -->|Nearest Neighbor| D[Unit Sequence]
Loading

Note: We currently use XLSR-53 for feature extraction. Meta's newer MMS model (1,400+ languages) may provide better representations for low-resource languages. See docs/MMS_VS_XLSR53.md for a detailed comparison and migration guide.

Phase 2: Pattern Discovery (BPE)

Analyzes the sequence of units to find recurring motifs (acoustic "words").

flowchart LR
    A[Unit Sequence<br/>31 31 87 43] -->|SentencePiece| B[BPE Training]
    B --> C[Vocabulary<br/>Motifs]
    C --> D[Tokenized Sequence<br/>105 43]
Loading

Phase 3: Vocoder (Synthesis)

Trains a generative model to convert discrete units back into continuous audio waveforms.

flowchart TB
    A[Unit Sequence] -->|Generator| B[Fake Audio]
    C[Real Audio] -->|Discriminator| D[Real/Fake Score]
    B -->|Discriminator| D
    B -->|Mel Loss| C
Loading

πŸš€ Quick Start

Prerequisites

  • Python 3.10+
  • Modal account (for cloud GPU training)
  • ffmpeg installed locally

Installation

pip install -r requirements.txt
python3 -m modal token set --token-id <id> --token-secret <secret>

Execution Flow

1. Segment Audio Locally (one-time, run on your machine)

# For Portuguese
python scripts/segment_audio.py --language portuguese

# For SaterΓ©-MawΓ©
python scripts/segment_audio.py --language satere

2. Upload to Cloud Storage (one-time, run on your machine)

python3 -m modal run scripts/upload_to_modal.py --language portuguese

3. Run Training Pipeline (on Modal)

Option A: Run All Phases at Once (Recommended)

# Run full pipeline with V2 vocoder (best quality)
python3 -m modal run src/training/run_full_pipeline.py::main

# Run full pipeline with V1 vocoder (faster, simpler)
python3 -m modal run src/training/run_full_pipeline.py::main --vocoder-version v1

Option B: Run Phases Individually

# Phase 1: Discover acoustic units (~2-4 hours)
python3 -m modal run --detach src/training/phase1_acoustic.py

# Phase 2: Learn motifs/BPE (~30 min)
python3 -m modal run --detach src/training/phase2_bpe.py

# Phase 3 V2: Train Enhanced Vocoder (~4-8 hours) ⭐ RECOMMENDED
python3 -m modal run --detach src/training/phase3_vocoder_v2.py

# OR Phase 3 V1: Train Original Vocoder (~2-4 hours)
python3 -m modal run --detach src/training/phase3_vocoder.py

Option C: Run Only Specific Phases

# Skip Phase 1 (already done), run 2 and 3
python3 -m modal run src/training/run_full_pipeline.py::main --phases 2,3

# Run only vocoder training (Phases 1 & 2 already done)
python3 -m modal run src/training/run_full_pipeline.py::main --phases 3

🌍 Multi-Language Support

All training scripts support multiple languages via the --language parameter.

Supported Languages

Language Code Segmented Audio Units Output Vocoder Checkpoints
Portuguese portuguese segmented_audio/ portuguese_units/ vocoder_v2_checkpoints/
SaterΓ©-MawΓ© satere segmented_audio_satere/ satere_units/ vocoder_v2_satere_checkpoints/

Training a New Language

# 1. Segment locally
python scripts/segment_audio.py --language satere

# 2. Upload to Modal
python3 -m modal run scripts/upload_to_modal.py --language satere

# 3. Run full pipeline for SaterΓ©
python3 -m modal run src/training/run_full_pipeline.py::main --language satere

# OR run phases individually
python3 -m modal run --detach src/training/phase1_acoustic.py::main_skip_segmentation --language satere
python3 -m modal run --detach src/training/phase2_bpe.py::main --language satere
python3 -m modal run --detach src/training/phase3_vocoder_v2.py::main --language satere

Testing & Downloading Results

# Test SaterΓ© vocoder
python3 -m modal run src/training/vocoder_test_v2.py::main --language satere --num-samples 50

# Download SaterΓ© checkpoints
modal volume get bible-audio-data vocoder_v2_satere_checkpoints/ ./modal_downloads/vocoder_v2_satere/

# Download test results
modal volume get bible-audio-data vocoder_v2_satere_test_output/ ./modal_downloads/vocoder_v2_satere_test/

Adding a New Language

To add support for a new language, update the LANGUAGE_CONFIGS dictionary in each training script:

LANGUAGE_CONFIGS = {
    # ... existing languages ...
    "new_language": {
        "segmented_dir": f"{AUDIO_MOUNT}/segmented_audio_new",
        "output_dir": f"{AUDIO_MOUNT}/new_language_units",
        "vocoder_dir": f"{AUDIO_MOUNT}/vocoder_v2_new_checkpoints",
        "corpus_file": "new_language_corpus_timestamped.json",
    },
}

🎯 V2 Vocoder (Enhanced)

The V2 vocoder addresses the "robotic audio" problem with several improvements.

Note: V2 only affects Phase 3 (vocoder training). Phases 1 and 2 remain the same. You must run Phases 1 and 2 first before training V2.

Prerequisites for V2

# Ensure Phases 1 and 2 are complete (check Modal volume)
modal volume ls bible-audio-data portuguese_units/
# Should show: portuguese_kmeans.pkl, all_units_for_bpe.txt, portuguese_corpus_timestamped.json

Key Improvements

Feature V1 V2
Pitch conditioning ❌ βœ… 32-bin F0 embedding
Generator Simple TransConv HiFi-GAN with MRF
Discriminator MSD only MPD + MSD
Losses Mel + Adversarial Mel + STFT + FM + Adversarial
Segment length 1 second 2 seconds
Audio quality Robotic Natural prosody

V2 Training Commands

# RECOMMENDED: Full pipeline with V2
python3 -m modal run src/training/run_full_pipeline.py::main

# OR: Only V2 vocoder (if Phases 1 & 2 done)
python3 -m modal run --detach src/training/phase3_vocoder_v2.py::main

# Custom parameters
python3 -m modal run --detach src/training/phase3_vocoder_v2.py::main \
    --epochs 1000 --segment-length 32000 --patience 100

# Resume training from checkpoint
python3 -m modal run --detach src/training/phase3_vocoder_v2.py::main --resume v2_latest.pt

V2 Testing

# Test quality metrics (after training)
python3 -m modal run src/training/vocoder_test_v2.py::main --num-samples 50

# Download test results
modal volume get bible-audio-data vocoder_v2_test_output/ ./modal_downloads/vocoder_v2_test/

V2 Documentation

πŸ“‚ Project Structure

model-training/
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ models/                    # Neural Network Architectures (Jupytext)
β”‚   β”‚   β”œβ”€β”€ generator.py           # V1 Vocoder (basic upsampling)
β”‚   β”‚   β”œβ”€β”€ generator_v2.py        # V2 Vocoder (HiFi-GAN + pitch conditioning)
β”‚   β”‚   β”œβ”€β”€ discriminator.py       # V1 Multi-Scale Discriminator
β”‚   β”‚   └── discriminator_v2.py    # V2 MPD + MSD with spectral norm
β”‚   β”œβ”€β”€ training/                  # Cloud Training Scripts
β”‚   β”‚   β”œβ”€β”€ run_full_pipeline.py   # πŸš€ Run all phases at once
β”‚   β”‚   β”œβ”€β”€ phase1_acoustic.py     # Feature extraction & clustering
β”‚   β”‚   β”œβ”€β”€ phase2_bpe.py          # BPE motif discovery
β”‚   β”‚   β”œβ”€β”€ phase3_vocoder.py      # V1 GAN training (simpler)
β”‚   β”‚   β”œβ”€β”€ phase3_vocoder_v2.py   # V2 GAN training (enhanced)
β”‚   β”‚   β”œβ”€β”€ vocoder_test.py        # V1 quality testing
β”‚   β”‚   β”œβ”€β”€ vocoder_test_v2.py     # V2 quality testing with F0
β”‚   β”‚   └── validate_units.py      # Unit validation
β”œβ”€β”€ scripts/                       # Local Utilities
β”‚   β”œβ”€β”€ segment_audio.py           # Silence-based segmentation
β”‚   └── upload_to_modal.py         # Data transfer
β”œβ”€β”€ docs/
β”‚   β”œβ”€β”€ ARCHITECTURE.md            # V1 design decisions
β”‚   β”œβ”€β”€ VOCODER_V2_ARCHITECTURE.md # V2 complete technical guide
β”‚   β”œβ”€β”€ ROBOTIC_AUDIO_ANALYSIS.md  # Why V1 sounds robotic + solutions
β”‚   β”œβ”€β”€ SEGMENT_PREPARATION.md     # Segment size impact on training
β”‚   β”œβ”€β”€ MMS_VS_XLSR53.md           # MMS vs XLSR-53 comparison & migration
β”‚   β”œβ”€β”€ AUDIOLM_INTEGRATION.md     # AudioLM architecture & integration guide
β”‚   └── PIPELINE.md                # Step-by-step manual
└── audio_data/                    # Raw input files (gitignored)

πŸ”¬ Design Decisions & Trade-offs

See docs/ARCHITECTURE.md for a comprehensive analysis.

Key Highlights:

  • Why 100 Units? Balanced trade-off between phonetic granularity and model trainability.
  • Why Layer 14? Best layer in XLSR-53 for phonetic content, filtering out speaker identity.
  • Why Robotic Audio (V1)? We deliberately discarded pitch (F0) to focus on phonetic content.
  • V2 Solution: Re-injects pitch via conditioning, uses HiFi-GAN architecture and enhanced losses.

Future Directions

Technology Document Potential Benefit
MMS (Meta) MMS_VS_XLSR53.md Better low-resource language support (1,400+ languages)
AudioLM (Google) AUDIOLM_INTEGRATION.md State-of-the-art quality via semantic + acoustic tokens

These documents provide deep technical analysis of how emerging architectures could enhance our pipeline.

πŸ“Š Results

V1 Vocoder

  • Compression: ~775x reduction in bitrate (Raw Audio β†’ Discrete Tokens).
  • Intelligibility: High. The vocoder successfully reconstructs words from tokens.
  • Naturalness: Low. Prosody is flat due to F0 loss.

V2 Vocoder (Expected)

  • Intelligibility: High. Same phonetic reconstruction.
  • Naturalness: Medium-High. Pitch conditioning restores prosody.
  • F0 Accuracy: < 20 Hz RMSE (good pitch tracking).
  • MCD: < 5.0 (good spectral quality).

License

Private - shemaobt organization.

About

Bible Audio Acoustic Tokenization Pipeline - XLSR-53 + K-Means + BPE

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages