Skip to content

KickItLikeShika/NileTTS

Repository files navigation

NileTTS

Paper: https://arxiv.org/abs/2602.15675

NileTTS is a large-scale Egyptian Arabic Text-to-Speech dataset and fine-tuned XTTS model. This repository contains the code for data generation, model training, and evaluation as described in our paper.

Resources

Resource Link
Model Weights KickItLikeShika/NileTTS-XTTS
Dataset KickItLikeShika/NileTTS-dataset

Overview

NileTTS addresses the lack of high-quality TTS resources for Egyptian Arabic by providing:

  1. 38 hours of transcribed Egyptian Arabic speech across medical, sales, and general conversation domains
  2. A fine-tuned XTTS v2 model optimized for Egyptian Arabic synthesis
  3. A reproducible synthetic data generation pipeline

Results

Model WER CER Speaker Similarity
XTTS v2 (Baseline) 26.8% 8.1% 0.713
NileTTS (Ours) 18.8% 4.1% 0.755

Quick Start

Installation

pip install -r requirements.txt

Using the Model

See playground.ipynb for a complete example of loading and using the model and the dataset.

Repository Structure

NileTTS/
├── generate-data.py
├── evaluate.py
├── playground.ipynb
├── requirements.txt
└── README.md

Data Generation Pipeline

The generate-data.py script processes audio files generated by NotebookLM into training-ready chunks with transcriptions and speaker labels.

Prerequisites

Before running the script, you need:

  1. Audio file: An .m4a or .wav file containing Egyptian Arabic speech (e.g., from NotebookLM)
  2. Speaker centroids: A speaker_centroids.pkl file containing pre-computed speaker embeddings for diarization

How It Works

  1. Transcription: Uses Whisper Large to transcribe the audio with Arabic language setting
  2. Chunking: Groups transcription segments into chunks of max 15 seconds
  3. Speaker Diarization: Identifies speaker for each chunk using ECAPA-TDNN embeddings and cosine similarity to pre-computed centroids
  4. Export: Saves audio chunks as WAV files with corresponding transcriptions and metadata CSV

Interactive Demo

The playground.ipynb notebook demonstrates:

  • Loading the NileTTS model from HuggingFace
  • Downloading dataset samples on-demand
  • Generating audios/speech
  • Playing and saving generated audio

Citation

If you use NileTTS in your research, please cite: [TO BE ADDED]

License

This project is licensed under the Apache 2.0 License - see the LICENSE file for details.

Acknowledgements

About

LLM-to-Speech Paper Codebase

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published