Paper: https://arxiv.org/abs/2602.15675
NileTTS is a large-scale Egyptian Arabic Text-to-Speech dataset and fine-tuned XTTS model. This repository contains the code for data generation, model training, and evaluation as described in our paper.
| Resource | Link |
|---|---|
| Model Weights | KickItLikeShika/NileTTS-XTTS |
| Dataset | KickItLikeShika/NileTTS-dataset |
NileTTS addresses the lack of high-quality TTS resources for Egyptian Arabic by providing:
- 38 hours of transcribed Egyptian Arabic speech across medical, sales, and general conversation domains
- A fine-tuned XTTS v2 model optimized for Egyptian Arabic synthesis
- A reproducible synthetic data generation pipeline
| Model | WER | CER | Speaker Similarity |
|---|---|---|---|
| XTTS v2 (Baseline) | 26.8% | 8.1% | 0.713 |
| NileTTS (Ours) | 18.8% | 4.1% | 0.755 |
pip install -r requirements.txtSee playground.ipynb for a complete example of loading and using the model and the dataset.
NileTTS/
├── generate-data.py
├── evaluate.py
├── playground.ipynb
├── requirements.txt
└── README.md
The generate-data.py script processes audio files generated by NotebookLM into training-ready chunks with transcriptions and speaker labels.
Before running the script, you need:
- Audio file: An
.m4aor.wavfile containing Egyptian Arabic speech (e.g., from NotebookLM) - Speaker centroids: A
speaker_centroids.pklfile containing pre-computed speaker embeddings for diarization
- Transcription: Uses Whisper Large to transcribe the audio with Arabic language setting
- Chunking: Groups transcription segments into chunks of max 15 seconds
- Speaker Diarization: Identifies speaker for each chunk using ECAPA-TDNN embeddings and cosine similarity to pre-computed centroids
- Export: Saves audio chunks as WAV files with corresponding transcriptions and metadata CSV
The playground.ipynb notebook demonstrates:
- Loading the NileTTS model from HuggingFace
- Downloading dataset samples on-demand
- Generating audios/speech
- Playing and saving generated audio
If you use NileTTS in your research, please cite: [TO BE ADDED]
This project is licensed under the Apache 2.0 License - see the LICENSE file for details.
- XTTSv2-Finetuning-for-New-Languages by @anhnh2002 for the training code. We adapted their finetuning pipeline and implemented additional evaluation metrics (WER, CER, Speaker Similarity) and Weights & Biases integration.
- Coqui TTS for the XTTS v2 architecture
- OpenAI Whisper for transcription
- SpeechBrain for speaker embeddings
- Google NotebookLM for audio synthesis