Firstly, install all Python package requirements:
pip install -r requirements.txtSecondly, build monotonic_align code (Cython):
cd model/monotonic_align; python setup.py build_ext --inplace; cd ../..Note: code is tested on Python==3.6.9.
- Make filelists of your audio data like ones included into
resources/filelistsfolder. For single speaker training refer tojspeechfilelists and tolibri-ttsfilelists for multispeaker. - Set experiment configuration in
params.pyfile. - Specify your GPU device and run training script:
export CUDA_VISIBLE_DEVICES=YOUR_GPU_ID python train.py # if single speaker python train_multi_speaker.py # if multispeaker
- To track your training process run tensorboard server on any available port:
During training all logging information and checkpoints are stored in
tensorboard --logdir=YOUR_LOG_DIR --port=8888
YOUR_LOG_DIR, which you can specify inparams.pybefore training.
You can download Grad-TTS and HiFi-GAN checkpoints trained on LJSpeech* and Libri-TTS datasets (22kHz) from here.
Put necessary Grad-TTS into checkpts folder in root Grad-TTS directory.
- Create text file with sentences you want to synthesize like
resources/filelists/synthesis.txt. - Run script
inference.pyby providing path to the text file, path to the Grad-TTS checkpoint, number of iterations to be used for reverse diffusion (default: 10) and speaker id if you want to perform multispeaker inference:python inference.py -f <your-text-file> -c <grad-tts-checkpoint> -t <number-of-timesteps> -s <speaker-id-if-multispeaker>
- Check out folder called
outfor generated audios.
Download pretrained GradTTS checkpoint here
Install using pip:
pip install diffwave
or from GitHub:
git clone https://github.com/lmnt-com/diffwave.git
cd diffwave
pip install .
Before you start training, you'll need to prepare a training dataset. The dataset can have any directory structure as long as the contained .wav files are 16-bit mono.If you need to change the data process parameters, edit params.py.
python -m diffwave.preprocess /path/to/dir/containing/wavs
python -m diffwave /path/to/model/dir /path/to/dir/containing/wavs
# in another shell to monitor training progress:
tensorboard --logdir /path/to/model/dir --bind_all
You should expect to hear intelligible (but noisy) speech by ~8k steps (~1.5h on a 2080 Ti).
Basic usage:
from diffwave.inference import predict as diffwave_predict
model_dir = '/path/to/model/dir'
spectrogram = # get your hands on a spectrogram in [N,C,W] format
audio, sample_rate = diffwave_predict(spectrogram, model_dir, fast_sampling=True)
# audio is a GPU tensor in [N,T] format.python -m diffwave.inference --fast /path/to/model /path/to/spectrogram -o output.wav
pip install .
python -m diffwave.batch_inference
Download pretrained DiffWave checkpoint here
- Monotonic Alignment Search algorithm is used for unsupervised duration modelling, official github repository: link.
- Phonemization utilizes CMUdict, official github repository: link.
- DiffWave: A Versatile Diffusion Model for Audio Synthesis
- Denoising Diffusion Probabilistic Models
- Code for Denoising Diffusion Probabilistic Models
- Text-To-Speech Synthesis In The Wild