diff --git a/Diffvc-Readme.md b/Diffvc-Readme.md
new file mode 100644
index 00000000..cebe4f53
--- /dev/null
+++ b/Diffvc-Readme.md
@@ -0,0 +1,239 @@
+# Diffvc-Readme
+
+## 成员分工及工作量描述
+
+吕春吉(贡献度:50%):组长,负责阅读论文源码,理清架构,进行了model主模型的嵌入、dataset数据处理部分的嵌入、trainer部分的嵌入、推理部分的嵌入、模型的训练和调整、参数的调整等工作
+
+蔡昕怡(贡献度:25%):负责数据集的查找、数据集的处理、模型的训练和调整、hifi_gan模块的嵌入
+
+唐欣欣(贡献度:10%):负责encoder训练部分的添加
+
+马翊程(贡献度:10%):参与数据处理部分的嵌入
+
+孙嘉成(贡献度:5%):负责数据集的查找
+
+## 完成功能
+
+可以进行diffvc模型的相关训练,可将数据集处理为所需要的形式(进行相关的特征提取和生成),在命令行输入指定命令可进行模型的训练并返回正确的训练结果。
+
+## 训练截图
+
+参见 readme pic
+
+## 所使用的的依赖
+
+参见 diffvc_requierments.txt文件
+
+## 训练过程
+
+1.下载我们所需要的预训练模型hifi-gan声码器。
+
+取自官方的hifi-gan存储库:https://drive.google.com/file/d/10khlrM645pTbQ4rc2aNEYPba8RFDBkW-/view?usp=sharing
+
+放在checkpts/vocoder/下
+
+2.下载在LibriTTS和VCTK上训练好的模型:
+
+LibriTTS:https://drive.google.com/file/d/18Xbme0CTVo58p2vOHoTQm8PBGW7oEjAy/view?usp=sharing
+
+VCTK:https://drive.google.com/file/d/12s9RPmwp9suleMkBCVetD8pub7wsDAy4/view?usp=sharing
+
+放在checkpts/vc/下
+
+3.进行环境配置,参考diffvc_requirements.txt文件。
+
+4.数据集获取和数据处理部分:
+
+数据集链接:https://www.openslr.org/60/
+
+数据处理部分:
+
+①首先建立一个数据集文件夹data,包含“wavs”、“mels”、“filelist”和“embeds”四个文件夹,并将原数据集上的wav文件按照原文件夹放入wav文件中,在filelist中加入训练数据名称valid.txt exceptions_libritts.txt
+
+(以下涉及到的代码块在talkingface/data/dataprocess/下
+
+②运行inference文件的第一部分,配置相应的环境(注:librosa版本为0.9.2)
+参见readme pic/pic1
+
+③对inference文件的第二个代码块中的get_mel函数和get_embed函数进行阅读,编写相关代码分别对原wav文件运行这两个函数。对原始数据运行get_mel函数生成mel文件,运行get_embed函数生成embed文件,分别保存在两个文件夹中。(注意,这里需要调用spk_encoder的预训练模型,下载之后引用其路径即可)
+
+参见readme pic/pic2
+
+④在原代码中补充引用预训练模型、提取wav文件并执行两个函数,并分别保存到相应的两个文件夹中且正确命名的代码:
+
+参见readme pic/pic3
+
+⑤运行该代码块,即可得到处理后的embed和mel文件。
+
+5.创建logs_enc文件夹,并下载训练好的编码器放在该文件夹下
+
+编码器下载地址:https://drive.google.com/file/d/1JdoC5hh7k6Nz_oTcumH0nXNEib-GDbSq/view?usp=sharing
+
+6.新建log_dec文件夹
+
+7.进行训练
+
+## 框架整体介绍
+
+### checkpoints
+
+主要保存的是训练和评估模型所需要的额外的预训练模型,在对应文件夹的[README](https://github.com/Academic-Hammer/talkingface-toolkit/blob/main/checkpoints/README.md)有更详细的介绍
+
+### datset
+
+存放数据集以及数据集预处理之后的数据,详细内容见dataset里的[README](https://github.com/Academic-Hammer/talkingface-toolkit/blob/main/dataset/README.md)
+
+### saved
+
+存放训练过程中保存的模型checkpoint, 训练过程中保存模型时自动创建
+
+### talkingface
+
+主要功能模块,包括所有核心代码
+
+#### config
+
+根据模型和数据集名称自动生成所有模型、数据集、训练、评估等相关的配置信息
+
+```
+config/
+
+├── configurator.py
+
+```
+
+#### data
+
+- dataprocess:该模型主要涉及到inference.ipynb和get_avg_mels.ipynb文件,用于生成数据集中的mels和embed文件夹和文件夹中的内容。
+- dataset:数据处理部分,主要为diffvc_dataset,其他文件为训练encoder部分所需的数据处理,可不关注。
+
+```
+data/
+
+├── dataprocess
+
+| ├── wav2lip_process.py
+
+| ├── inference.ipynb
+
+| ├── get_avg_mels.ipynb
+
+├── dataset
+
+| ├── wav2lip_dataset.py
+
+| ├── diffvc_dataset.py
+```
+
+#### evaluate
+
+主要涉及模型评估的代码
+LSE metric 需要的数据是生成的视频列表
+SSIM metric 需要的数据是生成的视频和真实的视频列表
+
+#### model
+
+实现的模型的网络和对应的方法
+
+diffvc存储在voice_conversion文件夹下的diffvc.py中,hifi-gan为训练encoder所需模块,可不关注。
+
+```
+model/
+
+├── audio_driven_talkingface
+
+| ├── wav2lip.py
+
+├── image_driven_talkingface
+
+| ├── xxxx.py
+
+├── nerf_based_talkingface
+
+| ├── xxxx.py
+
+├── voice_conversion
+
+| ├── diffvc.py
+
+├── abstract_talkingface.py
+
+```
+
+#### properties
+
+保存默认配置文件,包括diffvc.yaml,diffvc_dataset.yaml,diffvc_encoder.yaml和diffvc_encoder_dataset.yaml,其中diffvc_encoder.yaml和diffvc_encoder_dataset.yaml为训练encoder所需要的配置文件,可不关注。
+
+```
+properties/
+
+├── dataset
+
+| ├── diffvc_dataset.yaml
+
+| ├── diffvc_encoder_dataset.yaml
+
+├── model
+
+| ├── diffvc.yaml
+
+| ├── diffvc_encoder.yaml
+
+├── overall.yaml
+
+```
+
+#### quick_start
+
+通用的启动文件,根据传入参数自动配置数据集和模型,然后训练和评估
+
+```
+quick_start/
+
+├── quick_start.py
+
+```
+
+#### trainer
+
+训练、评估函数的主类。
+
+```
+trainer/
+
+├── trainer.py
+
+```
+
+#### utils
+
+公用的工具类。
+
+## 使用方法
+
+### 环境要求
+
+参见 diffvc_requierments.txt,可运行以下代码配置。
+
+```
+pip install -r diffvc_requierments.txt
+```
+
+
+
+### 训练和评估
+
+```bash
+python run_talkingface.py --model=diffvc -–dataset=diffvc_data
+```
+
+
+
+## 论文及源代码仓库链接:
+
+论文:https://arxiv.org/abs/2109.13821
+
+源代码仓库:https://github.com/huawei-noah/Speech-Backbones/tree/main/DiffVC
+
+
+
diff --git a/diffvc_requirements.txt b/diffvc_requirements.txt
new file mode 100644
index 00000000..6e9219d6
--- /dev/null
+++ b/diffvc_requirements.txt
@@ -0,0 +1,12 @@
+torchaudio==0.5.1
+torch==1.7.1
+einops==0.3.0
+librosa==0.6.0
+webrtcvad==2.0.10
+numpy==1.19.0
+scipy==1.5.1
+matplotlib==3.3.3
+tb-nightly
+future
+tqdm
+tgt
\ No newline at end of file
diff --git a/readme picture/image-20240124225137897.png b/readme picture/image-20240124225137897.png
new file mode 100644
index 00000000..c5b1d6de
Binary files /dev/null and b/readme picture/image-20240124225137897.png differ
diff --git a/readme picture/image-20240124225205428.png b/readme picture/image-20240124225205428.png
new file mode 100644
index 00000000..08489a87
Binary files /dev/null and b/readme picture/image-20240124225205428.png differ
diff --git a/readme picture/image-20240124225219412.png b/readme picture/image-20240124225219412.png
new file mode 100644
index 00000000..dbea7b16
Binary files /dev/null and b/readme picture/image-20240124225219412.png differ
diff --git a/readme picture/image-20240124225229981.png b/readme picture/image-20240124225229981.png
new file mode 100644
index 00000000..3a359772
Binary files /dev/null and b/readme picture/image-20240124225229981.png differ
diff --git a/readme picture/pic1.jpg b/readme picture/pic1.jpg
new file mode 100644
index 00000000..9ac70b00
Binary files /dev/null and b/readme picture/pic1.jpg differ
diff --git a/readme picture/pic2.jpg b/readme picture/pic2.jpg
new file mode 100644
index 00000000..bbbffcc9
Binary files /dev/null and b/readme picture/pic2.jpg differ
diff --git a/readme picture/pic3.jpg b/readme picture/pic3.jpg
new file mode 100644
index 00000000..5daafa3a
Binary files /dev/null and b/readme picture/pic3.jpg differ
diff --git a/talkingface/config/configurator.py b/talkingface/config/configurator.py
index 7b6e21d8..64773dc9 100644
--- a/talkingface/config/configurator.py
+++ b/talkingface/config/configurator.py
@@ -3,7 +3,11 @@
import sys
import yaml
from logging import getLogger
-from typing import Literal
+#from typing import Literal
+from typing_extensions import Literal
+
+
+
from talkingface.utils import(
get_model,
@@ -203,7 +207,6 @@ def _get_model_and_dataset(self, model, dataset):
)
else:
final_dataset = dataset
-
return final_model, final_model_class, final_dataset
def _update_internal_config_dict(self, file):
diff --git a/talkingface/config/diffVC.py b/talkingface/config/diffVC.py
new file mode 100644
index 00000000..1c21312f
--- /dev/null
+++ b/talkingface/config/diffVC.py
@@ -0,0 +1,45 @@
+librispeech_datasets = {
+ "train": {
+ "clean": ["LibriSpeech/train-clean-100", "LibriSpeech/train-clean-360"],
+ "other": ["LibriSpeech/train-other-500"]
+ },
+ "test": {
+ "clean": ["LibriSpeech/test-clean"],
+ "other": ["LibriSpeech/test-other"]
+ },
+ "dev": {
+ "clean": ["LibriSpeech/dev-clean"],
+ "other": ["LibriSpeech/dev-other"]
+ },
+}
+libritts_datasets = {
+ "train": {
+ "clean": ["LibriTTS/train-clean-100", "LibriTTS/train-clean-360"],
+ "other": ["LibriTTS/train-other-500"]
+ },
+ "test": {
+ "clean": ["LibriTTS/test-clean"],
+ "other": ["LibriTTS/test-other"]
+ },
+ "dev": {
+ "clean": ["LibriTTS/dev-clean"],
+ "other": ["LibriTTS/dev-other"]
+ },
+}
+voxceleb_datasets = {
+ "voxceleb1" : {
+ "train": ["VoxCeleb1/wav"],
+ "test": ["VoxCeleb1/test_wav"]
+ },
+ "voxceleb2" : {
+ "train": ["VoxCeleb2/dev/aac"],
+ "test": ["VoxCeleb2/test_wav"]
+ }
+}
+
+other_datasets = [
+ "LJSpeech-1.1",
+ "VCTK-Corpus/wav48",
+]
+
+anglophone_nationalites = ["australia", "canada", "ireland", "uk", "usa"]
diff --git a/talkingface/data/dataprocess/get_avg_mels.ipynb b/talkingface/data/dataprocess/get_avg_mels.ipynb
new file mode 100644
index 00000000..a5ab8c59
--- /dev/null
+++ b/talkingface/data/dataprocess/get_avg_mels.ipynb
@@ -0,0 +1,2450 @@
+{
+ "cells": [
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "metadata": {
+ "collapsed": true
+ },
+ "outputs": [],
+ "source": [
+ "import os\n",
+ "import numpy as np\n",
+ "import tgt\n",
+ "from scipy.stats import mode"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {
+ "collapsed": true
+ },
+ "outputs": [],
+ "source": [
+ "phoneme_list = ['AA0', 'AA1', 'AA2', 'AE0', 'AE1', 'AE2', \n",
+ " 'AH0', 'AH1', 'AH2', 'AO0', 'AO1', 'AO2', 'AW0', \n",
+ " 'AW1', 'AW2', 'AY0', 'AY1', 'AY2', 'B', 'CH', 'D', 'DH', \n",
+ " 'EH0', 'EH1', 'EH2', 'ER0', 'ER1', 'ER2', \n",
+ " 'EY0', 'EY1', 'EY2', 'F', 'G', 'HH', 'IH0', 'IH1', 'IH2', \n",
+ " 'IY0', 'IY1', 'IY2', 'JH', 'K', 'L', 'M', 'N', 'NG', \n",
+ " 'OW0', 'OW1', 'OW2', 'OY0', 'OY1', 'OY2', 'P', \n",
+ " 'R', 'S', 'SH', 'T', 'TH', 'UH0', 'UH1', 'UH2', \n",
+ " 'UW0', 'UW1', 'UW2', 'V', 'W', 'Y', 'Z', 'ZH', 'sil', 'sp', 'spn']\n",
+ "phoneme_dict = dict()\n",
+ "for j, p in enumerate(phoneme_list):\n",
+ " phoneme_dict[p] = j"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {
+ "collapsed": false,
+ "scrolled": true
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Speaker 1: 8057\n",
+ "Speaker 2: 4014\n",
+ "Speaker 3: 6415\n",
+ "Speaker 4: 5126\n",
+ "Speaker 5: 3723\n",
+ "Speaker 6: 587\n",
+ "Speaker 7: 8534\n",
+ "Speaker 8: 5322\n",
+ "Speaker 9: 2238\n",
+ "Speaker 10: 1401\n",
+ "Speaker 11: 4427\n",
+ "Speaker 12: 1705\n",
+ "Speaker 13: 561\n",
+ "Speaker 14: 2992\n",
+ "Speaker 15: 8776\n",
+ "Speaker 16: 54\n",
+ "Speaker 17: 806\n",
+ "Speaker 18: 1970\n",
+ "Speaker 19: 302\n",
+ "Speaker 20: 6272\n",
+ "Speaker 21: 1289\n",
+ "Speaker 22: 3807\n",
+ "Speaker 23: 6075\n",
+ "Speaker 24: 329\n",
+ "Speaker 25: 3483\n",
+ "Speaker 26: 1914\n",
+ "Speaker 27: 6499\n",
+ "Speaker 28: 7117\n",
+ "Speaker 29: 5703\n",
+ "Speaker 30: 3032\n",
+ "Speaker 31: 3001\n",
+ "Speaker 32: 5304\n",
+ "Speaker 33: 5012\n",
+ "Speaker 34: 8786\n",
+ "Speaker 35: 3187\n",
+ "Speaker 36: 5935\n",
+ "Speaker 37: 1088\n",
+ "Speaker 38: 783\n",
+ "Speaker 39: 5186\n",
+ "Speaker 40: 7994\n",
+ "Speaker 41: 6078\n",
+ "Speaker 42: 3168\n",
+ "Speaker 43: 6550\n",
+ "Speaker 44: 6701\n",
+ "Speaker 45: 4926\n",
+ "Speaker 46: 1355\n",
+ "Speaker 47: 1337\n",
+ "Speaker 48: 2582\n",
+ "Speaker 49: 8119\n",
+ "Speaker 50: 5767\n",
+ "Speaker 51: 1112\n",
+ "Speaker 52: 6054\n",
+ "Speaker 53: 5583\n",
+ "Speaker 54: 6120\n",
+ "Speaker 55: 4290\n",
+ "Speaker 56: 3440\n",
+ "Speaker 57: 2230\n",
+ "Speaker 58: 5802\n",
+ "Speaker 59: 3448\n",
+ "Speaker 60: 730\n",
+ "Speaker 61: 7011\n",
+ "Speaker 62: 40\n",
+ "Speaker 63: 1845\n",
+ "Speaker 64: 7816\n",
+ "Speaker 65: 4010\n",
+ "Speaker 66: 2823\n",
+ "Speaker 67: 511\n",
+ "Speaker 68: 229\n",
+ "Speaker 69: 5319\n",
+ "Speaker 70: 4830\n",
+ "Speaker 71: 8494\n",
+ "Speaker 72: 1509\n",
+ "Speaker 73: 7285\n",
+ "Speaker 74: 1226\n",
+ "Speaker 75: 2638\n",
+ "Speaker 76: 2920\n",
+ "Speaker 77: 7367\n",
+ "Speaker 78: 2598\n",
+ "Speaker 79: 3686\n",
+ "Speaker 80: 412\n",
+ "Speaker 81: 5538\n",
+ "Speaker 82: 663\n",
+ "Speaker 83: 6683\n",
+ "Speaker 84: 1271\n",
+ "Speaker 85: 5514\n",
+ "Speaker 86: 8699\n",
+ "Speaker 87: 7264\n",
+ "Speaker 88: 816\n",
+ "Speaker 89: 5092\n",
+ "Speaker 90: 7752\n",
+ "Speaker 91: 3008\n",
+ "Speaker 92: 5688\n",
+ "Speaker 93: 3513\n",
+ "Speaker 94: 1224\n",
+ "Speaker 95: 8312\n",
+ "Speaker 96: 8791\n",
+ "Speaker 97: 5104\n",
+ "Speaker 98: 5266\n",
+ "Speaker 99: 1392\n",
+ "Speaker 100: 3549\n",
+ "Speaker 101: 7945\n",
+ "Speaker 102: 272\n",
+ "Speaker 103: 5940\n",
+ "Speaker 104: 6437\n",
+ "Speaker 105: 2531\n",
+ "Speaker 106: 6509\n",
+ "Speaker 107: 4064\n",
+ "Speaker 108: 2167\n",
+ "Speaker 109: 3630\n",
+ "Speaker 110: 4018\n",
+ "Speaker 111: 8770\n",
+ "Speaker 112: 8163\n",
+ "Speaker 113: 5809\n",
+ "Speaker 114: 510\n",
+ "Speaker 115: 5007\n",
+ "Speaker 116: 4967\n",
+ "Speaker 117: 8396\n",
+ "Speaker 118: 359\n",
+ "Speaker 119: 5622\n",
+ "Speaker 120: 3521\n",
+ "Speaker 121: 3923\n",
+ "Speaker 122: 1382\n",
+ "Speaker 123: 1012\n",
+ "Speaker 124: 7939\n",
+ "Speaker 125: 4839\n",
+ "Speaker 126: 1175\n",
+ "Speaker 127: 2836\n",
+ "Speaker 128: 4853\n",
+ "Speaker 129: 639\n",
+ "Speaker 130: 4236\n",
+ "Speaker 131: 2654\n",
+ "Speaker 132: 3866\n",
+ "Speaker 133: 335\n",
+ "Speaker 134: 3551\n",
+ "Speaker 135: 1046\n",
+ "Speaker 136: 6147\n",
+ "Speaker 137: 157\n",
+ "Speaker 138: 3094\n",
+ "Speaker 139: 2427\n",
+ "Speaker 140: 8195\n",
+ "Speaker 141: 4238\n",
+ "Speaker 142: 4854\n",
+ "Speaker 143: 7832\n",
+ "Speaker 144: 1748\n",
+ "Speaker 145: 4586\n",
+ "Speaker 146: 7484\n",
+ "Speaker 147: 1825\n",
+ "Speaker 148: 669\n",
+ "Speaker 149: 512\n",
+ "Speaker 150: 4433\n",
+ "Speaker 151: 3374\n",
+ "Speaker 152: 6064\n",
+ "Speaker 153: 2201\n",
+ "Speaker 154: 6519\n",
+ "Speaker 155: 323\n",
+ "Speaker 156: 7515\n",
+ "Speaker 157: 1316\n",
+ "Speaker 158: 3717\n",
+ "Speaker 159: 4362\n",
+ "Speaker 160: 89\n",
+ "Speaker 161: 5810\n",
+ "Speaker 162: 8050\n",
+ "Speaker 163: 1025\n",
+ "Speaker 164: 7991\n",
+ "Speaker 165: 4495\n",
+ "Speaker 166: 3003\n",
+ "Speaker 167: 1001\n",
+ "Speaker 168: 4243\n",
+ "Speaker 169: 7069\n",
+ "Speaker 170: 593\n",
+ "Speaker 171: 1913\n",
+ "Speaker 172: 1058\n",
+ "Speaker 173: 4363\n",
+ "Speaker 174: 2056\n",
+ "Speaker 175: 4535\n",
+ "Speaker 176: 4138\n",
+ "Speaker 177: 2751\n",
+ "Speaker 178: 6367\n",
+ "Speaker 179: 6904\n",
+ "Speaker 180: 8677\n",
+ "Speaker 181: 5123\n",
+ "Speaker 182: 7520\n",
+ "Speaker 183: 6019\n",
+ "Speaker 184: 6294\n",
+ "Speaker 185: 1811\n",
+ "Speaker 186: 4226\n",
+ "Speaker 187: 6206\n",
+ "Speaker 188: 5062\n",
+ "Speaker 189: 16\n",
+ "Speaker 190: 6877\n",
+ "Speaker 191: 163\n",
+ "Speaker 192: 3114\n",
+ "Speaker 193: 7956\n",
+ "Speaker 194: 5002\n",
+ "Speaker 195: 957\n",
+ "Speaker 196: 8635\n",
+ "Speaker 197: 3977\n",
+ "Speaker 198: 3389\n",
+ "Speaker 199: 1639\n",
+ "Speaker 200: 1552\n",
+ "Speaker 201: 925\n",
+ "Speaker 202: 6115\n",
+ "Speaker 203: 2162\n",
+ "Speaker 204: 8051\n",
+ "Speaker 205: 8098\n",
+ "Speaker 206: 2581\n",
+ "Speaker 207: 4381\n",
+ "Speaker 208: 125\n",
+ "Speaker 209: 3046\n",
+ "Speaker 210: 6544\n",
+ "Speaker 211: 7594\n",
+ "Speaker 212: 2136\n",
+ "Speaker 213: 250\n",
+ "Speaker 214: 9023\n",
+ "Speaker 215: 4438\n",
+ "Speaker 216: 954\n",
+ "Speaker 217: 6189\n",
+ "Speaker 218: 7569\n",
+ "Speaker 219: 5389\n",
+ "Speaker 220: 6924\n",
+ "Speaker 221: 226\n",
+ "Speaker 222: 118\n",
+ "Speaker 223: 476\n",
+ "Speaker 224: 1556\n",
+ "Speaker 225: 4267\n",
+ "Speaker 226: 7316\n",
+ "Speaker 227: 409\n",
+ "Speaker 228: 7398\n",
+ "Speaker 229: 2182\n",
+ "Speaker 230: 8193\n",
+ "Speaker 231: 6782\n",
+ "Speaker 232: 3979\n",
+ "Speaker 233: 8879\n",
+ "Speaker 234: 90\n",
+ "Speaker 235: 850\n",
+ "Speaker 236: 4945\n",
+ "Speaker 237: 7833\n",
+ "Speaker 238: 583\n",
+ "Speaker 239: 6000\n",
+ "Speaker 240: 7030\n",
+ "Speaker 241: 1961\n",
+ "Speaker 242: 288\n",
+ "Speaker 243: 8190\n",
+ "Speaker 244: 7090\n",
+ "Speaker 245: 1867\n",
+ "Speaker 246: 2137\n",
+ "Speaker 247: 2110\n",
+ "Speaker 248: 233\n",
+ "Speaker 249: 5063\n",
+ "Speaker 250: 1776\n",
+ "Speaker 251: 7312\n",
+ "Speaker 252: 724\n",
+ "Speaker 253: 6673\n",
+ "Speaker 254: 1851\n",
+ "Speaker 255: 231\n",
+ "Speaker 256: 3082\n",
+ "Speaker 257: 8772\n",
+ "Speaker 258: 8629\n",
+ "Speaker 259: 32\n",
+ "Speaker 260: 4116\n",
+ "Speaker 261: 487\n",
+ "Speaker 262: 1649\n",
+ "Speaker 263: 1731\n",
+ "Speaker 264: 7909\n",
+ "Speaker 265: 3879\n",
+ "Speaker 266: 1903\n",
+ "Speaker 267: 8123\n",
+ "Speaker 268: 3699\n",
+ "Speaker 269: 1348\n",
+ "Speaker 270: 8118\n",
+ "Speaker 271: 2882\n",
+ "Speaker 272: 3083\n",
+ "Speaker 273: 7085\n",
+ "Speaker 274: 5154\n",
+ "Speaker 275: 1724\n",
+ "Speaker 276: 1446\n",
+ "Speaker 277: 1740\n",
+ "Speaker 278: 6080\n",
+ "Speaker 279: 6880\n",
+ "Speaker 280: 7384\n",
+ "Speaker 281: 1334\n",
+ "Speaker 282: 7540\n",
+ "Speaker 283: 5731\n",
+ "Speaker 284: 7517\n",
+ "Speaker 285: 4145\n",
+ "Speaker 286: 3852\n",
+ "Speaker 287: 4846\n",
+ "Speaker 288: 2812\n",
+ "Speaker 289: 1624\n",
+ "Speaker 290: 764\n",
+ "Speaker 291: 3482\n",
+ "Speaker 292: 2688\n",
+ "Speaker 293: 1422\n",
+ "Speaker 294: 1874\n",
+ "Speaker 295: 3072\n",
+ "Speaker 296: 7982\n",
+ "Speaker 297: 345\n",
+ "Speaker 298: 5115\n",
+ "Speaker 299: 7278\n",
+ "Speaker 300: 7730\n",
+ "Speaker 301: 5975\n",
+ "Speaker 302: 5985\n",
+ "Speaker 303: 1283\n",
+ "Speaker 304: 6492\n",
+ "Speaker 305: 5192\n",
+ "Speaker 306: 5463\n",
+ "Speaker 307: 8063\n",
+ "Speaker 308: 3025\n",
+ "Speaker 309: 5604\n",
+ "Speaker 310: 984\n",
+ "Speaker 311: 1241\n",
+ "Speaker 312: 1417\n",
+ "Speaker 313: 4731\n",
+ "Speaker 314: 8080\n",
+ "Speaker 315: 5672\n",
+ "Speaker 316: 3330\n",
+ "Speaker 317: 7286\n",
+ "Speaker 318: 1052\n",
+ "Speaker 319: 1502\n",
+ "Speaker 320: 4098\n",
+ "Speaker 321: 6925\n",
+ "Speaker 322: 4592\n",
+ "Speaker 323: 596\n",
+ "Speaker 324: 1448\n",
+ "Speaker 325: 6188\n",
+ "Speaker 326: 1841\n",
+ "Speaker 327: 4490\n",
+ "Speaker 328: 5157\n",
+ "Speaker 329: 3967\n",
+ "Speaker 330: 3994\n",
+ "Speaker 331: 1222\n",
+ "Speaker 332: 3228\n",
+ "Speaker 333: 254\n",
+ "Speaker 334: 60\n",
+ "Speaker 335: 3361\n",
+ "Speaker 336: 454\n",
+ "Speaker 337: 7837\n",
+ "Speaker 338: 8465\n",
+ "Speaker 339: 1093\n",
+ "Speaker 340: 2787\n",
+ "Speaker 341: 8097\n",
+ "Speaker 342: 7297\n",
+ "Speaker 343: 711\n",
+ "Speaker 344: 598\n",
+ "Speaker 345: 6330\n",
+ "Speaker 346: 8011\n",
+ "Speaker 347: 1322\n",
+ "Speaker 348: 6494\n",
+ "Speaker 349: 4425\n",
+ "Speaker 350: 374\n",
+ "Speaker 351: 1737\n",
+ "Speaker 352: 216\n",
+ "Speaker 353: 7061\n",
+ "Speaker 354: 2391\n",
+ "Speaker 355: 4051\n",
+ "Speaker 356: 2272\n",
+ "Speaker 357: 6497\n",
+ "Speaker 358: 2499\n",
+ "Speaker 359: 8108\n",
+ "Speaker 360: 7794\n",
+ "Speaker 361: 1349\n",
+ "Speaker 362: 6446\n",
+ "Speaker 363: 3118\n",
+ "Speaker 364: 8468\n",
+ "Speaker 365: 7051\n",
+ "Speaker 366: 2741\n",
+ "Speaker 367: 8705\n",
+ "Speaker 368: 2039\n",
+ "Speaker 369: 7802\n",
+ "Speaker 370: 5750\n",
+ "Speaker 371: 2285\n",
+ "Speaker 372: 4898\n",
+ "Speaker 373: 5606\n",
+ "Speaker 374: 5570\n",
+ "Speaker 375: 1898\n",
+ "Speaker 376: 2827\n",
+ "Speaker 377: 4434\n",
+ "Speaker 378: 3294\n",
+ "Speaker 379: 7825\n",
+ "Speaker 380: 2113\n",
+ "Speaker 381: 8266\n",
+ "Speaker 382: 322\n",
+ "Speaker 383: 6927\n",
+ "Speaker 384: 5652\n",
+ "Speaker 385: 6269\n",
+ "Speaker 386: 887\n",
+ "Speaker 387: 5133\n",
+ "Speaker 388: 4807\n",
+ "Speaker 389: 4733\n",
+ "Speaker 390: 7067\n",
+ "Speaker 391: 6531\n",
+ "Speaker 392: 8008\n",
+ "Speaker 393: 2512\n",
+ "Speaker 394: 1987\n",
+ "Speaker 395: 1195\n",
+ "Speaker 396: 7335\n",
+ "Speaker 397: 8758\n",
+ "Speaker 398: 2691\n",
+ "Speaker 399: 1061\n",
+ "Speaker 400: 5519\n",
+ "Speaker 401: 4680\n",
+ "Speaker 402: 3546\n",
+ "Speaker 403: 8419\n",
+ "Speaker 404: 1958\n",
+ "Speaker 405: 3289\n",
+ "Speaker 406: 5918\n",
+ "Speaker 407: 8887\n",
+ "Speaker 408: 8575\n",
+ "Speaker 409: 210\n",
+ "Speaker 410: 6965\n",
+ "Speaker 411: 258\n",
+ "Speaker 412: 2010\n",
+ "Speaker 413: 1066\n",
+ "Speaker 414: 7926\n",
+ "Speaker 415: 2401\n",
+ "Speaker 416: 3258\n",
+ "Speaker 417: 8742\n",
+ "Speaker 418: 1050\n",
+ "Speaker 419: 548\n",
+ "Speaker 420: 1027\n",
+ "Speaker 421: 8643\n",
+ "Speaker 422: 5513\n",
+ "Speaker 423: 3914\n",
+ "Speaker 424: 1053\n",
+ "Speaker 425: 7511\n",
+ "Speaker 426: 5290\n",
+ "Speaker 427: 4013\n",
+ "Speaker 428: 543\n",
+ "Speaker 429: 369\n",
+ "Speaker 430: 2404\n",
+ "Speaker 431: 4629\n",
+ "Speaker 432: 481\n",
+ "Speaker 433: 625\n",
+ "Speaker 434: 1472\n",
+ "Speaker 435: 7145\n",
+ "Speaker 436: 426\n",
+ "Speaker 437: 7647\n",
+ "Speaker 438: 2397\n",
+ "Speaker 439: 8464\n",
+ "Speaker 440: 6686\n",
+ "Speaker 441: 8222\n",
+ "Speaker 442: 2002\n",
+ "Speaker 443: 7437\n",
+ "Speaker 444: 1547\n",
+ "Speaker 445: 7733\n",
+ "Speaker 446: 8527\n",
+ "Speaker 447: 8194\n",
+ "Speaker 448: 911\n",
+ "Speaker 449: 3357\n",
+ "Speaker 450: 227\n",
+ "Speaker 451: 781\n",
+ "Speaker 452: 122\n",
+ "Speaker 453: 5635\n",
+ "Speaker 454: 362\n",
+ "Speaker 455: 353\n",
+ "Speaker 456: 7962\n",
+ "Speaker 457: 2240\n",
+ "Speaker 458: 8605\n",
+ "Speaker 459: 4044\n",
+ "Speaker 460: 7959\n",
+ "Speaker 461: 3703\n",
+ "Speaker 462: 311\n",
+ "Speaker 463: 2289\n",
+ "Speaker 464: 7704\n",
+ "Speaker 465: 4800\n",
+ "Speaker 466: 2411\n",
+ "Speaker 467: 5724\n",
+ "Speaker 468: 2517\n",
+ "Speaker 469: 606\n",
+ "Speaker 470: 1789\n",
+ "Speaker 471: 5206\n",
+ "Speaker 472: 6139\n",
+ "Speaker 473: 7783\n",
+ "Speaker 474: 1806\n",
+ "Speaker 475: 1028\n",
+ "Speaker 476: 4860\n",
+ "Speaker 477: 4859\n",
+ "Speaker 478: 6006\n",
+ "Speaker 479: 6157\n",
+ "Speaker 480: 2960\n",
+ "Speaker 481: 2092\n",
+ "Speaker 482: 8687\n",
+ "Speaker 483: 968\n",
+ "Speaker 484: 6060\n",
+ "Speaker 485: 8142\n",
+ "Speaker 486: 1943\n",
+ "Speaker 487: 6352\n",
+ "Speaker 488: 79\n",
+ "Speaker 489: 459\n",
+ "Speaker 490: 2364\n",
+ "Speaker 491: 3259\n",
+ "Speaker 492: 6288\n",
+ "Speaker 493: 3235\n",
+ "Speaker 494: 2269\n",
+ "Speaker 495: 1513\n",
+ "Speaker 496: 6341\n",
+ "Speaker 497: 1769\n",
+ "Speaker 498: 8474\n",
+ "Speaker 499: 534\n",
+ "Speaker 500: 1849\n",
+ "Speaker 501: 7294\n",
+ "Speaker 502: 5401\n",
+ "Speaker 503: 6865\n",
+ "Speaker 504: 8591\n",
+ "Speaker 505: 1859\n",
+ "Speaker 506: 56\n",
+ "Speaker 507: 101\n",
+ "Speaker 508: 4813\n",
+ "Speaker 509: 458\n",
+ "Speaker 510: 1498\n",
+ "Speaker 511: 1121\n",
+ "Speaker 512: 8404\n",
+ "Speaker 513: 1323\n",
+ "Speaker 514: 4734\n",
+ "Speaker 515: 6385\n",
+ "Speaker 516: 7874\n",
+ "Speaker 517: 1734\n",
+ "Speaker 518: 7981\n",
+ "Speaker 519: 5239\n",
+ "Speaker 520: 6160\n",
+ "Speaker 521: 201\n",
+ "Speaker 522: 7754\n",
+ "Speaker 523: 2204\n",
+ "Speaker 524: 986\n",
+ "Speaker 525: 2785\n",
+ "Speaker 526: 1018\n",
+ "Speaker 527: 4054\n",
+ "Speaker 528: 3307\n",
+ "Speaker 529: 5339\n",
+ "Speaker 530: 298\n",
+ "Speaker 531: 1296\n",
+ "Speaker 532: 4195\n",
+ "Speaker 533: 1054\n",
+ "Speaker 534: 2074\n",
+ "Speaker 535: 815\n",
+ "Speaker 536: 7665\n",
+ "Speaker 537: 594\n",
+ "Speaker 538: 119\n",
+ "Speaker 539: 8545\n",
+ "Speaker 540: 5261\n",
+ "Speaker 541: 2696\n",
+ "Speaker 542: 1944\n",
+ "Speaker 543: 7949\n",
+ "Speaker 544: 8152\n",
+ "Speaker 545: 820\n",
+ "Speaker 546: 7095\n",
+ "Speaker 547: 2159\n",
+ "Speaker 548: 4289\n",
+ "Speaker 549: 6014\n",
+ "Speaker 550: 7460\n",
+ "Speaker 551: 150\n",
+ "Speaker 552: 2893\n",
+ "Speaker 553: 2769\n",
+ "Speaker 554: 8479\n",
+ "Speaker 555: 8747\n",
+ "Speaker 556: 5789\n",
+ "Speaker 557: 6082\n",
+ "Speaker 558: 1641\n",
+ "Speaker 559: 3825\n",
+ "Speaker 560: 6119\n",
+ "Speaker 561: 7867\n",
+ "Speaker 562: 318\n",
+ "Speaker 563: 39\n",
+ "Speaker 564: 4837\n",
+ "Speaker 565: 7868\n",
+ "Speaker 566: 7498\n",
+ "Speaker 567: 2053\n",
+ "Speaker 568: 14\n",
+ "Speaker 569: 8820\n",
+ "Speaker 570: 7313\n",
+ "Speaker 571: 7383\n",
+ "Speaker 572: 1553\n",
+ "Speaker 573: 5727\n",
+ "Speaker 574: 2494\n",
+ "Speaker 575: 1079\n",
+ "Speaker 576: 2592\n",
+ "Speaker 577: 1678\n",
+ "Speaker 578: 2589\n",
+ "Speaker 579: 696\n",
+ "Speaker 580: 3224\n",
+ "Speaker 581: 4719\n",
+ "Speaker 582: 3972\n",
+ "Speaker 583: 731\n",
+ "Speaker 584: 1290\n",
+ "Speaker 585: 2319\n",
+ "Speaker 586: 1933\n",
+ "Speaker 587: 7126\n",
+ "Speaker 588: 803\n",
+ "Speaker 589: 4057\n",
+ "Speaker 590: 2628\n",
+ "Speaker 591: 2436\n",
+ "Speaker 592: 6788\n",
+ "Speaker 593: 3379\n",
+ "Speaker 594: 948\n",
+ "Speaker 595: 5712\n",
+ "Speaker 596: 5242\n",
+ "Speaker 597: 7140\n",
+ "Speaker 598: 7078\n",
+ "Speaker 599: 5655\n",
+ "Speaker 600: 4957\n",
+ "Speaker 601: 1607\n",
+ "Speaker 602: 78\n",
+ "Speaker 603: 2652\n",
+ "Speaker 604: 5984\n",
+ "Speaker 605: 953\n",
+ "Speaker 606: 7169\n",
+ "Speaker 607: 500\n",
+ "Speaker 608: 5093\n",
+ "Speaker 609: 8300\n",
+ "Speaker 610: 3171\n",
+ "Speaker 611: 8498\n",
+ "Speaker 612: 5561\n",
+ "Speaker 613: 7538\n",
+ "Speaker 614: 7789\n",
+ "Speaker 615: 451\n",
+ "Speaker 616: 4111\n",
+ "Speaker 617: 3180\n",
+ "Speaker 618: 6378\n",
+ "Speaker 619: 2989\n",
+ "Speaker 620: 6574\n",
+ "Speaker 621: 4406\n",
+ "Speaker 622: 208\n",
+ "Speaker 623: 248\n",
+ "Speaker 624: 274\n",
+ "Speaker 625: 3157\n",
+ "Speaker 626: 6836\n",
+ "Speaker 627: 5333\n",
+ "Speaker 628: 7445\n",
+ "Speaker 629: 5456\n",
+ "Speaker 630: 3654\n",
+ "Speaker 631: 5246\n",
+ "Speaker 632: 6099\n",
+ "Speaker 633: 4297\n",
+ "Speaker 634: 7688\n",
+ "Speaker 635: 839\n",
+ "Speaker 636: 1571\n",
+ "Speaker 637: 1390\n",
+ "Speaker 638: 2764\n",
+ "Speaker 639: 7000\n",
+ "Speaker 640: 2294\n",
+ "Speaker 641: 289\n",
+ "Speaker 642: 3922\n",
+ "Speaker 643: 8797\n",
+ "Speaker 644: 6518\n",
+ "Speaker 645: 8630\n",
+ "Speaker 646: 2929\n",
+ "Speaker 647: 7967\n",
+ "Speaker 648: 6286\n",
+ "Speaker 649: 6620\n",
+ "Speaker 650: 1379\n",
+ "Speaker 651: 405\n",
+ "Speaker 652: 8329\n",
+ "Speaker 653: 8838\n",
+ "Speaker 654: 3493\n",
+ "Speaker 655: 6167\n",
+ "Speaker 656: 7241\n",
+ "Speaker 657: 6308\n",
+ "Speaker 658: 4039\n",
+ "Speaker 659: 3982\n",
+ "Speaker 660: 770\n",
+ "Speaker 661: 8324\n",
+ "Speaker 662: 7777\n",
+ "Speaker 663: 2299\n",
+ "Speaker 664: 3446\n",
+ "Speaker 665: 2004\n",
+ "Speaker 666: 6359\n",
+ "Speaker 667: 6818\n",
+ "Speaker 668: 203\n",
+ "Speaker 669: 8609\n",
+ "Speaker 670: 1777\n",
+ "Speaker 671: 2007\n",
+ "Speaker 672: 3185\n",
+ "Speaker 673: 9022\n",
+ "Speaker 674: 3328\n",
+ "Speaker 675: 6763\n",
+ "Speaker 676: 6918\n",
+ "Speaker 677: 637\n",
+ "Speaker 678: 5054\n",
+ "Speaker 679: 4358\n",
+ "Speaker 680: 2758\n",
+ "Speaker 681: 3105\n",
+ "Speaker 682: 5400\n",
+ "Speaker 683: 5337\n",
+ "Speaker 684: 2254\n",
+ "Speaker 685: 3242\n",
+ "Speaker 686: 8176\n",
+ "Speaker 687: 19\n",
+ "Speaker 688: 4481\n",
+ "Speaker 689: 4899\n",
+ "Speaker 690: 1460\n",
+ "Speaker 691: 7657\n",
+ "Speaker 692: 4640\n",
+ "Speaker 693: 6696\n",
+ "Speaker 694: 7402\n",
+ "Speaker 695: 920\n",
+ "Speaker 696: 1311\n",
+ "Speaker 697: 83\n",
+ "Speaker 698: 225\n",
+ "Speaker 699: 224\n",
+ "Speaker 700: 6426\n",
+ "Speaker 701: 7910\n",
+ "Speaker 702: 8848\n",
+ "Speaker 703: 7247\n",
+ "Speaker 704: 408\n",
+ "Speaker 705: 8825\n",
+ "Speaker 706: 2481\n",
+ "Speaker 707: 7938\n",
+ "Speaker 708: 307\n",
+ "Speaker 709: 475\n",
+ "Speaker 710: 4598\n",
+ "Speaker 711: 3119\n",
+ "Speaker 712: 1081\n",
+ "Speaker 713: 6690\n",
+ "Speaker 714: 2012\n",
+ "Speaker 715: 1594\n",
+ "Speaker 716: 192\n",
+ "Speaker 717: 115\n",
+ "Speaker 718: 8088\n",
+ "Speaker 719: 81\n",
+ "Speaker 720: 8095\n",
+ "Speaker 721: 1779\n",
+ "Speaker 722: 5684\n",
+ "Speaker 723: 6510\n",
+ "Speaker 724: 6032\n",
+ "Speaker 725: 3728\n",
+ "Speaker 726: 1116\n",
+ "Speaker 727: 8138\n",
+ "Speaker 728: 5808\n",
+ "Speaker 729: 4397\n",
+ "Speaker 730: 2368\n",
+ "Speaker 731: 5968\n",
+ "Speaker 732: 5588\n",
+ "Speaker 733: 7148\n",
+ "Speaker 734: 6981\n",
+ "Speaker 735: 6395\n",
+ "Speaker 736: 1313\n",
+ "Speaker 737: 200\n",
+ "Speaker 738: 7525\n",
+ "Speaker 739: 8075\n",
+ "Speaker 740: 1212\n",
+ "Speaker 741: 3835\n",
+ "Speaker 742: 1100\n",
+ "Speaker 743: 7314\n",
+ "Speaker 744: 8459\n",
+ "Speaker 745: 7342\n",
+ "Speaker 746: 5147\n",
+ "Speaker 747: 6181\n",
+ "Speaker 748: 6037\n",
+ "Speaker 749: 2473\n",
+ "Speaker 750: 2971\n",
+ "Speaker 751: 2156\n",
+ "Speaker 752: 7705\n",
+ "Speaker 753: 6937\n",
+ "Speaker 754: 22\n",
+ "Speaker 755: 666\n",
+ "Speaker 756: 3945\n",
+ "Speaker 757: 1335\n",
+ "Speaker 758: 5618\n",
+ "Speaker 759: 559\n",
+ "Speaker 760: 340\n",
+ "Speaker 761: 7828\n",
+ "Speaker 762: 7229\n",
+ "Speaker 763: 1165\n",
+ "Speaker 764: 8225\n",
+ "Speaker 765: 2843\n",
+ "Speaker 766: 1536\n",
+ "Speaker 767: 8573\n",
+ "Speaker 768: 3927\n",
+ "Speaker 769: 8875\n",
+ "Speaker 770: 636\n",
+ "Speaker 771: 597\n",
+ "Speaker 772: 4356\n",
+ "Speaker 773: 176\n",
+ "Speaker 774: 434\n",
+ "Speaker 775: 1343\n",
+ "Speaker 776: 332\n",
+ "Speaker 777: 4260\n",
+ "Speaker 778: 2775\n",
+ "Speaker 779: 17\n",
+ "Speaker 780: 3905\n",
+ "Speaker 781: 7717\n",
+ "Speaker 782: 198\n",
+ "Speaker 783: 6529\n",
+ "Speaker 784: 8580\n",
+ "Speaker 785: 1885\n",
+ "Speaker 786: 7932\n",
+ "Speaker 787: 5778\n",
+ "Speaker 788: 7518\n",
+ "Speaker 789: 4519\n",
+ "Speaker 790: 3792\n",
+ "Speaker 791: 5029\n",
+ "Speaker 792: 3857\n",
+ "Speaker 793: 949\n",
+ "Speaker 794: 8421\n",
+ "Speaker 795: 1455\n",
+ "Speaker 796: 5717\n",
+ "Speaker 797: 3781\n",
+ "Speaker 798: 7134\n",
+ "Speaker 799: 7732\n",
+ "Speaker 800: 576\n",
+ "Speaker 801: 8226\n",
+ "Speaker 802: 7226\n",
+ "Speaker 803: 1098\n",
+ "Speaker 804: 7780\n",
+ "Speaker 805: 5723\n",
+ "Speaker 806: 7553\n",
+ "Speaker 807: 5740\n",
+ "Speaker 808: 1992\n",
+ "Speaker 809: 4441\n",
+ "Speaker 810: 28\n",
+ "Speaker 811: 7302\n",
+ "Speaker 812: 4214\n",
+ "Speaker 813: 7957\n",
+ "Speaker 814: 1363\n",
+ "Speaker 815: 6098\n",
+ "Speaker 816: 4848\n",
+ "Speaker 817: 1365\n",
+ "Speaker 818: 2093\n",
+ "Speaker 819: 1265\n",
+ "Speaker 820: 1578\n",
+ "Speaker 821: 4278\n",
+ "Speaker 822: 3816\n",
+ "Speaker 823: 1752\n",
+ "Speaker 824: 7495\n",
+ "Speaker 825: 1183\n",
+ "Speaker 826: 1645\n",
+ "Speaker 827: 698\n",
+ "Speaker 828: 2060\n",
+ "Speaker 829: 7318\n",
+ "Speaker 830: 112\n",
+ "Speaker 831: 4088\n",
+ "Speaker 832: 7859\n",
+ "Speaker 833: 7447\n",
+ "Speaker 834: 3009\n",
+ "Speaker 835: 246\n",
+ "Speaker 836: 1754\n",
+ "Speaker 837: 480\n",
+ "Speaker 838: 403\n",
+ "Speaker 839: 3368\n",
+ "Speaker 840: 446\n",
+ "Speaker 841: 2774\n",
+ "Speaker 842: 2498\n",
+ "Speaker 843: 3540\n",
+ "Speaker 844: 6300\n",
+ "Speaker 845: 6458\n",
+ "Speaker 846: 8401\n",
+ "Speaker 847: 2577\n",
+ "Speaker 848: 7635\n",
+ "Speaker 849: 1040\n",
+ "Speaker 850: 6215\n",
+ "Speaker 851: 6727\n",
+ "Speaker 852: 6406\n",
+ "Speaker 853: 2853\n",
+ "Speaker 854: 6317\n",
+ "Speaker 855: 7113\n",
+ "Speaker 856: 8425\n",
+ "Speaker 857: 6694\n",
+ "Speaker 858: 882\n",
+ "Speaker 859: 100\n",
+ "Speaker 860: 6454\n",
+ "Speaker 861: 3584\n",
+ "Speaker 862: 2817\n",
+ "Speaker 863: 667\n",
+ "Speaker 864: 2229\n",
+ "Speaker 865: 3851\n",
+ "Speaker 866: 4133\n",
+ "Speaker 867: 5656\n",
+ "Speaker 868: 278\n",
+ "Speaker 869: 1535\n",
+ "Speaker 870: 1259\n",
+ "Speaker 871: 7128\n",
+ "Speaker 872: 296\n",
+ "Speaker 873: 2061\n",
+ "Speaker 874: 5393\n",
+ "Speaker 875: 3221\n",
+ "Speaker 876: 30\n",
+ "Speaker 877: 188\n",
+ "Speaker 878: 1445\n",
+ "Speaker 879: 2393\n",
+ "Speaker 880: 6956\n",
+ "Speaker 881: 5163\n",
+ "Speaker 882: 549\n",
+ "Speaker 883: 2518\n",
+ "Speaker 884: 829\n",
+ "Speaker 885: 6567\n",
+ "Speaker 886: 8592\n",
+ "Speaker 887: 303\n",
+ "Speaker 888: 240\n",
+ "Speaker 889: 3380\n",
+ "Speaker 890: 7481\n",
+ "Speaker 891: 5883\n",
+ "Speaker 892: 3214\n",
+ "Speaker 893: 8855\n",
+ "Speaker 894: 3947\n",
+ "Speaker 895: 398\n",
+ "Speaker 896: 55\n",
+ "Speaker 897: 8722\n",
+ "Speaker 898: 8713\n",
+ "Speaker 899: 5868\n",
+ "Speaker 900: 979\n",
+ "Speaker 901: 209\n",
+ "Speaker 902: 2673\n",
+ "Speaker 903: 3340\n",
+ "Speaker 904: 126\n",
+ "Speaker 905: 612\n",
+ "Speaker 906: 580\n",
+ "Speaker 907: 1182\n",
+ "Speaker 908: 664\n",
+ "Speaker 909: 1246\n",
+ "Speaker 910: 5678\n",
+ "Speaker 911: 1487\n",
+ "Speaker 912: 9026\n",
+ "Speaker 913: 2196\n",
+ "Speaker 914: 6235\n",
+ "Speaker 915: 2952\n",
+ "Speaker 916: 3733\n",
+ "Speaker 917: 2790\n",
+ "Speaker 918: 6339\n",
+ "Speaker 919: 5489\n",
+ "Speaker 920: 7505\n",
+ "Speaker 921: 7190\n",
+ "Speaker 922: 7059\n",
+ "Speaker 923: 175\n",
+ "Speaker 924: 5390\n",
+ "Speaker 925: 6373\n",
+ "Speaker 926: 6895\n",
+ "Speaker 927: 4148\n",
+ "Speaker 928: 93\n",
+ "Speaker 929: 339\n",
+ "Speaker 930: 8113\n",
+ "Speaker 931: 7478\n",
+ "Speaker 932: 439\n",
+ "Speaker 933: 6209\n",
+ "Speaker 934: 6553\n",
+ "Speaker 935: 5660\n",
+ "Speaker 936: 716\n",
+ "Speaker 937: 6643\n",
+ "Speaker 938: 4788\n",
+ "Speaker 939: 114\n",
+ "Speaker 940: 492\n",
+ "Speaker 941: 5909\n",
+ "Speaker 942: 1482\n",
+ "Speaker 943: 38\n",
+ "Speaker 944: 5448\n",
+ "Speaker 945: 98\n",
+ "Speaker 946: 159\n",
+ "Speaker 947: 718\n",
+ "Speaker 948: 922\n",
+ "Speaker 949: 7258\n",
+ "Speaker 950: 6388\n",
+ "Speaker 951: 7178\n",
+ "Speaker 952: 7558\n",
+ "Speaker 953: 899\n",
+ "Speaker 954: 373\n",
+ "Speaker 955: 87\n",
+ "Speaker 956: 3526\n",
+ "Speaker 957: 3864\n",
+ "Speaker 958: 3370\n",
+ "Speaker 959: 1826\n",
+ "Speaker 960: 7739\n",
+ "Speaker 961: 6575\n",
+ "Speaker 962: 501\n",
+ "Speaker 963: 909\n",
+ "Speaker 964: 3112\n",
+ "Speaker 965: 7240\n",
+ "Speaker 966: 699\n",
+ "Speaker 967: 4595\n",
+ "Speaker 968: 5746\n",
+ "Speaker 969: 4856\n",
+ "Speaker 970: 1629\n",
+ "Speaker 971: 707\n",
+ "Speaker 972: 589\n",
+ "Speaker 973: 1638\n",
+ "Speaker 974: 830\n",
+ "Speaker 975: 3989\n",
+ "Speaker 976: 8066\n",
+ "Speaker 977: 7416\n",
+ "Speaker 978: 70\n",
+ "Speaker 979: 6993\n",
+ "Speaker 980: 3790\n",
+ "Speaker 981: 3490\n",
+ "Speaker 982: 8684\n",
+ "Speaker 983: 166\n",
+ "Speaker 984: 6505\n",
+ "Speaker 985: 2911\n",
+ "Speaker 986: 2127\n",
+ "Speaker 987: 2146\n",
+ "Speaker 988: 3664\n",
+ "Speaker 989: 7995\n",
+ "Speaker 990: 8725\n",
+ "Speaker 991: 4340\n",
+ "Speaker 992: 8006\n",
+ "Speaker 993: 4973\n",
+ "Speaker 994: 2910\n",
+ "Speaker 995: 497\n",
+ "Speaker 996: 5876\n",
+ "Speaker 997: 6233\n",
+ "Speaker 998: 3537\n",
+ "Speaker 999: 1413\n",
+ "Speaker 1000: 5189\n",
+ "Speaker 1001: 204\n",
+ "Speaker 1002: 836\n",
+ "Speaker 1003: 2618\n",
+ "Speaker 1004: 7276\n",
+ "Speaker 1005: 1264\n",
+ "Speaker 1006: 2045\n",
+ "Speaker 1007: 3215\n",
+ "Speaker 1008: 6555\n",
+ "Speaker 1009: 196\n",
+ "Speaker 1010: 6848\n",
+ "Speaker 1011: 1160\n",
+ "Speaker 1012: 8771\n",
+ "Speaker 1013: 4744\n",
+ "Speaker 1014: 6637\n",
+ "Speaker 1015: 1463\n",
+ "Speaker 1016: 3615\n",
+ "Speaker 1017: 5776\n",
+ "Speaker 1018: 26\n",
+ "Speaker 1019: 7339\n",
+ "Speaker 1020: 249\n",
+ "Speaker 1021: 1034\n",
+ "Speaker 1022: 1743\n",
+ "Speaker 1023: 207\n",
+ "Speaker 1024: 831\n",
+ "Speaker 1025: 4335\n",
+ "Speaker 1026: 7720\n",
+ "Speaker 1027: 2194\n",
+ "Speaker 1028: 688\n",
+ "Speaker 1029: 8619\n",
+ "Speaker 1030: 8718\n",
+ "Speaker 1031: 581\n",
+ "Speaker 1032: 835\n",
+ "Speaker 1033: 7881\n",
+ "Speaker 1034: 3607\n",
+ "Speaker 1035: 7933\n",
+ "Speaker 1036: 708\n",
+ "Speaker 1037: 7188\n",
+ "Speaker 1038: 4246\n",
+ "Speaker 1039: 1926\n",
+ "Speaker 1040: 7766\n",
+ "Speaker 1041: 6538\n",
+ "Speaker 1042: 2149\n",
+ "Speaker 1043: 7434\n",
+ "Speaker 1044: 3230\n",
+ "Speaker 1045: 3983\n",
+ "Speaker 1046: 4152\n",
+ "Speaker 1047: 1336\n",
+ "Speaker 1048: 2388\n",
+ "Speaker 1049: 5139\n",
+ "Speaker 1050: 1473\n",
+ "Speaker 1051: 868\n",
+ "Speaker 1052: 2709\n",
+ "Speaker 1053: 2674\n",
+ "Speaker 1054: 2570\n",
+ "Speaker 1055: 211\n",
+ "Speaker 1056: 4137\n",
+ "Speaker 1057: 472\n",
+ "Speaker 1058: 5022\n",
+ "Speaker 1059: 1263\n",
+ "Speaker 1060: 1801\n",
+ "Speaker 1061: 1963\n",
+ "Speaker 1062: 5386\n",
+ "Speaker 1063: 3274\n",
+ "Speaker 1064: 3070\n",
+ "Speaker 1065: 3436\n",
+ "Speaker 1066: 8347\n",
+ "Speaker 1067: 7245\n",
+ "Speaker 1068: 3240\n",
+ "Speaker 1069: 7555\n",
+ "Speaker 1070: 6081\n",
+ "Speaker 1071: 5914\n",
+ "Speaker 1072: 1827\n",
+ "Speaker 1073: 8238\n",
+ "Speaker 1074: 2256\n",
+ "Speaker 1075: 7139\n",
+ "Speaker 1076: 1668\n",
+ "Speaker 1077: 4108\n",
+ "Speaker 1078: 7809\n",
+ "Speaker 1079: 2384\n",
+ "Speaker 1080: 4806\n",
+ "Speaker 1081: 3830\n",
+ "Speaker 1082: 3889\n",
+ "Speaker 1083: 217\n",
+ "Speaker 1084: 3645\n",
+ "Speaker 1085: 205\n",
+ "Speaker 1086: 6476\n",
+ "Speaker 1087: 4590\n",
+ "Speaker 1088: 6563\n",
+ "Speaker 1089: 2416\n",
+ "Speaker 1090: 8183\n",
+ "Speaker 1091: 8975\n",
+ "Speaker 1092: 4257\n",
+ "Speaker 1093: 1425\n",
+ "Speaker 1094: 8014\n",
+ "Speaker 1095: 5190\n",
+ "Speaker 1096: 4160\n",
+ "Speaker 1097: 1069\n",
+ "Speaker 1098: 1923\n",
+ "Speaker 1099: 4110\n",
+ "Speaker 1100: 1235\n",
+ "Speaker 1101: 5049\n",
+ "Speaker 1102: 479\n",
+ "Speaker 1103: 2573\n",
+ "Speaker 1104: 4331\n",
+ "Speaker 1105: 6828\n",
+ "Speaker 1106: 380\n",
+ "Speaker 1107: 7395\n",
+ "Speaker 1108: 3638\n",
+ "Speaker 1109: 6104\n",
+ "Speaker 1110: 4681\n",
+ "Speaker 1111: 8824\n",
+ "Speaker 1112: 2085\n",
+ "Speaker 1113: 6038\n",
+ "Speaker 1114: 7475\n",
+ "Speaker 1115: 8490\n",
+ "Speaker 1116: 3486\n",
+ "Speaker 1117: 6258\n",
+ "Speaker 1118: 2999\n",
+ "Speaker 1119: 8228\n",
+ "Speaker 1120: 1387\n",
+ "Speaker 1121: 8028\n",
+ "Speaker 1122: 1060\n",
+ "Speaker 1123: 3869\n",
+ "Speaker 1124: 8410\n",
+ "Speaker 1125: 2532\n",
+ "Speaker 1126: 2562\n",
+ "Speaker 1127: 1383\n",
+ "Speaker 1128: 7120\n",
+ "Speaker 1129: 1974\n",
+ "Speaker 1130: 1603\n",
+ "Speaker 1131: 8388\n",
+ "Speaker 1132: 8506\n",
+ "Speaker 1133: 337\n",
+ "Speaker 1134: 4222\n",
+ "Speaker 1135: 6371\n",
+ "Speaker 1136: 5039\n",
+ "Speaker 1137: 5867\n",
+ "Speaker 1138: 3876\n",
+ "Speaker 1139: 5293\n",
+ "Speaker 1140: 103\n",
+ "Speaker 1141: 64\n",
+ "Speaker 1142: 7800\n",
+ "Speaker 1143: 1447\n",
+ "Speaker 1144: 154\n",
+ "Speaker 1145: 460\n",
+ "Speaker 1146: 2514\n",
+ "Speaker 1147: 2816\n",
+ "Speaker 1148: 27\n",
+ "Speaker 1149: 242\n",
+ "Speaker 1150: 3738\n"
+ ]
+ }
+ ],
+ "source": [
+ "data_dir = 'LibriTTS'\n",
+ "mels_mode_dict = dict()\n",
+ "lens_dict = dict()\n",
+ "for p in phoneme_list:\n",
+ " mels_mode_dict[p] = []\n",
+ " lens_dict[p] = []\n",
+ "speakers = os.listdir(os.path.join(data_dir, 'mels'))\n",
+ "for s, speaker in enumerate(speakers):\n",
+ " print('Speaker %d: %s' % (s + 1, speaker))\n",
+ " textgrids = os.listdir(os.path.join(data_dir, 'textgrids', speaker))\n",
+ " for textgrid in textgrids:\n",
+ " t = tgt.io.read_textgrid(os.path.join(data_dir, 'textgrids', speaker, textgrid))\n",
+ " m = np.load(os.path.join(data_dir, 'mels', speaker, textgrid.replace('.TextGrid', '_mel.npy')))\n",
+ " t = t.get_tier_by_name('phones')\n",
+ " for i in range(len(t)):\n",
+ " phoneme = t[i].text\n",
+ " start_frame = int(t[i].start_time * 22050.0) // 256\n",
+ " end_frame = int(t[i].end_time * 22050.0) // 256 + 1\n",
+ " mels_mode_dict[phoneme] += [np.round(np.median(m[:, start_frame:end_frame], 1), 1)]\n",
+ " lens_dict[phoneme] += [end_frame - start_frame]\n",
+ "\n",
+ "mels_mode = dict()\n",
+ "lens = dict()\n",
+ "for p in phoneme_list:\n",
+ " mels_mode[p] = mode(np.asarray(mels_mode_dict[p]), 0).mode[0]\n",
+ " lens[p] = np.mean(np.asarray(lens_dict[p]))\n",
+ "del mels_mode_dict\n",
+ "del lens_dict"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 9,
+ "metadata": {
+ "collapsed": false,
+ "scrolled": true
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Speaker 1: 8057\n",
+ "Speaker 2: 4014\n",
+ "Speaker 3: 6415\n",
+ "Speaker 4: 5126\n",
+ "Speaker 5: 3723\n",
+ "Speaker 6: 587\n",
+ "Speaker 7: 8534\n",
+ "Speaker 8: 5322\n",
+ "Speaker 9: 2238\n",
+ "Speaker 10: 1401\n",
+ "Speaker 11: 4427\n",
+ "Speaker 12: 1705\n",
+ "Speaker 13: 561\n",
+ "Speaker 14: 2992\n",
+ "Speaker 15: 8776\n",
+ "Speaker 16: 54\n",
+ "Speaker 17: 806\n",
+ "Speaker 18: 1970\n",
+ "Speaker 19: 302\n",
+ "Speaker 20: 6272\n",
+ "Speaker 21: 1289\n",
+ "Speaker 22: 3807\n",
+ "Speaker 23: 6075\n",
+ "Speaker 24: 329\n",
+ "Speaker 25: 3483\n",
+ "Speaker 26: 1914\n",
+ "Speaker 27: 6499\n",
+ "Speaker 28: 7117\n",
+ "Speaker 29: 5703\n",
+ "Speaker 30: 3032\n",
+ "Speaker 31: 3001\n",
+ "Speaker 32: 5304\n",
+ "Speaker 33: 5012\n",
+ "Speaker 34: 8786\n",
+ "Speaker 35: 3187\n",
+ "Speaker 36: 5935\n",
+ "Speaker 37: 1088\n",
+ "Speaker 38: 783\n",
+ "Speaker 39: 5186\n",
+ "Speaker 40: 7994\n",
+ "Speaker 41: 6078\n",
+ "Speaker 42: 3168\n",
+ "Speaker 43: 6550\n",
+ "Speaker 44: 6701\n",
+ "Speaker 45: 4926\n",
+ "Speaker 46: 1355\n",
+ "Speaker 47: 1337\n",
+ "Speaker 48: 2582\n",
+ "Speaker 49: 8119\n",
+ "Speaker 50: 5767\n",
+ "Speaker 51: 1112\n",
+ "Speaker 52: 6054\n",
+ "Speaker 53: 5583\n",
+ "Speaker 54: 6120\n",
+ "Speaker 55: 4290\n",
+ "Speaker 56: 3440\n",
+ "Speaker 57: 2230\n",
+ "Speaker 58: 5802\n",
+ "Speaker 59: 3448\n",
+ "Speaker 60: 730\n",
+ "Speaker 61: 7011\n",
+ "Speaker 62: 40\n",
+ "Speaker 63: 1845\n",
+ "Speaker 64: 7816\n",
+ "Speaker 65: 4010\n",
+ "Speaker 66: 2823\n",
+ "Speaker 67: 511\n",
+ "Speaker 68: 229\n",
+ "Speaker 69: 5319\n",
+ "Speaker 70: 4830\n",
+ "Speaker 71: 8494\n",
+ "Speaker 72: 1509\n",
+ "Speaker 73: 7285\n",
+ "Speaker 74: 1226\n",
+ "Speaker 75: 2638\n",
+ "Speaker 76: 2920\n",
+ "Speaker 77: 7367\n",
+ "Speaker 78: 2598\n",
+ "Speaker 79: 3686\n",
+ "Speaker 80: 412\n",
+ "Speaker 81: 5538\n",
+ "Speaker 82: 663\n",
+ "Speaker 83: 6683\n",
+ "Speaker 84: 1271\n",
+ "Speaker 85: 5514\n",
+ "Speaker 86: 8699\n",
+ "Speaker 87: 7264\n",
+ "Speaker 88: 816\n",
+ "Speaker 89: 5092\n",
+ "Speaker 90: 7752\n",
+ "Speaker 91: 3008\n",
+ "Speaker 92: 5688\n",
+ "Speaker 93: 3513\n",
+ "Speaker 94: 1224\n",
+ "Speaker 95: 8312\n",
+ "Speaker 96: 8791\n",
+ "Speaker 97: 5104\n",
+ "Speaker 98: 5266\n",
+ "Speaker 99: 1392\n",
+ "Speaker 100: 3549\n",
+ "Speaker 101: 7945\n",
+ "Speaker 102: 272\n",
+ "Speaker 103: 5940\n",
+ "Speaker 104: 6437\n",
+ "Speaker 105: 2531\n",
+ "Speaker 106: 6509\n",
+ "Speaker 107: 4064\n",
+ "Speaker 108: 2167\n",
+ "Speaker 109: 3630\n",
+ "Speaker 110: 4018\n",
+ "Speaker 111: 8770\n",
+ "Speaker 112: 8163\n",
+ "Speaker 113: 5809\n",
+ "Speaker 114: 510\n",
+ "Speaker 115: 5007\n",
+ "Speaker 116: 4967\n",
+ "Speaker 117: 8396\n",
+ "Speaker 118: 359\n",
+ "Speaker 119: 5622\n",
+ "Speaker 120: 3521\n",
+ "Speaker 121: 3923\n",
+ "Speaker 122: 1382\n",
+ "Speaker 123: 1012\n",
+ "Speaker 124: 7939\n",
+ "Speaker 125: 4839\n",
+ "Speaker 126: 1175\n",
+ "Speaker 127: 2836\n",
+ "Speaker 128: 4853\n",
+ "Speaker 129: 639\n",
+ "Speaker 130: 4236\n",
+ "Speaker 131: 2654\n",
+ "Speaker 132: 3866\n",
+ "Speaker 133: 335\n",
+ "Speaker 134: 3551\n",
+ "Speaker 135: 1046\n",
+ "Speaker 136: 6147\n",
+ "Speaker 137: 157\n",
+ "Speaker 138: 3094\n",
+ "Speaker 139: 2427\n",
+ "Speaker 140: 8195\n",
+ "Speaker 141: 4238\n",
+ "Speaker 142: 4854\n",
+ "Speaker 143: 7832\n",
+ "Speaker 144: 1748\n",
+ "Speaker 145: 4586\n",
+ "Speaker 146: 7484\n",
+ "Speaker 147: 1825\n",
+ "Speaker 148: 669\n",
+ "Speaker 149: 512\n",
+ "Speaker 150: 4433\n",
+ "Speaker 151: 3374\n",
+ "Speaker 152: 6064\n",
+ "Speaker 153: 2201\n",
+ "Speaker 154: 6519\n",
+ "Speaker 155: 323\n",
+ "Speaker 156: 7515\n",
+ "Speaker 157: 1316\n",
+ "Speaker 158: 3717\n",
+ "Speaker 159: 4362\n",
+ "Speaker 160: 89\n",
+ "Speaker 161: 5810\n",
+ "Speaker 162: 8050\n",
+ "Speaker 163: 1025\n",
+ "Speaker 164: 7991\n",
+ "Speaker 165: 4495\n",
+ "Speaker 166: 3003\n",
+ "Speaker 167: 1001\n",
+ "Speaker 168: 4243\n",
+ "Speaker 169: 7069\n",
+ "Speaker 170: 593\n",
+ "Speaker 171: 1913\n",
+ "Speaker 172: 1058\n",
+ "Speaker 173: 4363\n",
+ "Speaker 174: 2056\n",
+ "Speaker 175: 4535\n",
+ "Speaker 176: 4138\n",
+ "Speaker 177: 2751\n",
+ "Speaker 178: 6367\n",
+ "Speaker 179: 6904\n",
+ "Speaker 180: 8677\n",
+ "Speaker 181: 5123\n",
+ "Speaker 182: 7520\n",
+ "Speaker 183: 6019\n",
+ "Speaker 184: 6294\n",
+ "Speaker 185: 1811\n",
+ "Speaker 186: 4226\n",
+ "Speaker 187: 6206\n",
+ "Speaker 188: 5062\n",
+ "Speaker 189: 16\n",
+ "Speaker 190: 6877\n",
+ "Speaker 191: 163\n",
+ "Speaker 192: 3114\n",
+ "Speaker 193: 7956\n",
+ "Speaker 194: 5002\n",
+ "Speaker 195: 957\n",
+ "Speaker 196: 8635\n",
+ "Speaker 197: 3977\n",
+ "Speaker 198: 3389\n",
+ "Speaker 199: 1639\n",
+ "Speaker 200: 1552\n",
+ "Speaker 201: 925\n",
+ "Speaker 202: 6115\n",
+ "Speaker 203: 2162\n",
+ "Speaker 204: 8051\n",
+ "Speaker 205: 8098\n",
+ "Speaker 206: 2581\n",
+ "Speaker 207: 4381\n",
+ "Speaker 208: 125\n",
+ "Speaker 209: 3046\n",
+ "Speaker 210: 6544\n",
+ "Speaker 211: 7594\n",
+ "Speaker 212: 2136\n",
+ "Speaker 213: 250\n",
+ "Speaker 214: 9023\n",
+ "Speaker 215: 4438\n",
+ "Speaker 216: 954\n",
+ "Speaker 217: 6189\n",
+ "Speaker 218: 7569\n",
+ "Speaker 219: 5389\n",
+ "Speaker 220: 6924\n",
+ "Speaker 221: 226\n",
+ "Speaker 222: 118\n",
+ "Speaker 223: 476\n",
+ "Speaker 224: 1556\n",
+ "Speaker 225: 4267\n",
+ "Speaker 226: 7316\n",
+ "Speaker 227: 409\n",
+ "Speaker 228: 7398\n",
+ "Speaker 229: 2182\n",
+ "Speaker 230: 8193\n",
+ "Speaker 231: 6782\n",
+ "Speaker 232: 3979\n",
+ "Speaker 233: 8879\n",
+ "Speaker 234: 90\n",
+ "Speaker 235: 850\n",
+ "Speaker 236: 4945\n",
+ "Speaker 237: 7833\n",
+ "Speaker 238: 583\n",
+ "Speaker 239: 6000\n",
+ "Speaker 240: 7030\n",
+ "Speaker 241: 1961\n",
+ "Speaker 242: 288\n",
+ "Speaker 243: 8190\n",
+ "Speaker 244: 7090\n",
+ "Speaker 245: 1867\n",
+ "Speaker 246: 2137\n",
+ "Speaker 247: 2110\n",
+ "Speaker 248: 233\n",
+ "Speaker 249: 5063\n",
+ "Speaker 250: 1776\n",
+ "Speaker 251: 7312\n",
+ "Speaker 252: 724\n",
+ "Speaker 253: 6673\n",
+ "Speaker 254: 1851\n",
+ "Speaker 255: 231\n",
+ "Speaker 256: 3082\n",
+ "Speaker 257: 8772\n",
+ "Speaker 258: 8629\n",
+ "Speaker 259: 32\n",
+ "Speaker 260: 4116\n",
+ "Speaker 261: 487\n",
+ "Speaker 262: 1649\n",
+ "Speaker 263: 1731\n",
+ "Speaker 264: 7909\n",
+ "Speaker 265: 3879\n",
+ "Speaker 266: 1903\n",
+ "Speaker 267: 8123\n",
+ "Speaker 268: 3699\n",
+ "Speaker 269: 1348\n",
+ "Speaker 270: 8118\n",
+ "Speaker 271: 2882\n",
+ "Speaker 272: 3083\n",
+ "Speaker 273: 7085\n",
+ "Speaker 274: 5154\n",
+ "Speaker 275: 1724\n",
+ "Speaker 276: 1446\n",
+ "Speaker 277: 1740\n",
+ "Speaker 278: 6080\n",
+ "Speaker 279: 6880\n",
+ "Speaker 280: 7384\n",
+ "Speaker 281: 1334\n",
+ "Speaker 282: 7540\n",
+ "Speaker 283: 5731\n",
+ "Speaker 284: 7517\n",
+ "Speaker 285: 4145\n",
+ "Speaker 286: 3852\n",
+ "Speaker 287: 4846\n",
+ "Speaker 288: 2812\n",
+ "Speaker 289: 1624\n",
+ "Speaker 290: 764\n",
+ "Speaker 291: 3482\n",
+ "Speaker 292: 2688\n",
+ "Speaker 293: 1422\n",
+ "Speaker 294: 1874\n",
+ "Speaker 295: 3072\n",
+ "Speaker 296: 7982\n",
+ "Speaker 297: 345\n",
+ "Speaker 298: 5115\n",
+ "Speaker 299: 7278\n",
+ "Speaker 300: 7730\n",
+ "Speaker 301: 5975\n",
+ "Speaker 302: 5985\n",
+ "Speaker 303: 1283\n",
+ "Speaker 304: 6492\n",
+ "Speaker 305: 5192\n",
+ "Speaker 306: 5463\n",
+ "Speaker 307: 8063\n",
+ "Speaker 308: 3025\n",
+ "Speaker 309: 5604\n",
+ "Speaker 310: 984\n",
+ "Speaker 311: 1241\n",
+ "Speaker 312: 1417\n",
+ "Speaker 313: 4731\n",
+ "Speaker 314: 8080\n",
+ "Speaker 315: 5672\n",
+ "Speaker 316: 3330\n",
+ "Speaker 317: 7286\n",
+ "Speaker 318: 1052\n",
+ "Speaker 319: 1502\n",
+ "Speaker 320: 4098\n",
+ "Speaker 321: 6925\n",
+ "Speaker 322: 4592\n",
+ "Speaker 323: 596\n",
+ "Speaker 324: 1448\n",
+ "Speaker 325: 6188\n",
+ "Speaker 326: 1841\n",
+ "Speaker 327: 4490\n",
+ "Speaker 328: 5157\n",
+ "Speaker 329: 3967\n",
+ "Speaker 330: 3994\n",
+ "Speaker 331: 1222\n",
+ "Speaker 332: 3228\n",
+ "Speaker 333: 254\n",
+ "Speaker 334: 60\n",
+ "Speaker 335: 3361\n",
+ "Speaker 336: 454\n",
+ "Speaker 337: 7837\n",
+ "Speaker 338: 8465\n",
+ "Speaker 339: 1093\n",
+ "Speaker 340: 2787\n",
+ "Speaker 341: 8097\n",
+ "Speaker 342: 7297\n",
+ "Speaker 343: 711\n",
+ "Speaker 344: 598\n",
+ "Speaker 345: 6330\n",
+ "Speaker 346: 8011\n",
+ "Speaker 347: 1322\n",
+ "Speaker 348: 6494\n",
+ "Speaker 349: 4425\n",
+ "Speaker 350: 374\n",
+ "Speaker 351: 1737\n",
+ "Speaker 352: 216\n",
+ "Speaker 353: 7061\n",
+ "Speaker 354: 2391\n",
+ "Speaker 355: 4051\n",
+ "Speaker 356: 2272\n",
+ "Speaker 357: 6497\n",
+ "Speaker 358: 2499\n",
+ "Speaker 359: 8108\n",
+ "Speaker 360: 7794\n",
+ "Speaker 361: 1349\n",
+ "Speaker 362: 6446\n",
+ "Speaker 363: 3118\n",
+ "Speaker 364: 8468\n",
+ "Speaker 365: 7051\n",
+ "Speaker 366: 2741\n",
+ "Speaker 367: 8705\n",
+ "Speaker 368: 2039\n",
+ "Speaker 369: 7802\n",
+ "Speaker 370: 5750\n",
+ "Speaker 371: 2285\n",
+ "Speaker 372: 4898\n",
+ "Speaker 373: 5606\n",
+ "Speaker 374: 5570\n",
+ "Speaker 375: 1898\n",
+ "Speaker 376: 2827\n",
+ "Speaker 377: 4434\n",
+ "Speaker 378: 3294\n",
+ "Speaker 379: 7825\n",
+ "Speaker 380: 2113\n",
+ "Speaker 381: 8266\n",
+ "Speaker 382: 322\n",
+ "Speaker 383: 6927\n",
+ "Speaker 384: 5652\n",
+ "Speaker 385: 6269\n",
+ "Speaker 386: 887\n",
+ "Speaker 387: 5133\n",
+ "Speaker 388: 4807\n",
+ "Speaker 389: 4733\n",
+ "Speaker 390: 7067\n",
+ "Speaker 391: 6531\n",
+ "Speaker 392: 8008\n",
+ "Speaker 393: 2512\n",
+ "Speaker 394: 1987\n",
+ "Speaker 395: 1195\n",
+ "Speaker 396: 7335\n",
+ "Speaker 397: 8758\n",
+ "Speaker 398: 2691\n",
+ "Speaker 399: 1061\n",
+ "Speaker 400: 5519\n",
+ "Speaker 401: 4680\n",
+ "Speaker 402: 3546\n",
+ "Speaker 403: 8419\n",
+ "Speaker 404: 1958\n",
+ "Speaker 405: 3289\n",
+ "Speaker 406: 5918\n",
+ "Speaker 407: 8887\n",
+ "Speaker 408: 8575\n",
+ "Speaker 409: 210\n",
+ "Speaker 410: 6965\n",
+ "Speaker 411: 258\n",
+ "Speaker 412: 2010\n",
+ "Speaker 413: 1066\n",
+ "Speaker 414: 7926\n",
+ "Speaker 415: 2401\n",
+ "Speaker 416: 3258\n",
+ "Speaker 417: 8742\n",
+ "Speaker 418: 1050\n",
+ "Speaker 419: 548\n",
+ "Speaker 420: 1027\n",
+ "Speaker 421: 8643\n",
+ "Speaker 422: 5513\n",
+ "Speaker 423: 3914\n",
+ "Speaker 424: 1053\n",
+ "Speaker 425: 7511\n",
+ "Speaker 426: 5290\n",
+ "Speaker 427: 4013\n",
+ "Speaker 428: 543\n",
+ "Speaker 429: 369\n",
+ "Speaker 430: 2404\n",
+ "Speaker 431: 4629\n",
+ "Speaker 432: 481\n",
+ "Speaker 433: 625\n",
+ "Speaker 434: 1472\n",
+ "Speaker 435: 7145\n",
+ "Speaker 436: 426\n",
+ "Speaker 437: 7647\n",
+ "Speaker 438: 2397\n",
+ "Speaker 439: 8464\n",
+ "Speaker 440: 6686\n",
+ "Speaker 441: 8222\n",
+ "Speaker 442: 2002\n",
+ "Speaker 443: 7437\n",
+ "Speaker 444: 1547\n",
+ "Speaker 445: 7733\n",
+ "Speaker 446: 8527\n",
+ "Speaker 447: 8194\n",
+ "Speaker 448: 911\n",
+ "Speaker 449: 3357\n",
+ "Speaker 450: 227\n",
+ "Speaker 451: 781\n",
+ "Speaker 452: 122\n",
+ "Speaker 453: 5635\n",
+ "Speaker 454: 362\n",
+ "Speaker 455: 353\n",
+ "Speaker 456: 7962\n",
+ "Speaker 457: 2240\n",
+ "Speaker 458: 8605\n",
+ "Speaker 459: 4044\n",
+ "Speaker 460: 7959\n",
+ "Speaker 461: 3703\n",
+ "Speaker 462: 311\n",
+ "Speaker 463: 2289\n",
+ "Speaker 464: 7704\n",
+ "Speaker 465: 4800\n",
+ "Speaker 466: 2411\n",
+ "Speaker 467: 5724\n",
+ "Speaker 468: 2517\n",
+ "Speaker 469: 606\n",
+ "Speaker 470: 1789\n",
+ "Speaker 471: 5206\n",
+ "Speaker 472: 6139\n",
+ "Speaker 473: 7783\n",
+ "Speaker 474: 1806\n",
+ "Speaker 475: 1028\n",
+ "Speaker 476: 4860\n",
+ "Speaker 477: 4859\n",
+ "Speaker 478: 6006\n",
+ "Speaker 479: 6157\n",
+ "Speaker 480: 2960\n",
+ "Speaker 481: 2092\n",
+ "Speaker 482: 8687\n",
+ "Speaker 483: 968\n",
+ "Speaker 484: 6060\n",
+ "Speaker 485: 8142\n",
+ "Speaker 486: 1943\n",
+ "Speaker 487: 6352\n",
+ "Speaker 488: 79\n",
+ "Speaker 489: 459\n",
+ "Speaker 490: 2364\n",
+ "Speaker 491: 3259\n",
+ "Speaker 492: 6288\n",
+ "Speaker 493: 3235\n",
+ "Speaker 494: 2269\n",
+ "Speaker 495: 1513\n",
+ "Speaker 496: 6341\n",
+ "Speaker 497: 1769\n",
+ "Speaker 498: 8474\n",
+ "Speaker 499: 534\n",
+ "Speaker 500: 1849\n",
+ "Speaker 501: 7294\n",
+ "Speaker 502: 5401\n",
+ "Speaker 503: 6865\n",
+ "Speaker 504: 8591\n",
+ "Speaker 505: 1859\n",
+ "Speaker 506: 56\n",
+ "Speaker 507: 101\n",
+ "Speaker 508: 4813\n",
+ "Speaker 509: 458\n",
+ "Speaker 510: 1498\n",
+ "Speaker 511: 1121\n",
+ "Speaker 512: 8404\n",
+ "Speaker 513: 1323\n",
+ "Speaker 514: 4734\n",
+ "Speaker 515: 6385\n",
+ "Speaker 516: 7874\n",
+ "Speaker 517: 1734\n",
+ "Speaker 518: 7981\n",
+ "Speaker 519: 5239\n",
+ "Speaker 520: 6160\n",
+ "Speaker 521: 201\n",
+ "Speaker 522: 7754\n",
+ "Speaker 523: 2204\n",
+ "Speaker 524: 986\n",
+ "Speaker 525: 2785\n",
+ "Speaker 526: 1018\n",
+ "Speaker 527: 4054\n",
+ "Speaker 528: 3307\n",
+ "Speaker 529: 5339\n",
+ "Speaker 530: 298\n",
+ "Speaker 531: 1296\n",
+ "Speaker 532: 4195\n",
+ "Speaker 533: 1054\n",
+ "Speaker 534: 2074\n",
+ "Speaker 535: 815\n",
+ "Speaker 536: 7665\n",
+ "Speaker 537: 594\n",
+ "Speaker 538: 119\n",
+ "Speaker 539: 8545\n",
+ "Speaker 540: 5261\n",
+ "Speaker 541: 2696\n",
+ "Speaker 542: 1944\n",
+ "Speaker 543: 7949\n",
+ "Speaker 544: 8152\n",
+ "Speaker 545: 820\n",
+ "Speaker 546: 7095\n",
+ "Speaker 547: 2159\n",
+ "Speaker 548: 4289\n",
+ "Speaker 549: 6014\n",
+ "Speaker 550: 7460\n",
+ "Speaker 551: 150\n",
+ "Speaker 552: 2893\n",
+ "Speaker 553: 2769\n",
+ "Speaker 554: 8479\n",
+ "Speaker 555: 8747\n",
+ "Speaker 556: 5789\n",
+ "Speaker 557: 6082\n",
+ "Speaker 558: 1641\n",
+ "Speaker 559: 3825\n",
+ "Speaker 560: 6119\n",
+ "Speaker 561: 7867\n",
+ "Speaker 562: 318\n",
+ "Speaker 563: 39\n",
+ "Speaker 564: 4837\n",
+ "Speaker 565: 7868\n",
+ "Speaker 566: 7498\n",
+ "Speaker 567: 2053\n",
+ "Speaker 568: 14\n",
+ "Speaker 569: 8820\n",
+ "Speaker 570: 7313\n",
+ "Speaker 571: 7383\n",
+ "Speaker 572: 1553\n",
+ "Speaker 573: 5727\n",
+ "Speaker 574: 2494\n",
+ "Speaker 575: 1079\n",
+ "Speaker 576: 2592\n",
+ "Speaker 577: 1678\n",
+ "Speaker 578: 2589\n",
+ "Speaker 579: 696\n",
+ "Speaker 580: 3224\n",
+ "Speaker 581: 4719\n",
+ "Speaker 582: 3972\n",
+ "Speaker 583: 731\n",
+ "Speaker 584: 1290\n",
+ "Speaker 585: 2319\n",
+ "Speaker 586: 1933\n",
+ "Speaker 587: 7126\n",
+ "Speaker 588: 803\n",
+ "Speaker 589: 4057\n",
+ "Speaker 590: 2628\n",
+ "Speaker 591: 2436\n",
+ "Speaker 592: 6788\n",
+ "Speaker 593: 3379\n",
+ "Speaker 594: 948\n",
+ "Speaker 595: 5712\n",
+ "Speaker 596: 5242\n",
+ "Speaker 597: 7140\n",
+ "Speaker 598: 7078\n",
+ "Speaker 599: 5655\n",
+ "Speaker 600: 4957\n",
+ "Speaker 601: 1607\n",
+ "Speaker 602: 78\n",
+ "Speaker 603: 2652\n",
+ "Speaker 604: 5984\n",
+ "Speaker 605: 953\n",
+ "Speaker 606: 7169\n",
+ "Speaker 607: 500\n",
+ "Speaker 608: 5093\n",
+ "Speaker 609: 8300\n",
+ "Speaker 610: 3171\n",
+ "Speaker 611: 8498\n",
+ "Speaker 612: 5561\n",
+ "Speaker 613: 7538\n",
+ "Speaker 614: 7789\n",
+ "Speaker 615: 451\n",
+ "Speaker 616: 4111\n",
+ "Speaker 617: 3180\n",
+ "Speaker 618: 6378\n",
+ "Speaker 619: 2989\n",
+ "Speaker 620: 6574\n",
+ "Speaker 621: 4406\n",
+ "Speaker 622: 208\n",
+ "Speaker 623: 248\n",
+ "Speaker 624: 274\n",
+ "Speaker 625: 3157\n",
+ "Speaker 626: 6836\n",
+ "Speaker 627: 5333\n",
+ "Speaker 628: 7445\n",
+ "Speaker 629: 5456\n",
+ "Speaker 630: 3654\n",
+ "Speaker 631: 5246\n",
+ "Speaker 632: 6099\n",
+ "Speaker 633: 4297\n",
+ "Speaker 634: 7688\n",
+ "Speaker 635: 839\n",
+ "Speaker 636: 1571\n",
+ "Speaker 637: 1390\n",
+ "Speaker 638: 2764\n",
+ "Speaker 639: 7000\n",
+ "Speaker 640: 2294\n",
+ "Speaker 641: 289\n",
+ "Speaker 642: 3922\n",
+ "Speaker 643: 8797\n",
+ "Speaker 644: 6518\n",
+ "Speaker 645: 8630\n",
+ "Speaker 646: 2929\n",
+ "Speaker 647: 7967\n",
+ "Speaker 648: 6286\n",
+ "Speaker 649: 6620\n",
+ "Speaker 650: 1379\n",
+ "Speaker 651: 405\n",
+ "Speaker 652: 8329\n",
+ "Speaker 653: 8838\n",
+ "Speaker 654: 3493\n",
+ "Speaker 655: 6167\n",
+ "Speaker 656: 7241\n",
+ "Speaker 657: 6308\n",
+ "Speaker 658: 4039\n",
+ "Speaker 659: 3982\n",
+ "Speaker 660: 770\n",
+ "Speaker 661: 8324\n",
+ "Speaker 662: 7777\n",
+ "Speaker 663: 2299\n",
+ "Speaker 664: 3446\n",
+ "Speaker 665: 2004\n",
+ "Speaker 666: 6359\n",
+ "Speaker 667: 6818\n",
+ "Speaker 668: 203\n",
+ "Speaker 669: 8609\n",
+ "Speaker 670: 1777\n",
+ "Speaker 671: 2007\n",
+ "Speaker 672: 3185\n",
+ "Speaker 673: 9022\n",
+ "Speaker 674: 3328\n",
+ "Speaker 675: 6763\n",
+ "Speaker 676: 6918\n",
+ "Speaker 677: 637\n",
+ "Speaker 678: 5054\n",
+ "Speaker 679: 4358\n",
+ "Speaker 680: 2758\n",
+ "Speaker 681: 3105\n",
+ "Speaker 682: 5400\n",
+ "Speaker 683: 5337\n",
+ "Speaker 684: 2254\n",
+ "Speaker 685: 3242\n",
+ "Speaker 686: 8176\n",
+ "Speaker 687: 19\n",
+ "Speaker 688: 4481\n",
+ "Speaker 689: 4899\n",
+ "Speaker 690: 1460\n",
+ "Speaker 691: 7657\n",
+ "Speaker 692: 4640\n",
+ "Speaker 693: 6696\n",
+ "Speaker 694: 7402\n",
+ "Speaker 695: 920\n",
+ "Speaker 696: 1311\n",
+ "Speaker 697: 83\n",
+ "Speaker 698: 225\n",
+ "Speaker 699: 224\n",
+ "Speaker 700: 6426\n",
+ "Speaker 701: 7910\n",
+ "Speaker 702: 8848\n",
+ "Speaker 703: 7247\n",
+ "Speaker 704: 408\n",
+ "Speaker 705: 8825\n",
+ "Speaker 706: 2481\n",
+ "Speaker 707: 7938\n",
+ "Speaker 708: 307\n",
+ "Speaker 709: 475\n",
+ "Speaker 710: 4598\n",
+ "Speaker 711: 3119\n",
+ "Speaker 712: 1081\n",
+ "Speaker 713: 6690\n",
+ "Speaker 714: 2012\n",
+ "Speaker 715: 1594\n",
+ "Speaker 716: 192\n",
+ "Speaker 717: 115\n",
+ "Speaker 718: 8088\n",
+ "Speaker 719: 81\n",
+ "Speaker 720: 8095\n",
+ "Speaker 721: 1779\n",
+ "Speaker 722: 5684\n",
+ "Speaker 723: 6510\n",
+ "Speaker 724: 6032\n",
+ "Speaker 725: 3728\n",
+ "Speaker 726: 1116\n",
+ "Speaker 727: 8138\n",
+ "Speaker 728: 5808\n",
+ "Speaker 729: 4397\n",
+ "Speaker 730: 2368\n",
+ "Speaker 731: 5968\n",
+ "Speaker 732: 5588\n",
+ "Speaker 733: 7148\n",
+ "Speaker 734: 6981\n",
+ "Speaker 735: 6395\n",
+ "Speaker 736: 1313\n",
+ "Speaker 737: 200\n",
+ "Speaker 738: 7525\n",
+ "Speaker 739: 8075\n",
+ "Speaker 740: 1212\n",
+ "Speaker 741: 3835\n",
+ "Speaker 742: 1100\n",
+ "Speaker 743: 7314\n",
+ "Speaker 744: 8459\n",
+ "Speaker 745: 7342\n",
+ "Speaker 746: 5147\n",
+ "Speaker 747: 6181\n",
+ "Speaker 748: 6037\n",
+ "Speaker 749: 2473\n",
+ "Speaker 750: 2971\n",
+ "Speaker 751: 2156\n",
+ "Speaker 752: 7705\n",
+ "Speaker 753: 6937\n",
+ "Speaker 754: 22\n",
+ "Speaker 755: 666\n",
+ "Speaker 756: 3945\n",
+ "Speaker 757: 1335\n",
+ "Speaker 758: 5618\n",
+ "Speaker 759: 559\n",
+ "Speaker 760: 340\n",
+ "Speaker 761: 7828\n",
+ "Speaker 762: 7229\n",
+ "Speaker 763: 1165\n",
+ "Speaker 764: 8225\n",
+ "Speaker 765: 2843\n",
+ "Speaker 766: 1536\n",
+ "Speaker 767: 8573\n",
+ "Speaker 768: 3927\n",
+ "Speaker 769: 8875\n",
+ "Speaker 770: 636\n",
+ "Speaker 771: 597\n",
+ "Speaker 772: 4356\n",
+ "Speaker 773: 176\n",
+ "Speaker 774: 434\n",
+ "Speaker 775: 1343\n",
+ "Speaker 776: 332\n",
+ "Speaker 777: 4260\n",
+ "Speaker 778: 2775\n",
+ "Speaker 779: 17\n",
+ "Speaker 780: 3905\n",
+ "Speaker 781: 7717\n",
+ "Speaker 782: 198\n",
+ "Speaker 783: 6529\n",
+ "Speaker 784: 8580\n",
+ "Speaker 785: 1885\n",
+ "Speaker 786: 7932\n",
+ "Speaker 787: 5778\n",
+ "Speaker 788: 7518\n",
+ "Speaker 789: 4519\n",
+ "Speaker 790: 3792\n",
+ "Speaker 791: 5029\n",
+ "Speaker 792: 3857\n",
+ "Speaker 793: 949\n",
+ "Speaker 794: 8421\n",
+ "Speaker 795: 1455\n",
+ "Speaker 796: 5717\n",
+ "Speaker 797: 3781\n",
+ "Speaker 798: 7134\n",
+ "Speaker 799: 7732\n",
+ "Speaker 800: 576\n",
+ "Speaker 801: 8226\n",
+ "Speaker 802: 7226\n",
+ "Speaker 803: 1098\n",
+ "Speaker 804: 7780\n",
+ "Speaker 805: 5723\n",
+ "Speaker 806: 7553\n",
+ "Speaker 807: 5740\n",
+ "Speaker 808: 1992\n",
+ "Speaker 809: 4441\n",
+ "Speaker 810: 28\n",
+ "Speaker 811: 7302\n",
+ "Speaker 812: 4214\n",
+ "Speaker 813: 7957\n",
+ "Speaker 814: 1363\n",
+ "Speaker 815: 6098\n",
+ "Speaker 816: 4848\n",
+ "Speaker 817: 1365\n",
+ "Speaker 818: 2093\n",
+ "Speaker 819: 1265\n",
+ "Speaker 820: 1578\n",
+ "Speaker 821: 4278\n",
+ "Speaker 822: 3816\n",
+ "Speaker 823: 1752\n",
+ "Speaker 824: 7495\n",
+ "Speaker 825: 1183\n",
+ "Speaker 826: 1645\n",
+ "Speaker 827: 698\n",
+ "Speaker 828: 2060\n",
+ "Speaker 829: 7318\n",
+ "Speaker 830: 112\n",
+ "Speaker 831: 4088\n",
+ "Speaker 832: 7859\n",
+ "Speaker 833: 7447\n",
+ "Speaker 834: 3009\n",
+ "Speaker 835: 246\n",
+ "Speaker 836: 1754\n",
+ "Speaker 837: 480\n",
+ "Speaker 838: 403\n",
+ "Speaker 839: 3368\n",
+ "Speaker 840: 446\n",
+ "Speaker 841: 2774\n",
+ "Speaker 842: 2498\n",
+ "Speaker 843: 3540\n",
+ "Speaker 844: 6300\n",
+ "Speaker 845: 6458\n",
+ "Speaker 846: 8401\n",
+ "Speaker 847: 2577\n",
+ "Speaker 848: 7635\n",
+ "Speaker 849: 1040\n",
+ "Speaker 850: 6215\n",
+ "Speaker 851: 6727\n",
+ "Speaker 852: 6406\n",
+ "Speaker 853: 2853\n",
+ "Speaker 854: 6317\n",
+ "Speaker 855: 7113\n",
+ "Speaker 856: 8425\n",
+ "Speaker 857: 6694\n",
+ "Speaker 858: 882\n",
+ "Speaker 859: 100\n",
+ "Speaker 860: 6454\n",
+ "Speaker 861: 3584\n",
+ "Speaker 862: 2817\n",
+ "Speaker 863: 667\n",
+ "Speaker 864: 2229\n",
+ "Speaker 865: 3851\n",
+ "Speaker 866: 4133\n",
+ "Speaker 867: 5656\n",
+ "Speaker 868: 278\n",
+ "Speaker 869: 1535\n",
+ "Speaker 870: 1259\n",
+ "Speaker 871: 7128\n",
+ "Speaker 872: 296\n",
+ "Speaker 873: 2061\n",
+ "Speaker 874: 5393\n",
+ "Speaker 875: 3221\n",
+ "Speaker 876: 30\n",
+ "Speaker 877: 188\n",
+ "Speaker 878: 1445\n",
+ "Speaker 879: 2393\n",
+ "Speaker 880: 6956\n",
+ "Speaker 881: 5163\n",
+ "Speaker 882: 549\n",
+ "Speaker 883: 2518\n",
+ "Speaker 884: 829\n",
+ "Speaker 885: 6567\n",
+ "Speaker 886: 8592\n",
+ "Speaker 887: 303\n",
+ "Speaker 888: 240\n",
+ "Speaker 889: 3380\n",
+ "Speaker 890: 7481\n",
+ "Speaker 891: 5883\n",
+ "Speaker 892: 3214\n",
+ "Speaker 893: 8855\n",
+ "Speaker 894: 3947\n",
+ "Speaker 895: 398\n",
+ "Speaker 896: 55\n",
+ "Speaker 897: 8722\n",
+ "Speaker 898: 8713\n",
+ "Speaker 899: 5868\n",
+ "Speaker 900: 979\n",
+ "Speaker 901: 209\n",
+ "Speaker 902: 2673\n",
+ "Speaker 903: 3340\n",
+ "Speaker 904: 126\n",
+ "Speaker 905: 612\n",
+ "Speaker 906: 580\n",
+ "Speaker 907: 1182\n",
+ "Speaker 908: 664\n",
+ "Speaker 909: 1246\n",
+ "Speaker 910: 5678\n",
+ "Speaker 911: 1487\n",
+ "Speaker 912: 9026\n",
+ "Speaker 913: 2196\n",
+ "Speaker 914: 6235\n",
+ "Speaker 915: 2952\n",
+ "Speaker 916: 3733\n",
+ "Speaker 917: 2790\n",
+ "Speaker 918: 6339\n",
+ "Speaker 919: 5489\n",
+ "Speaker 920: 7505\n",
+ "Speaker 921: 7190\n",
+ "Speaker 922: 7059\n",
+ "Speaker 923: 175\n",
+ "Speaker 924: 5390\n",
+ "Speaker 925: 6373\n",
+ "Speaker 926: 6895\n",
+ "Speaker 927: 4148\n",
+ "Speaker 928: 93\n",
+ "Speaker 929: 339\n",
+ "Speaker 930: 8113\n",
+ "Speaker 931: 7478\n",
+ "Speaker 932: 439\n",
+ "Speaker 933: 6209\n",
+ "Speaker 934: 6553\n",
+ "Speaker 935: 5660\n",
+ "Speaker 936: 716\n",
+ "Speaker 937: 6643\n",
+ "Speaker 938: 4788\n",
+ "Speaker 939: 114\n",
+ "Speaker 940: 492\n",
+ "Speaker 941: 5909\n",
+ "Speaker 942: 1482\n",
+ "Speaker 943: 38\n",
+ "Speaker 944: 5448\n",
+ "Speaker 945: 98\n",
+ "Speaker 946: 159\n",
+ "Speaker 947: 718\n",
+ "Speaker 948: 922\n",
+ "Speaker 949: 7258\n",
+ "Speaker 950: 6388\n",
+ "Speaker 951: 7178\n",
+ "Speaker 952: 7558\n",
+ "Speaker 953: 899\n",
+ "Speaker 954: 373\n",
+ "Speaker 955: 87\n",
+ "Speaker 956: 3526\n",
+ "Speaker 957: 3864\n",
+ "Speaker 958: 3370\n",
+ "Speaker 959: 1826\n",
+ "Speaker 960: 7739\n",
+ "Speaker 961: 6575\n",
+ "Speaker 962: 501\n",
+ "Speaker 963: 909\n",
+ "Speaker 964: 3112\n",
+ "Speaker 965: 7240\n",
+ "Speaker 966: 699\n",
+ "Speaker 967: 4595\n",
+ "Speaker 968: 5746\n",
+ "Speaker 969: 4856\n",
+ "Speaker 970: 1629\n",
+ "Speaker 971: 707\n",
+ "Speaker 972: 589\n",
+ "Speaker 973: 1638\n",
+ "Speaker 974: 830\n",
+ "Speaker 975: 3989\n",
+ "Speaker 976: 8066\n",
+ "Speaker 977: 7416\n",
+ "Speaker 978: 70\n",
+ "Speaker 979: 6993\n",
+ "Speaker 980: 3790\n",
+ "Speaker 981: 3490\n",
+ "Speaker 982: 8684\n",
+ "Speaker 983: 166\n",
+ "Speaker 984: 6505\n",
+ "Speaker 985: 2911\n",
+ "Speaker 986: 2127\n",
+ "Speaker 987: 2146\n",
+ "Speaker 988: 3664\n",
+ "Speaker 989: 7995\n",
+ "Speaker 990: 8725\n",
+ "Speaker 991: 4340\n",
+ "Speaker 992: 8006\n",
+ "Speaker 993: 4973\n",
+ "Speaker 994: 2910\n",
+ "Speaker 995: 497\n",
+ "Speaker 996: 5876\n",
+ "Speaker 997: 6233\n",
+ "Speaker 998: 3537\n",
+ "Speaker 999: 1413\n",
+ "Speaker 1000: 5189\n",
+ "Speaker 1001: 204\n",
+ "Speaker 1002: 836\n",
+ "Speaker 1003: 2618\n",
+ "Speaker 1004: 7276\n",
+ "Speaker 1005: 1264\n",
+ "Speaker 1006: 2045\n",
+ "Speaker 1007: 3215\n",
+ "Speaker 1008: 6555\n",
+ "Speaker 1009: 196\n",
+ "Speaker 1010: 6848\n",
+ "Speaker 1011: 1160\n",
+ "Speaker 1012: 8771\n",
+ "Speaker 1013: 4744\n",
+ "Speaker 1014: 6637\n",
+ "Speaker 1015: 1463\n",
+ "Speaker 1016: 3615\n",
+ "Speaker 1017: 5776\n",
+ "Speaker 1018: 26\n",
+ "Speaker 1019: 7339\n",
+ "Speaker 1020: 249\n",
+ "Speaker 1021: 1034\n",
+ "Speaker 1022: 1743\n",
+ "Speaker 1023: 207\n",
+ "Speaker 1024: 831\n",
+ "Speaker 1025: 4335\n",
+ "Speaker 1026: 7720\n",
+ "Speaker 1027: 2194\n",
+ "Speaker 1028: 688\n",
+ "Speaker 1029: 8619\n",
+ "Speaker 1030: 8718\n",
+ "Speaker 1031: 581\n",
+ "Speaker 1032: 835\n",
+ "Speaker 1033: 7881\n",
+ "Speaker 1034: 3607\n",
+ "Speaker 1035: 7933\n",
+ "Speaker 1036: 708\n",
+ "Speaker 1037: 7188\n",
+ "Speaker 1038: 4246\n",
+ "Speaker 1039: 1926\n",
+ "Speaker 1040: 7766\n",
+ "Speaker 1041: 6538\n",
+ "Speaker 1042: 2149\n",
+ "Speaker 1043: 7434\n",
+ "Speaker 1044: 3230\n",
+ "Speaker 1045: 3983\n",
+ "Speaker 1046: 4152\n",
+ "Speaker 1047: 1336\n",
+ "Speaker 1048: 2388\n",
+ "Speaker 1049: 5139\n",
+ "Speaker 1050: 1473\n",
+ "Speaker 1051: 868\n",
+ "Speaker 1052: 2709\n",
+ "Speaker 1053: 2674\n",
+ "Speaker 1054: 2570\n",
+ "Speaker 1055: 211\n",
+ "Speaker 1056: 4137\n",
+ "Speaker 1057: 472\n",
+ "Speaker 1058: 5022\n",
+ "Speaker 1059: 1263\n",
+ "Speaker 1060: 1801\n",
+ "Speaker 1061: 1963\n",
+ "Speaker 1062: 5386\n",
+ "Speaker 1063: 3274\n",
+ "Speaker 1064: 3070\n",
+ "Speaker 1065: 3436\n",
+ "Speaker 1066: 8347\n",
+ "Speaker 1067: 7245\n",
+ "Speaker 1068: 3240\n",
+ "Speaker 1069: 7555\n",
+ "Speaker 1070: 6081\n",
+ "Speaker 1071: 5914\n",
+ "Speaker 1072: 1827\n",
+ "Speaker 1073: 8238\n",
+ "Speaker 1074: 2256\n",
+ "Speaker 1075: 7139\n",
+ "Speaker 1076: 1668\n",
+ "Speaker 1077: 4108\n",
+ "Speaker 1078: 7809\n",
+ "Speaker 1079: 2384\n",
+ "Speaker 1080: 4806\n",
+ "Speaker 1081: 3830\n",
+ "Speaker 1082: 3889\n",
+ "Speaker 1083: 217\n",
+ "Speaker 1084: 3645\n",
+ "Speaker 1085: 205\n",
+ "Speaker 1086: 6476\n",
+ "Speaker 1087: 4590\n",
+ "Speaker 1088: 6563\n",
+ "Speaker 1089: 2416\n",
+ "Speaker 1090: 8183\n",
+ "Speaker 1091: 8975\n",
+ "Speaker 1092: 4257\n",
+ "Speaker 1093: 1425\n",
+ "Speaker 1094: 8014\n",
+ "Speaker 1095: 5190\n",
+ "Speaker 1096: 4160\n",
+ "Speaker 1097: 1069\n",
+ "Speaker 1098: 1923\n",
+ "Speaker 1099: 4110\n",
+ "Speaker 1100: 1235\n",
+ "Speaker 1101: 5049\n",
+ "Speaker 1102: 479\n",
+ "Speaker 1103: 2573\n",
+ "Speaker 1104: 4331\n",
+ "Speaker 1105: 6828\n",
+ "Speaker 1106: 380\n",
+ "Speaker 1107: 7395\n",
+ "Speaker 1108: 3638\n",
+ "Speaker 1109: 6104\n",
+ "Speaker 1110: 4681\n",
+ "Speaker 1111: 8824\n",
+ "Speaker 1112: 2085\n",
+ "Speaker 1113: 6038\n",
+ "Speaker 1114: 7475\n",
+ "Speaker 1115: 8490\n",
+ "Speaker 1116: 3486\n",
+ "Speaker 1117: 6258\n",
+ "Speaker 1118: 2999\n",
+ "Speaker 1119: 8228\n",
+ "Speaker 1120: 1387\n",
+ "Speaker 1121: 8028\n",
+ "Speaker 1122: 1060\n",
+ "Speaker 1123: 3869\n",
+ "Speaker 1124: 8410\n",
+ "Speaker 1125: 2532\n",
+ "Speaker 1126: 2562\n",
+ "Speaker 1127: 1383\n",
+ "Speaker 1128: 7120\n",
+ "Speaker 1129: 1974\n",
+ "Speaker 1130: 1603\n",
+ "Speaker 1131: 8388\n",
+ "Speaker 1132: 8506\n",
+ "Speaker 1133: 337\n",
+ "Speaker 1134: 4222\n",
+ "Speaker 1135: 6371\n",
+ "Speaker 1136: 5039\n",
+ "Speaker 1137: 5867\n",
+ "Speaker 1138: 3876\n",
+ "Speaker 1139: 5293\n",
+ "Speaker 1140: 103\n",
+ "Speaker 1141: 64\n",
+ "Speaker 1142: 7800\n",
+ "Speaker 1143: 1447\n",
+ "Speaker 1144: 154\n",
+ "Speaker 1145: 460\n",
+ "Speaker 1146: 2514\n",
+ "Speaker 1147: 2816\n",
+ "Speaker 1148: 27\n",
+ "Speaker 1149: 242\n",
+ "Speaker 1150: 3738\n"
+ ]
+ }
+ ],
+ "source": [
+ "for s, speaker in enumerate(speakers):\n",
+ " print('Speaker %d: %s' % (s + 1, speaker))\n",
+ " os.mkdir(os.path.join(data_dir, 'mels_mode', speaker))\n",
+ " textgrids = os.listdir(os.path.join(data_dir, 'textgrids', speaker))\n",
+ " for textgrid in textgrids:\n",
+ " t = tgt.io.read_textgrid(os.path.join(data_dir, 'textgrids', speaker, textgrid))\n",
+ " m = np.load(os.path.join(data_dir, 'mels', speaker, textgrid.replace('.TextGrid', '_mel.npy')))\n",
+ " m_mode = np.copy(m)\n",
+ " t = t.get_tier_by_name('phones')\n",
+ " for i in range(len(t)):\n",
+ " phoneme = t[i].text\n",
+ " start_frame = int(t[i].start_time * 22050.0) // 256\n",
+ " end_frame = int(t[i].end_time * 22050.0) // 256 + 1\n",
+ " m_mode[:, start_frame:end_frame] = np.repeat(np.expand_dims(mels_mode[phoneme], 1), end_frame - start_frame, 1)\n",
+ " np.save(os.path.join(data_dir, 'mels_mode', speaker, textgrid.replace('.TextGrid', '_avgmel.npy')), m_mode)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "collapsed": true
+ },
+ "outputs": [],
+ "source": []
+ }
+ ],
+ "metadata": {
+ "anaconda-cloud": {},
+ "kernelspec": {
+ "display_name": "Python [default]",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.5.2"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
diff --git a/talkingface/data/dataprocess/inference.ipynb b/talkingface/data/dataprocess/inference.ipynb
new file mode 100644
index 00000000..284de0c9
--- /dev/null
+++ b/talkingface/data/dataprocess/inference.ipynb
@@ -0,0 +1,356 @@
+{
+ "cells": [
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import argparse\n",
+ "import json\n",
+ "import os\n",
+ "import numpy as np\n",
+ "import IPython.display as ipd\n",
+ "from tqdm import tqdm\n",
+ "from scipy.io.wavfile import write\n",
+ "\n",
+ "import torch\n",
+ "use_gpu = torch.cuda.is_available()\n",
+ "\n",
+ "import librosa\n",
+ "from librosa.core import load\n",
+ "from librosa.filters import mel as librosa_mel_fn\n",
+ "#mel_basis = librosa_mel_fn(22050, 1024, 80, 0, 8000)\n",
+ "mel_basis = librosa_mel_fn(sr=22050, n_fft=1024, n_mels=80, fmin=0, fmax=8000)\n",
+ "\n",
+ "import params\n",
+ "from model import DiffVC\n",
+ "\n",
+ "import sys\n",
+ "sys.path.append('hifi-gan/')\n",
+ "from env import AttrDict\n",
+ "from models import Generator as HiFiGAN\n",
+ "\n",
+ "sys.path.append('speaker_encoder/')\n",
+ "from encoder import inference as spk_encoder\n",
+ "from pathlib import Path"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "def get_mel(wav_path):\n",
+ " wav, _ = load(wav_path, sr=22050)\n",
+ " wav = wav[:(wav.shape[0] // 256)*256]\n",
+ " wav = np.pad(wav, 384, mode='reflect')\n",
+ " stft = librosa.core.stft(wav, n_fft=1024, hop_length=256, win_length=1024, window='hann', center=False)\n",
+ " stftm = np.sqrt(np.real(stft) ** 2 + np.imag(stft) ** 2 + (1e-9))\n",
+ " mel_spectrogram = np.matmul(mel_basis, stftm)\n",
+ " log_mel_spectrogram = np.log(np.clip(mel_spectrogram, a_min=1e-5, a_max=None))\n",
+ " return log_mel_spectrogram\n",
+ "\n",
+ "def get_embed(wav_path):\n",
+ " wav_preprocessed = spk_encoder.preprocess_wav(wav_path)\n",
+ " embed = spk_encoder.embed_utterance(wav_preprocessed)\n",
+ " return embed\n",
+ "\n",
+ "def noise_median_smoothing(x, w=5):\n",
+ " y = np.copy(x)\n",
+ " x = np.pad(x, w, \"edge\")\n",
+ " for i in range(y.shape[0]):\n",
+ " med = np.median(x[i:i+2*w+1])\n",
+ " y[i] = min(x[i+w+1], med)\n",
+ " return y\n",
+ "\n",
+ "def mel_spectral_subtraction(mel_synth, mel_source, spectral_floor=0.02, silence_window=5, smoothing_window=5):\n",
+ " mel_len = mel_source.shape[-1]\n",
+ " energy_min = 100000.0\n",
+ " i_min = 0\n",
+ " for i in range(mel_len - silence_window):\n",
+ " energy_cur = np.sum(np.exp(2.0 * mel_source[:, i:i+silence_window]))\n",
+ " if energy_cur < energy_min:\n",
+ " i_min = i\n",
+ " energy_min = energy_cur\n",
+ " estimated_noise_energy = np.min(np.exp(2.0 * mel_synth[:, i_min:i_min+silence_window]), axis=-1)\n",
+ " if smoothing_window is not None:\n",
+ " estimated_noise_energy = noise_median_smoothing(estimated_noise_energy, smoothing_window)\n",
+ " mel_denoised = np.copy(mel_synth)\n",
+ " for i in range(mel_len):\n",
+ " signal_subtract_noise = np.exp(2.0 * mel_synth[:, i]) - estimated_noise_energy\n",
+ " estimated_signal_energy = np.maximum(signal_subtract_noise, spectral_floor * estimated_noise_energy)\n",
+ " mel_denoised[:, i] = np.log(np.sqrt(estimated_signal_energy))\n",
+ " return mel_denoised"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Number of parameters: 126259128\n"
+ ]
+ }
+ ],
+ "source": [
+ "# loading voice conversion model\n",
+ "vc_path = 'checkpts/vc/vc_libritts_wodyn.pt' # path to voice conversion model\n",
+ "\n",
+ "generator = DiffVC(params.n_mels, params.channels, params.filters, params.heads, \n",
+ " params.layers, params.kernel, params.dropout, params.window_size, \n",
+ " params.enc_dim, params.spk_dim, params.use_ref_t, params.dec_dim, \n",
+ " params.beta_min, params.beta_max)\n",
+ "if use_gpu:\n",
+ " generator = generator.cuda()\n",
+ " generator.load_state_dict(torch.load(vc_path))\n",
+ "else:\n",
+ " generator.load_state_dict(torch.load(vc_path, map_location='cpu'))\n",
+ "generator.eval()\n",
+ "\n",
+ "print(f'Number of parameters: {generator.nparams}')"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "c:\\Users\\liberty\\AppData\\Local\\Programs\\Python\\Python310\\lib\\site-packages\\torch\\nn\\utils\\weight_norm.py:30: UserWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.\n",
+ " warnings.warn(\"torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.\")\n"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Removing weight norm...\n"
+ ]
+ }
+ ],
+ "source": [
+ "# loading HiFi-GAN vocoder\n",
+ "hfg_path = 'checkpts/vocoder/' # HiFi-GAN path\n",
+ "\n",
+ "with open(hfg_path + 'config.json') as f:\n",
+ " h = AttrDict(json.load(f))\n",
+ "\n",
+ "if use_gpu:\n",
+ " hifigan_universal = HiFiGAN(h).cuda()\n",
+ " hifigan_universal.load_state_dict(torch.load(hfg_path + 'generator')['generator'])\n",
+ "else:\n",
+ " hifigan_universal = HiFiGAN(h)\n",
+ " hifigan_universal.load_state_dict(torch.load(hfg_path + 'generator', map_location='cpu')['generator'])\n",
+ "\n",
+ "_ = hifigan_universal.eval()\n",
+ "hifigan_universal.remove_weight_norm()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Loaded encoder \"pretrained.pt\" trained to step 1564501\n"
+ ]
+ }
+ ],
+ "source": [
+ "# loading speaker encoder\n",
+ "enc_model_fpath = Path('checkpts/spk_encoder/pretrained.pt') # speaker encoder path\n",
+ "if use_gpu:\n",
+ " spk_encoder.load_model(enc_model_fpath, device=\"cuda\")\n",
+ "else:\n",
+ " spk_encoder.load_model(enc_model_fpath, device=\"cpu\")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 15,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "c:\\Users\\liberty\\Desktop\\divff\\Speech-Backbones\\DiffVC\\speaker_encoder\\encoder\\audio.py:41: FutureWarning: Pass orig_sr=24000, target_sr=16000 as keyword args. From version 0.10 passing these as positional arguments will result in an error\n",
+ " wav = librosa.resample(wav, source_sr, sampling_rate)\n",
+ "c:\\Users\\liberty\\Desktop\\divff\\Speech-Backbones\\DiffVC\\speaker_encoder\\encoder\\audio.py:75: FutureWarning: Pass y=[-5.1470578e-04 -4.9517461e-04 7.9890393e-05 ... 7.1139593e-04\n",
+ " 4.4408118e-04 4.9962930e-04] as keyword args. From version 0.10 passing these as positional arguments will result in an error\n",
+ " frames = librosa.feature.melspectrogram(\n"
+ ]
+ }
+ ],
+ "source": [
+ "# loading source and reference wavs, calculating mel-spectrograms and speaker embeddings\n",
+ "src_path = 'example/8534_216567_000015_000010.wav' # path to source utterance\n",
+ "tgt_path = 'example/6415_111615_000012_000005.wav' # path to reference utterance\n",
+ "\n",
+ "mel_source = torch.from_numpy(get_mel(src_path)).float().unsqueeze(0)\n",
+ "if use_gpu:\n",
+ " mel_source = mel_source.cuda()\n",
+ "mel_source_lengths = torch.LongTensor([mel_source.shape[-1]])\n",
+ "if use_gpu:\n",
+ " mel_source_lengths = mel_source_lengths.cuda()\n",
+ "\n",
+ "mel_target = torch.from_numpy(get_mel(tgt_path)).float().unsqueeze(0)\n",
+ "if use_gpu:\n",
+ " mel_target = mel_target.cuda()\n",
+ "mel_target_lengths = torch.LongTensor([mel_target.shape[-1]])\n",
+ "if use_gpu:\n",
+ " mel_target_lengths = mel_target_lengths.cuda()\n",
+ "\n",
+ "embed_target = torch.from_numpy(get_embed(tgt_path)).float().unsqueeze(0)\n",
+ "if use_gpu:\n",
+ " embed_target = embed_target.cuda()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 16,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# performing voice conversion\n",
+ "mel_encoded, mel_ = generator.forward(mel_source, mel_source_lengths, mel_target, mel_target_lengths, embed_target, \n",
+ " n_timesteps=30, mode='ml')\n",
+ "mel_synth_np = mel_.cpu().detach().squeeze().numpy()\n",
+ "mel_source_np = mel_.cpu().detach().squeeze().numpy()\n",
+ "mel = torch.from_numpy(mel_spectral_subtraction(mel_synth_np, mel_source_np, smoothing_window=1)).float().unsqueeze(0)\n",
+ "if use_gpu:\n",
+ " mel = mel.cuda()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 17,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ " \n",
+ " "
+ ],
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "# source utterance (vocoded)\n",
+ "with torch.no_grad():\n",
+ " audio = hifigan_universal.forward(mel_source).cpu().squeeze().clamp(-1, 1)\n",
+ "ipd.display(ipd.Audio(audio, rate=22050))"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 18,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ " \n",
+ " "
+ ],
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "# reference utterance (vocoded)\n",
+ "with torch.no_grad():\n",
+ " audio = hifigan_universal.forward(mel_target).cpu().squeeze().clamp(-1, 1)\n",
+ "ipd.display(ipd.Audio(audio, rate=22050))"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 19,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ " \n",
+ " "
+ ],
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "# converted speech\n",
+ "with torch.no_grad():\n",
+ " audio = hifigan_universal.forward(mel).cpu().squeeze().clamp(-1, 1)\n",
+ "ipd.display(ipd.Audio(audio, rate=22050))"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "venv",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.10.6"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/talkingface/data/dataset/__init__.py b/talkingface/data/dataset/__init__.py
index 3fd37538..b884aeb6 100644
--- a/talkingface/data/dataset/__init__.py
+++ b/talkingface/data/dataset/__init__.py
@@ -1,2 +1,4 @@
from talkingface.data.dataset.wav2lip_dataset import Wav2LipDataset
-from talkingface.data.dataset.dataset import Dataset
\ No newline at end of file
+from talkingface.data.dataset.dataset import Dataset
+from talkingface.data.dataset.diffvc_dataset import diffvcDataset
+from talkingface.data.dataset.diffvc_dataset import VCDecBatchCollate
\ No newline at end of file
diff --git a/talkingface/data/dataset/data_objects/__init__.py b/talkingface/data/dataset/data_objects/__init__.py
new file mode 100644
index 00000000..12ae8452
--- /dev/null
+++ b/talkingface/data/dataset/data_objects/__init__.py
@@ -0,0 +1,4 @@
+""" from https://github.com/CorentinJ/Real-Time-Voice-Cloning """
+
+from talkingface.data.dataset.data_objects.speaker_verification_dataset import SpeakerVerificationDataset
+from talkingface.data.dataset.data_objects.speaker_verification_dataset import SpeakerVerificationDataLoader
diff --git a/talkingface/data/dataset/data_objects/random_cycler.py b/talkingface/data/dataset/data_objects/random_cycler.py
new file mode 100644
index 00000000..6fd5bb00
--- /dev/null
+++ b/talkingface/data/dataset/data_objects/random_cycler.py
@@ -0,0 +1,39 @@
+""" from https://github.com/CorentinJ/Real-Time-Voice-Cloning """
+
+import random
+
+class RandomCycler:
+ """
+ Creates an internal copy of a sequence and allows access to its items in a constrained random
+ order. For a source sequence of n items and one or several consecutive queries of a total
+ of m items, the following guarantees hold (one implies the other):
+ - Each item will be returned between m // n and ((m - 1) // n) + 1 times.
+ - Between two appearances of the same item, there may be at most 2 * (n - 1) other items.
+ """
+
+ def __init__(self, source):
+ if len(source) == 0:
+ raise Exception("Can't create RandomCycler from an empty collection")
+ self.all_items = list(source)
+ self.next_items = []
+
+ def sample(self, count: int):
+ shuffle = lambda l: random.sample(l, len(l))
+
+ out = []
+ while count > 0:
+ if count >= len(self.all_items):
+ out.extend(shuffle(list(self.all_items)))
+ count -= len(self.all_items)
+ continue
+ n = min(count, len(self.next_items))
+ out.extend(self.next_items[:n])
+ count -= n
+ self.next_items = self.next_items[n:]
+ if len(self.next_items) == 0:
+ self.next_items = shuffle(list(self.all_items))
+ return out
+
+ def __next__(self):
+ return self.sample(1)[0]
+
diff --git a/talkingface/data/dataset/data_objects/speaker.py b/talkingface/data/dataset/data_objects/speaker.py
new file mode 100644
index 00000000..1f7dada6
--- /dev/null
+++ b/talkingface/data/dataset/data_objects/speaker.py
@@ -0,0 +1,42 @@
+""" from https://github.com/CorentinJ/Real-Time-Voice-Cloning """
+
+from talkingface.data.dataset.data_objects.random_cycler import RandomCycler
+from talkingface.data.dataset.data_objects.utterance import Utterance
+from pathlib import Path
+
+# Contains the set of utterances of a single speaker
+class Speaker:
+ def __init__(self, root: Path):
+ self.root = root
+ self.name = root.name
+ self.utterances = None
+ self.utterance_cycler = None
+
+ def _load_utterances(self):
+ with self.root.joinpath("_sources.txt").open("r") as sources_file:
+ sources = [l.split(",") for l in sources_file]
+ sources = {frames_fname: wave_fpath for frames_fname, wave_fpath in sources}
+ self.utterances = [Utterance(self.root.joinpath(f), w) for f, w in sources.items()]
+ self.utterance_cycler = RandomCycler(self.utterances)
+
+ def random_partial(self, count, n_frames):
+ """
+ Samples a batch of unique partial utterances from the disk in a way that all
+ utterances come up at least once every two cycles and in a random order every time.
+
+ :param count: The number of partial utterances to sample from the set of utterances from
+ that speaker. Utterances are guaranteed not to be repeated if is not larger than
+ the number of utterances available.
+ :param n_frames: The number of frames in the partial utterance.
+ :return: A list of tuples (utterance, frames, range) where utterance is an Utterance,
+ frames are the frames of the partial utterances and range is the range of the partial
+ utterance with regard to the complete utterance.
+ """
+ if self.utterances is None:
+ self._load_utterances()
+
+ utterances = self.utterance_cycler.sample(count)
+
+ a = [(u,) + u.random_partial(n_frames) for u in utterances]
+
+ return a
diff --git a/talkingface/data/dataset/data_objects/speaker_batch.py b/talkingface/data/dataset/data_objects/speaker_batch.py
new file mode 100644
index 00000000..afad39b5
--- /dev/null
+++ b/talkingface/data/dataset/data_objects/speaker_batch.py
@@ -0,0 +1,14 @@
+""" from https://github.com/CorentinJ/Real-Time-Voice-Cloning """
+
+import numpy as np
+from typing import List
+from talkingface.data.dataset.data_objects.speaker import Speaker
+
+class SpeakerBatch:
+ def __init__(self, speakers: List[Speaker], utterances_per_speaker: int, n_frames: int):
+ self.speakers = speakers
+ self.partials = {s: s.random_partial(utterances_per_speaker, n_frames) for s in speakers}
+
+ # Array of shape (n_speakers * n_utterances, n_frames, mel_n), e.g. for 3 speakers with
+ # 4 utterances each of 160 frames of 40 mel coefficients: (12, 160, 40)
+ self.data = np.array([frames for s in speakers for _, frames, _ in self.partials[s]])
diff --git a/talkingface/data/dataset/data_objects/speaker_verification_dataset.py b/talkingface/data/dataset/data_objects/speaker_verification_dataset.py
new file mode 100644
index 00000000..a52ebbc2
--- /dev/null
+++ b/talkingface/data/dataset/data_objects/speaker_verification_dataset.py
@@ -0,0 +1,58 @@
+""" from https://github.com/CorentinJ/Real-Time-Voice-Cloning """
+
+from talkingface.data.dataset.data_objects.random_cycler import RandomCycler
+from talkingface.data.dataset.data_objects.speaker_batch import SpeakerBatch
+from talkingface.data.dataset.data_objects.speaker import Speaker
+from talkingface.properties.dataset.params_data import partials_n_frames
+from torch.utils.data import Dataset, DataLoader
+from pathlib import Path
+
+# TODO: improve with a pool of speakers for data efficiency
+
+class SpeakerVerificationDataset(Dataset):
+ def __init__(self, datasets_root: Path):
+ self.root = datasets_root
+ speaker_dirs = [f for f in self.root.glob("*") if f.is_dir()]
+ if len(speaker_dirs) == 0:
+ raise Exception("No speakers found. Make sure you are pointing to the directory "
+ "containing all preprocessed speaker directories.")
+ self.speakers = [Speaker(speaker_dir) for speaker_dir in speaker_dirs]
+ self.speaker_cycler = RandomCycler(self.speakers)
+
+ def __len__(self):
+ return int(1e10)
+
+ def __getitem__(self, index):
+ return next(self.speaker_cycler)
+
+ def get_logs(self):
+ log_string = ""
+ for log_fpath in self.root.glob("*.txt"):
+ with log_fpath.open("r") as log_file:
+ log_string += "".join(log_file.readlines())
+ return log_string
+
+
+class SpeakerVerificationDataLoader(DataLoader):
+ def __init__(self, dataset, speakers_per_batch, utterances_per_speaker, sampler=None,
+ batch_sampler=None, num_workers=0, pin_memory=False, timeout=0,
+ worker_init_fn=None):
+ self.utterances_per_speaker = utterances_per_speaker
+
+ super().__init__(
+ dataset=dataset,
+ batch_size=speakers_per_batch,
+ shuffle=False,
+ sampler=sampler,
+ batch_sampler=batch_sampler,
+ num_workers=num_workers,
+ collate_fn=self.collate,
+ pin_memory=pin_memory,
+ drop_last=False,
+ timeout=timeout,
+ worker_init_fn=worker_init_fn
+ )
+
+ def collate(self, speakers):
+ return SpeakerBatch(speakers, self.utterances_per_speaker, partials_n_frames)
+
\ No newline at end of file
diff --git a/talkingface/data/dataset/data_objects/utterance.py b/talkingface/data/dataset/data_objects/utterance.py
new file mode 100644
index 00000000..2b878c58
--- /dev/null
+++ b/talkingface/data/dataset/data_objects/utterance.py
@@ -0,0 +1,28 @@
+""" from https://github.com/CorentinJ/Real-Time-Voice-Cloning """
+
+import numpy as np
+
+
+class Utterance:
+ def __init__(self, frames_fpath, wave_fpath):
+ self.frames_fpath = frames_fpath
+ self.wave_fpath = wave_fpath
+
+ def get_frames(self):
+ return np.load(self.frames_fpath)
+
+ def random_partial(self, n_frames):
+ """
+ Crops the frames into a partial utterance of n_frames
+
+ :param n_frames: The number of frames of the partial utterance
+ :return: the partial utterance frames and a tuple indicating the start and end of the
+ partial utterance in the complete utterance.
+ """
+ frames = self.get_frames()
+ if frames.shape[0] == n_frames:
+ start = 0
+ else:
+ start = np.random.randint(0, frames.shape[0] - n_frames)
+ end = start + n_frames
+ return frames[start:end], (start, end)
\ No newline at end of file
diff --git a/talkingface/data/dataset/diffvc_dataset.py b/talkingface/data/dataset/diffvc_dataset.py
new file mode 100644
index 00000000..31a65c30
--- /dev/null
+++ b/talkingface/data/dataset/diffvc_dataset.py
@@ -0,0 +1,379 @@
+import os
+import random
+import numpy as np
+import torch
+import tgt
+
+from talkingface.data.dataset.dataset import Dataset
+
+random_seed = 37
+n_mels = 80
+train_frames = 128
+
+# 返回测试说话人的列表。
+def get_test_speakers():
+ test_speakers = ['1401', '2238', '3723', '4014', '5126',
+ '5322', '587', '6415', '8057', '8534']
+ return test_speakers
+
+# 返回VCTK数据集中未知说话人的列表。
+def get_vctk_unseen_speakers():
+ unseen_speakers = ['p252', 'p261', 'p241', 'p238', 'p243',
+ 'p294', 'p334', 'p343', 'p360', 'p362']
+ return unseen_speakers
+
+# 返回VCTK数据集中未知句子的列表。
+def get_vctk_unseen_sentences():
+ unseen_sentences = ['001', '002', '003', '004', '005']
+ return unseen_sentences
+
+# 用于排除MFA不能识别某些单词的语音。
+def exclude_spn(data_dir, spk, mel_ids):
+ res = []
+ for mel_id in mel_ids:
+ textgrid = mel_id + '.TextGrid'
+ t = tgt.io.read_textgrid(os.path.join(data_dir, 'textgrids', spk, textgrid))
+ t = t.get_tier_by_name('phones')
+ spn_found = False
+ for i in range(len(t)):
+ if t[i].text == 'spn':
+ spn_found = True
+ break
+ if not spn_found:
+ res.append(mel_id)
+ return res
+
+# 用于训练 "average voice" 编码器的 LibriTTS 数据集。
+class VCEncDataset(Dataset):
+ def __init__(self, config, datasplit, data_dir, exc_file, avg_type):
+ super().__init__(config, datasplit)
+ self.mel_x_dir = os.path.join(data_dir, 'mels')
+ self.mel_y_dir = os.path.join(data_dir, 'mels_%s' % avg_type)
+
+ self.test_speakers = get_test_speakers()
+ self.speakers = [spk for spk in os.listdir(self.mel_x_dir)
+ if spk not in self.test_speakers]
+ with open(exc_file) as f:
+ exceptions = f.readlines()
+ self.exceptions = [e.strip() + '_mel.npy' for e in exceptions]
+ self.test_info = []
+ self.train_info = []
+ for spk in self.speakers:
+ mel_ids = os.listdir(os.path.join(self.mel_x_dir, spk))
+ mel_ids = [m[:-8] for m in mel_ids if m not in self.exceptions]
+ mel_ids = exclude_spn(data_dir, spk, mel_ids)
+ self.train_info += [(m, spk) for m in mel_ids]
+ for spk in self.test_speakers:
+ mel_ids = os.listdir(os.path.join(self.mel_x_dir, spk))
+ mel_ids = [m[:-8] for m in mel_ids]
+ self.test_info += [(m, spk) for m in mel_ids]
+ print("Total number of test wavs is %d." % len(self.test_info))
+ print("Total number of training wavs is %d." % len(self.train_info))
+ random.seed(random_seed)
+ random.shuffle(self.train_info)
+
+ def get_vc_data(self, mel_id, spk):
+ mel_x_path = os.path.join(self.mel_x_dir, spk, mel_id + '_mel.npy')
+ mel_y_path = os.path.join(self.mel_y_dir, spk, mel_id + '_avgmel.npy')
+ mel_x = np.load(mel_x_path)
+ mel_y = np.load(mel_y_path)
+ mel_x = torch.from_numpy(mel_x).float()
+ mel_y = torch.from_numpy(mel_y).float()
+ return (mel_x, mel_y)
+
+ def __getitem__(self, index):
+ mel_id, spk = self.train_info[index]
+ mel_x, mel_y = self.get_vc_data(mel_id, spk)
+ return {"x": mel_x, "y": mel_y}
+
+ def __len__(self):
+ return len(self.train_info)
+
+ def get_test_dataset(self):
+ pairs = []
+ for i in range(len(self.test_info)):
+ mel_id, spk = self.test_info[i]
+ mel_x, mel_y = self.get_vc_data(mel_id, spk)
+ pairs.append((mel_x, mel_y))
+ return [{"x": pair[0], "y": pair[1]} for pair in pairs]
+
+class VCDecDataset(torch.utils.data.Dataset):
+ def __init__(self, data_dir, val_file, exc_file):
+ self.mel_dir = os.path.join(data_dir, 'mels')
+ #self.mel_dir = self.mel_dir.replace('\\','/')
+ #self.mel_dir="./data/mels"
+ self.emb_dir = os.path.join(data_dir, 'embeds')
+ self.test_speakers = get_test_speakers()
+ self.speakers = [spk for spk in os.listdir(self.mel_dir)
+ if spk not in self.test_speakers]
+ self.speakers = [spk for spk in self.speakers
+ if len(os.listdir(os.path.join(self.mel_dir, spk))) >= 10]
+ random.seed(random_seed)
+ random.shuffle(self.speakers)
+ with open(exc_file) as f:
+ exceptions = f.readlines()
+ self.exceptions = [e.strip() + '_mel.npy' for e in exceptions]
+ with open(val_file) as f:
+ valid_ids = f.readlines()
+ self.valid_ids = set([v.strip() + '_mel.npy' for v in valid_ids])
+ self.exceptions += self.valid_ids
+
+ self.valid_info = [(v[:-8], v.split('_')[0]) for v in self.valid_ids]
+ self.train_info = []
+ for spk in self.speakers:
+ mel_ids = os.listdir(os.path.join(self.mel_dir, spk))
+ mel_ids = [m for m in mel_ids if m not in self.exceptions]
+ self.train_info += [(i[:-8], spk) for i in mel_ids]
+ print("Total number of validation wavs is %d." % len(self.valid_info))
+ print("Total number of training wavs is %d." % len(self.train_info))
+ print("Total number of training speakers is %d." % len(self.speakers))
+ random.seed(random_seed)
+ random.shuffle(self.train_info)
+ def get_vc_data(self, audio_info):
+ audio_id, spk = audio_info
+ mels = self.get_mels(audio_id, spk)
+ embed = self.get_embed(audio_id, spk)
+ return (mels, embed)
+
+ def get_mels(self, audio_id, spk):
+ mel_path = os.path.join(self.mel_dir, spk, audio_id + '_mel.npy')
+ mels = np.load(mel_path)
+ mels = torch.from_numpy(mels).float()
+ return mels
+
+ def get_embed(self, audio_id, spk):
+ embed_path = os.path.join(self.emb_dir, spk, audio_id + '_embed.npy')
+ embed = np.load(embed_path)
+ embed = torch.from_numpy(embed).float()
+ return embed
+
+ def __getitem__(self, index):
+ mels, embed = self.get_vc_data(self.train_info[index])
+ item = {'mel': mels, 'c': embed}
+ return item
+
+ def __len__(self):
+ return len(self.train_info)
+
+ def get_valid_dataset(self):
+ pairs = []
+ for i in range(len(self.valid_info)):
+ mels, embed = self.get_vc_data(self.valid_info[i])
+ pairs.append((mels, embed))
+ return pairs
+
+# 用于训练 "average voice" 编码器的 VCTK 数据集。
+class VCTKEncDataset(Dataset):
+ def __init__(self, config, datasplit):
+ super().__init__(config, datasplit)
+ data_dir=config['data_dir']
+ exc_file=config['exc_file']
+ avg_type=config['avg_type']
+ self.mel_x_dir = os.path.join(data_dir, 'mels')
+ self.mel_y_dir = os.path.join(data_dir, 'mels_%s' % avg_type)
+
+ self.unseen_speakers = get_vctk_unseen_speakers()
+ self.unseen_sentences = get_vctk_unseen_sentences()
+ self.speakers = [spk for spk in os.listdir(self.mel_x_dir)
+ if spk not in self.unseen_speakers]
+ with open(exc_file) as f:
+ exceptions = f.readlines()
+ self.exceptions = [e.strip() + '_mel.npy' for e in exceptions]
+ self.test_info = []
+ self.train_info = []
+ for spk in self.speakers:
+ mel_ids = os.listdir(os.path.join(self.mel_x_dir, spk))
+ mel_ids = [m for m in mel_ids if m.split('_')[1] not in self.unseen_sentences]
+ mel_ids = [m[:-8] for m in mel_ids if m not in self.exceptions]
+ mel_ids = exclude_spn(data_dir, spk, mel_ids)
+ self.train_info += [(m, spk) for m in mel_ids]
+ for spk in self.unseen_speakers:
+ mel_ids = os.listdir(os.path.join(self.mel_x_dir, spk))
+ mel_ids = [m for m in mel_ids if m.split('_')[1] not in self.unseen_sentences]
+ mel_ids = [m[:-8] for m in mel_ids if m not in self.exceptions]
+ self.test_info += [(m, spk) for m in mel_ids]
+ print("Total number of test wavs is %d." % len(self.test_info))
+ print("Total number of training wavs is %d." % len(self.train_info))
+ random.seed(random_seed)
+ random.shuffle(self.train_info)
+
+ def get_vc_data(self, mel_id, spk):
+ mel_x_path = os.path.join(self.mel_x_dir, spk, mel_id + '_mel.npy')
+ mel_y_path = os.path.join(self.mel_y_dir, spk, mel_id + '_avgmel.npy')
+ mel_x = np.load(mel_x_path)
+ mel_y = np.load(mel_y_path)
+ mel_x = torch.from_numpy(mel_x).float()
+ mel_y = torch.from_numpy(mel_y).float()
+ return (mel_x, mel_y)
+
+ def __getitem__(self, index):
+ mel_id, spk = self.train_info[index]
+ mel_x, mel_y = self.get_vc_data(mel_id, spk)
+ return {"x": mel_x, "y": mel_y}
+
+ def __len__(self):
+ return len(self.train_info)
+
+ def get_test_dataset(self):
+ pairs = []
+ for i in range(len(self.test_info)):
+ mel_id, spk = self.test_info[i]
+ mel_x, mel_y = self.get_vc_data(mel_id, spk)
+ pairs.append((mel_x, mel_y))
+ return [{"x": pair[0], "y": pair[1]} for pair in pairs]
+
+# 用于训练 speaker-conditional diffusion-based 解码器的 LibriTTS 数据集。(train)
+class diffvcDataset(Dataset):
+ def __init__(self, config, datasplit):
+ super().__init__(config, datasplit)
+ data_dir=config['data_dir']
+ exc_file=config['exc_file']
+ val_file=config['val_file']
+ self.mel_dir = os.path.join(data_dir, 'mels')
+ self.emb_dir = os.path.join(data_dir, 'embeds')
+ self.test_speakers = get_test_speakers()
+ self.speakers = [spk for spk in os.listdir(self.mel_dir)
+ if spk not in self.test_speakers]
+ self.speakers = [spk for spk in self.speakers
+ if len(os.listdir(os.path.join(self.mel_dir, spk))) >= 10]
+ random.seed(random_seed)
+ random.shuffle(self.speakers)
+ with open(exc_file) as f:
+ exceptions = f.readlines()
+ self.exceptions = [e.strip() + '_mel.npy' for e in exceptions]
+ with open(val_file) as f:
+ valid_ids = f.readlines()
+ self.valid_ids = set([v.strip() + '_mel.npy' for v in valid_ids])
+ self.exceptions += self.valid_ids
+
+ self.valid_info = [(v[:-8], v.split('_')[0]) for v in self.valid_ids]
+ self.train_info = []
+ for spk in self.speakers:
+ mel_ids = os.listdir(os.path.join(self.mel_dir, spk))
+ mel_ids = [m for m in mel_ids if m not in self.exceptions]
+ self.train_info += [(i[:-8], spk) for i in mel_ids]
+ print("Total number of validation wavs is %d." % len(self.valid_info))
+ print("Total number of training wavs is %d." % len(self.train_info))
+ print("Total number of training speakers is %d." % len(self.speakers))
+ random.seed(random_seed)
+ random.shuffle(self.train_info)
+
+ def get_vc_data(self, audio_info):
+ audio_id, spk = audio_info
+ mels = self.get_mels(audio_id, spk)
+ embed = self.get_embed(audio_id, spk)
+ return (mels, embed)
+
+ def get_mels(self, audio_id, spk):
+ mel_path = os.path.join(self.mel_dir, spk, audio_id + '_mel.npy')
+ mels = np.load(mel_path)
+ mels = torch.from_numpy(mels).float()
+ return mels
+
+ def get_embed(self, audio_id, spk):
+ embed_path = os.path.join(self.emb_dir, spk, audio_id + '_embed.npy')
+ embed = np.load(embed_path)
+ embed = torch.from_numpy(embed).float()
+ return embed
+
+ def __getitem__(self, index):
+ mels, embed = self.get_vc_data(self.train_info[index])
+ return {'mel': mels, 'c': embed}
+
+ def __len__(self):
+ return len(self.train_info)
+
+ def get_valid_dataset(self):
+ pairs = []
+ for i in range(len(self.valid_info)):
+ mels, embed = self.get_vc_data(self.valid_info[i])
+ pairs.append((mels, embed))
+ return [{"mel": pair[0], "c": pair[1]} for pair in pairs]
+
+
+
+
+# 用于训练 speaker-conditional diffusion-based 解码器的 VCTK 数据集。
+class VCTKDecDataset(Dataset):
+ def __init__(self,config, datasplit, data_dir):
+ super().__init__(config, datasplit)
+ self.mel_dir = os.path.join(data_dir, 'mels')
+ self.emb_dir = os.path.join(data_dir, 'embeds')
+ self.unseen_speakers = get_vctk_unseen_speakers()
+ self.unseen_sentences = get_vctk_unseen_sentences()
+ self.speakers = [spk for spk in os.listdir(self.mel_dir)
+ if spk not in self.unseen_speakers]
+ random.seed(random_seed)
+ random.shuffle(self.speakers)
+ self.train_info = []
+ for spk in self.speakers:
+ mel_ids = os.listdir(os.path.join(self.mel_dir, spk))
+ mel_ids = [m for m in mel_ids if m.split('_')[1] not in self.unseen_sentences]
+ self.train_info += [(i[:-8], spk) for i in mel_ids]
+ self.valid_info = []
+ for spk in self.unseen_speakers:
+ mel_ids = os.listdir(os.path.join(self.mel_dir, spk))
+ mel_ids = [m for m in mel_ids if m.split('_')[1] not in self.unseen_sentences]
+ self.valid_info += [(i[:-8], spk) for i in mel_ids]
+ print("Total number of validation wavs is %d." % len(self.valid_info))
+ print("Total number of training wavs is %d." % len(self.train_info))
+ print("Total number of training speakers is %d." % len(self.speakers))
+ random.seed(random_seed)
+ random.shuffle(self.train_info)
+
+ def get_vc_data(self, audio_info):
+ audio_id, spk = audio_info
+ mels = self.get_mels(audio_id, spk)
+ embed = self.get_embed(audio_id, spk)
+ return (mels, embed)
+
+ def get_mels(self, audio_id, spk):
+ mel_path = os.path.join(self.mel_dir, spk, audio_id + '_mel.npy')
+ mels = np.load(mel_path)
+ mels = torch.from_numpy(mels).float()
+ return mels
+
+ def get_embed(self, audio_id, spk):
+ embed_path = os.path.join(self.emb_dir, spk, audio_id + '_embed.npy')
+ embed = np.load(embed_path)
+ embed = torch.from_numpy(embed).float()
+ return embed
+
+ def __getitem__(self, index):
+ mels, embed = self.get_vc_data(self.train_info[index])
+ return {'mel': mels, 'c': embed}
+
+ def __len__(self):
+ return len(self.train_info)
+
+ def get_valid_dataset(self):
+ pairs = []
+ for i in range(len(self.valid_info)):
+ mels, embed = self.get_vc_data(self.valid_info[i])
+ pairs.append((mels, embed))
+ return [{"mel": pair[0], "c": pair[1]} for pair in pairs]
+
+class VCDecBatchCollate(object):
+ def __call__(self, batch):
+ B = len(batch)
+ mels1 = torch.zeros((B, n_mels, train_frames), dtype=torch.float32)
+ mels2 = torch.zeros((B, n_mels, train_frames), dtype=torch.float32)
+ max_starts = [max(item['mel'].shape[-1] - train_frames, 0)
+ for item in batch]
+ starts1 = [random.choice(range(m)) if m > 0 else 0 for m in max_starts]
+ starts2 = [random.choice(range(m)) if m > 0 else 0 for m in max_starts]
+ mel_lengths = []
+ for i, item in enumerate(batch):
+ mel = item['mel']
+ if mel.shape[-1] < train_frames:
+ mel_length = mel.shape[-1]
+ else:
+ mel_length = train_frames
+ mels1[i, :, :mel_length] = mel[:, starts1[i]:starts1[i] + mel_length]
+ mels2[i, :, :mel_length] = mel[:, starts2[i]:starts2[i] + mel_length]
+ mel_lengths.append(mel_length)
+ mel_lengths = torch.LongTensor(mel_lengths)
+ embed = torch.stack([item['c'] for item in batch], 0)
+ return {'mel1': mels1, 'mel2': mels2, 'mel_lengths': mel_lengths, 'c': embed}
+
+
diff --git a/talkingface/model/.DS_Store b/talkingface/model/.DS_Store
new file mode 100644
index 00000000..76fc83f6
Binary files /dev/null and b/talkingface/model/.DS_Store differ
diff --git a/talkingface/model/voice_conversion/__init__.py b/talkingface/model/voice_conversion/__init__.py
new file mode 100644
index 00000000..0e9dd9f6
--- /dev/null
+++ b/talkingface/model/voice_conversion/__init__.py
@@ -0,0 +1 @@
+from talkingface.model.voice_conversion.diffvc import diffvc
\ No newline at end of file
diff --git a/talkingface/model/voice_conversion/diffvc.py b/talkingface/model/voice_conversion/diffvc.py
new file mode 100644
index 00000000..fbe0c590
--- /dev/null
+++ b/talkingface/model/voice_conversion/diffvc.py
@@ -0,0 +1,1151 @@
+from talkingface.model.abstract_talkingface import AbstractTalkingFace
+import torch
+import numpy as np
+import math
+import torch
+from torch import nn
+from torch.nn import functional as F
+from einops import rearrange
+import torch
+import torchaudio
+import numpy as np
+from librosa.filters import mel as librosa_mel_fn
+from talkingface.utils.utils import sequence_mask, fix_len_compatibility, mse_loss,convert_pad_shape
+
+
+from talkingface.utils.voice_conversion_talkingface.params_model import *
+from talkingface.utils.voice_conversion_talkingface.params_data import *
+from scipy.interpolate import interp1d
+from sklearn.metrics import roc_curve
+from torch.nn.utils import clip_grad_norm_
+from scipy.optimize import brentq
+from torch import nn
+import numpy as np
+import torch
+
+
+class SpeakerEncoder(nn.Module):
+ def __init__(self, device, loss_device):
+ super().__init__()
+ self.loss_device = loss_device
+
+ # Network defition
+ self.lstm = nn.LSTM(input_size=mel_n_channels,
+ hidden_size=model_hidden_size,
+ num_layers=model_num_layers,
+ batch_first=True).to(device)
+ self.linear = nn.Linear(in_features=model_hidden_size,
+ out_features=model_embedding_size).to(device)
+ self.relu = torch.nn.ReLU().to(device)
+
+ # Cosine similarity scaling (with fixed initial parameter values)
+ self.similarity_weight = nn.Parameter(torch.tensor([10.])).to(loss_device)
+ self.similarity_bias = nn.Parameter(torch.tensor([-5.])).to(loss_device)
+
+ # Loss
+ self.loss_fn = nn.CrossEntropyLoss().to(loss_device)
+
+ def do_gradient_ops(self):
+ # Gradient scale
+ self.similarity_weight.grad *= 0.01
+ self.similarity_bias.grad *= 0.01
+
+ # Gradient clipping
+ clip_grad_norm_(self.parameters(), 3, norm_type=2)
+
+ def forward(self, utterances, hidden_init=None):
+ """
+ Computes the embeddings of a batch of utterance spectrograms.
+
+ :param utterances: batch of mel-scale filterbanks of same duration as a tensor of shape
+ (batch_size, n_frames, n_channels)
+ :param hidden_init: initial hidden state of the LSTM as a tensor of shape (num_layers,
+ batch_size, hidden_size). Will default to a tensor of zeros if None.
+ :return: the embeddings as a tensor of shape (batch_size, embedding_size)
+ """
+ # Pass the input through the LSTM layers and retrieve all outputs, the final hidden state
+ # and the final cell state.
+ out, (hidden, cell) = self.lstm(utterances, hidden_init)
+
+ # We take only the hidden state of the last layer
+ embeds_raw = self.relu(self.linear(hidden[-1]))
+
+ # L2-normalize it
+ embeds = embeds_raw / torch.norm(embeds_raw, dim=1, keepdim=True)
+
+ return embeds
+
+ def similarity_matrix(self, embeds):
+ """
+ Computes the similarity matrix according the section 2.1 of GE2E.
+
+ :param embeds: the embeddings as a tensor of shape (speakers_per_batch,
+ utterances_per_speaker, embedding_size)
+ :return: the similarity matrix as a tensor of shape (speakers_per_batch,
+ utterances_per_speaker, speakers_per_batch)
+ """
+ speakers_per_batch, utterances_per_speaker = embeds.shape[:2]
+
+ # Inclusive centroids (1 per speaker). Cloning is needed for reverse differentiation
+ centroids_incl = torch.mean(embeds, dim=1, keepdim=True)
+ centroids_incl = centroids_incl.clone() / torch.norm(centroids_incl, dim=2, keepdim=True)
+
+ # Exclusive centroids (1 per utterance)
+ centroids_excl = (torch.sum(embeds, dim=1, keepdim=True) - embeds)
+ centroids_excl /= (utterances_per_speaker - 1)
+ centroids_excl = centroids_excl.clone() / torch.norm(centroids_excl, dim=2, keepdim=True)
+
+ # Similarity matrix. The cosine similarity of already 2-normed vectors is simply the dot
+ # product of these vectors (which is just an element-wise multiplication reduced by a sum).
+ # We vectorize the computation for efficiency.
+ sim_matrix = torch.zeros(speakers_per_batch, utterances_per_speaker,
+ speakers_per_batch).to(self.loss_device)
+ mask_matrix = 1 - np.eye(speakers_per_batch, dtype=np.int)
+ for j in range(speakers_per_batch):
+ mask = np.where(mask_matrix[j])[0]
+ sim_matrix[mask, :, j] = (embeds[mask] * centroids_incl[j]).sum(dim=2)
+ sim_matrix[j, :, j] = (embeds[j] * centroids_excl[j]).sum(dim=1)
+
+ ## Even more vectorized version (slower maybe because of transpose)
+ # sim_matrix2 = torch.zeros(speakers_per_batch, speakers_per_batch, utterances_per_speaker
+ # ).to(self.loss_device)
+ # eye = np.eye(speakers_per_batch, dtype=np.int)
+ # mask = np.where(1 - eye)
+ # sim_matrix2[mask] = (embeds[mask[0]] * centroids_incl[mask[1]]).sum(dim=2)
+ # mask = np.where(eye)
+ # sim_matrix2[mask] = (embeds * centroids_excl).sum(dim=2)
+ # sim_matrix2 = sim_matrix2.transpose(1, 2)
+
+ sim_matrix = sim_matrix * self.similarity_weight + self.similarity_bias
+ return sim_matrix
+
+ def loss(self, embeds):
+ """
+ Computes the softmax loss according the section 2.1 of GE2E.
+
+ :param embeds: the embeddings as a tensor of shape (speakers_per_batch,
+ utterances_per_speaker, embedding_size)
+ :return: the loss and the EER for this batch of embeddings.
+ """
+ speakers_per_batch, utterances_per_speaker = embeds.shape[:2]
+
+ # Loss
+ sim_matrix = self.similarity_matrix(embeds)
+ sim_matrix = sim_matrix.reshape((speakers_per_batch * utterances_per_speaker,
+ speakers_per_batch))
+ ground_truth = np.repeat(np.arange(speakers_per_batch), utterances_per_speaker)
+ target = torch.from_numpy(ground_truth).long().to(self.loss_device)
+ loss = self.loss_fn(sim_matrix, target)
+
+ # EER (not backpropagated)
+ with torch.no_grad():
+ inv_argmax = lambda i: np.eye(1, speakers_per_batch, i, dtype=np.int)[0]
+ labels = np.array([inv_argmax(i) for i in ground_truth])
+ preds = sim_matrix.detach().cpu().numpy()
+
+ # Snippet from https://yangcha.github.io/EER-ROC/
+ fpr, tpr, thresholds = roc_curve(labels.flatten(), preds.flatten())
+ eer = brentq(lambda x: 1. - x - interp1d(fpr, tpr)(x), 0., 1.)
+
+ return loss, eer
+
+#base
+class BaseModule(torch.nn.Module):
+ def __init__(self):
+ super(BaseModule, self).__init__()
+
+ @property
+ def nparams(self):
+ num_params = 0
+ for name, param in self.named_parameters():
+ if param.requires_grad:
+ num_params += np.prod(param.detach().cpu().numpy().shape)
+ return num_params
+
+
+ def relocate_input(self, x: list):
+ device = next(self.parameters()).device
+ for i in range(len(x)):
+ if isinstance(x[i], torch.Tensor) and x[i].device != device:
+ x[i] = x[i].to(device)
+ return x
+#modules
+class Mish(BaseModule):
+ def forward(self, x):
+ return x * torch.tanh(torch.nn.functional.softplus(x))
+
+
+class Upsample(BaseModule):
+ def __init__(self, dim):
+ super(Upsample, self).__init__()
+ self.conv = torch.nn.ConvTranspose2d(dim, dim, 4, 2, 1)
+
+ def forward(self, x):
+ return self.conv(x)
+
+
+class Downsample(BaseModule):
+ def __init__(self, dim):
+ super(Downsample, self).__init__()
+ self.conv = torch.nn.Conv2d(dim, dim, 3, 2, 1)
+
+ def forward(self, x):
+ return self.conv(x)
+
+
+class Rezero(BaseModule):
+ def __init__(self, fn):
+ super(Rezero, self).__init__()
+ self.fn = fn
+ self.g = torch.nn.Parameter(torch.zeros(1))
+
+ def forward(self, x):
+ return self.fn(x) * self.g
+
+
+class Block(BaseModule):
+ def __init__(self, dim, dim_out, groups=8):
+ super(Block, self).__init__()
+ self.block = torch.nn.Sequential(torch.nn.Conv2d(dim, dim_out, 3,
+ padding=1), torch.nn.GroupNorm(
+ groups, dim_out), Mish())
+
+ def forward(self, x, mask):
+ output = self.block(x * mask)
+ return output * mask
+
+
+class ResnetBlock(BaseModule):
+ def __init__(self, dim, dim_out, time_emb_dim, groups=8):
+ super(ResnetBlock, self).__init__()
+ self.mlp = torch.nn.Sequential(Mish(), torch.nn.Linear(time_emb_dim,
+ dim_out))
+
+ self.block1 = Block(dim, dim_out)
+ self.block2 = Block(dim_out, dim_out)
+ if dim != dim_out:
+ self.res_conv = torch.nn.Conv2d(dim, dim_out, 1)
+ else:
+ self.res_conv = torch.nn.Identity()
+
+ def forward(self, x, mask, time_emb):
+ h = self.block1(x, mask)
+ h += self.mlp(time_emb).unsqueeze(-1).unsqueeze(-1)
+ h = self.block2(h, mask)
+ output = h + self.res_conv(x * mask)
+ return output
+
+
+class LinearAttention(BaseModule):
+ def __init__(self, dim, heads=4, dim_head=32):
+ super(LinearAttention, self).__init__()
+ self.heads = heads
+ hidden_dim = dim_head * heads
+ self.to_qkv = torch.nn.Conv2d(dim, hidden_dim * 3, 1, bias=False)
+ self.to_out = torch.nn.Conv2d(hidden_dim, dim, 1)
+
+ def forward(self, x):
+ b, c, h, w = x.shape
+ qkv = self.to_qkv(x)
+ q, k, v = rearrange(qkv, 'b (qkv heads c) h w -> qkv b heads c (h w)',
+ heads = self.heads, qkv=3)
+ k = k.softmax(dim=-1)
+ context = torch.einsum('bhdn,bhen->bhde', k, v)
+ out = torch.einsum('bhde,bhdn->bhen', context, q)
+ out = rearrange(out, 'b heads c (h w) -> b (heads c) h w',
+ heads=self.heads, h=h, w=w)
+ return self.to_out(out)
+
+
+class Residual(BaseModule):
+ def __init__(self, fn):
+ super(Residual, self).__init__()
+ self.fn = fn
+
+ def forward(self, x, *args, **kwargs):
+ output = self.fn(x, *args, **kwargs) + x
+ return output
+
+
+class SinusoidalPosEmb(BaseModule):
+ def __init__(self, dim):
+ super(SinusoidalPosEmb, self).__init__()
+ self.dim = dim
+
+ def forward(self, x):
+ device = x.device
+ half_dim = self.dim // 2
+ emb = math.log(10000) / (half_dim - 1)
+ emb = torch.exp(torch.arange(half_dim, device=device).float() * -emb)
+ emb = 1000.0 * x.unsqueeze(1) * emb.unsqueeze(0)
+ emb = torch.cat((emb.sin(), emb.cos()), dim=-1)
+ return emb
+
+
+class RefBlock(BaseModule):
+ def __init__(self, out_dim, time_emb_dim):
+ super(RefBlock, self).__init__()
+ base_dim = out_dim // 4
+ self.mlp1 = torch.nn.Sequential(Mish(), torch.nn.Linear(time_emb_dim,
+ base_dim))
+ self.mlp2 = torch.nn.Sequential(Mish(), torch.nn.Linear(time_emb_dim,
+ 2 * base_dim))
+ self.block11 = torch.nn.Sequential(torch.nn.Conv2d(1, 2 * base_dim,
+ 3, 1, 1), torch.nn.InstanceNorm2d(2 * base_dim, affine=True),
+ torch.nn.GLU(dim=1))
+ self.block12 = torch.nn.Sequential(torch.nn.Conv2d(base_dim, 2 * base_dim,
+ 3, 1, 1), torch.nn.InstanceNorm2d(2 * base_dim, affine=True),
+ torch.nn.GLU(dim=1))
+ self.block21 = torch.nn.Sequential(torch.nn.Conv2d(base_dim, 4 * base_dim,
+ 3, 1, 1), torch.nn.InstanceNorm2d(4 * base_dim, affine=True),
+ torch.nn.GLU(dim=1))
+ self.block22 = torch.nn.Sequential(torch.nn.Conv2d(2 * base_dim, 4 * base_dim,
+ 3, 1, 1), torch.nn.InstanceNorm2d(4 * base_dim, affine=True),
+ torch.nn.GLU(dim=1))
+ self.block31 = torch.nn.Sequential(torch.nn.Conv2d(2 * base_dim, 8 * base_dim,
+ 3, 1, 1), torch.nn.InstanceNorm2d(8 * base_dim, affine=True),
+ torch.nn.GLU(dim=1))
+ self.block32 = torch.nn.Sequential(torch.nn.Conv2d(4 * base_dim, 8 * base_dim,
+ 3, 1, 1), torch.nn.InstanceNorm2d(8 * base_dim, affine=True),
+ torch.nn.GLU(dim=1))
+ self.final_conv = torch.nn.Conv2d(4 * base_dim, out_dim, 1)
+
+ def forward(self, x, mask, time_emb):
+ y = self.block11(x * mask)
+ y = self.block12(y * mask)
+ y += self.mlp1(time_emb).unsqueeze(-1).unsqueeze(-1)
+ y = self.block21(y * mask)
+ y = self.block22(y * mask)
+ y += self.mlp2(time_emb).unsqueeze(-1).unsqueeze(-1)
+ y = self.block31(y * mask)
+ y = self.block32(y * mask)
+ y = self.final_conv(y * mask)
+ return (y * mask).sum((2, 3)) / (mask.sum((2, 3)) * x.shape[2])
+
+#diffusion
+class GradLogPEstimator(BaseModule):
+ def __init__(self, dim_base, dim_cond, use_ref_t, dim_mults=(1, 2, 4)):
+ super(GradLogPEstimator, self).__init__()
+ self.use_ref_t = use_ref_t
+ dims = [2 + dim_cond, *map(lambda m: dim_base * m, dim_mults)]
+ in_out = list(zip(dims[:-1], dims[1:]))
+
+ self.time_pos_emb = SinusoidalPosEmb(dim_base)
+ self.mlp = torch.nn.Sequential(torch.nn.Linear(dim_base, dim_base * 4),
+ Mish(), torch.nn.Linear(dim_base * 4, dim_base))
+
+ cond_total = dim_base + 256
+ if use_ref_t:
+ self.ref_block = RefBlock(out_dim=dim_cond, time_emb_dim=dim_base)
+ cond_total += dim_cond
+ self.cond_block = torch.nn.Sequential(torch.nn.Linear(cond_total, 4 * dim_cond),
+ Mish(), torch.nn.Linear(4 * dim_cond, dim_cond))
+
+ self.downs = torch.nn.ModuleList([])
+ self.ups = torch.nn.ModuleList([])
+ num_resolutions = len(in_out)
+
+ for ind, (dim_in, dim_out) in enumerate(in_out):
+ is_last = ind >= (num_resolutions - 1)
+ self.downs.append(torch.nn.ModuleList([
+ ResnetBlock(dim_in, dim_out,time_emb_dim=dim_base),
+ ResnetBlock(dim_out, dim_out, time_emb_dim=dim_base),
+ Residual(Rezero(LinearAttention(dim_out))),
+ Downsample(dim_out) if not is_last else torch.nn.Identity()]))
+
+ mid_dim = dims[-1]
+ self.mid_block1 = ResnetBlock(mid_dim, mid_dim, time_emb_dim=dim_base)
+ self.mid_attn = Residual(Rezero(LinearAttention(mid_dim)))
+ self.mid_block2 = ResnetBlock(mid_dim, mid_dim, time_emb_dim=dim_base)
+
+ for ind, (dim_in, dim_out) in enumerate(reversed(in_out[1:])):
+ self.ups.append(torch.nn.ModuleList([
+ ResnetBlock(dim_out * 2, dim_in, time_emb_dim=dim_base),
+ ResnetBlock(dim_in, dim_in, time_emb_dim=dim_base),
+ Residual(Rezero(LinearAttention(dim_in))),
+ Upsample(dim_in)]))
+ self.final_block = Block(dim_base, dim_base)
+ self.final_conv = torch.nn.Conv2d(dim_base, 1, 1)
+
+ def forward(self, x, x_mask, mean, ref, ref_mask, c, t):
+ condition = self.time_pos_emb(t)
+ t = self.mlp(condition)
+
+ x = torch.stack([mean, x], 1)
+ x_mask = x_mask.unsqueeze(1)
+ ref_mask = ref_mask.unsqueeze(1)
+
+ if self.use_ref_t:
+ condition = torch.cat([condition, self.ref_block(ref, ref_mask, t)], 1)
+ condition = torch.cat([condition, c], 1)
+
+ condition = self.cond_block(condition).unsqueeze(-1).unsqueeze(-1)
+ condition = torch.cat(x.shape[2]*[condition], 2)
+ condition = torch.cat(x.shape[3]*[condition], 3)
+ x = torch.cat([x, condition], 1)
+
+ hiddens = []
+ masks = [x_mask]
+ for resnet1, resnet2, attn, downsample in self.downs:
+ mask_down = masks[-1]
+ x = resnet1(x, mask_down, t)
+ x = resnet2(x, mask_down, t)
+ x = attn(x)
+ hiddens.append(x)
+ x = downsample(x * mask_down)
+ masks.append(mask_down[:, :, :, ::2])
+
+ masks = masks[:-1]
+ mask_mid = masks[-1]
+ x = self.mid_block1(x, mask_mid, t)
+ x = self.mid_attn(x)
+ x = self.mid_block2(x, mask_mid, t)
+
+ for resnet1, resnet2, attn, upsample in self.ups:
+ mask_up = masks.pop()
+ x = torch.cat((x, hiddens.pop()), dim=1)
+ x = resnet1(x, mask_up, t)
+ x = resnet2(x, mask_up, t)
+ x = attn(x)
+ x = upsample(x * mask_up)
+
+ x = self.final_block(x, x_mask)
+ output = self.final_conv(x * x_mask)
+
+ return (output * x_mask).squeeze(1)
+
+
+class Diffusion(BaseModule):
+ def __init__(self, n_feats, dim_unet, dim_spk, use_ref_t, beta_min, beta_max):
+ super(Diffusion, self).__init__()
+ self.estimator = GradLogPEstimator(dim_unet, dim_spk, use_ref_t)
+ self.n_feats = n_feats
+ self.dim_unet = dim_unet
+ self.dim_spk = dim_spk
+ self.use_ref_t = use_ref_t
+ self.beta_min = beta_min
+ self.beta_max = beta_max
+
+ def get_beta(self, t):
+ beta = self.beta_min + (self.beta_max - self.beta_min) * t
+ return beta
+
+ def get_gamma(self, s, t, p=1.0, use_torch=False):
+ beta_integral = self.beta_min + 0.5*(self.beta_max - self.beta_min)*(t + s)
+ beta_integral *= (t - s)
+ if use_torch:
+ gamma = torch.exp(-0.5*p*beta_integral).unsqueeze(-1).unsqueeze(-1)
+ else:
+ gamma = math.exp(-0.5*p*beta_integral)
+ return gamma
+
+ def get_mu(self, s, t):
+ a = self.get_gamma(s, t)
+ b = 1.0 - self.get_gamma(0, s, p=2.0)
+ c = 1.0 - self.get_gamma(0, t, p=2.0)
+ return a * b / c
+
+ def get_nu(self, s, t):
+ a = self.get_gamma(0, s)
+ b = 1.0 - self.get_gamma(s, t, p=2.0)
+ c = 1.0 - self.get_gamma(0, t, p=2.0)
+ return a * b / c
+
+ def get_sigma(self, s, t):
+ a = 1.0 - self.get_gamma(0, s, p=2.0)
+ b = 1.0 - self.get_gamma(s, t, p=2.0)
+ c = 1.0 - self.get_gamma(0, t, p=2.0)
+ return math.sqrt(a * b / c)
+
+ def compute_diffused_mean(self, x0, mask, mean, t, use_torch=False):
+ x0_weight = self.get_gamma(0, t, use_torch=use_torch)
+ mean_weight = 1.0 - x0_weight
+ xt_mean = x0 * x0_weight + mean * mean_weight
+ return xt_mean * mask
+
+ def forward_diffusion(self, x0, mask, mean, t):
+ xt_mean = self.compute_diffused_mean(x0, mask, mean, t, use_torch=True)
+ variance = 1.0 - self.get_gamma(0, t, p=2.0, use_torch=True)
+ z = torch.randn(x0.shape, dtype=x0.dtype, device=x0.device, requires_grad=False)
+ xt = xt_mean + z * torch.sqrt(variance)
+ return xt * mask, z * mask
+
+ @torch.no_grad()
+ def reverse_diffusion(self, z, mask, mean, ref, ref_mask, mean_ref, c,
+ n_timesteps, mode):
+ h = 1.0 / n_timesteps
+ xt = z * mask
+ for i in range(n_timesteps):
+ t = 1.0 - i*h
+ time = t * torch.ones(z.shape[0], dtype=z.dtype, device=z.device)
+ beta_t = self.get_beta(t)
+ xt_ref = [self.compute_diffused_mean(ref, ref_mask, mean_ref, t)]
+# for j in range(15):
+# xt_ref += [self.compute_diffused_mean(ref, ref_mask, mean_ref, (j+0.5)/15.0)]
+ xt_ref = torch.stack(xt_ref, 1)
+ if mode == 'pf':
+ dxt = 0.5 * (mean - xt - self.estimator(xt, mask, mean, xt_ref, ref_mask, c, time)) * (beta_t * h)
+ else:
+ if mode == 'ml':
+ kappa = self.get_gamma(0, t - h) * (1.0 - self.get_gamma(t - h, t, p=2.0))
+ kappa /= (self.get_gamma(0, t) * beta_t * h)
+ kappa -= 1.0
+ omega = self.get_nu(t - h, t) / self.get_gamma(0, t)
+ omega += self.get_mu(t - h, t)
+ omega -= (0.5 * beta_t * h + 1.0)
+ sigma = self.get_sigma(t - h, t)
+ else:
+ kappa = 0.0
+ omega = 0.0
+ sigma = math.sqrt(beta_t * h)
+ dxt = (mean - xt) * (0.5 * beta_t * h + omega)
+ dxt -= self.estimator(xt, mask, mean, xt_ref, ref_mask, c, time) * (1.0 + kappa) * (beta_t * h)
+ dxt += torch.randn_like(z, device=z.device) * sigma
+ xt = (xt - dxt) * mask
+ return xt
+
+ @torch.no_grad()
+ def forward(self, z, mask, mean, ref, ref_mask, mean_ref, c,
+ n_timesteps, mode):
+ if mode not in ['pf', 'em', 'ml']:
+ print('Inference mode must be one of [pf, em, ml]!')
+ return z
+ return self.reverse_diffusion(z, mask, mean, ref, ref_mask, mean_ref, c,
+ n_timesteps, mode)
+
+ def loss_t(self, x0, mask, mean, x_ref, mean_ref, c, t):
+ xt, z = self.forward_diffusion(x0, mask, mean, t)
+ xt_ref = [self.compute_diffused_mean(x_ref, mask, mean_ref, t, use_torch=True)]
+# for j in range(15):
+# xt_ref += [self.compute_diffused_mean(x_ref, mask, mean_ref, (j+0.5)/15.0)]
+ xt_ref = torch.stack(xt_ref, 1)
+ z_estimation = self.estimator(xt, mask, mean, xt_ref, mask, c, t)
+ z_estimation *= torch.sqrt(1.0 - self.get_gamma(0, t, p=2.0, use_torch=True))
+ loss = torch.sum((z_estimation + z)**2) / (torch.sum(mask)*self.n_feats)
+ return loss
+
+ def compute_loss(self, x0, mask, mean, x_ref, mean_ref, c, offset=1e-5):
+ b = x0.shape[0]
+ t = torch.rand(b, dtype=x0.dtype, device=x0.device, requires_grad=False)
+ t = torch.clamp(t, offset, 1.0 - offset)
+ return self.loss_t(x0, mask, mean, x_ref, mean_ref, c, t)
+#encoder
+class LayerNorm(BaseModule):
+ def __init__(self, channels, eps=1e-4):
+ super(LayerNorm, self).__init__()
+ self.channels = channels
+ self.eps = eps
+
+ self.gamma = torch.nn.Parameter(torch.ones(channels))
+ self.beta = torch.nn.Parameter(torch.zeros(channels))
+
+ def forward(self, x):
+ n_dims = len(x.shape)
+ mean = torch.mean(x, 1, keepdim=True)
+ variance = torch.mean((x - mean)**2, 1, keepdim=True)
+
+ x = (x - mean) * torch.rsqrt(variance + self.eps)
+
+ shape = [1, -1] + [1] * (n_dims - 2)
+ x = x * self.gamma.view(*shape) + self.beta.view(*shape)
+ return x
+
+
+class ConvReluNorm(BaseModule):
+ def __init__(self, in_channels, hidden_channels, out_channels, kernel_size,
+ n_layers, p_dropout):
+ super(ConvReluNorm, self).__init__()
+ self.in_channels = in_channels
+ self.hidden_channels = hidden_channels
+ self.out_channels = out_channels
+ self.kernel_size = kernel_size
+ self.n_layers = n_layers
+ self.p_dropout = p_dropout
+
+ self.conv_layers = torch.nn.ModuleList()
+ self.norm_layers = torch.nn.ModuleList()
+ self.conv_layers.append(torch.nn.Conv1d(in_channels, hidden_channels,
+ kernel_size, padding=kernel_size//2))
+ self.norm_layers.append(LayerNorm(hidden_channels))
+ self.relu_drop = torch.nn.Sequential(torch.nn.ReLU(), torch.nn.Dropout(p_dropout))
+ for _ in range(n_layers - 1):
+ self.conv_layers.append(torch.nn.Conv1d(hidden_channels, hidden_channels,
+ kernel_size, padding=kernel_size//2))
+ self.norm_layers.append(LayerNorm(hidden_channels))
+ self.proj = torch.nn.Conv1d(hidden_channels, out_channels, 1)
+ self.proj.weight.data.zero_()
+ self.proj.bias.data.zero_()
+
+ def forward(self, x, x_mask):
+ x_org = x
+ for i in range(self.n_layers):
+ x = self.conv_layers[i](x * x_mask)
+ x = self.norm_layers[i](x)
+ x = self.relu_drop(x)
+ x = x_org + self.proj(x)
+ return x * x_mask
+
+
+class MultiHeadAttention(BaseModule):
+ def __init__(self, channels, out_channels, n_heads, window_size=None,
+ heads_share=True, p_dropout=0.0, proximal_bias=False,
+ proximal_init=False):
+ super(MultiHeadAttention, self).__init__()
+ assert channels % n_heads == 0
+
+ self.channels = channels
+ self.out_channels = out_channels
+ self.n_heads = n_heads
+ self.window_size = window_size
+ self.heads_share = heads_share
+ self.proximal_bias = proximal_bias
+ self.p_dropout = p_dropout
+ self.attn = None
+
+ self.k_channels = channels // n_heads
+ self.conv_q = torch.nn.Conv1d(channels, channels, 1)
+ self.conv_k = torch.nn.Conv1d(channels, channels, 1)
+ self.conv_v = torch.nn.Conv1d(channels, channels, 1)
+ if window_size is not None:
+ n_heads_rel = 1 if heads_share else n_heads
+ rel_stddev = self.k_channels**-0.5
+ self.emb_rel_k = torch.nn.Parameter(torch.randn(n_heads_rel,
+ window_size * 2 + 1, self.k_channels) * rel_stddev)
+ self.emb_rel_v = torch.nn.Parameter(torch.randn(n_heads_rel,
+ window_size * 2 + 1, self.k_channels) * rel_stddev)
+ self.conv_o = torch.nn.Conv1d(channels, out_channels, 1)
+ self.drop = torch.nn.Dropout(p_dropout)
+
+ torch.nn.init.xavier_uniform_(self.conv_q.weight)
+ torch.nn.init.xavier_uniform_(self.conv_k.weight)
+ if proximal_init:
+ self.conv_k.weight.data.copy_(self.conv_q.weight.data)
+ self.conv_k.bias.data.copy_(self.conv_q.bias.data)
+ torch.nn.init.xavier_uniform_(self.conv_v.weight)
+
+ def forward(self, x, c, attn_mask=None):
+ q = self.conv_q(x)
+ k = self.conv_k(c)
+ v = self.conv_v(c)
+
+ x, self.attn = self.attention(q, k, v, mask=attn_mask)
+
+ x = self.conv_o(x)
+ return x
+
+ def attention(self, query, key, value, mask=None):
+ b, d, t_s, t_t = (*key.size(), query.size(2))
+ query = query.view(b, self.n_heads, self.k_channels, t_t).transpose(2, 3)
+ key = key.view(b, self.n_heads, self.k_channels, t_s).transpose(2, 3)
+ value = value.view(b, self.n_heads, self.k_channels, t_s).transpose(2, 3)
+
+ scores = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(self.k_channels)
+ if self.window_size is not None:
+ assert t_s == t_t, "Relative attention is only available for self-attention."
+ key_relative_embeddings = self._get_relative_embeddings(self.emb_rel_k, t_s)
+ rel_logits = self._matmul_with_relative_keys(query, key_relative_embeddings)
+ rel_logits = self._relative_position_to_absolute_position(rel_logits)
+ scores_local = rel_logits / math.sqrt(self.k_channels)
+ scores = scores + scores_local
+ if self.proximal_bias:
+ assert t_s == t_t, "Proximal bias is only available for self-attention."
+ scores = scores + self._attention_bias_proximal(t_s).to(device=scores.device,
+ dtype=scores.dtype)
+ if mask is not None:
+ scores = scores.masked_fill(mask == 0, -1e4)
+ p_attn = torch.nn.functional.softmax(scores, dim=-1)
+ p_attn = self.drop(p_attn)
+ output = torch.matmul(p_attn, value)
+ if self.window_size is not None:
+ relative_weights = self._absolute_position_to_relative_position(p_attn)
+ value_relative_embeddings = self._get_relative_embeddings(self.emb_rel_v, t_s)
+ output = output + self._matmul_with_relative_values(relative_weights,
+ value_relative_embeddings)
+ output = output.transpose(2, 3).contiguous().view(b, d, t_t)
+ return output, p_attn
+
+ def _matmul_with_relative_values(self, x, y):
+ ret = torch.matmul(x, y.unsqueeze(0))
+ return ret
+
+ def _matmul_with_relative_keys(self, x, y):
+ ret = torch.matmul(x, y.unsqueeze(0).transpose(-2, -1))
+ return ret
+
+ def _get_relative_embeddings(self, relative_embeddings, length):
+ pad_length = max(length - (self.window_size + 1), 0)
+ slice_start_position = max((self.window_size + 1) - length, 0)
+ slice_end_position = slice_start_position + 2 * length - 1
+ if pad_length > 0:
+ padded_relative_embeddings = torch.nn.functional.pad(
+ relative_embeddings, convert_pad_shape([[0, 0],
+ [pad_length, pad_length], [0, 0]]))
+ else:
+ padded_relative_embeddings = relative_embeddings
+ used_relative_embeddings = padded_relative_embeddings[:,
+ slice_start_position:slice_end_position]
+ return used_relative_embeddings
+
+ def _relative_position_to_absolute_position(self, x):
+ batch, heads, length, _ = x.size()
+ x = torch.nn.functional.pad(x, convert_pad_shape([[0,0],[0,0],[0,0],[0,1]]))
+ x_flat = x.view([batch, heads, length * 2 * length])
+ x_flat = torch.nn.functional.pad(x_flat, convert_pad_shape([[0,0],[0,0],[0,length-1]]))
+ x_final = x_flat.view([batch, heads, length+1, 2*length-1])[:, :, :length, length-1:]
+ return x_final
+
+ def _absolute_position_to_relative_position(self, x):
+ batch, heads, length, _ = x.size()
+ x = torch.nn.functional.pad(x, convert_pad_shape([[0, 0], [0, 0], [0, 0], [0, length-1]]))
+ x_flat = x.view([batch, heads, length**2 + length*(length - 1)])
+ x_flat = torch.nn.functional.pad(x_flat, convert_pad_shape([[0, 0], [0, 0], [length, 0]]))
+ x_final = x_flat.view([batch, heads, length, 2*length])[:,:,:,1:]
+ return x_final
+
+ def _attention_bias_proximal(self, length):
+ r = torch.arange(length, dtype=torch.float32)
+ diff = torch.unsqueeze(r, 0) - torch.unsqueeze(r, 1)
+ return torch.unsqueeze(torch.unsqueeze(-torch.log1p(torch.abs(diff)), 0), 0)
+
+
+class FFN(BaseModule):
+ def __init__(self, in_channels, out_channels, filter_channels, kernel_size,
+ p_dropout=0.0):
+ super(FFN, self).__init__()
+ self.in_channels = in_channels
+ self.out_channels = out_channels
+ self.filter_channels = filter_channels
+ self.kernel_size = kernel_size
+ self.p_dropout = p_dropout
+
+ self.conv_1 = torch.nn.Conv1d(in_channels, filter_channels, kernel_size,
+ padding=kernel_size//2)
+ self.conv_2 = torch.nn.Conv1d(filter_channels, out_channels, kernel_size,
+ padding=kernel_size//2)
+ self.drop = torch.nn.Dropout(p_dropout)
+
+ def forward(self, x, x_mask):
+ x = self.conv_1(x * x_mask)
+ x = torch.relu(x)
+ x = self.drop(x)
+ x = self.conv_2(x * x_mask)
+ return x * x_mask
+
+
+class Encoder(BaseModule):
+ def __init__(self, hidden_channels, filter_channels, n_heads, n_layers,
+ kernel_size=1, p_dropout=0.0, window_size=None, **kwargs):
+ super(Encoder, self).__init__()
+ self.hidden_channels = hidden_channels
+ self.filter_channels = filter_channels
+ self.n_heads = n_heads
+ self.n_layers = n_layers
+ self.kernel_size = kernel_size
+ self.p_dropout = p_dropout
+ self.window_size = window_size
+
+ self.drop = torch.nn.Dropout(p_dropout)
+ self.attn_layers = torch.nn.ModuleList()
+ self.norm_layers_1 = torch.nn.ModuleList()
+ self.ffn_layers = torch.nn.ModuleList()
+ self.norm_layers_2 = torch.nn.ModuleList()
+ for _ in range(self.n_layers):
+ self.attn_layers.append(MultiHeadAttention(hidden_channels, hidden_channels,
+ n_heads, window_size=window_size, p_dropout=p_dropout))
+ self.norm_layers_1.append(LayerNorm(hidden_channels))
+ self.ffn_layers.append(FFN(hidden_channels, hidden_channels,
+ filter_channels, kernel_size, p_dropout=p_dropout))
+ self.norm_layers_2.append(LayerNorm(hidden_channels))
+
+ def forward(self, x, x_mask):
+ attn_mask = x_mask.unsqueeze(2) * x_mask.unsqueeze(-1)
+ for i in range(self.n_layers):
+ x = x * x_mask
+ y = self.attn_layers[i](x, x, attn_mask)
+ y = self.drop(y)
+ x = self.norm_layers_1[i](x + y)
+ y = self.ffn_layers[i](x, x_mask)
+ y = self.drop(y)
+ x = self.norm_layers_2[i](x + y)
+ x = x * x_mask
+ return x
+
+
+class MelEncoder(BaseModule):
+ def __init__(self, n_feats, channels, filters, heads, layers, kernel,
+ dropout, window_size=None):
+ super(MelEncoder, self).__init__()
+ self.n_feats = n_feats
+ self.channels = channels
+ self.filters = filters
+ self.heads = heads
+ self.layers = layers
+ self.kernel = kernel
+ self.dropout = dropout
+ self.window_size = window_size
+ print(self.channels)
+ print(self.n_feats)
+ self.init_proj = torch.nn.Conv1d(n_feats, channels, 1)
+ self.prenet = ConvReluNorm(channels, channels, channels,
+ kernel_size=5, n_layers=3, p_dropout=0.5)
+
+ self.encoder = Encoder(channels, filters, heads, layers, kernel,
+ dropout, window_size=window_size)
+
+ self.term_proj = torch.nn.Conv1d(channels, n_feats, 1)
+
+ def forward(self, x, x_mask):
+ x = self.init_proj(x * x_mask)
+ x = self.prenet(x, x_mask)
+ x = self.encoder(x, x_mask)
+ x = self.term_proj(x * x_mask)
+ return x
+
+#layer
+class Conv2d(nn.Module):
+ def __init__(self, cin, cout, kernel_size, stride, padding, residual=False, *args, **kwargs):
+ super().__init__(*args, **kwargs)
+ self.conv_block = nn.Sequential(
+ nn.Conv2d(cin, cout, kernel_size, stride, padding),
+ nn.BatchNorm2d(cout)
+ )
+ self.act = nn.ReLU()
+ self.residual = residual
+
+ def forward(self, x):
+ out = self.conv_block(x)
+ if self.residual:
+ out += x
+ return self.act(out)
+
+class nonorm_Conv2d(nn.Module):
+ def __init__(self, cin, cout, kernel_size, stride, padding, residual=False, *args, **kwargs):
+ super().__init__(*args, **kwargs)
+ self.conv_block = nn.Sequential(
+ nn.Conv2d(cin, cout, kernel_size, stride, padding),
+ )
+ self.act = nn.LeakyReLU(0.01, inplace=True)
+
+ def forward(self, x):
+ out = self.conv_block(x)
+ return self.act(out)
+
+class Conv2dTranspose(nn.Module):
+ def __init__(self, cin, cout, kernel_size, stride, padding, output_padding=0, *args, **kwargs):
+ super().__init__(*args, **kwargs)
+ self.conv_block = nn.Sequential(
+ nn.ConvTranspose2d(cin, cout, kernel_size, stride, padding, output_padding),
+ nn.BatchNorm2d(cout)
+ )
+ self.act = nn.ReLU()
+
+ def forward(self, x):
+ out = self.conv_block(x)
+ return self.act(out)
+
+#postnet
+class Block1(BaseModule):
+ def __init__(self, dim, groups=8):
+ super(Block1, self).__init__()
+ self.block = torch.nn.Sequential(torch.nn.Conv2d(dim, dim, 7,
+ padding=3), torch.nn.GroupNorm(groups, dim), Mish())
+
+ def forward(self, x, mask):
+ output = self.block(x * mask)
+ return output * mask
+
+
+class ResnetBlock1(BaseModule):
+ def __init__(self, dim, groups=8):
+ super(ResnetBlock1, self).__init__()
+ self.block1 = Block1(dim)
+ self.block2 = Block1(dim)
+ self.res = torch.nn.Conv2d(dim, dim, 1)
+
+ def forward(self, x, mask):
+ h = self.block1(x, mask)
+ h = self.block2(h, mask)
+ output = self.res(x * mask) + h
+ return output
+
+
+class PostNet(BaseModule):
+ def __init__(self, dim, groups=8):
+ super(PostNet, self).__init__()
+ self.init_conv = torch.nn.Conv2d(1, dim, 1)
+ self.res_block = ResnetBlock1(dim, groups=groups)
+ self.final_conv = torch.nn.Conv2d(dim, 1, 1)
+
+ def forward(self, x, mask):
+ x = x.unsqueeze(1)
+ mask = mask.unsqueeze(1)
+ x = self.init_conv(x * mask)
+ x = self.res_block(x, mask)
+ output = self.final_conv(x * mask)
+ return output.squeeze(1)
+
+#utils
+class PseudoInversion(BaseModule):
+ def __init__(self, n_mels, sampling_rate, n_fft):
+ super(PseudoInversion, self).__init__()
+ self.n_mels = n_mels
+ self.sampling_rate = sampling_rate
+ self.n_fft = n_fft
+ mel_basis = librosa_mel_fn(sampling_rate, n_fft, n_mels, 0, 8000)
+ mel_basis_inverse = np.linalg.pinv(mel_basis)
+ mel_basis_inverse = torch.from_numpy(mel_basis_inverse).float()
+ self.register_buffer("mel_basis_inverse", mel_basis_inverse)
+
+ def forward(self, log_mel_spectrogram):
+ mel_spectrogram = torch.exp(log_mel_spectrogram)
+ stftm = torch.matmul(self.mel_basis_inverse, mel_spectrogram)
+ return stftm
+
+
+class InitialReconstruction(BaseModule):
+ def __init__(self, n_fft, hop_size):
+ super(InitialReconstruction, self).__init__()
+ self.n_fft = n_fft
+ self.hop_size = hop_size
+ window = torch.hann_window(n_fft).float()
+ self.register_buffer("window", window)
+
+ def forward(self, stftm):
+ real_part = torch.ones_like(stftm, device=stftm.device)
+ imag_part = torch.zeros_like(stftm, device=stftm.device)
+ #stft = torch.stack([real_part, imag_part], -1)*stftm.unsqueeze(-1)
+
+ stft_complex = torch.complex(real_part * stftm, imag_part * stftm)
+ #istft = torchaudio.functional.istft(stft, n_fft=self.n_fft,
+ #hop_length=self.hop_size, win_length=self.n_fft,
+ # window=self.window, center=True)
+ istft = torch.istft(stft_complex, n_fft=self.n_fft, hop_length=self.hop_size, win_length=self.n_fft,
+ window=self.window,center=True )
+ return istft.unsqueeze(1)
+
+
+# Fast Griffin-Lim algorithm as a PyTorch module
+class FastGL(BaseModule):
+ def __init__(self, n_mels, sampling_rate, n_fft, hop_size, momentum=0.99):
+ super(FastGL, self).__init__()
+ self.n_mels = n_mels
+ self.sampling_rate = sampling_rate
+ self.n_fft = n_fft
+ self.hop_size = hop_size
+ self.momentum = momentum
+ self.pi = PseudoInversion(n_mels, sampling_rate, n_fft)
+ self.ir = InitialReconstruction(n_fft, hop_size)
+ window = torch.hann_window(n_fft).float()
+ self.register_buffer("window", window)
+
+ @torch.no_grad()
+ def forward(self, s, n_iters=32):
+ c = self.pi(s)
+ x = self.ir(c)
+ x = x.squeeze(1)
+ c = c.unsqueeze(-1)
+ prev_angles = torch.zeros_like(c, device=c.device)
+ for _ in range(n_iters):
+ s = torch.stft(x, n_fft=self.n_fft, hop_length=self.hop_size,
+ win_length=self.n_fft, window=self.window,
+ center=True, return_complex=True)
+ #real_part, imag_part = s.unbind(-1)
+ real_part = s.real
+ imag_part = s.imag
+ stftm = torch.sqrt(torch.clamp(real_part**2 + imag_part**2, min=1e-8))
+ print(s.shape)
+ print(stftm.shape)
+ stftm_complex = torch.complex(stftm, torch.zeros_like(stftm))
+ #angles = s / stftm.unsqueeze(-1)
+ angles = s / stftm_complex
+ angles=angles.unsqueeze(-1)
+ print("prev_angles.shape:", prev_angles.shape)
+ s = c * (angles + self.momentum * (angles - prev_angles))
+ # real_part = torch.ones_like(s, device=stftm.device)
+ # imag_part = torch.zeros_like(s, device=stftm.device)
+ # s_complex = torch.complex(real_part * s, imag_part * s)
+ #x = torchaudio.functional.istft(s, n_fft=self.n_fft, hop_length=self.hop_size,
+ #win_length=self.n_fft, window=self.window,
+ #center=True)
+ # print("2s.shape:", s_complex.shape)
+ s = s.squeeze(-1)
+ x = torch.istft(s, n_fft=self.n_fft, hop_length=self.hop_size,
+ win_length=self.n_fft, window=self.window,
+ center=True)
+ prev_angles = angles
+ return x.unsqueeze(1)
+
+#vc
+class FwdDiffusion(BaseModule):
+ def __init__(self, n_feats, channels, filters, heads, layers, kernel,
+ dropout, window_size, dim):
+ super(FwdDiffusion, self).__init__()
+ self.n_feats = n_feats
+ self.channels = channels
+ self.filters = filters
+ self.heads = heads
+ self.layers = layers
+ self.kernel = kernel
+ self.dropout = dropout
+ self.window_size = window_size
+ self.dim = dim
+ print(self.channels)
+ self.encoder = MelEncoder(n_feats, channels, filters, heads, layers,
+ kernel, dropout, window_size)
+ self.postnet = PostNet(dim)
+
+ def nparams(self):
+ num_params = 0
+ for name, param in self.named_parameters():
+ if param.requires_grad:
+ num_params += np.prod(param.detach().cpu().numpy().shape)
+ return num_params
+
+
+ def relocate_input(self, x: list):
+ device = next(self.parameters()).device
+ for i in range(len(x)):
+ if isinstance(x[i], torch.Tensor) and x[i].device != device:
+ x[i] = x[i].to(device)
+ return x
+
+ @torch.no_grad()
+ def forward(self, x, mask):
+ x, mask = self.relocate_input([x, mask])
+ z = self.encoder(x, mask)
+ z_output = self.postnet(z, mask)
+ return z_output
+
+ def compute_loss(self, x, y, mask):
+ x, y, mask = self.relocate_input([x, y, mask])
+ z = self.encoder(x, mask)
+ z_output = self.postnet(z, mask)
+ loss = mse_loss(z_output, y, mask, self.n_feats)
+ return loss
+
+
+class diffvc(AbstractTalkingFace):
+ def __init__(self, config):
+ super(diffvc, self).__init__()
+
+ self.n_feats=config["n_feats"]
+ self.channels= config["channels"]
+ self.filters=config["filters"]
+ self.heads=config["heads"]
+ self.layers=config["layers"]
+ self.kernel=config["kernel"]
+ self.dropout=config["dropout"]
+ self.window_size=config["window_size"]
+ self.enc_dim=config["enc_dim"]
+ self.spk_dim=config["spk_dim"]
+ self.use_ref_t=config["use_ref_t"]
+ self.dec_dim=config["dec_dim"]
+ self.beta_min=config["beta_min"]
+ self.beta_max=config["beta_max"]
+ print(config["n_feats"])
+ self.encoder = FwdDiffusion(config["n_feats"], config["channels"], config["filters"],
+ config["heads"],config["layers"],config["kernel"],
+ config["dropout"],config["window_size"],config["enc_dim"])
+ self.decoder = Diffusion(config["n_feats"], config["dec_dim"],config["spk_dim"],config["use_ref_t"],
+ config["beta_min"],config["beta_max"])
+
+ def nparams(self):
+ num_params = 0
+ for name, param in self.named_parameters():
+ if param.requires_grad:
+ num_params += np.prod(param.detach().cpu().numpy().shape)
+ return num_params
+
+
+ def relocate_input(self, x: list):
+ device = next(self.parameters()).device
+ for i in range(len(x)):
+ if isinstance(x[i], torch.Tensor) and x[i].device != device:
+ x[i] = x[i].to(device)
+ return x
+
+ def load_encoder(self, enc_path):
+ enc_dict = torch.load(enc_path, map_location=lambda loc, storage: loc)
+ self.encoder.load_state_dict(enc_dict, strict=False)
+
+ @torch.no_grad()
+ def forward(self, x, x_lengths, x_ref, x_ref_lengths, c, n_timesteps,
+ mode='ml'):
+ """
+ Generates mel-spectrogram from source mel-spectrogram conditioned on
+ target speaker embedding. Returns:
+ 1. 'average voice' encoder outputs
+ 2. decoder outputs
+
+ Args:
+ x (torch.Tensor): batch of source mel-spectrograms.
+ x_lengths (torch.Tensor): numbers of frames in source mel-spectrograms.
+ x_ref (torch.Tensor): batch of reference mel-spectrograms.
+ x_ref_lengths (torch.Tensor): numbers of frames in reference mel-spectrograms.
+ c (torch.Tensor): batch of reference speaker embeddings
+ n_timesteps (int): number of steps to use for reverse diffusion in decoder.
+ mode (string, optional): sampling method. Can be one of:
+ 'pf' - probability flow sampling (Euler scheme for ODE)
+ 'em' - Euler-Maruyama SDE solver
+ 'ml' - Maximum Likelihood SDE solver
+ """
+ x, x_lengths = self.relocate_input([x, x_lengths])
+ x_ref, x_ref_lengths, c = self.relocate_input([x_ref, x_ref_lengths, c])
+ x_mask = sequence_mask(x_lengths).unsqueeze(1).to(x.dtype)
+ x_ref_mask = sequence_mask(x_ref_lengths).unsqueeze(1).to(x_ref.dtype)
+ mean = self.encoder(x, x_mask)
+ mean_x = self.decoder.compute_diffused_mean(x, x_mask, mean, 1.0)
+ mean_ref = self.encoder(x_ref, x_ref_mask)
+
+ b = x.shape[0]
+ max_length = int(x_lengths.max())
+ max_length_new = fix_len_compatibility(max_length)
+ x_mask_new = sequence_mask(x_lengths, max_length_new).unsqueeze(1).to(x.dtype)
+ mean_new = torch.zeros((b, self.n_feats, max_length_new), dtype=x.dtype,
+ device=x.device)
+ mean_x_new = torch.zeros((b, self.n_feats, max_length_new), dtype=x.dtype,
+ device=x.device)
+ for i in range(b):
+ mean_new[i, :, :x_lengths[i]] = mean[i, :, :x_lengths[i]]
+ mean_x_new[i, :, :x_lengths[i]] = mean_x[i, :, :x_lengths[i]]
+
+ z = mean_x_new
+ z += torch.randn_like(mean_x_new, device=mean_x_new.device)
+
+ y = self.decoder(z, x_mask_new, mean_new, x_ref, x_ref_mask, mean_ref, c,
+ n_timesteps, mode)
+ return mean_x, y[:, :, :max_length]
+
+ def compute_loss(self, x, x_lengths, x_ref, c):
+ """
+ Computes diffusion (score matching) loss.
+
+ Args:
+ x (torch.Tensor): batch of source mel-spectrograms.
+ x_lengths (torch.Tensor): numbers of frames in source mel-spectrograms.
+ x_ref (torch.Tensor): batch of reference mel-spectrograms.
+ c (torch.Tensor): batch of reference speaker embeddings
+ """
+ x, x_lengths, x_ref, c = self.relocate_input([x, x_lengths, x_ref, c])
+ x_mask = sequence_mask(x_lengths).unsqueeze(1).to(x.dtype)
+ mean = self.encoder(x, x_mask).detach()
+ mean_ref = self.encoder(x_ref, x_mask).detach()
+ diff_loss = self.decoder.compute_loss(x, x_mask, mean, x_ref, mean_ref, c)
+ return diff_loss
+
+
+ def calculate_loss(self, interaction):
+ x, x_lengths, x_ref, c = interaction
+ return {"loss": self.compute_loss(x, x_lengths, x_ref, c)}
+
+ def predict(self, interaction):
+ x, x_lengths, x_ref, x_ref_lengths, c, n_timesteps, mode = interaction
+ return self.forward(x, x_lengths, x_ref, x_ref_lengths, c, n_timesteps, mode)
+
+ def generate_batch(self):
+ # Implement this method based on your requirements
+ pass
+
+
+
+
diff --git a/talkingface/model/voice_conversion/hifi-gan.py b/talkingface/model/voice_conversion/hifi-gan.py
new file mode 100644
index 00000000..06fa6efa
--- /dev/null
+++ b/talkingface/model/voice_conversion/hifi-gan.py
@@ -0,0 +1,340 @@
+""" from https://github.com/jik876/hifi-gan """
+from abstract_talkingface import AbstractTalkingFace
+import torch
+import torch.nn.functional as F
+import torch.nn as nn
+from torch.nn import Conv1d, ConvTranspose1d, AvgPool1d, Conv2d
+from torch.nn.utils import weight_norm, remove_weight_norm, spectral_norm
+from xutils import init_weights, get_padding
+import glob
+import os
+import matplotlib
+matplotlib.use("Agg")
+import matplotlib.pylab as plt
+
+LRELU_SLOPE = 0.1
+
+class HiFiGANUtils:
+ @staticmethod
+ def plot_spectrogram(spectrogram):
+ fig, ax = plt.subplots(figsize=(10, 2))
+ im = ax.imshow(spectrogram, aspect="auto", origin="lower",
+ interpolation='none')
+ plt.colorbar(im, ax=ax)
+
+ fig.canvas.draw()
+ plt.close()
+
+ return fig
+
+ @staticmethod
+ def init_weights(m, mean=0.0, std=0.01):
+ classname = m.__class__.__name__
+ if classname.find("Conv") != -1:
+ m.weight.data.normal_(mean, std)
+
+ @staticmethod
+ def apply_weight_norm(m):
+ classname = m.__class__.__name__
+ if classname.find("Conv") != -1:
+ weight_norm(m)
+
+ @staticmethod
+ def get_padding(kernel_size, dilation=1):
+ return int((kernel_size * dilation - dilation) / 2)
+
+ @staticmethod
+ def load_checkpoint(filepath, device):
+ assert os.path.isfile(filepath)
+ print("Loading '{}'".format(filepath))
+ checkpoint_dict = torch.load(filepath, map_location=device)
+ print("Complete.")
+ return checkpoint_dict
+
+ @staticmethod
+ def save_checkpoint(filepath, obj):
+ print("Saving checkpoint to {}".format(filepath))
+ torch.save(obj, filepath)
+ print("Complete.")
+
+ @staticmethod
+ def scan_checkpoint(cp_dir, prefix):
+ pattern = os.path.join(cp_dir, prefix + '????????')
+ cp_list = glob.glob(pattern)
+ if len(cp_list) == 0:
+ return None
+ return sorted(cp_list)[-1]
+
+class ResBlock1(torch.nn.Module):
+ def __init__(self, h, channels, kernel_size=3, dilation=(1, 3, 5)):
+ super(ResBlock1, self).__init__()
+ self.h = h
+ self.convs1 = nn.ModuleList([
+ weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=dilation[0],
+ padding=HiFiGANUtils.get_padding(kernel_size, dilation[0]))),
+ weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=dilation[1],
+ padding=HiFiGANUtils.get_padding(kernel_size, dilation[1]))),
+ weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=dilation[2],
+ padding=HiFiGANUtils.get_padding(kernel_size, dilation[2])))
+ ])
+ self.convs1.apply(HiFiGANUtils.init_weights)
+
+ self.convs2 = nn.ModuleList([
+ weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=1,
+ padding=HiFiGANUtils.get_padding(kernel_size, 1))),
+ weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=1,
+ padding=HiFiGANUtils.get_padding(kernel_size, 1))),
+ weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=1,
+ padding=HiFiGANUtils.get_padding(kernel_size, 1)))
+ ])
+ self.convs2.apply(HiFiGANUtils.init_weights)
+
+ def forward(self, x):
+ for c1, c2 in zip(self.convs1, self.convs2):
+ xt = F.leaky_relu(x, LRELU_SLOPE)
+ xt = c1(xt)
+ xt = F.leaky_relu(xt, LRELU_SLOPE)
+ xt = c2(xt)
+ x = xt + x
+ return x
+
+ def remove_weight_norm(self):
+ for l in self.convs1:
+ remove_weight_norm(l)
+ for l in self.convs2:
+ remove_weight_norm(l)
+
+
+class ResBlock2(torch.nn.Module):
+ def __init__(self, h, channels, kernel_size=3, dilation=(1, 3)):
+ super(ResBlock2, self).__init__()
+ self.h = h
+ self.convs = nn.ModuleList([
+ weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=dilation[0],
+ padding=HiFiGANUtils.get_padding(kernel_size, dilation[0]))),
+ weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=dilation[1],
+ padding=HiFiGANUtils.get_padding(kernel_size, dilation[1])))
+ ])
+ self.convs.apply(HiFiGANUtils.init_weights)
+
+ def forward(self, x):
+ for c in self.convs:
+ xt = F.leaky_relu(x, LRELU_SLOPE)
+ xt = c(xt)
+ x = xt + x
+ return x
+
+ def remove_weight_norm(self):
+ for l in self.convs:
+ remove_weight_norm(l)
+
+
+class Generator(torch.nn.Module):
+ def __init__(self, h):
+ super(Generator, self).__init__()
+ self.h = h
+ self.num_kernels = len(h.resblock_kernel_sizes)
+ self.num_upsamples = len(h.upsample_rates)
+ self.conv_pre = weight_norm(Conv1d(80, h.upsample_initial_channel, 7, 1, padding=3))
+ resblock = ResBlock1 if h.resblock == '1' else ResBlock2
+
+ self.ups = nn.ModuleList()
+ for i, (u, k) in enumerate(zip(h.upsample_rates, h.upsample_kernel_sizes)):
+ self.ups.append(weight_norm(
+ ConvTranspose1d(h.upsample_initial_channel//(2**i), h.upsample_initial_channel//(2**(i+1)),
+ k, u, padding=(k-u)//2)))
+
+ self.resblocks = nn.ModuleList()
+ for i in range(len(self.ups)):
+ ch = h.upsample_initial_channel//(2**(i+1))
+ for j, (k, d) in enumerate(zip(h.resblock_kernel_sizes, h.resblock_dilation_sizes)):
+ self.resblocks.append(resblock(h, ch, k, d))
+
+ self.conv_post = weight_norm(Conv1d(ch, 1, 7, 1, padding=3))
+ self.ups.apply(HiFiGANUtils.init_weights)
+ self.conv_post.apply(HiFiGANUtils.init_weights)
+
+ def forward(self, x):
+ x = self.conv_pre(x)
+ for i in range(self.num_upsamples):
+ x = F.leaky_relu(x, LRELU_SLOPE)
+ x = self.ups[i](x)
+ xs = None
+ for j in range(self.num_kernels):
+ if xs is None:
+ xs = self.resblocks[i*self.num_kernels+j](x)
+ else:
+ xs += self.resblocks[i*self.num_kernels+j](x)
+ x = xs / self.num_kernels
+ x = F.leaky_relu(x)
+ x = self.conv_post(x)
+ x = torch.tanh(x)
+
+ return x
+
+ def remove_weight_norm(self):
+ print('Removing weight norm...')
+ for l in self.ups:
+ remove_weight_norm(l)
+ for l in self.resblocks:
+ l.remove_weight_norm()
+ remove_weight_norm(self.conv_pre)
+ remove_weight_norm(self.conv_post)
+
+
+class DiscriminatorP(torch.nn.Module):
+ def __init__(self, period, kernel_size=5, stride=3, use_spectral_norm=False):
+ super(DiscriminatorP, self).__init__()
+ self.period = period
+ norm_f = weight_norm if use_spectral_norm == False else spectral_norm
+ self.convs = nn.ModuleList([
+ norm_f(Conv2d(1, 32, (kernel_size, 1), (stride, 1), padding=(HiFiGANUtils.get_padding(5, 1), 0))),
+ norm_f(Conv2d(32, 128, (kernel_size, 1), (stride, 1), padding=(HiFiGANUtils.get_padding(5, 1), 0))),
+ norm_f(Conv2d(128, 512, (kernel_size, 1), (stride, 1), padding=(HiFiGANUtils.get_padding(5, 1), 0))),
+ norm_f(Conv2d(512, 1024, (kernel_size, 1), (stride, 1), padding=(HiFiGANUtils.get_padding(5, 1), 0))),
+ norm_f(Conv2d(1024, 1024, (kernel_size, 1), 1, padding=(2, 0))),
+ ])
+ self.conv_post = norm_f(Conv2d(1024, 1, (3, 1), 1, padding=(1, 0)))
+
+ def forward(self, x):
+ fmap = []
+
+ # 1d to 2d
+ b, c, t = x.shape
+ if t % self.period != 0: # pad first
+ n_pad = self.period - (t % self.period)
+ x = F.pad(x, (0, n_pad), "reflect")
+ t = t + n_pad
+ x = x.view(b, c, t // self.period, self.period)
+
+ for l in self.convs:
+ x = l(x)
+ x = F.leaky_relu(x, LRELU_SLOPE)
+ fmap.append(x)
+ x = self.conv_post(x)
+ fmap.append(x)
+ x = torch.flatten(x, 1, -1)
+
+ return x, fmap
+
+
+class MultiPeriodDiscriminator(torch.nn.Module):
+ def __init__(self):
+ super(MultiPeriodDiscriminator, self).__init__()
+ self.discriminators = nn.ModuleList([
+ DiscriminatorP(2),
+ DiscriminatorP(3),
+ DiscriminatorP(5),
+ DiscriminatorP(7),
+ DiscriminatorP(11),
+ ])
+
+ def forward(self, y, y_hat):
+ y_d_rs = []
+ y_d_gs = []
+ fmap_rs = []
+ fmap_gs = []
+ for i, d in enumerate(self.discriminators):
+ y_d_r, fmap_r = d(y)
+ y_d_g, fmap_g = d(y_hat)
+ y_d_rs.append(y_d_r)
+ fmap_rs.append(fmap_r)
+ y_d_gs.append(y_d_g)
+ fmap_gs.append(fmap_g)
+
+ return y_d_rs, y_d_gs, fmap_rs, fmap_gs
+
+
+class DiscriminatorS(torch.nn.Module):
+ def __init__(self, use_spectral_norm=False):
+ super(DiscriminatorS, self).__init__()
+ norm_f = weight_norm if use_spectral_norm == False else spectral_norm
+ self.convs = nn.ModuleList([
+ norm_f(Conv1d(1, 128, 15, 1, padding=7)),
+ norm_f(Conv1d(128, 128, 41, 2, groups=4, padding=20)),
+ norm_f(Conv1d(128, 256, 41, 2, groups=16, padding=20)),
+ norm_f(Conv1d(256, 512, 41, 4, groups=16, padding=20)),
+ norm_f(Conv1d(512, 1024, 41, 4, groups=16, padding=20)),
+ norm_f(Conv1d(1024, 1024, 41, 1, groups=16, padding=20)),
+ norm_f(Conv1d(1024, 1024, 5, 1, padding=2)),
+ ])
+ self.conv_post = norm_f(Conv1d(1024, 1, 3, 1, padding=1))
+
+ def forward(self, x):
+ fmap = []
+ for l in self.convs:
+ x = l(x)
+ x = F.leaky_relu(x, LRELU_SLOPE)
+ fmap.append(x)
+ x = self.conv_post(x)
+ fmap.append(x)
+ x = torch.flatten(x, 1, -1)
+
+ return x, fmap
+
+
+class MultiScaleDiscriminator(torch.nn.Module):
+ def __init__(self):
+ super(MultiScaleDiscriminator, self).__init__()
+ self.discriminators = nn.ModuleList([
+ DiscriminatorS(use_spectral_norm=True),
+ DiscriminatorS(),
+ DiscriminatorS(),
+ ])
+ self.meanpools = nn.ModuleList([
+ AvgPool1d(4, 2, padding=2),
+ AvgPool1d(4, 2, padding=2)
+ ])
+
+ def forward(self, y, y_hat):
+ y_d_rs = []
+ y_d_gs = []
+ fmap_rs = []
+ fmap_gs = []
+ for i, d in enumerate(self.discriminators):
+ if i != 0:
+ y = self.meanpools[i-1](y)
+ y_hat = self.meanpools[i-1](y_hat)
+ y_d_r, fmap_r = d(y)
+ y_d_g, fmap_g = d(y_hat)
+ y_d_rs.append(y_d_r)
+ fmap_rs.append(fmap_r)
+ y_d_gs.append(y_d_g)
+ fmap_gs.append(fmap_g)
+
+ return y_d_rs, y_d_gs, fmap_rs, fmap_gs
+
+
+def feature_loss(fmap_r, fmap_g):
+ loss = 0
+ for dr, dg in zip(fmap_r, fmap_g):
+ for rl, gl in zip(dr, dg):
+ loss += torch.mean(torch.abs(rl - gl))
+
+ return loss*2
+
+
+def discriminator_loss(disc_real_outputs, disc_generated_outputs):
+ loss = 0
+ r_losses = []
+ g_losses = []
+ for dr, dg in zip(disc_real_outputs, disc_generated_outputs):
+ r_loss = torch.mean((1-dr)**2)
+ g_loss = torch.mean(dg**2)
+ loss += (r_loss + g_loss)
+ r_losses.append(r_loss.item())
+ g_losses.append(g_loss.item())
+
+ return loss, r_losses, g_losses
+
+
+def generator_loss(disc_outputs):
+ loss = 0
+ gen_losses = []
+ for dg in disc_outputs:
+ l = torch.mean((1-dg)**2)
+ gen_losses.append(l)
+ loss += l
+
+ return loss, gen_losses
+
diff --git a/talkingface/properties/dataset/diffvc_dataset.yaml b/talkingface/properties/dataset/diffvc_dataset.yaml
new file mode 100644
index 00000000..8dc20796
--- /dev/null
+++ b/talkingface/properties/dataset/diffvc_dataset.yaml
@@ -0,0 +1,19 @@
+---
+data_dir: 'dataset/diffvc_data/'
+val_file: "dataset/diffvc_data/filelist/valid.txt" # 注意:修复了注释位置dataset/diffvc_data/filelist/valid.txt
+exc_file: "dataset/diffvc_data/filelist/exceptions_libritts.txt" # 注意:修复了注释位置
+log_dir: 'logs_dec'
+enc_dir: 'logs_enc'
+epochs: 10
+batch_size: 2
+learning_rate: 1e-4
+save_every: 1
+
+# Train
+checkpoint_sub_dir: "/diffvc" # 和overall.yaml里checkpoint_dir拼起来作为最终目录
+
+temp_sub_dir: "/diffvc" # 和overall.yaml里temp_dir拼起来作为最终目录
+
+train_filelist: 'dataset/diff_data/filelist/valid.txt' # 当前数据集的数据划分文件 train
+test_filelist: 'dataset/diff_data/filelist/valid.txt' # 当前数据集的数据划分文件 test
+val_filelist: 'dataset/diff_data/filelist/valid.txt' # 当前数据集的数据划分文件 val
\ No newline at end of file
diff --git a/talkingface/properties/dataset/diffvc_encoder_dataset.yaml b/talkingface/properties/dataset/diffvc_encoder_dataset.yaml
new file mode 100644
index 00000000..25a430ae
--- /dev/null
+++ b/talkingface/properties/dataset/diffvc_encoder_dataset.yaml
@@ -0,0 +1,32 @@
+""" from https://github.com/CorentinJ/Real-Time-Voice-Cloning """
+
+## Mel-filterbank
+mel_window_length = 25 # In milliseconds
+mel_window_step = 10 # In milliseconds
+mel_n_channels = 40
+
+
+## Audio
+sampling_rate = 16000
+# Number of spectrogram frames in a partial utterance
+partials_n_frames = 160 # 1600 ms
+# Number of spectrogram frames at inference
+inference_n_frames = 80 # 800 ms
+
+
+## Voice Activation Detection
+# Window size of the VAD. Must be either 10, 20 or 30 milliseconds.
+# This sets the granularity of the VAD. Should not need to be changed.
+vad_window_length = 30 # In milliseconds
+# Number of frames to average together when performing the moving average smoothing.
+# The larger this value, the larger the VAD variations must be to not get smoothed out.
+vad_moving_average_width = 8
+# Maximum number of consecutive silent frames a segment can have.
+vad_max_silence_length = 6
+
+
+## Audio volume normalization
+audio_norm_target_dBFS = -30
+
+
+
diff --git a/talkingface/properties/dataset/params_data.py b/talkingface/properties/dataset/params_data.py
new file mode 100644
index 00000000..62d04121
--- /dev/null
+++ b/talkingface/properties/dataset/params_data.py
@@ -0,0 +1,30 @@
+""" from https://github.com/CorentinJ/Real-Time-Voice-Cloning """
+
+## Mel-filterbank
+mel_window_length = 25 # In milliseconds
+mel_window_step = 10 # In milliseconds
+mel_n_channels = 40
+
+
+## Audio
+sampling_rate = 16000
+# Number of spectrogram frames in a partial utterance
+partials_n_frames = 160 # 1600 ms
+# Number of spectrogram frames at inference
+inference_n_frames = 80 # 800 ms
+
+
+## Voice Activation Detection
+# Window size of the VAD. Must be either 10, 20 or 30 milliseconds.
+# This sets the granularity of the VAD. Should not need to be changed.
+vad_window_length = 30 # In milliseconds
+# Number of frames to average together when performing the moving average smoothing.
+# The larger this value, the larger the VAD variations must be to not get smoothed out.
+vad_moving_average_width = 8
+# Maximum number of consecutive silent frames a segment can have.
+vad_max_silence_length = 6
+
+
+## Audio volume normalization
+audio_norm_target_dBFS = -30
+
diff --git a/talkingface/properties/model/diffvc.yaml b/talkingface/properties/model/diffvc.yaml
new file mode 100644
index 00000000..bf473874
--- /dev/null
+++ b/talkingface/properties/model/diffvc.yaml
@@ -0,0 +1,39 @@
+---
+n_feats: 80
+sampling_rate: 22050
+n_fft: 1024
+hop_size: 256
+channels: 192
+filters: 768
+layers: 6
+kernel: 3
+dropout: 0.1
+heads: 2
+window_size: 4
+enc_dim: 128
+dec_dim: 256
+spk_dim: 128
+use_ref_t: True
+beta_min: 0.05
+beta_max: 20.0
+random_seed: 37
+test_size: 1
+train_frame: 128
+data_dir: 'dataset/diffvc_data/'
+val_file: "dataset/diffvc_data/filelist/valid.txt" # 注意:修复了注释位置dataset/diffvc_data/filelist/valid.txt
+exc_file: "dataset/diffvc_data/filelist/exceptions_libritts.txt" # 注意:修复了注释位置
+log_dir: 'logs_dec'
+enc_dir: 'logs_enc'
+epochs: 10
+batch_size: 32
+learning_rate: 1e-4
+save_every: 1
+
+# Train
+checkpoint_sub_dir: "/diffvc" # 和overall.yaml里checkpoint_dir拼起来作为最终目录
+
+temp_sub_dir: "/diffvc" # 和overall.yaml里temp_dir拼起来作为最终目录
+
+train_filelist: 'dataset/diff_data/filelist/valid.txt' # 当前数据集的数据划分文件 train
+test_filelist: '' # 当前数据集的数据划分文件 test
+val_filelist: '' # 当前数据集的数据划分文件 val
\ No newline at end of file
diff --git a/talkingface/properties/model/diffvc_encoder.yaml b/talkingface/properties/model/diffvc_encoder.yaml
new file mode 100644
index 00000000..be1c069a
--- /dev/null
+++ b/talkingface/properties/model/diffvc_encoder.yaml
@@ -0,0 +1,14 @@
+""" from https://github.com/CorentinJ/Real-Time-Voice-Cloning """
+
+## Model parameters
+model_hidden_size = 256
+model_embedding_size = 256
+model_num_layers = 3
+
+
+## Training parameters
+learning_rate_init = 1e-4
+speakers_per_batch = 64
+utterances_per_speaker = 10
+
+
diff --git a/talkingface/properties/model/params.py b/talkingface/properties/model/params.py
new file mode 100644
index 00000000..ea6ed635
--- /dev/null
+++ b/talkingface/properties/model/params.py
@@ -0,0 +1,33 @@
+n_mels = 80
+sampling_rate = 22050
+n_fft = 1024
+hop_size = 256
+
+# "average voice" encoder parameters
+channels = 192
+filters = 768
+layers = 6
+kernel = 3
+dropout = 0.1
+heads = 2
+window_size = 4
+enc_dim = 128
+
+# diffusion-based decoder parameters
+dec_dim = 256
+spk_dim = 128
+use_ref_t = True
+beta_min = 0.05
+beta_max = 20.0
+
+# training parameters
+seed = 37
+test_size = 1
+train_frames = 128
+
+data_dir = 'dataset/diffvc_data/'
+#val_file: "dataset/diffvc_data/filelist/valid.txt" # 注意:修复了注释位置dataset/diffvc_data/filelist/valid.txt
+#exc_file: "dataset/diffvc_data/filelist/exceptions_libritts.txt" # 注意:修复了注释位置
+val_file = "dataset/diffvc_data/filelist/valid.txt"
+ #val_file = r'C:\Users\liberty\Desktop\diffvc-yuanma\\filelists\\valid.txt'
+exc_file = "dataset/diffvc_data/filelist/exceptions_libritts.txt"
diff --git a/talkingface/properties/overall.yaml b/talkingface/properties/overall.yaml
index 81ac51ae..48c8d7ed 100644
--- a/talkingface/properties/overall.yaml
+++ b/talkingface/properties/overall.yaml
@@ -1,7 +1,7 @@
# Enviroment Settings
gpu_id: '3, 4, 5' # (str) The id of GPU device(s).
worker: 0 # (int) The number of workers processing the data.
-use_gpu: True # (bool) Whether or not to use GPU.
+use_gpu: True # (bool) Whether or not to use GPU.
seed: 2023 # (int) Random seed.
checkpoint_dir: 'saved' # (str) The path to save checkpoint file.
show_progress: True # (bool) Whether or not to show the progress bar of every epoch.
diff --git a/talkingface/quick_start/quick_start.py b/talkingface/quick_start/quick_start.py
index 3ff2e889..0587213f 100644
--- a/talkingface/quick_start/quick_start.py
+++ b/talkingface/quick_start/quick_start.py
@@ -77,13 +77,13 @@ def run_talkingface(
train_dataset, val_dataset = create_dataset(config)
train_data_loader = data_utils.DataLoader(
- train_dataset, batch_size=config["batch_size"], shuffle=True
+ train_dataset, batch_size=config["batch_size"], shuffle=True
)
val_data_loader = data_utils.DataLoader(
val_dataset, batch_size=config["batch_size"], shuffle=False
)
- # load model
+ #load model
model = get_model(config["model"])(config).to(config["device"])
logger.info(model)
diff --git a/talkingface/trainer/trainer.py b/talkingface/trainer/trainer.py
index 2c34717b..afcf2bdd 100644
--- a/talkingface/trainer/trainer.py
+++ b/talkingface/trainer/trainer.py
@@ -13,6 +13,33 @@
import torch.cuda.amp as amp
from torch import nn
from pathlib import Path
+#train talkingface.data.dataset.data_objects talkingface.utils.voice_conversion_talkingface
+from talkingface.data.dataset.data_objects import SpeakerVerificationDataLoader, SpeakerVerificationDataset
+from talkingface.utils.voice_conversion_talkingface.params_model import *
+from talkingface.model.voice_conversion.diffvc import SpeakerEncoder
+from talkingface.utils.voice_conversion_talkingface.DiffVC_speaker_encoder_utils import Profiler as Profiler
+from pathlib import Path
+
+from torch.utils.data import DataLoader
+
+from talkingface.data.dataset.diffvc_dataset import diffvcDataset, VCDecBatchCollate,VCDecDataset
+from talkingface.model.voice_conversion.diffvc import diffvc
+from talkingface.model.voice_conversion.diffvc import FastGL
+from talkingface.utils.utils import save_plot, save_audio
+
+#visualization
+from talkingface.data.dataset.data_objects.speaker_verification_dataset import SpeakerVerificationDataset
+from datetime import datetime
+from time import perf_counter as timer
+import matplotlib.pyplot as plt
+import numpy as np
+import talkingface.properties.model.params as params
+
+from talkingface.model.voice_conversion.diffvc import diffvc
+import webbrowser
+import visdom
+import umap
+
from talkingface.utils import(
ensure_dir,
@@ -29,6 +56,8 @@
from talkingface.evaluator import Evaluator
+
+
class AbstractTrainer(object):
r"""Trainer Class is used to manage the training and evaluation processes of recommender system models.
AbstractTrainer is an abstract class in which the fit() and evaluate() method should be implemented according
@@ -446,6 +475,410 @@ def evaluate(self, load_best_model=True, model_file=None):
eval_result = self.evaluator.evaluate(datadict)
self.logger.info(eval_result)
+class diffvcTrainer(Trainer):
+ def __init__(self, config, model):
+ super(diffvcTrainer, self).__init__(config, model)
+ self.optimizer = config["optimizer"]
+ self.train_loader = config["train_loader"]
+ self.epochs = config["epochs"]
+ self.batch_size = config["batch_size"]
+ self.learning_rate = config["learning_rate"]
+ self.save_every = config["save_every"]
+ self.log_dir = config["log_dir"]
+ torch.manual_seed(config["random_seed"])
+ np.random.seed(config["random_seed"])
+ n_mels = params.n_mels
+ sampling_rate = params.sampling_rate
+ n_fft = params.n_fft
+ hop_size = params.hop_size
+
+ random_seed = params.seed
+ test_size = params.test_size
+
+ data_dir=params.data_dir
+ val_file=params.val_file
+ exc_file=params.exc_file
+
+ log_dir = 'logs_dec'
+ enc_dir = 'logs_enc'
+ epochs = 10
+ batch_size = 32
+ learning_rate = 1e-4
+ save_every = 1
+
+
+
+
+ torch.manual_seed(random_seed)
+ np.random.seed(random_seed)
+
+ os.makedirs(log_dir, exist_ok=True)
+
+ print('Initializing data loaders...')
+ train_set = VCDecDataset(data_dir, val_file, exc_file)
+ collate_fn = VCDecBatchCollate()
+ train_loader = DataLoader(train_set, batch_size=batch_size,
+ collate_fn=collate_fn, num_workers=4, drop_last=True)
+
+ print('Initializing and loading models...')
+ #fgl = FastGL(n_mels, sampling_rate, n_fft, hop_size).cuda()
+ fgl = FastGL(n_mels, sampling_rate, n_fft, hop_size)
+
+ #.cuda()
+ model.load_encoder(os.path.join(enc_dir, 'enc.pt'))
+
+ print('Encoder:')
+ print(model.encoder)
+ # print('Number of parameters = %.2fm\n' % (model.encoder.nparams/1e6))
+ print('Decoder:')
+ print(model.decoder)
+ # print('Number of parameters = %.2fm\n' % (model.decoder.nparams/1e6))
+
+ print('Initializing optimizers...')
+ optimizer = torch.optim.Adam(params=model.decoder.parameters(), lr=learning_rate)
+
+ print('Start training.')
+ torch.backends.cudnn.benchmark = True
+ iteration = 0
+ for epoch in range(1, epochs + 1):
+ print(f'Epoch: {epoch} [iteration: {iteration}]')
+ model.train()
+ losses = []
+ for batch in tqdm(train_loader, total=len(train_set)//batch_size):
+ mel, mel_ref = batch['mel1'].cuda(), batch['mel2'].cuda()
+ c, mel_lengths = batch['c'].cuda(), batch['mel_lengths'].cuda()
+ model.zero_grad()
+ loss = model.compute_loss(mel, mel_lengths, mel_ref, c)
+ loss.backward()
+ torch.nn.utils.clip_grad_norm_(model.decoder.parameters(), max_norm=1)
+ optimizer.step()
+ losses.append(loss.item())
+ iteration += 1
+
+ losses = np.asarray(losses)
+ msg = 'Epoch %d: loss = %.4f\n' % (epoch, np.mean(losses))
+ print(msg)
+ with open(f'{log_dir}/train_dec.log', 'a') as f:
+ f.write(msg)
+ losses = []
+
+ if epoch % save_every > 0:
+ continue
+
+ model.eval()
+ print('Inference...\n')
+ with torch.no_grad():
+ mels = train_set.get_valid_dataset()
+ for i, (mel, c) in enumerate(mels):
+ if i >= test_size:
+ break
+ mel = mel.unsqueeze(0).float().cuda()
+ c = c.unsqueeze(0).float().cuda()
+ mel_lengths = torch.LongTensor([mel.shape[-1]]).cuda()
+ mel_avg, mel_rec = model(mel, mel_lengths, mel, mel_lengths, c,
+ n_timesteps=100)
+ if epoch == save_every:
+ save_plot(mel.squeeze().cpu(), f'{log_dir}/original_{i}.png')
+ audio = fgl(mel)
+ save_audio(f'{log_dir}/original_{i}.wav', sampling_rate, audio)
+ save_plot(mel_avg.squeeze().cpu(), f'{log_dir}/average_{i}.png')
+ audio = fgl(mel_avg)
+ save_audio(f'{log_dir}/average_{i}.wav', sampling_rate, audio)
+ save_plot(mel_rec.squeeze().cpu(), f'{log_dir}/reconstructed_{i}.png')
+ audio = fgl(mel_rec)
+ save_audio(f'{log_dir}/reconstructed_{i}.wav', sampling_rate, audio)
+
+ print('Saving model...\n')
+ ckpt = model.state_dict()
+ torch.save(ckpt, f=f"{log_dir}/vc_{epoch}.pt")
+
+
+class diffVC_encoder_train:
+ def sync(device: torch.device):
+ # FIXME
+ return
+ # For correct profiling (cuda operations are async)
+ if device.type == "cuda":
+ torch.cuda.synchronize(device)
+
+ def train(run_id: str, clean_data_root: Path, models_dir: Path, umap_every: int, save_every: int,
+ backup_every: int, vis_every: int, force_restart: bool, visdom_server: str,
+ no_visdom: bool):
+ # Create a dataset and a dataloader
+ dataset = SpeakerVerificationDataset(clean_data_root)
+ loader = SpeakerVerificationDataLoader(
+ dataset,
+ diffVC_encoder_train.speakers_per_batch,
+ diffVC_encoder_train.utterances_per_speaker,
+ num_workers=8,
+ )
+
+ # Setup the device on which to run the forward pass and the loss. These can be different,
+ # because the forward pass is faster on the GPU whereas the loss is often (depending on your
+ # hyperparameters) faster on the CPU.
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+ # FIXME: currently, the gradient is None if loss_device is cuda
+ loss_device = torch.device("cpu")
+
+ # Create the model and the optimizer
+ model = SpeakerEncoder(device, loss_device)
+ optimizer = torch.optim.Adam(model.parameters(), lr=diffVC_encoder_train.learning_rate_init)
+ init_step = 1
+
+ # Configure file path for the model
+ state_fpath = models_dir.joinpath(run_id + ".pt")
+ backup_dir = models_dir.joinpath(run_id + "_backups")
+
+ # Load any existing model
+ if not force_restart:
+ if state_fpath.exists():
+ print("Found existing model \"%s\", loading it and resuming training." % run_id)
+ checkpoint = torch.load(state_fpath)
+ init_step = checkpoint["step"]
+ model.load_state_dict(checkpoint["model_state"])
+ optimizer.load_state_dict(checkpoint["optimizer_state"])
+ optimizer.param_groups[0]["lr"] = diffVC_encoder_train.learning_rate_init
+ else:
+ print("No model \"%s\" found, starting training from scratch." % run_id)
+ else:
+ print("Starting the training from scratch.")
+ model.train()
+
+ # Initialize the visualization environment
+ vis = Visualizations(run_id, vis_every, server=visdom_server, disabled=no_visdom)
+ vis.log_dataset(dataset)
+ vis.log_params()
+ device_name = str(torch.cuda.get_device_name(0) if torch.cuda.is_available() else "CPU")
+ vis.log_implementation({"Device": device_name})
+
+ # Training loop
+ profiler = Profiler(summarize_every=10, disabled=False)
+ for step, speaker_batch in enumerate(loader, init_step):
+ profiler.tick("Blocking, waiting for batch (threaded)")
+
+ # Forward pass
+ inputs = torch.from_numpy(speaker_batch.data).to(device)
+ diffVC_encoder_train.sync(device)
+ profiler.tick("Data to %s" % device)
+ embeds = model(inputs)
+ diffVC_encoder_train.sync(device)
+ profiler.tick("Forward pass")
+ embeds_loss = embeds.view((diffVC_encoder_train.speakers_per_batch, diffVC_encoder_train.utterances_per_speaker, -1)).to(loss_device)
+ loss, eer = model.loss(embeds_loss)
+ diffVC_encoder_train.sync(loss_device)
+ profiler.tick("Loss")
+
+ # Backward pass
+ model.zero_grad()
+ loss.backward()
+ profiler.tick("Backward pass")
+ model.do_gradient_ops()
+ optimizer.step()
+ profiler.tick("Parameter update")
+
+ # Update visualizations
+ # learning_rate = optimizer.param_groups[0]["lr"]
+ vis.update(loss.item(), eer, step)
+
+ # Draw projections and save them to the backup folder
+ if umap_every != 0 and step % umap_every == 0:
+ print("Drawing and saving projections (step %d)" % step)
+ backup_dir.mkdir(exist_ok=True)
+ projection_fpath = backup_dir.joinpath("%s_umap_%06d.png" % (run_id, step))
+ embeds = embeds.detach().cpu().numpy()
+ vis.draw_projections(embeds, diffVC_encoder_train.utterances_per_speaker, step, projection_fpath)
+ vis.save()
+
+ # Overwrite the latest version of the model
+ if save_every != 0 and step % save_every == 0:
+ print("Saving the model (step %d)" % step)
+ torch.save({
+ "step": step + 1,
+ "model_state": model.state_dict(),
+ "optimizer_state": optimizer.state_dict(),
+ }, state_fpath)
+
+ # Make a backup
+ if backup_every != 0 and step % backup_every == 0:
+ print("Making a backup (step %d)" % step)
+ backup_dir.mkdir(exist_ok=True)
+ backup_fpath = backup_dir.joinpath("%s_bak_%06d.pt" % (run_id, step))
+ torch.save({
+ "step": step + 1,
+ "model_state": model.state_dict(),
+ "optimizer_state": optimizer.state_dict(),
+ }, backup_fpath)
+
+ profiler.tick("Extras (visualizations, saving)")
+
+
+class Visualizations:
+ colormap = np.array([
+ [76, 255, 0],
+ [0, 127, 70],
+ [255, 0, 0],
+ [255, 217, 38],
+ [0, 135, 255],
+ [165, 0, 165],
+ [255, 167, 255],
+ [0, 255, 255],
+ [255, 96, 38],
+ [142, 76, 0],
+ [33, 0, 127],
+ [0, 0, 0],
+ [183, 183, 183],
+ ], dtype=np.float) / 255
+
+ def __init__(self, env_name=None, update_every=10, server="http://localhost", disabled=False):
+ # Tracking data
+ self.last_update_timestamp = timer()
+ self.update_every = update_every
+ self.step_times = []
+ self.losses = []
+ self.eers = []
+ print("Updating the visualizations every %d steps." % update_every)
+
+ # If visdom is disabled TODO: use a better paradigm for that
+ self.disabled = disabled
+ if self.disabled:
+ return
+
+ # Set the environment name
+ now = str(datetime.now().strftime("%d-%m %Hh%M"))
+ if env_name is None:
+ self.env_name = now
+ else:
+ self.env_name = "%s (%s)" % (env_name, now)
+
+ # Connect to visdom and open the corresponding window in the browser
+ try:
+ self.vis = visdom.Visdom(server, env=self.env_name, raise_exceptions=True)
+ except ConnectionError:
+ raise Exception("No visdom server detected. Run the command \"visdom\" in your CLI to "
+ "start it.")
+ # webbrowser.open("http://localhost:8097/env/" + self.env_name)
+
+ # Create the windows
+ self.loss_win = None
+ self.eer_win = None
+ # self.lr_win = None
+ self.implementation_win = None
+ self.projection_win = None
+ self.implementation_string = ""
+
+ def log_params(self):
+ if self.disabled:
+ return
+ from talkingface.utils.voice_conversion_talkingface import params_data
+ from talkingface.utils.voice_conversion_talkingface import params_model
+ param_string = "Model parameters:
"
+ for param_name in (p for p in dir(params_model) if not p.startswith("__")):
+ value = getattr(params_model, param_name)
+ param_string += "\t%s: %s
" % (param_name, value)
+ param_string += "Data parameters:
"
+ for param_name in (p for p in dir(params_data) if not p.startswith("__")):
+ value = getattr(params_data, param_name)
+ param_string += "\t%s: %s
" % (param_name, value)
+ self.vis.text(param_string, opts={"title": "Parameters"})
+
+ def log_dataset(self, dataset: SpeakerVerificationDataset):
+ if self.disabled:
+ return
+ dataset_string = ""
+ dataset_string += "Speakers: %s\n" % len(dataset.speakers)
+ dataset_string += "\n" + dataset.get_logs()
+ dataset_string = dataset_string.replace("\n", "
")
+ self.vis.text(dataset_string, opts={"title": "Dataset"})
+
+ def log_implementation(self, params):
+ if self.disabled:
+ return
+ implementation_string = ""
+ for param, value in params.items():
+ implementation_string += "%s: %s\n" % (param, value)
+ implementation_string = implementation_string.replace("\n", "
")
+ self.implementation_string = implementation_string
+ self.implementation_win = self.vis.text(
+ implementation_string,
+ opts={"title": "Training implementation"}
+ )
+
+ def update(self, loss, eer, step):
+ # Update the tracking data
+ now = timer()
+ self.step_times.append(1000 * (now - self.last_update_timestamp))
+ self.last_update_timestamp = now
+ self.losses.append(loss)
+ self.eers.append(eer)
+ print(".", end="")
+
+ # Update the plots every steps
+ if step % self.update_every != 0:
+ return
+ time_string = "Step time: mean: %5dms std: %5dms" % \
+ (int(np.mean(self.step_times)), int(np.std(self.step_times)))
+ print("\nStep %6d Loss: %.4f EER: %.4f %s" %
+ (step, np.mean(self.losses), np.mean(self.eers), time_string))
+ if not self.disabled:
+ self.loss_win = self.vis.line(
+ [np.mean(self.losses)],
+ [step],
+ win=self.loss_win,
+ update="append" if self.loss_win else None,
+ opts=dict(
+ legend=["Avg. loss"],
+ xlabel="Step",
+ ylabel="Loss",
+ title="Loss",
+ )
+ )
+ self.eer_win = self.vis.line(
+ [np.mean(self.eers)],
+ [step],
+ win=self.eer_win,
+ update="append" if self.eer_win else None,
+ opts=dict(
+ legend=["Avg. EER"],
+ xlabel="Step",
+ ylabel="EER",
+ title="Equal error rate"
+ )
+ )
+ if self.implementation_win is not None:
+ self.vis.text(
+ self.implementation_string + ("%s" % time_string),
+ win=self.implementation_win,
+ opts={"title": "Training implementation"},
+ )
+
+ # Reset the tracking
+ self.losses.clear()
+ self.eers.clear()
+ self.step_times.clear()
+
+ def draw_projections(self, embeds, utterances_per_speaker, step, out_fpath=None,
+ max_speakers=10):
+ max_speakers = min(max_speakers, len(Visualizations.colormap))
+ embeds = embeds[:max_speakers * utterances_per_speaker]
+
+ n_speakers = len(embeds) // utterances_per_speaker
+ ground_truth = np.repeat(np.arange(n_speakers), utterances_per_speaker)
+ colors = [Visualizations.colormap[i] for i in ground_truth]
+
+ reducer = umap.UMAP()
+ projected = reducer.fit_transform(embeds)
+ plt.scatter(projected[:, 0], projected[:, 1], c=colors)
+ plt.gca().set_aspect("equal", "datalim")
+ plt.title("UMAP projection (step %d)" % step)
+ if not self.disabled:
+ self.projection_win = self.vis.matplot(plt, win=self.projection_win)
+ if out_fpath is not None:
+ plt.savefig(out_fpath)
+ plt.clf()
+
+ def save(self):
+ if not self.disabled:
+ self.vis.save([self.env_name])
class Wav2LipTrainer(Trainer):
diff --git a/talkingface/utils/.DS_Store b/talkingface/utils/.DS_Store
new file mode 100644
index 00000000..8a9ddfed
Binary files /dev/null and b/talkingface/utils/.DS_Store differ
diff --git a/talkingface/utils/face_detection/.DS_Store b/talkingface/utils/face_detection/.DS_Store
new file mode 100644
index 00000000..b6c6d639
Binary files /dev/null and b/talkingface/utils/face_detection/.DS_Store differ
diff --git a/talkingface/utils/face_detection/detection/.DS_Store b/talkingface/utils/face_detection/detection/.DS_Store
new file mode 100644
index 00000000..01c190d6
Binary files /dev/null and b/talkingface/utils/face_detection/detection/.DS_Store differ
diff --git a/talkingface/utils/utils.py b/talkingface/utils/utils.py
index a5019491..eb4aca47 100644
--- a/talkingface/utils/utils.py
+++ b/talkingface/utils/utils.py
@@ -4,6 +4,9 @@
import random
import pandas as pd
+import torchaudio
+
+from librosa.filters import mel as librosa_mel_fn
import numpy as np
import torch
import torch.nn as nn
@@ -11,6 +14,10 @@
from torch.utils.tensorboard import SummaryWriter
from texttable import Texttable
+
+import matplotlib.pyplot as plt
+from scipy.io import wavfile
+
def get_local_time():
r"""Get current time
@@ -43,12 +50,12 @@ def get_model(model_name):
Recommender: model class
"""
model_submodule = [
- "audio_driven_talkingface",
+ "voice_conversion",
+ #"audio_driven_talkingface",
"image_driven_talkingface",
"nerf_based_talkingface",
- "text_to_speech",
- "voice_conversion"
-
+ "text_to_speech"
+
]
model_file_name = model_name.lower()
@@ -433,9 +440,11 @@ def get_preprocess(dataset_name):
def create_dataset(config):
r"""Automatically select dataset class based on dataset name
"""
+ dataset_module = None
model_name = config['model']
dataset_file_name = model_name.lower()+'_dataset'
module_path = ".".join(["talkingface.data.dataset", dataset_file_name])
+ print(module_path +"=module_path")
if importlib.util.find_spec(module_path, __name__):
dataset_module = importlib.import_module(module_path, __name__)
if dataset_module is None:
@@ -447,9 +456,44 @@ def create_dataset(config):
return dataset_class(config, config['train_filelist']), dataset_class(config, config['val_filelist'])
+#diffVC
+def mse_loss(x, y, mask, n_feats):
+ loss = torch.sum(((x - y)**2) * mask)
+ return loss / (torch.sum(mask) * n_feats)
+
+
+def sequence_mask(length, max_length=None):
+ if max_length is None:
+ max_length = length.max()
+ x = torch.arange(int(max_length), dtype=length.dtype, device=length.device)
+ return x.unsqueeze(0) < length.unsqueeze(1)
+
+
+def convert_pad_shape(pad_shape):
+ l = pad_shape[::-1]
+ pad_shape = [item for sublist in l for item in sublist]
+ return pad_shape
+
+def fix_len_compatibility(length, num_downsamplings_in_unet=2):
+ while True:
+ if length % (2**num_downsamplings_in_unet) == 0:
+ return length
+ length += 1
+def save_plot(tensor, savepath):
+ plt.style.use('default')
+ fig, ax = plt.subplots(figsize=(12, 3))
+ im = ax.imshow(tensor, aspect="auto", origin="lower", interpolation='none')
+ plt.colorbar(im, ax=ax)
+ plt.tight_layout()
+ fig.canvas.draw()
+ plt.savefig(savepath)
+ plt.close()
+def save_audio(file_path, sampling_rate, audio):
+ audio = np.clip(audio.detach().cpu().squeeze().numpy(), -0.999, 0.999)
+ wavfile.write(file_path, sampling_rate, (audio * 32767).astype("int16"))
diff --git a/talkingface/utils/voice_conversion_talkingface/.DS_Store b/talkingface/utils/voice_conversion_talkingface/.DS_Store
new file mode 100644
index 00000000..7e5eafca
Binary files /dev/null and b/talkingface/utils/voice_conversion_talkingface/.DS_Store differ
diff --git a/talkingface/utils/voice_conversion_talkingface/DiffVC_hifi-gan_xutils.py b/talkingface/utils/voice_conversion_talkingface/DiffVC_hifi-gan_xutils.py
new file mode 100644
index 00000000..e2d88d5c
--- /dev/null
+++ b/talkingface/utils/voice_conversion_talkingface/DiffVC_hifi-gan_xutils.py
@@ -0,0 +1,60 @@
+""" from https://github.com/jik876/hifi-gan """
+
+import glob
+import os
+import matplotlib
+import torch
+from torch.nn.utils import weight_norm
+matplotlib.use("Agg")
+import matplotlib.pylab as plt
+
+
+def plot_spectrogram(spectrogram):
+ fig, ax = plt.subplots(figsize=(10, 2))
+ im = ax.imshow(spectrogram, aspect="auto", origin="lower",
+ interpolation='none')
+ plt.colorbar(im, ax=ax)
+
+ fig.canvas.draw()
+ plt.close()
+
+ return fig
+
+
+def init_weights(m, mean=0.0, std=0.01):
+ classname = m.__class__.__name__
+ if classname.find("Conv") != -1:
+ m.weight.data.normal_(mean, std)
+
+
+def apply_weight_norm(m):
+ classname = m.__class__.__name__
+ if classname.find("Conv") != -1:
+ weight_norm(m)
+
+
+def get_padding(kernel_size, dilation=1):
+ return int((kernel_size*dilation - dilation)/2)
+
+
+def load_checkpoint(filepath, device):
+ assert os.path.isfile(filepath)
+ print("Loading '{}'".format(filepath))
+ checkpoint_dict = torch.load(filepath, map_location=device)
+ print("Complete.")
+ return checkpoint_dict
+
+
+def save_checkpoint(filepath, obj):
+ print("Saving checkpoint to {}".format(filepath))
+ torch.save(obj, filepath)
+ print("Complete.")
+
+
+def scan_checkpoint(cp_dir, prefix):
+ pattern = os.path.join(cp_dir, prefix + '????????')
+ cp_list = glob.glob(pattern)
+ if len(cp_list) == 0:
+ return None
+ return sorted(cp_list)[-1]
+
diff --git a/talkingface/utils/voice_conversion_talkingface/DiffVC_model_utils.py b/talkingface/utils/voice_conversion_talkingface/DiffVC_model_utils.py
new file mode 100644
index 00000000..79be82b5
--- /dev/null
+++ b/talkingface/utils/voice_conversion_talkingface/DiffVC_model_utils.py
@@ -0,0 +1,110 @@
+# Copyright (C) 2022. Huawei Technologies Co., Ltd. All rights reserved.
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the MIT License.
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+# MIT License for more details.
+
+import torch
+import torchaudio
+import numpy as np
+from librosa.filters import mel as librosa_mel_fn
+
+from model.base import BaseModule
+
+
+def mse_loss(x, y, mask, n_feats):
+ loss = torch.sum(((x - y)**2) * mask)
+ return loss / (torch.sum(mask) * n_feats)
+
+
+def sequence_mask(length, max_length=None):
+ if max_length is None:
+ max_length = length.max()
+ x = torch.arange(int(max_length), dtype=length.dtype, device=length.device)
+ return x.unsqueeze(0) < length.unsqueeze(1)
+
+
+def convert_pad_shape(pad_shape):
+ l = pad_shape[::-1]
+ pad_shape = [item for sublist in l for item in sublist]
+ return pad_shape
+
+
+def fix_len_compatibility(length, num_downsamplings_in_unet=2):
+ while True:
+ if length % (2**num_downsamplings_in_unet) == 0:
+ return length
+ length += 1
+
+
+class PseudoInversion(BaseModule):
+ def __init__(self, n_mels, sampling_rate, n_fft):
+ super(PseudoInversion, self).__init__()
+ self.n_mels = n_mels
+ self.sampling_rate = sampling_rate
+ self.n_fft = n_fft
+ mel_basis = librosa_mel_fn(sampling_rate, n_fft, n_mels, 0, 8000)
+ mel_basis_inverse = np.linalg.pinv(mel_basis)
+ mel_basis_inverse = torch.from_numpy(mel_basis_inverse).float()
+ self.register_buffer("mel_basis_inverse", mel_basis_inverse)
+
+ def forward(self, log_mel_spectrogram):
+ mel_spectrogram = torch.exp(log_mel_spectrogram)
+ stftm = torch.matmul(self.mel_basis_inverse, mel_spectrogram)
+ return stftm
+
+
+class InitialReconstruction(BaseModule):
+ def __init__(self, n_fft, hop_size):
+ super(InitialReconstruction, self).__init__()
+ self.n_fft = n_fft
+ self.hop_size = hop_size
+ window = torch.hann_window(n_fft).float()
+ self.register_buffer("window", window)
+
+ def forward(self, stftm):
+ real_part = torch.ones_like(stftm, device=stftm.device)
+ imag_part = torch.zeros_like(stftm, device=stftm.device)
+ stft = torch.stack([real_part, imag_part], -1)*stftm.unsqueeze(-1)
+ istft = torchaudio.functional.istft(stft, n_fft=self.n_fft,
+ hop_length=self.hop_size, win_length=self.n_fft,
+ window=self.window, center=True)
+ return istft.unsqueeze(1)
+
+
+# Fast Griffin-Lim algorithm as a PyTorch module
+class FastGL(BaseModule):
+ def __init__(self, n_mels, sampling_rate, n_fft, hop_size, momentum=0.99):
+ super(FastGL, self).__init__()
+ self.n_mels = n_mels
+ self.sampling_rate = sampling_rate
+ self.n_fft = n_fft
+ self.hop_size = hop_size
+ self.momentum = momentum
+ self.pi = PseudoInversion(n_mels, sampling_rate, n_fft)
+ self.ir = InitialReconstruction(n_fft, hop_size)
+ window = torch.hann_window(n_fft).float()
+ self.register_buffer("window", window)
+
+ @torch.no_grad()
+ def forward(self, s, n_iters=32):
+ c = self.pi(s)
+ x = self.ir(c)
+ x = x.squeeze(1)
+ c = c.unsqueeze(-1)
+ prev_angles = torch.zeros_like(c, device=c.device)
+ for _ in range(n_iters):
+ s = torch.stft(x, n_fft=self.n_fft, hop_length=self.hop_size,
+ win_length=self.n_fft, window=self.window,
+ center=True)
+ real_part, imag_part = s.unbind(-1)
+ stftm = torch.sqrt(torch.clamp(real_part**2 + imag_part**2, min=1e-8))
+ angles = s / stftm.unsqueeze(-1)
+ s = c * (angles + self.momentum * (angles - prev_angles))
+ x = torchaudio.functional.istft(s, n_fft=self.n_fft, hop_length=self.hop_size,
+ win_length=self.n_fft, window=self.window,
+ center=True)
+ prev_angles = angles
+ return x.unsqueeze(1)
diff --git a/talkingface/utils/voice_conversion_talkingface/DiffVC_speaker_encoder_utils.py b/talkingface/utils/voice_conversion_talkingface/DiffVC_speaker_encoder_utils.py
new file mode 100644
index 00000000..9eacd8e2
--- /dev/null
+++ b/talkingface/utils/voice_conversion_talkingface/DiffVC_speaker_encoder_utils.py
@@ -0,0 +1,303 @@
+from pathlib import Path
+import numpy as np
+import argparse
+
+import numpy as np
+import math
+from scipy.special import expn
+from collections import namedtuple
+
+from time import perf_counter as timer
+from collections import OrderedDict
+import numpy as np
+
+_type_priorities = [ # In decreasing order
+ Path,
+ str,
+ int,
+ float,
+ bool,
+]
+
+class argutils:
+
+ def _priority(o):
+ p = next((i for i, t in enumerate(_type_priorities) if type(o) is t), None)
+ if p is not None:
+ return p
+ p = next((i for i, t in enumerate(_type_priorities) if isinstance(o, t)), None)
+ if p is not None:
+ return p
+ return len(_type_priorities)
+
+ def print_args(args: argparse.Namespace, parser=None):
+ args = vars(args)
+ if parser is None:
+ priorities = list(map(argutils._priority, args.values()))
+ else:
+ all_params = [a.dest for g in parser._action_groups for a in g._group_actions]
+ priority = lambda p: all_params.index(p) if p in all_params else len(all_params)
+ priorities = list(map(priority, args.keys()))
+
+ pad = max(map(len, args.keys())) + 3
+ indices = np.lexsort((list(args.keys()), priorities))
+ items = list(args.items())
+
+ print("Arguments:")
+ for i in indices:
+ param, value = items[i]
+ print(" {0}:{1}{2}".format(param, ' ' * (pad - len(param)), value))
+ print("")
+
+class logmmse:
+ NoiseProfile = namedtuple("NoiseProfile", "sampling_rate window_size len1 len2 win n_fft noise_mu2")
+ def profile_noise(noise, sampling_rate, window_size=0):
+ """
+ Creates a profile of the noise in a given waveform.
+
+ :param noise: a waveform containing noise ONLY, as a numpy array of floats or ints.
+ :param sampling_rate: the sampling rate of the audio
+ :param window_size: the size of the window the logmmse algorithm operates on. A default value
+ will be picked if left as 0.
+ :return: a NoiseProfile object
+ """
+ noise, dtype = logmmse.to_float(noise)
+ noise += np.finfo(np.float64).eps
+
+ if window_size == 0:
+ window_size = int(math.floor(0.02 * sampling_rate))
+
+ if window_size % 2 == 1:
+ window_size = window_size + 1
+
+ perc = 50
+ len1 = int(math.floor(window_size * perc / 100))
+ len2 = int(window_size - len1)
+
+ win = np.hanning(window_size)
+ win = win * len2 / np.sum(win)
+ n_fft = 2 * window_size
+
+ noise_mean = np.zeros(n_fft)
+ n_frames = len(noise) // window_size
+ for j in range(0, window_size * n_frames, window_size):
+ noise_mean += np.absolute(np.fft.fft(win * noise[j:j + window_size], n_fft, axis=0))
+ noise_mu2 = (noise_mean / n_frames) ** 2
+
+ return logmmse.NoiseProfile(sampling_rate, window_size, len1, len2, win, n_fft, noise_mu2)
+
+ def denoise(wav, noise_profile: NoiseProfile, eta=0.15):
+ """
+ Cleans the noise from a speech waveform given a noise profile. The waveform must have the
+ same sampling rate as the one used to create the noise profile.
+
+ :param wav: a speech waveform as a numpy array of floats or ints.
+ :param noise_profile: a NoiseProfile object that was created from a similar (or a segment of
+ the same) waveform.
+ :param eta: voice threshold for noise update. While the voice activation detection value is
+ below this threshold, the noise profile will be continuously updated throughout the audio.
+ Set to 0 to disable updating the noise profile.
+ :return: the clean wav as a numpy array of floats or ints of the same length.
+ """
+ wav, dtype = logmmse.to_float(wav)
+ wav += np.finfo(np.float64).eps
+ p = noise_profile
+
+ nframes = int(math.floor(len(wav) / p.len2) - math.floor(p.window_size / p.len2))
+ x_final = np.zeros(nframes * p.len2)
+
+ aa = 0.98
+ mu = 0.98
+ ksi_min = 10 ** (-25 / 10)
+
+ x_old = np.zeros(p.len1)
+ xk_prev = np.zeros(p.len1)
+ noise_mu2 = p.noise_mu2
+ for k in range(0, nframes * p.len2, p.len2):
+ insign = p.win * wav[k:k + p.window_size]
+
+ spec = np.fft.fft(insign, p.n_fft, axis=0)
+ sig = np.absolute(spec)
+ sig2 = sig ** 2
+
+ gammak = np.minimum(sig2 / noise_mu2, 40)
+
+ if xk_prev.all() == 0:
+ ksi = aa + (1 - aa) * np.maximum(gammak - 1, 0)
+ else:
+ ksi = aa * xk_prev / noise_mu2 + (1 - aa) * np.maximum(gammak - 1, 0)
+ ksi = np.maximum(ksi_min, ksi)
+
+ log_sigma_k = gammak * ksi / (1 + ksi) - np.log(1 + ksi)
+ vad_decision = np.sum(log_sigma_k) / p.window_size
+ if vad_decision < eta:
+ noise_mu2 = mu * noise_mu2 + (1 - mu) * sig2
+
+ a = ksi / (1 + ksi)
+ vk = a * gammak
+ ei_vk = 0.5 * expn(1, np.maximum(vk, 1e-8))
+ hw = a * np.exp(ei_vk)
+ sig = sig * hw
+ xk_prev = sig ** 2
+ xi_w = np.fft.ifft(hw * spec, p.n_fft, axis=0)
+ xi_w = np.real(xi_w)
+
+ x_final[k:k + p.len2] = x_old + xi_w[0:p.len1]
+ x_old = xi_w[p.len1:p.window_size]
+
+ output = logmmse.from_float(x_final, dtype)
+ output = np.pad(output, (0, len(wav) - len(output)), mode="constant")
+ return output
+
+ ## Alternative VAD algorithm to webrctvad. It has the advantage of not requiring to install that
+ ## darn package and it also works for any sampling rate. Maybe I'll eventually use it instead of
+ ## webrctvad
+ # def vad(wav, sampling_rate, eta=0.15, window_size=0):
+ # """
+ # TODO: fix doc
+ # Creates a profile of the noise in a given waveform.
+ #
+ # :param wav: a waveform containing noise ONLY, as a numpy array of floats or ints.
+ # :param sampling_rate: the sampling rate of the audio
+ # :param window_size: the size of the window the logmmse algorithm operates on. A default value
+ # will be picked if left as 0.
+ # :param eta: voice threshold for noise update. While the voice activation detection value is
+ # below this threshold, the noise profile will be continuously updated throughout the audio.
+ # Set to 0 to disable updating the noise profile.
+ # """
+ # wav, dtype = to_float(wav)
+ # wav += np.finfo(np.float64).eps
+ #
+ # if window_size == 0:
+ # window_size = int(math.floor(0.02 * sampling_rate))
+ #
+ # if window_size % 2 == 1:
+ # window_size = window_size + 1
+ #
+ # perc = 50
+ # len1 = int(math.floor(window_size * perc / 100))
+ # len2 = int(window_size - len1)
+ #
+ # win = np.hanning(window_size)
+ # win = win * len2 / np.sum(win)
+ # n_fft = 2 * window_size
+ #
+ # wav_mean = np.zeros(n_fft)
+ # n_frames = len(wav) // window_size
+ # for j in range(0, window_size * n_frames, window_size):
+ # wav_mean += np.absolute(np.fft.fft(win * wav[j:j + window_size], n_fft, axis=0))
+ # noise_mu2 = (wav_mean / n_frames) ** 2
+ #
+ # wav, dtype = to_float(wav)
+ # wav += np.finfo(np.float64).eps
+ #
+ # nframes = int(math.floor(len(wav) / len2) - math.floor(window_size / len2))
+ # vad = np.zeros(nframes * len2, dtype=np.bool)
+ #
+ # aa = 0.98
+ # mu = 0.98
+ # ksi_min = 10 ** (-25 / 10)
+ #
+ # xk_prev = np.zeros(len1)
+ # noise_mu2 = noise_mu2
+ # for k in range(0, nframes * len2, len2):
+ # insign = win * wav[k:k + window_size]
+ #
+ # spec = np.fft.fft(insign, n_fft, axis=0)
+ # sig = np.absolute(spec)
+ # sig2 = sig ** 2
+ #
+ # gammak = np.minimum(sig2 / noise_mu2, 40)
+ #
+ # if xk_prev.all() == 0:
+ # ksi = aa + (1 - aa) * np.maximum(gammak - 1, 0)
+ # else:
+ # ksi = aa * xk_prev / noise_mu2 + (1 - aa) * np.maximum(gammak - 1, 0)
+ # ksi = np.maximum(ksi_min, ksi)
+ #
+ # log_sigma_k = gammak * ksi / (1 + ksi) - np.log(1 + ksi)
+ # vad_decision = np.sum(log_sigma_k) / window_size
+ # if vad_decision < eta:
+ # noise_mu2 = mu * noise_mu2 + (1 - mu) * sig2
+ # print(vad_decision)
+ #
+ # a = ksi / (1 + ksi)
+ # vk = a * gammak
+ # ei_vk = 0.5 * expn(1, np.maximum(vk, 1e-8))
+ # hw = a * np.exp(ei_vk)
+ # sig = sig * hw
+ # xk_prev = sig ** 2
+ #
+ # vad[k:k + len2] = vad_decision >= eta
+ #
+ # vad = np.pad(vad, (0, len(wav) - len(vad)), mode="constant")
+ # return vad
+
+ def to_float(_input):
+ if _input.dtype == np.float64:
+ return _input, _input.dtype
+ elif _input.dtype == np.float32:
+ return _input.astype(np.float64), _input.dtype
+ elif _input.dtype == np.uint8:
+ return (_input - 128) / 128., _input.dtype
+ elif _input.dtype == np.int16:
+ return _input / 32768., _input.dtype
+ elif _input.dtype == np.int32:
+ return _input / 2147483648., _input.dtype
+ raise ValueError('Unsupported wave file format')
+
+ def from_float(_input, dtype):
+ if dtype == np.float64:
+ return _input, np.float64
+ elif dtype == np.float32:
+ return _input.astype(np.float32)
+ elif dtype == np.uint8:
+ return ((_input * 128) + 128).astype(np.uint8)
+ elif dtype == np.int16:
+ return (_input * 32768).astype(np.int16)
+ elif dtype == np.int32:
+ print(_input)
+ return (_input * 2147483648).astype(np.int32)
+ raise ValueError('Unsupported wave file format')
+
+
+class Profiler:
+ def __init__(self, summarize_every=5, disabled=False):
+ self.last_tick = timer()
+ self.logs = OrderedDict()
+ self.summarize_every = summarize_every
+ self.disabled = disabled
+
+ def tick(self, name):
+ if self.disabled:
+ return
+
+ # Log the time needed to execute that function
+ if not name in self.logs:
+ self.logs[name] = []
+ if len(self.logs[name]) >= self.summarize_every:
+ self.summarize()
+ self.purge_logs()
+ self.logs[name].append(timer() - self.last_tick)
+
+ self.reset_timer()
+
+ def purge_logs(self):
+ for name in self.logs:
+ self.logs[name].clear()
+
+ def reset_timer(self):
+ self.last_tick = timer()
+
+ def summarize(self):
+ n = max(map(len, self.logs.values()))
+ assert n == self.summarize_every
+ print("\nAverage execution time over %d steps:" % n)
+
+ name_msgs = ["%s (%d/%d):" % (name, len(deltas), n) for name, deltas in self.logs.items()]
+ pad = max(map(len, name_msgs))
+ for name_msg, deltas in zip(name_msgs, self.logs.values()):
+ print(" %s mean: %4.0fms std: %4.0fms" %
+ (name_msg.ljust(pad), np.mean(deltas) * 1000, np.std(deltas) * 1000))
+ print("", flush=True)
+
diff --git a/talkingface/utils/voice_conversion_talkingface/DiffVC_utils.py b/talkingface/utils/voice_conversion_talkingface/DiffVC_utils.py
new file mode 100644
index 00000000..45e34eeb
--- /dev/null
+++ b/talkingface/utils/voice_conversion_talkingface/DiffVC_utils.py
@@ -0,0 +1,27 @@
+# Copyright (C) 2022. Huawei Technologies Co., Ltd. All rights reserved.
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the MIT License.
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+# MIT License for more details.
+
+import numpy as np
+import matplotlib.pyplot as plt
+from scipy.io import wavfile
+
+
+def save_plot(tensor, savepath):
+ plt.style.use('default')
+ fig, ax = plt.subplots(figsize=(12, 3))
+ im = ax.imshow(tensor, aspect="auto", origin="lower", interpolation='none')
+ plt.colorbar(im, ax=ax)
+ plt.tight_layout()
+ fig.canvas.draw()
+ plt.savefig(savepath)
+ plt.close()
+
+
+def save_audio(file_path, sampling_rate, audio):
+ audio = np.clip(audio.detach().cpu().squeeze().numpy(), -0.999, 0.999)
+ wavfile.write(file_path, sampling_rate, (audio * 32767).astype("int16"))
diff --git a/talkingface/utils/voice_conversion_talkingface/audio.py b/talkingface/utils/voice_conversion_talkingface/audio.py
new file mode 100644
index 00000000..b8b55311
--- /dev/null
+++ b/talkingface/utils/voice_conversion_talkingface/audio.py
@@ -0,0 +1,157 @@
+""" from https://github.com/CorentinJ/Real-Time-Voice-Cloning """
+
+from scipy.ndimage.morphology import binary_dilation
+from encoder.params_data import *
+from pathlib import Path
+from typing import Optional, Union
+import numpy as np
+import webrtcvad
+import librosa
+import struct
+
+import torch
+from torchaudio.transforms import Resample
+from librosa.filters import mel as librosa_mel_fn
+
+
+int16_max = (2 ** 15) - 1
+
+
+def preprocess_wav(fpath_or_wav: Union[str, Path, np.ndarray],
+ source_sr: Optional[int] = None):
+ """
+ Applies the preprocessing operations used in training the Speaker Encoder to a waveform
+ either on disk or in memory. The waveform will be resampled to match the data hyperparameters.
+
+ :param fpath_or_wav: either a filepath to an audio file (many extensions are supported, not
+ just .wav), either the waveform as a numpy array of floats.
+ :param source_sr: if passing an audio waveform, the sampling rate of the waveform before
+ preprocessing. After preprocessing, the waveform's sampling rate will match the data
+ hyperparameters. If passing a filepath, the sampling rate will be automatically detected and
+ this argument will be ignored.
+ """
+ # Load the wav from disk if needed
+ if isinstance(fpath_or_wav, str) or isinstance(fpath_or_wav, Path):
+ wav, source_sr = librosa.load(fpath_or_wav, sr=None)
+ else:
+ wav = fpath_or_wav
+
+ # Resample the wav if needed
+ if source_sr is not None and source_sr != sampling_rate:
+ wav = librosa.resample(wav, source_sr, sampling_rate)
+
+ # Apply the preprocessing: normalize volume and shorten long silences
+ wav = normalize_volume(wav, audio_norm_target_dBFS, increase_only=True)
+ wav = trim_long_silences(wav)
+
+ return wav
+
+
+def preprocess_wav_batch(wavs, source_sr=22050):
+ # This torch version is designed to cope with a batch of same lengths wavs
+ if sampling_rate != source_sr:
+ resample = Resample(source_sr, sampling_rate)
+ wavs = resample(wavs)
+ wavs_preprocessed = normalize_volume_batch(wavs, audio_norm_target_dBFS,
+ increase_only=True)
+ # Trimming silence is not implemented in this version yet!
+ return wavs_preprocessed
+
+
+def wav_to_mel_spectrogram(wav):
+ """
+ Derives a mel spectrogram ready to be used by the encoder from a preprocessed audio waveform.
+ Note: this not a log-mel spectrogram.
+ """
+ frames = librosa.feature.melspectrogram(
+ wav,
+ sampling_rate,
+ n_fft=int(sampling_rate * mel_window_length / 1000),
+ hop_length=int(sampling_rate * mel_window_step / 1000),
+ n_mels=mel_n_channels
+ )
+ return frames.astype(np.float32).T
+
+
+def wav_to_mel_spectrogram_batch(wavs):
+ # This torch version is designed to cope with a batch of same lengths wavs
+ n_fft = int(sampling_rate * mel_window_length / 1000)
+ hop_length = int(sampling_rate * mel_window_step / 1000)
+ win_length = int(sampling_rate * mel_window_length / 1000)
+ window = torch.hann_window(n_fft).to(wavs)
+ mel_basis = torch.from_numpy(librosa_mel_fn(sampling_rate, n_fft,
+ mel_n_channels)).to(wavs)
+ s = torch.stft(wavs, n_fft=n_fft, hop_length=hop_length,
+ win_length=win_length, window=window, center=True)
+ real_part, imag_part = s.unbind(-1)
+ stftm = real_part**2 + imag_part**2
+ mels = torch.matmul(mel_basis, stftm)
+ return torch.transpose(mels, 1, 2)
+
+
+def normalize_volume(wav, target_dBFS, increase_only=False, decrease_only=False):
+ if increase_only and decrease_only:
+ raise ValueError("Both increase only and decrease only are set")
+ dBFS_change = target_dBFS - 10 * np.log10(np.mean(wav ** 2))
+ if (dBFS_change < 0 and increase_only) or (dBFS_change > 0 and decrease_only):
+ return wav
+ return wav * (10 ** (dBFS_change / 20))
+
+
+def normalize_volume_batch(wavs, target_dBFS, increase_only=False, decrease_only=False):
+ # This torch version is designed to cope with a batch of same lengths wavs
+ if increase_only and decrease_only:
+ raise ValueError("Both increase only and decrease only are set")
+ dBFS_change = target_dBFS - 10 * torch.log10(torch.mean(wavs ** 2, axis=-1))
+ scales = torch.ones(wavs.shape[0], device=wavs.device, dtype=wavs.dtype)
+ if increase_only:
+ mask = (dBFS_change > 0).to(scales)
+ elif decrease_only:
+ mask = (dBFS_change < 0).to(scales)
+ else:
+ mask = torch.zeros_like(scales)
+ scales = scales + mask * (10 ** (dBFS_change / 20) - 1.0)
+ return wavs * scales.unsqueeze(-1)
+
+
+def trim_long_silences(wav):
+ """
+ Ensures that segments without voice in the waveform remain no longer than a
+ threshold determined by the VAD parameters in params.py.
+
+ :param wav: the raw waveform as a numpy array of floats
+ :return: the same waveform with silences trimmed away (length <= original wav length)
+ """
+ # Compute the voice detection window size
+ samples_per_window = (vad_window_length * sampling_rate) // 1000
+
+ # Trim the end of the audio to have a multiple of the window size
+ wav = wav[:len(wav) - (len(wav) % samples_per_window)]
+
+ # Convert the float waveform to 16-bit mono PCM
+ pcm_wave = struct.pack("%dh" % len(wav), *(np.round(wav * int16_max)).astype(np.int16))
+
+ # Perform voice activation detection
+ voice_flags = []
+ vad = webrtcvad.Vad(mode=3)
+ for window_start in range(0, len(wav), samples_per_window):
+ window_end = window_start + samples_per_window
+ voice_flags.append(vad.is_speech(pcm_wave[window_start * 2:window_end * 2],
+ sample_rate=sampling_rate))
+ voice_flags = np.array(voice_flags)
+
+ # Smooth the voice detection with a moving average
+ def moving_average(array, width):
+ array_padded = np.concatenate((np.zeros((width - 1) // 2), array, np.zeros(width // 2)))
+ ret = np.cumsum(array_padded, dtype=float)
+ ret[width:] = ret[width:] - ret[:-width]
+ return ret[width - 1:] / width
+
+ audio_mask = moving_average(voice_flags, vad_moving_average_width)
+ audio_mask = np.round(audio_mask).astype(np.bool)
+
+ # Dilate the voiced regions
+ audio_mask = binary_dilation(audio_mask, np.ones(vad_max_silence_length + 1))
+ audio_mask = np.repeat(audio_mask, samples_per_window)
+
+ return wav[audio_mask == True]
diff --git a/talkingface/utils/voice_conversion_talkingface/inference.py b/talkingface/utils/voice_conversion_talkingface/inference.py
new file mode 100644
index 00000000..0fd2c8f1
--- /dev/null
+++ b/talkingface/utils/voice_conversion_talkingface/inference.py
@@ -0,0 +1,209 @@
+""" from https://github.com/CorentinJ/Real-Time-Voice-Cloning """
+
+from encoder.params_data import *
+from encoder.model import SpeakerEncoder
+from encoder.audio import preprocess_wav, preprocess_wav_batch
+from matplotlib import cm
+from encoder import audio
+from pathlib import Path
+import matplotlib.pyplot as plt
+import numpy as np
+import torch
+
+_model = None # type: SpeakerEncoder
+_device = None # type: torch.device
+
+
+def load_model(weights_fpath: Path, device="cpu"):
+ """
+ Loads the model in memory. If this function is not explicitely called, it will be run on the
+ first call to embed_frames() with the default weights file.
+
+ :param weights_fpath: the path to saved model weights.
+ :param device: either a torch device or the name of a torch device (e.g. "cpu", "cuda"). The
+ model will be loaded and will run on this device. Outputs will however always be on the cpu.
+ If None, will default to your GPU if it"s available, otherwise your CPU.
+ """
+ # TODO: I think the slow loading of the encoder might have something to do with the device it
+ # was saved on. Worth investigating.
+ global _model, _device
+ if device is None:
+ _device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+ elif isinstance(device, str):
+ _device = torch.device(device)
+ _model = SpeakerEncoder(_device, torch.device("cpu"))
+ checkpoint = torch.load(weights_fpath, map_location="cpu")
+ _model.load_state_dict(checkpoint["model_state"])
+ _model.eval()
+ print("Loaded encoder \"%s\" trained to step %d" % (weights_fpath.name, checkpoint["step"]))
+
+
+def is_loaded():
+ return _model is not None
+
+
+def embed_frames_batch(frames, use_torch=False):
+ if _model is None:
+ raise Exception("Model was not loaded. Call load_model() before inference.")
+
+ if not use_torch:
+ frames = torch.from_numpy(frames)
+ frames = frames.to(_device)
+
+ embeds = _model.forward(frames)
+ if not use_torch:
+ embeds = embeds.detach().cpu().numpy()
+ return embeds
+
+
+def compute_partial_slices(n_samples, partial_utterance_n_frames=partials_n_frames,
+ min_pad_coverage=0.75, overlap=0.5):
+ """
+ Computes where to split an utterance waveform and its corresponding mel spectrogram to obtain
+ partial utterances of each. Both the waveform and the mel
+ spectrogram slices are returned, so as to make each partial utterance waveform correspond to
+ its spectrogram. This function assumes that the mel spectrogram parameters used are those
+ defined in params_data.py.
+
+ The returned ranges may be indexing further than the length of the waveform. It is
+ recommended that you pad the waveform with zeros up to wave_slices[-1].stop.
+
+ :param n_samples: the number of samples in the waveform
+ :param partial_utterance_n_frames: the number of mel spectrogram frames in each partial
+ utterance
+ :param min_pad_coverage: when reaching the last partial utterance, it may or may not have
+ enough frames. If at least of are present,
+ then the last partial utterance will be considered, as if we padded the audio. Otherwise,
+ it will be discarded, as if we trimmed the audio. If there aren't enough frames for 1 partial
+ utterance, this parameter is ignored so that the function always returns at least 1 slice.
+ :param overlap: by how much the partial utterance should overlap. If set to 0, the partial
+ utterances are entirely disjoint.
+ :return: the waveform slices and mel spectrogram slices as lists of array slices. Index
+ respectively the waveform and the mel spectrogram with these slices to obtain the partial
+ utterances.
+ """
+ assert 0 <= overlap < 1
+ assert 0 < min_pad_coverage <= 1
+
+ samples_per_frame = int((sampling_rate * mel_window_step / 1000))
+ n_frames = int(np.ceil((n_samples + 1) / samples_per_frame))
+ frame_step = max(int(np.round(partial_utterance_n_frames * (1 - overlap))), 1)
+
+ # Compute the slices
+ wav_slices, mel_slices = [], []
+ steps = max(1, n_frames - partial_utterance_n_frames + frame_step + 1)
+ for i in range(0, steps, frame_step):
+ mel_range = np.array([i, i + partial_utterance_n_frames])
+ wav_range = mel_range * samples_per_frame
+ mel_slices.append(slice(*mel_range))
+ wav_slices.append(slice(*wav_range))
+
+ # Evaluate whether extra padding is warranted or not
+ last_wav_range = wav_slices[-1]
+ coverage = (n_samples - last_wav_range.start) / (last_wav_range.stop - last_wav_range.start)
+ if coverage < min_pad_coverage and len(mel_slices) > 1:
+ mel_slices = mel_slices[:-1]
+ wav_slices = wav_slices[:-1]
+
+ return wav_slices, mel_slices
+
+
+def embed_utterance(wav, using_partials=True, return_partials=False, **kwargs):
+ """
+ Computes an embedding for a single utterance.
+
+ # TODO: handle multiple wavs to benefit from batching on GPU
+ :param wav: a preprocessed (see audio.py) utterance waveform as a numpy array of float32
+ :param using_partials: if True, then the utterance is split in partial utterances of
+ frames and the utterance embedding is computed from their
+ normalized average. If False, the utterance is instead computed from feeding the entire
+ spectogram to the network.
+ :param return_partials: if True, the partial embeddings will also be returned along with the
+ wav slices that correspond to the partial embeddings.
+ :param kwargs: additional arguments to compute_partial_splits()
+ :return: the embedding as a numpy array of float32 of shape (model_embedding_size,). If
+ is True, the partial utterances as a numpy array of float32 of shape
+ (n_partials, model_embedding_size) and the wav partials as a list of slices will also be
+ returned. If is simultaneously set to False, both these values will be None
+ instead.
+ """
+ # Process the entire utterance if not using partials
+ if not using_partials:
+ frames = audio.wav_to_mel_spectrogram(wav)
+ embed = embed_frames_batch(frames[None, ...])[0]
+ if return_partials:
+ return embed, None, None
+ return embed
+
+ # Compute where to split the utterance into partials and pad if necessary
+ wave_slices, mel_slices = compute_partial_slices(len(wav), **kwargs)
+ max_wave_length = wave_slices[-1].stop
+ if max_wave_length >= len(wav):
+ wav = np.pad(wav, (0, max_wave_length - len(wav)), "constant")
+
+ # Split the utterance into partials
+ frames = audio.wav_to_mel_spectrogram(wav)
+ frames_batch = np.array([frames[s] for s in mel_slices])
+ partial_embeds = embed_frames_batch(frames_batch)
+
+ # Compute the utterance embedding from the partial embeddings
+ raw_embed = np.mean(partial_embeds, axis=0)
+ embed = raw_embed / np.linalg.norm(raw_embed, 2)
+
+ if return_partials:
+ return embed, partial_embeds, wave_slices
+ return embed
+
+
+def embed_utterance_batch(wavs, using_partials=True, return_partials=False, **kwargs):
+ # This torch version is designed to cope with a batch of same lengths wavs
+ if not using_partials:
+ print(wavs.shape)
+ frames = audio.wav_to_mel_spectrogram_batch(wavs)
+ embeds = embed_frames_batch(frames)
+ if return_partials:
+ return embeds, None, None
+ return embeds
+
+ wave_slices, mel_slices = compute_partial_slices(wavs.shape[-1], **kwargs)
+ max_wave_length = wave_slices[-1].stop
+ if max_wave_length >= wavs.shape[-1]:
+ wavs = torch.cat([wavs, torch.ones((wavs.shape[0], max_wave_length - wavs.shape[-1]),
+ dtype=wavs.dtype, device=wavs.device)], 1)
+
+ frames = audio.wav_to_mel_spectrogram_batch(wavs)
+ frames_batch = []
+ for i in range(len(frames)):
+ frames_batch += [frames[i][s] for s in mel_slices]
+ frames_batch = torch.stack(frames_batch, 0)
+ partial_embeds = embed_frames_batch(frames_batch, use_torch=True)
+ partial_embeds = partial_embeds.view(wavs.shape[0], len(mel_slices), -1)
+
+ raw_embeds = torch.mean(partial_embeds, axis=1, keepdims=False)
+ embeds = raw_embeds / torch.linalg.norm(raw_embeds, axis=-1, keepdims=True)
+
+ if return_partials:
+ return embeds, partial_embeds, wave_slices
+ return embeds
+
+
+def embed_speaker(wavs, **kwargs):
+ raise NotImplemented()
+
+
+def plot_embedding_as_heatmap(embed, ax=None, title="", shape=None, color_range=(0, 0.30)):
+ if ax is None:
+ ax = plt.gca()
+
+ if shape is None:
+ height = int(np.sqrt(len(embed)))
+ shape = (height, -1)
+ embed = embed.reshape(shape)
+
+ cmap = cm.get_cmap()
+ mappable = ax.imshow(embed, cmap=cmap)
+ cbar = plt.colorbar(mappable, ax=ax, fraction=0.046, pad=0.04)
+ cbar.set_clim(*color_range)
+
+ ax.set_xticks([]), ax.set_yticks([])
+ ax.set_title(title)
diff --git a/talkingface/utils/voice_conversion_talkingface/params_data.py b/talkingface/utils/voice_conversion_talkingface/params_data.py
new file mode 100644
index 00000000..62d04121
--- /dev/null
+++ b/talkingface/utils/voice_conversion_talkingface/params_data.py
@@ -0,0 +1,30 @@
+""" from https://github.com/CorentinJ/Real-Time-Voice-Cloning """
+
+## Mel-filterbank
+mel_window_length = 25 # In milliseconds
+mel_window_step = 10 # In milliseconds
+mel_n_channels = 40
+
+
+## Audio
+sampling_rate = 16000
+# Number of spectrogram frames in a partial utterance
+partials_n_frames = 160 # 1600 ms
+# Number of spectrogram frames at inference
+inference_n_frames = 80 # 800 ms
+
+
+## Voice Activation Detection
+# Window size of the VAD. Must be either 10, 20 or 30 milliseconds.
+# This sets the granularity of the VAD. Should not need to be changed.
+vad_window_length = 30 # In milliseconds
+# Number of frames to average together when performing the moving average smoothing.
+# The larger this value, the larger the VAD variations must be to not get smoothed out.
+vad_moving_average_width = 8
+# Maximum number of consecutive silent frames a segment can have.
+vad_max_silence_length = 6
+
+
+## Audio volume normalization
+audio_norm_target_dBFS = -30
+
diff --git a/talkingface/utils/voice_conversion_talkingface/params_model.py b/talkingface/utils/voice_conversion_talkingface/params_model.py
new file mode 100644
index 00000000..9c535205
--- /dev/null
+++ b/talkingface/utils/voice_conversion_talkingface/params_model.py
@@ -0,0 +1,12 @@
+""" from https://github.com/CorentinJ/Real-Time-Voice-Cloning """
+
+## Model parameters
+model_hidden_size = 256
+model_embedding_size = 256
+model_num_layers = 3
+
+
+## Training parameters
+learning_rate_init = 1e-4
+speakers_per_batch = 64
+utterances_per_speaker = 10