Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 29 additions & 0 deletions DINet/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# 使用 Python 3.6.13-slim 作为基础镜像
FROM python:3.6.13-slim

# 更新镜像并安装必要的系统库(包括 ffmpeg 和其他依赖)
RUN apt-get update --fix-missing && apt-get install -y \
ffmpeg \
libsm6 \
libxext6 \
libx264-dev \
&& rm -rf /var/lib/apt/lists/*

# 设置工作目录
WORKDIR /app

# 将当前目录中的所有文件复制到容器的 /app 目录
COPY . /app

RUN pip install --upgrade pip
# 安装 PyTorch、TorchVision 和 Torchaudio
#RUN pip install torch==1.11.0+cu113 torchvision==0.12.0+cu113 torchaudio==0.11.0 --extra-index-url https://download.pytorch.org/whl/cu113
RUN pip install torch==1.10.1+cu111 torchvision==0.11.2+cu111 torchaudio==0.10.1 -f https://download.pytorch.org/whl/cu111/torch_stable.html
# 安装项目依赖项
RUN pip install --no-cache-dir -r requirements.txt

# 设置 ENTRYPOINT 为 python
ENTRYPOINT ["python"]

# 设置默认命令为 run_evaluate.py --process_num 1
CMD ["run_evaluate.py"]
105 changes: 105 additions & 0 deletions DINet/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
# DINet: Deformation Inpainting Network for Realistic Face Visually Dubbing on High Resolution Video (AAAI2023)
![在这里插入图片描述](https://img-blog.csdnimg.cn/178c6b3ec0074af7a2dcc9ef26450e75.png)
[Paper](https://fuxivirtualhuman.github.io/pdf/AAAI2023_FaceDubbing.pdf)         [demo video](https://www.youtube.com/watch?v=UU344T-9h7M&t=6s)      Supplementary materials


这是2024年语音识别课程大作业的仓库,用于[DINet](https://github.com/MRzzm/DINet)的复现
# 复现注意事项
首先这里知识对于原项目进行了复述,因此具体操作参考配置文档.txt文件进行使用。

## 数据获取
##### 在 [Google drive](https://drive.google.com/drive/folders/1rPtOo9Uuhc59YfFVv4gBmkh0_oG0nCQb?usp=share_link)中下载资源 (asserts.zip)。解压缩并将 dir 放入 ./ 中
+ 使用示例视频进行推理。运行
```python
python inference.py --mouth_region_size=256 --source_video_path=./asserts/examples/testxxx.mp4 --source_openface_landmark_path=./asserts/examples/testxxx.csv --driving_audio_path=./asserts/examples/driving_audio_xxx.wav --pretrained_clip_DINet_path=./asserts/clip_training_DINet_256mouth.pth
```
结果保存在 ./asserts/inference_result

+ 使用自定义视频进行推理。
**Note:** 发布的预训练模型是在 HDTF 数据集上训练的。(视频名称在 ./asserts/training_video_name.txt 中)

使用 [openface](https://github.com/TadasBaltrusaitis/OpenFace)检测自定义视频的平滑面部特征点。


检测到的人脸特征点保存在 “xxxx.csv” 中。运行
```python
python inference.py --mouth_region_size=256 --source_video_path= custom video path --source_openface_landmark_path= detected landmark path --driving_audio_path= driving audio path --pretrained_clip_DINet_path=./asserts/clip_training_DINet_256mouth.pth
```
在您的自定义视频上实现人脸视觉配音。
## 训练
### 数据处理

1. 从[HDTF](https://github.com/MRzzm/HDTF)下载视频。根据 xx_annotion_time.txt 分割视频,不裁剪和调整视频大小。
2. 将所有分割的视频重新采样为 25fps,并将视频放入 “./asserts/split_video_25fps”。您可以在 “./asserts/split_video_25fps” 中看到两个示例视频。我们使用[软件](http://www.pcfreetime.com/formatfactory/cn/index.html) 对视频进行重新采样。我们在实验中提供了训练视频的名称列表。(请参阅“./asserts/training_video_name.txt”)
3. 使用 [openface](https://github.com/TadasBaltrusaitis/OpenFace) 检测所有视频的平滑面部特征点。将所有 “.csv” 结果放入 “./asserts/split_video_25fps_landmark_openface” 中。您可以在 “./asserts/split_video_25fps_landmark_openface” 中看到两个示例 csv 文件。



4. 从所有视频中提取帧并将帧保存在 “./asserts/split_video_25fps_frame” 中。运行
```python
python data_processing.py --extract_video_frame
```
5. 从所有视频中提取音频,并将音频保存在 ./asserts/split_video_25fps_audio 中。运行
```python
python data_processing.py --extract_audio
```
6. 从所有音频中提取 deepspeech 特征并将特征保存在 “./asserts/split_video_25fps_deepspeech” 中。运行
```python
python data_processing.py --extract_deep_speech
```
7. 裁剪所有视频的人脸并将图像保存在 “./asserts/split_video_25fps_crop_face” 中。运行
```python
python data_processing.py --crop_face
```
8. 生成训练 json 文件 “./asserts/training_json.json”。运行
```python
python data_processing.py --generate_training_json
```

### 训练模型
训练过程分为帧训练阶段和 clip 训练阶段。在帧训练阶段,我们使用从粗到细的策略,因此您可以在任意分辨率下训练模型。

#### 框架训练阶段。
在帧训练阶段,我们只使用感知损失和 GAN 损失

1. 首先,以 104x80(嘴部区域为 64x64)分辨率训练 DINet。运行
```python
python train_DINet_frame.py --augment_num=32 --mouth_region_size=64 --batch_size=24 --result_path=./asserts/training_model_weight/frame_training_64
```


2. 加载预训练模型(面部:104x80 & 嘴巴:64x64)并以更高分辨率训练DINet(面部:208x160 & 嘴巴:128x128)。运行
python train_DINet_frame.py --augment_num=100 --mouth_region_size=128 --batch_size=80 --coarse2fine --coarse_model_path=./asserts/training_model_weight/frame_training_64/xxxxxx.pth --result_path=./asserts/training_model_weight/frame_training_128
```


3. 加载预训练模型(面部:208x160 & 嘴巴:128x128)并以更高分辨率训练DINet(面部:416x320 & 嘴巴:256x256)。运行
```python
python train_DINet_frame.py --augment_num=20 --mouth_region_size=256 --batch_size=12 --coarse2fine --coarse_model_path=./asserts/training_model_weight/frame_training_128/xxxxxx.pth --result_path=./asserts/training_model_weight/frame_training_256
```


#### 剪辑训练阶段。
在剪辑训练阶段,我们使用感知损失、帧/剪辑 GAN 损失和同步损失。加载预训练的帧模型(面部:416x320 & 嘴巴:256x256),预训练的同步网络模型(嘴巴:256x256)并在剪辑设置中训练DINet。运行
```python
python train_DINet_clip.py --augment_num=3 --mouth_region_size=256 --batch_size=3 --pretrained_syncnet_path=./asserts/syncnet_256mouth.pth --pretrained_frame_DINet_path=./asserts/training_model_weight/frame_training_256/xxxxx.pth --result_path=./asserts/training_model_weight/clip_training_256
```


# 改进推理和评估
1.上述推理过程中过于繁琐不便于我们进行测试集的推理和评估,因此这里对于该部分进行了修改,通过一个统一的脚本进行推理,我们只需要将测试集放到test_data文件夹下面我们就可以很方便的进行推理:
```
docker run --rm -v path/to/your/DINet:/app --gpus all -it --shm-size=8G -e CUDA_LAUNCH_BLOCKING=1 kevia/dinet:latest run_inference.py --process_num x
```
process_num 遍历推理个数,结果保存在./asserts/inference_result里面
推理阶段默认使用预训练模型,如果使用训练模型加入参数 --model_path ./asserts/training_model_weight/clip_training_256/xxxx.pth
可以选择音频路径 --audio_path /path/to/your_auido

2.由于该项目并没有评估脚本来对于模型进行评估,因此这里加入了专用的评估脚本,实现了NIQE,PSNR,FID,SSIM,LSE-C,LSE-D指标,通过运行
```
docker run --rm -v path/to/your/DINet:/app --gpus all -it --shm-size=8G -e CUDA_LAUNCH_BLOCKING=1 kevia/dinet:latest run_evaluate.py
```
评估结果保存在./asserts/evaluate_result里面。

## 声明
整体实现思路来自[https://github.com/MRzzm/DINet](https://github.com/MRzzm/DINet)
Binary file added DINet/config/__pycache__/config.cpython-36.pyc
Binary file not shown.
110 changes: 110 additions & 0 deletions DINet/config/config.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
import argparse

class DataProcessingOptions():
def __init__(self):
self.parser = argparse.ArgumentParser()

def parse_args(self):
self.parser.add_argument('--extract_video_frame', action='store_true', help='extract video frame')
self.parser.add_argument('--extract_audio', action='store_true', help='extract audio files from videos')
self.parser.add_argument('--extract_deep_speech', action='store_true', help='extract deep speech features')
self.parser.add_argument('--crop_face', action='store_true', help='crop face')
self.parser.add_argument('--generate_training_json', action='store_true', help='generate training json file')

self.parser.add_argument('--source_video_dir', type=str, default="./asserts/training_data/split_video_25fps",
help='path of source video in 25 fps')
self.parser.add_argument('--openface_landmark_dir', type=str, default="./asserts/training_data/split_video_25fps_landmark_openface",
help='path of openface landmark dir')
self.parser.add_argument('--video_frame_dir', type=str, default="./asserts/training_data/split_video_25fps_frame",
help='path of video frames')
self.parser.add_argument('--audio_dir', type=str, default="./asserts/training_data/split_video_25fps_audio",
help='path of audios')
self.parser.add_argument('--deep_speech_dir', type=str, default="./asserts/training_data/split_video_25fps_deepspeech",
help='path of deep speech')
self.parser.add_argument('--crop_face_dir', type=str, default="./asserts/training_data/split_video_25fps_crop_face",
help='path of crop face dir')
self.parser.add_argument('--json_path', type=str, default="./asserts/training_data/training_json.json",
help='path of training json')
self.parser.add_argument('--clip_length', type=int, default=9, help='clip length')
self.parser.add_argument('--deep_speech_model', type=str, default="./asserts/output_graph.pb",
help='path of pretrained deepspeech model')
return self.parser.parse_args()

class DINetTrainingOptions():
def __init__(self):
self.parser = argparse.ArgumentParser()

def parse_args(self):
self.parser.add_argument('--seed', type=int, default=456, help='random seed to use.')
self.parser.add_argument('--source_channel', type=int, default=3, help='input source image channels')
self.parser.add_argument('--ref_channel', type=int, default=15, help='input reference image channels')
self.parser.add_argument('--audio_channel', type=int, default=29, help='input audio channels')
self.parser.add_argument('--augment_num', type=int, default=32, help='augment training data')
self.parser.add_argument('--mouth_region_size', type=int, default=64, help='augment training data')
self.parser.add_argument('--train_data', type=str, default=r"./asserts/training_data/training_json.json",
help='path of training json')
self.parser.add_argument('--batch_size', type=int, default=24, help='training batch size')
self.parser.add_argument('--lamb_perception', type=int, default=10, help='weight of perception loss')
self.parser.add_argument('--lamb_syncnet_perception', type=int, default=0.1, help='weight of perception loss')
self.parser.add_argument('--lr_g', type=float, default=0.0001, help='initial learning rate for adam')
self.parser.add_argument('--lr_dI', type=float, default=0.0001, help='initial learning rate for adam')
self.parser.add_argument('--start_epoch', default=1, type=int, help='start epoch in training stage')
self.parser.add_argument('--non_decay', default=200, type=int, help='num of epoches with fixed learning rate')
self.parser.add_argument('--decay', default=200, type=int, help='num of linearly decay epochs')
self.parser.add_argument('--checkpoint', type=int, default=2, help='num of checkpoints in training stage')
self.parser.add_argument('--result_path', type=str, default=r"./asserts/training_model_weight/frame_training_64",
help='result path to save model')
self.parser.add_argument('--coarse2fine', action='store_true', help='If true, load pretrained model path.')
self.parser.add_argument('--coarse_model_path',
default='',
type=str,
help='Save data (.pth) of previous training')
self.parser.add_argument('--pretrained_syncnet_path',
default='',
type=str,
help='Save data (.pth) of pretrained syncnet')
self.parser.add_argument('--pretrained_frame_DINet_path',
default='',
type=str,
help='Save data (.pth) of frame trained DINet')
# ========================= Discriminator ==========================
self.parser.add_argument('--D_num_blocks', type=int, default=4, help='num of down blocks in discriminator')
self.parser.add_argument('--D_block_expansion', type=int, default=64, help='block expansion in discriminator')
self.parser.add_argument('--D_max_features', type=int, default=256, help='max channels in discriminator')
return self.parser.parse_args()


class DINetInferenceOptions():
def __init__(self):
self.parser = argparse.ArgumentParser()

def parse_args(self):
self.parser.add_argument('--source_channel', type=int, default=3, help='channels of source image')
self.parser.add_argument('--ref_channel', type=int, default=15, help='channels of reference image')
self.parser.add_argument('--audio_channel', type=int, default=29, help='channels of audio feature')
self.parser.add_argument('--mouth_region_size', type=int, default=256, help='help to resize window')
self.parser.add_argument('--source_video_path',
default='./asserts/examples/test4.mp4',
type=str,
help='path of source video')
self.parser.add_argument('--source_openface_landmark_path',
default='./asserts/examples/test4.csv',
type=str,
help='path of detected openface landmark')
self.parser.add_argument('--driving_audio_path',
default='./asserts/examples/driving_audio_1.wav',
type=str,
help='path of driving audio')
self.parser.add_argument('--pretrained_clip_DINet_path',
default='./asserts/clip_training_DINet_256mouth.pth',
type=str,
help='pretrained model of DINet(clip trained)')
self.parser.add_argument('--deepspeech_model_path',
default='./asserts/output_graph.pb',
type=str,
help='path of deepspeech model')
self.parser.add_argument('--res_video_dir',
default='./asserts/inference_result',
type=str,
help='path of generated videos')
return self.parser.parse_args()
Loading