Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
cbea235
emodisc_dataprocess_update
zhangyuanyuan02 Jan 29, 2024
0bbd9b7
Create conv.py
Jiangzheng123 Jan 30, 2024
4617743
Create syncnet.py
Jiangzheng123 Jan 30, 2024
292f31c
Create wav2lip.py
Jiangzheng123 Jan 30, 2024
810ca8c
Update __init__.py
Jiangzheng123 Jan 30, 2024
34d453b
Update requirements.txt
Jiangzheng123 Jan 30, 2024
cda9e39
Create color_syncnet_train.py
Jiangzheng123 Jan 30, 2024
60dc714
Create emotion_disc_train.py
Jiangzheng123 Jan 30, 2024
3bfba5b
Update README.md
zhangyuanyuan02 Jan 30, 2024
ed591f5
Update utils.py
Jiangzheng123 Jan 30, 2024
47b2b38
Update requirements.txt
Jiangzheng123 Jan 30, 2024
f4bbbb0
Merge pull request #1 from Jiangzheng123/main
zhangyuanyuan02 Jan 30, 2024
b6d655d
Update README.md
zhangyuanyuan02 Jan 30, 2024
cbf1a2b
update models path
zhangyuanyuan02 Jan 30, 2024
a868c27
from path update
zhangyuanyuan02 Jan 30, 2024
ef7bfdd
Update README.md
zhangyuanyuan02 Jan 30, 2024
52d0930
image_driven
randombibi Jan 30, 2024
92300a9
Merge pull request #2 from randombibi/main
zhangyuanyuan02 Jan 30, 2024
a94c431
fix
zhangyuanyuan02 Jan 30, 2024
f0dea90
Create evaluater.py
zhangyuanyuan02 Jan 30, 2024
440fcd2
Update evaluater.py
zhangyuanyuan02 Jan 30, 2024
ffc2414
Update README.md
zhangyuanyuan02 Jan 30, 2024
a28b2fb
Update README.md
zhangyuanyuan02 Jan 30, 2024
06e358f
Update README.md
zhangyuanyuan02 Jan 30, 2024
e932d1e
Create README_EMOGEN.md
zhangyuanyuan02 Jan 30, 2024
43ad1b4
Update README_EMOGEN.md
zhangyuanyuan02 Jan 30, 2024
f900990
Update README.md
zhangyuanyuan02 Jan 30, 2024
414445e
saved
zhangyuanyuan02 Jan 30, 2024
f95ac1e
commitInfo
zhangyuanyuan02 Jan 30, 2024
4526c21
Merge branch 'main' of https://github.com/zhangyuanyuan02/talkingface…
zhangyuanyuan02 Jan 30, 2024
48e852e
Update and rename README.md to README-EMOGEN.md
zhangyuanyuan02 Jan 30, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitattributes
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
“.pth” filter=lfs diff=lfs merge=lfs -text
.pth filter=lfs diff=lfs merge=lfs -text
8 changes: 8 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -208,3 +208,11 @@ python run_talkingface.py --model=xxxx --dataset=xxxx (--other_parameters=xxxxxx











62 changes: 62 additions & 0 deletions README_EMOGEN.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
# 小组README文件
## 项目介绍-Emotionally Enhanced Talking Face Generation

"情感增强的说话面部生成"这篇论文主要关注于通过加入广泛的情感范围来创建更加逼真和有说服力的说话面部视频。它解决了以往工作的局限性,这些工作通常无法创建逼真的视频,因为它们很少关注人物的表情和情感。本项目提出的框架旨在生成包含适当表情和情感的唇同步说话面部视频,使其更具说服力。

## 项目功能
说话面部生成:该框架基于基础骨架架构,使用2D-CNN编解码器网络生成单独的帧。这涉及到一个面部编码器、一个音频编码器和一个解码器,强调视觉质量和准确的唇同步生成。

说话面部生成中的情感捕捉:这是关键部分,因为它涉及将情感信息包含在视频中。该方法将语音音频中表示的情感与视频生成的独立情感标签分开,提供了更多控制主题情感的方法。

数据预处理和增强:该框架使用完全遮盖的帧以及参考帧来加入情感,因为情感不仅通过面部的嘴唇区域来表达。

情感编码器:这将分类情感编码进视频生成过程。

## 环境要求(依赖)
依赖库详见 requirements.txt。要求安装ffmpeg, 安装albumentations库

## 实现及演示

### 张卓远

编写实现了模型的数据预处理和数据加载代码,编写实现了模型emo_disc情绪鉴别器模型,编写emogen.yaml。

运行步骤:

进入talkingface/data/dataprocess/文件夹下运行控制台,输入命令

python emogen_process.py --input_folder <folder_of_dataset> --preprocessed_root <output_folder_for_preprocessed_dataset/>

预处理会先将视频转化为25帧格式,此过程需要安装ffmpeg并添加为环境变量。

并将转化好的视频存入./modified_videos文件夹下
<img width="1280" alt="数据预处理第一步转换FPS演示结果" src="https://github.com/zhangyuanyuan02/talkingface-toolkit/assets/103866519/11d41884-5fc6-4af1-8664-2bb58e54db30">
<img width="1276" alt="转换视频FPS结果" src="https://github.com/zhangyuanyuan02/talkingface-toolkit/assets/103866519/1b751036-dbea-47b6-b3a3-fa82a8c711a7">
程序会自动运行数据预处理第二步:
<img width="1280" alt="数据预处理第二步演示" src="https://github.com/zhangyuanyuan02/talkingface-toolkit/assets/103866519/a2eba2fb-ca15-419e-896d-fe7665474ca9">
由于本机内存空间不足转为使用其他云服务器测试,完成数据预处理的过程。
<img width="1280" alt="预处理内存空间不足报错" src="https://github.com/zhangyuanyuan02/talkingface-toolkit/assets/103866519/d393514c-8107-495a-a4f9-d7da15acc063">
<img width="578" alt="预处理完成演示" src="https://github.com/zhangyuanyuan02/talkingface-toolkit/assets/103866519/16264f1c-188c-463a-a2f0-db03aa9faf03">

接下来训练情绪鉴别器:

<img width="572" alt="训练情绪鉴别器结果演示" src="https://github.com/zhangyuanyuan02/talkingface-toolkit/assets/103866519/3c7e3c19-0939-4c41-a4b8-7f4489026fdc">

由于训练过程过长,可以在中途输入Ctrl^C以停止训练


### 蒋政
完成训练部分代码编写,实现专家口型同步鉴别器模型

修改requirements.txt

修改utils文件夹中的utils.py

在model文件夹中添加 wav2lip.py,修改__init__.py

在model/audio_driven_talkingface下添加conv.py , emogen_syncnet.py

在trainer文件夹中添加color_syncnet_train.py, emotion_disc_train.py

### 周扬
填充了image_driven所需的模型文件,包括conv.py,emo_disc.py,emo_syncnet.py和wav2lip.py
12 changes: 11 additions & 1 deletion requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,6 @@ joblib==1.3.2
jsonschema==4.19.2
jsonschema-specifications==2023.7.1
kiwisolver==1.4.5
kornia==0.5.5
lazy_loader==0.3
librosa==0.10.1
llvmlite==0.37.0
Expand Down Expand Up @@ -112,3 +111,14 @@ wandb==0.15.12
Werkzeug==3.0.1
yapf==0.40.2
zipp==3.17.0
librosa==0.9.1
numba
numpy
opencv-python==4.1.1.26
torch>=1.1.0
torchvision>=0.3.0
tqdm>=4.45.0
dlib
scikit-image
matplotlib
h5py
4 changes: 4 additions & 0 deletions saved/README-EMOGEN.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
保存训练完成的权重文件
由于github只允许上传不超过100mb文件,请从百度网盘下载
链接:https://pan.baidu.com/s/1Rdtpv7P38HYzPckk3pqCEA?pwd=emog
提取码:emog
136 changes: 136 additions & 0 deletions talkingface/data/dataprocess/audio.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,136 @@
import librosa
import librosa.filters
import numpy as np
# import tensorflow as tf
from scipy import signal
from scipy.io import wavfile
from hparams import hparams as hp

def load_wav(path, sr):
return librosa.core.load(path, sr=sr)[0]

def save_wav(wav, path, sr):
wav *= 32767 / max(0.01, np.max(np.abs(wav)))
#proposed by @dsmiller
wavfile.write(path, sr, wav.astype(np.int16))

def save_wavenet_wav(wav, path, sr):
librosa.output.write_wav(path, wav, sr=sr)

def preemphasis(wav, k, preemphasize=True):
if preemphasize:
return signal.lfilter([1, -k], [1], wav)
return wav

def inv_preemphasis(wav, k, inv_preemphasize=True):
if inv_preemphasize:
return signal.lfilter([1], [1, -k], wav)
return wav

def get_hop_size():
hop_size = hp.hop_size
if hop_size is None:
assert hp.frame_shift_ms is not None
hop_size = int(hp.frame_shift_ms / 1000 * hp.sample_rate)
return hop_size

def linearspectrogram(wav):
D = _stft(preemphasis(wav, hp.preemphasis, hp.preemphasize))
S = _amp_to_db(np.abs(D)) - hp.ref_level_db

if hp.signal_normalization:
return _normalize(S)
return S

def melspectrogram(wav):
D = _stft(preemphasis(wav, hp.preemphasis, hp.preemphasize))
S = _amp_to_db(_linear_to_mel(np.abs(D))) - hp.ref_level_db

if hp.signal_normalization:
return _normalize(S)
return S

def _lws_processor():
import lws
return lws.lws(hp.n_fft, get_hop_size(), fftsize=hp.win_size, mode="speech")

def _stft(y):
if hp.use_lws:
return _lws_processor(hp).stft(y).T
else:
return librosa.stft(y=y, n_fft=hp.n_fft, hop_length=get_hop_size(), win_length=hp.win_size)

##########################################################
#Those are only correct when using lws!!! (This was messing with Wavenet quality for a long time!)
def num_frames(length, fsize, fshift):
"""Compute number of time frames of spectrogram
"""
pad = (fsize - fshift)
if length % fshift == 0:
M = (length + pad * 2 - fsize) // fshift + 1
else:
M = (length + pad * 2 - fsize) // fshift + 2
return M


def pad_lr(x, fsize, fshift):
"""Compute left and right padding
"""
M = num_frames(len(x), fsize, fshift)
pad = (fsize - fshift)
T = len(x) + 2 * pad
r = (M - 1) * fshift + fsize - T
return pad, pad + r
##########################################################
#Librosa correct padding
def librosa_pad_lr(x, fsize, fshift):
return 0, (x.shape[0] // fshift + 1) * fshift - x.shape[0]

# Conversions
_mel_basis = None

def _linear_to_mel(spectogram):
global _mel_basis
if _mel_basis is None:
_mel_basis = _build_mel_basis()
return np.dot(_mel_basis, spectogram)

def _build_mel_basis():
assert hp.fmax <= hp.sample_rate // 2
return librosa.filters.mel(hp.sample_rate, hp.n_fft, n_mels=hp.num_mels,
fmin=hp.fmin, fmax=hp.fmax)

def _amp_to_db(x):
min_level = np.exp(hp.min_level_db / 20 * np.log(10))
return 20 * np.log10(np.maximum(min_level, x))

def _db_to_amp(x):
return np.power(10.0, (x) * 0.05)

def _normalize(S):
if hp.allow_clipping_in_normalization:
if hp.symmetric_mels:
return np.clip((2 * hp.max_abs_value) * ((S - hp.min_level_db) / (-hp.min_level_db)) - hp.max_abs_value,
-hp.max_abs_value, hp.max_abs_value)
else:
return np.clip(hp.max_abs_value * ((S - hp.min_level_db) / (-hp.min_level_db)), 0, hp.max_abs_value)

assert S.max() <= 0 and S.min() - hp.min_level_db >= 0
if hp.symmetric_mels:
return (2 * hp.max_abs_value) * ((S - hp.min_level_db) / (-hp.min_level_db)) - hp.max_abs_value
else:
return hp.max_abs_value * ((S - hp.min_level_db) / (-hp.min_level_db))

def _denormalize(D):
if hp.allow_clipping_in_normalization:
if hp.symmetric_mels:
return (((np.clip(D, -hp.max_abs_value,
hp.max_abs_value) + hp.max_abs_value) * -hp.min_level_db / (2 * hp.max_abs_value))
+ hp.min_level_db)
else:
return ((np.clip(D, 0, hp.max_abs_value) * -hp.min_level_db / hp.max_abs_value) + hp.min_level_db)

if hp.symmetric_mels:
return (((D + hp.max_abs_value) * -hp.min_level_db / (2 * hp.max_abs_value)) + hp.min_level_db)
else:
return ((D * -hp.min_level_db / hp.max_abs_value) + hp.min_level_db)
104 changes: 104 additions & 0 deletions talkingface/data/dataprocess/emogen_process.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
import argparse
import os
import subprocess
from glob import glob
from tqdm import tqdm
from concurrent.futures import ThreadPoolExecutor, as_completed
import cv2
import numpy as np
import traceback
import audio
from hparams import hparams as hp

import face_detection

def modify_frame_rate(input_folder, output_folder, fps=25.0):
# 修改视频的帧率
os.makedirs(output_folder, exist_ok=True)
fileList = []
for root, dirnames, filenames in os.walk(input_folder):
for filename in filenames:
if filename.lower().endswith(('.mp4', '.mpg', '.mov', '.flv')):
fileList.append(os.path.join(root, filename))

for file in fileList:
subprocess.run("ffmpeg -i {} -r {} -y {}".format(
file, fps, os.path.join(output_folder, os.path.basename(file))), shell=True)

def process_video_file(vfile, args, gpu_id, fa):
video_stream = cv2.VideoCapture(vfile)
frames = []
while True:
still_reading, frame = video_stream.read()
if not still_reading:
video_stream.release()
break
frames.append(frame)

vidname = os.path.basename(vfile).split('.')[0]

fulldir = os.path.join(args.preprocessed_root, vidname)
os.makedirs(fulldir, exist_ok=True)

batches = [frames[i:i + args.batch_size] for i in range(0, len(frames), args.batch_size)]

i = -1
for fb in batches:
preds = fa[gpu_id].get_detections_for_batch(np.asarray(fb))

for j, f in enumerate(preds):
i += 1
if f is None:
continue

x1, y1, x2, y2 = f
cv2.imwrite(os.path.join(fulldir, '{}.jpg'.format(i)), fb[j][y1:y2, x1:x2])

def process_audio_file(vfile, args):
vidname = os.path.basename(vfile).split('.')[0]
fulldir = os.path.join(args.preprocessed_root, vidname)
os.makedirs(fulldir, exist_ok=True)

wavpath = os.path.join(fulldir, 'audio.wav')

command = f"ffmpeg -loglevel panic -y -i {vfile} -strict -2 {wavpath}"
subprocess.call(command, shell=True)

def main():
parser = argparse.ArgumentParser()
parser.add_argument("--input_folder", type=str, help='Path to folder that contains original video files')
parser.add_argument("--output_folder", type=str, help='Path to folder for storing modified videos', default='modified_videos/')
parser.add_argument("--fps", type=float, help='Target FPS', default=25.0)
parser.add_argument("--ngpu", type=int, help='Number of GPUs across which to run in parallel', default=1)
parser.add_argument("--batch_size", type=int, help='Single GPU Face detection batch size', default=32)
parser.add_argument("--preprocessed_root", help="Root folder of the preprocessed dataset", required=True)

args = parser.parse_args()

# 第一步:修改视频帧率
modify_frame_rate(args.input_folder, args.output_folder, args.fps)

# 配置面部检测
fa = [face_detection.FaceAlignment(face_detection.LandmarksType._2D, flip_input=False,
device=f'cuda:{id}') for id in range(args.ngpu)]

# 第二步:视频和音频的数据预处理
filelist = glob(os.path.join(args.output_folder, '*.mp4'))

print('Started processing videos')
jobs = [(vfile, args, i % args.ngpu, fa[i % args.ngpu]) for i, vfile in enumerate(filelist)]
with ThreadPoolExecutor(args.ngpu) as p:
futures = [p.submit(process_video_file, *job) for job in jobs]
_ = [r.result() for r in tqdm(as_completed(futures), total=len(futures))]

print('Dumping audios...')
for vfile in tqdm(filelist):
try:
process_audio_file(vfile, args)
except KeyboardInterrupt:
exit(0)
except:
traceback.print_exc()

if __name__ == '__main__':
main()
1 change: 1 addition & 0 deletions talkingface/data/dataprocess/face_detection/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
The code for Face Detection in this folder has been taken from the wonderful [face_alignment](https://github.com/1adrianb/face-alignment) repository. This has been modified to take batches of faces at a time.
7 changes: 7 additions & 0 deletions talkingface/data/dataprocess/face_detection/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# -*- coding: utf-8 -*-

__author__ = """Adrian Bulat"""
__email__ = 'adrian.bulat@nottingham.ac.uk'
__version__ = '1.0.1'

from .api import FaceAlignment, LandmarksType, NetworkSize
Loading