Academic-Hammer · imdrinkcat · Dec 21, 2024 · Dec 21, 2024 · Dec 21, 2024 · Dec 23, 2024
diff --git a/LiveSpeechPortraits/Paper_README.md b/LiveSpeechPortraits/Paper_README.md
@@ -0,0 +1,106 @@
+# Live Speech Portraits: Real-Time Photorealistic Talking-Head Animation
+
+This repository contains the implementation of the following paper:
+
+> **Live Speech Portraits: Real-Time Photorealistic Talking-Head Animation**
+>
+> Yuanxun Lu, [Jinxiang Chai](https://scholar.google.com/citations?user=OcN1_gwAAAAJ&hl=zh-CN&oi=ao), [Xun Cao](https://cite.nju.edu.cn/People/Faculty/20190621/i5054.html) *(SIGGRAPH Asia 2021)*
+>
+> **Abstract**: To the best of our knowledge, we first present a live system that generates personalized photorealistic talking-head animation only driven by audio signals at over 30 fps. Our system contains three stages. The first stage is a deep neural network that extracts deep audio features along with a manifold projection to project the features to the target person's speech space. In the second stage, we learn facial dynamics and motions from the projected audio features. The predicted motions include head poses and upper body motions, where the former is generated by an autoregressive probabilistic model which models the head pose distribution of the target person. Upper body motions are deduced from head poses. In the final stage, we generate conditional feature maps from previous predictions and send them with a candidate image set to an image-to-image translation network to synthesize photorealistic renderings. Our method generalizes well to wild audio and successfully synthesizes high-fidelity personalized facial details, e.g., wrinkles, teeth. Our method also allows explicit control of head poses. Extensive qualitative and quantitative evaluations, along with user studies, demonstrate the superiority of our method over state-of-the-art techniques.
+>
+> [[Project Page]](https://yuanxunlu.github.io/projects/LiveSpeechPortraits/)    [[Paper]](https://yuanxunlu.github.io/projects/LiveSpeechPortraits/resources/SIGGRAPH_Asia_2021__Live_Speech_Portraits__Real_Time_Photorealistic_Talking_Head_Animation.pdf)    [[Arxiv]](https://arxiv.org/abs/2109.10595)    [[Web Demo]](https://replicate.ai/yuanxunlu/livespeechportraits)
+
+![Teaser](./doc/Teaser.jpg)
+
+Figure 1. Given an arbitrary input audio stream, our system generates personalized and photorealistic talking-head animation in real-time. Right: May and Obama are driven by the same utterance but present different speaking characteristics.
+
+<a href="https://replicate.ai/yuanxunlu/livespeechportraits"><img src="https://img.shields.io/static/v1?label=Replicate&message=Demo and Docker Image&color=blue"></a>
+
+
+## Requirements
+
+- This project is successfully trained and tested on Windows10 with PyTorch 1.7 (Python 3.6).  Linux and lower version PyTorch should also work (not tested). We recommend creating a new environment:
+
+```
+conda create -n LSP python=3.6
+conda activate LSP
+```
+
+- Clone the repository:
+
+```
+git clone https://github.com/YuanxunLu/LiveSpeechPortraits.git
+cd LiveSpeechPortraits
+```
+
+- FFmpeg is required to combine the audio and the silent generated videos. Please check [FFmpeg](http://ffmpeg.org/download.html) for installation. For Linux users,  you can also:
+
+```
+sudo apt-get install ffmpeg
+```
+
+- Install the dependences:
+
+```
+pip install -r requirements.txt
+```
+
+
+
+## Demo
+
+- Download the pre-trained models and data from [Google Drive](https://drive.google.com/drive/folders/1sHc2xEEGwnb0h2rkUhG9sPmOxvRvPVpJ?usp=sharing) to the `data` folder.  Five subjects data are released (May, Obama1, Obama2, Nadella and McStay).
+
+- Run the demo:
+
+  ```
+  python demo.py --id May --driving_audio ./data/Input/00083.wav --device cuda
+  ```
+
+  Results can be found under the `results` folder.
+
+- **(New!) Docker and Web Demo**
+
+  We are really grateful to [Andreas](https://github.com/andreasjansson) from [Replicate](https://replicate.ai/home) for his amazing job to make the web demo! Now you can run the [Demo](https://replicate.ai/yuanxunlu/livespeechportraits) on the browser.
+
+- **For the orginal links of these videos, please check issue [#7](https://github.com/YuanxunLu/LiveSpeechPortraits/issues/7).**
+
+
+
+
+## Citation
+
+If you find this project useful for your research, please consider citing:
+
+```
+@article{lu2021live,
+ author = {Lu, Yuanxun and Chai, Jinxiang and Cao, Xun},
+ title = {{Live Speech Portraits}: Real-Time Photorealistic Talking-Head Animation},
+ journal = {ACM Transactions on Graphics},
+ numpages = {17},
+ volume={40},
+ number={6},
+ month = December,
+ year = {2021},
+ doi={10.1145/3478513.3480484}
+} 
+```
+
+
+
+## Acknowledgment
+
+- This repo was built based on the framework of [pix2pix-pytorch](https://github.com/junyanz/pytorch-CycleGAN-and-pix2pix).
+- Thanks the authors of [MakeItTalk](https://github.com/adobe-research/MakeItTalk), [ATVG](https://github.com/lelechen63/ATVGnet), [RhythmicHead](https://github.com/lelechen63/Talking-head-Generation-with-Rhythmic-Head-Motion), [Speech-Driven Animation](https://github.com/DinoMan/speech-driven-animation) for making their excellent work and codes publicly available.
+- Thanks [Andreas](https://github.com/andreasjansson) for the efforts of the web demo.
+
+
+
+
+
+
+
+
+
+
+
diff --git a/LiveSpeechPortraits/README.md b/LiveSpeechPortraits/README.md
@@ -0,0 +1,143 @@
+# 语音识别大作业 Live Speech Portraits
+
+## 小组成员
+
+孙乐天，余博轩，孙仲天，李泽鸣
+
+## 文件说明
+
+`./source`：目录下包含全部源代码
+
+`./data`：目录下包含 **部分** 测试用的音视频以及预训练模型
+
+`README.md`：Docker 镜像说明文档
+
+`Paper_README.md`：论文代码的说明文档
+
+## LiveSpeechPortraits Docker 镜像使用说明
+
+本项目的模型使用 `Docker 27.3` 版本封装了模型，使用前请确保已正确安装 `Docker Engine 27.3` 。您可以通过访问 [Docker 官网](https://www.docker.com/) 下载并安装 Docker。
+
+论文配套代码使用 `pytorch-1.7.10+cu110` 环境，我们基于此版本进行了复现，Docker镜像可从此处下载：[lsp_demo_1.3.tar](https://pan.baidu.com/s/1NIJDdSzwFL3lPSb-tYuaZQ?pwd=fnbg) 。
+
+由于使用的 `cuda` 版本较老，最新的RTX40系显卡无法进行训练。因此，面向高版本重新封装了基于 `cuda 11.8` 的Docker镜像。由于原文代码存在兼容性问题，因此我们替换了不兼容的模块，并改写了部分代码。该Docker镜像可从此处下载：[lsp_quickrun_cu118.tar](https://pan.baidu.com/s/1a1C2oy5DBqbVjWnXOOr9rw?pwd=imjb) 。
+
+### 文件准备
+
+#### 1. 从.tar文件载入 Docker 镜像
+
+首先，参考上方链接，下载与您的 GPU 兼容的 Docker 镜像文件。
+
+使用以下命令载入 Docker 镜像：
+
+```bash
+docker load -i lsp_demo_XXXXX.tar
+```
+
+#### 2. 准备需要使用的数据
+
+在运行 Docker 镜像前，请创建以下三个文件夹，分别用于存放预训练模型、输入音频和输出结果：
+
+`models`：存放预训练模型
+
+`input`：存放输入音频文件
+
+`results`：存放生成的输出结果
+
+您可以通过 Google Drive [下载](https://drive.google.com/drive/folders/1sHc2xEEGwnb0h2rkUhG9sPmOxvRvPVpJ?usp=sharing) 预训练模型，并将其保存至 `models` 文件夹。确保文件夹中的内容如下所示：
+
+```
+.
+|-- APC_epoch_160.model
+|-- May
+|-- McStay
+|-- Nadella
+|-- Obama1
+`-- Obama2
+```
+
+将您需要输入模型的数据文件保存在 `input` 文件夹中
+
+### 镜像使用参数说明
+
+`LSP_QuickRun` 镜像支持两种运行模式：用于 **生成视频** 的 `--lspmodel` 模式和用于 **评估模型** 的 `--eval` 模式。
+
+#### 1. `--lspmodel` 模式
+
+在该模式下，您需要指定以下参数：
+
+`--id` :  预训练模型的名称，例如 `May` 、 `Obama1` 、 `Obama2` 等；
+
+`--device` : 所使用的设备类型，例如 `cuda` 、 `cpu` 等；
+
+`--driving_audio` : 输入音频文件的路径（Docker 容器内的路径）。
+
+生成的视频文件将保存到容器内的 `/workspace/results` 目录中。
+
+#### 2. `--eval` 模式
+
+在此模式下，您需要指定以下参数：
+
+`--gt_video` :  参考视频的路径（Docker 容器内的路径）；
+
+`--gen_video` : 模型生成的视频路径（Docker 容器内的路径）。
+
+评估结果将在命令行中显示。
+
+### Docker 运行命令示例
+
+#### 1. `--lspmodel` 模式
+
+运行命令的模板如下所示：
+
+```dockerfile
+docker run -it --gpus all --rm --shm-size=8g \
+-v <本地的model文件夹目录>:/workspace/data \
+-v <本地的input文件夹目录>:/workspace/input \
+-v <本地的result文件夹目录>:/workspace/results \
+<镜像名称> \
+--lspmodel \
+--id <预训练模型ID> \
+--device <设备名称> \
+--driving_audio <容器内输入音频的路径>
+```
+
+例如：
+
+```dockerfile
+docker run -it --gpus all --rm --shm-size=8g \
+-v E:\Code\Docker\models:/workspace/data \
+-v E:\Code\Docker\input:/workspace/input \
+-v E:\Code\Docker\results:/workspace/results \
+lsp_quickrun:1.3 \
+--lspmodel \
+--id Obama1 \
+--device cuda \
+--driving_audio /workspace/input/00083.wav
+```
+
+#### 2. `--eval` 模式
+
+运行命令的模板如下所示：
+
+```dockerfile
+docker run -it --gpus all --rm --shm-size=8g \
+-v <本地的input文件夹目录>:/workspace/input \
+-v <本地的result文件夹目录>:/workspace/results \
+<镜像名称> \
+--eval \
+--gt_video <容器内参考视频的路径> \
+--gen_video <容器内生成视频的路径>
+```
+
+例如：
+
+```dockerfile
+docker run -it --gpus all --rm --shm-size=8g \
+-v E:\Code\Docker\input:/workspace/input \
+-v E:\Code\Docker\results:/workspace/results \
+lsp_quickrun:1.3 \
+--eval \
+--gt_video /workspace/input/May_short.mp4 \
+--gen_video /workspace/results/May/May_short/May_short.avi
+```
diff --git a/LiveSpeechPortraits/data/input/00083.wav b/LiveSpeechPortraits/data/input/00083.wav
diff --git a/LiveSpeechPortraits/data/input/May_short.mp4 b/LiveSpeechPortraits/data/input/May_short.mp4
diff --git a/LiveSpeechPortraits/data/models/APC_epoch_160.model b/LiveSpeechPortraits/data/models/APC_epoch_160.model
diff --git a/LiveSpeechPortraits/source_code/LICENSE b/LiveSpeechPortraits/source_code/LICENSE
@@ -0,0 +1,21 @@
+MIT License
+
+Copyright (c) 2021 OldSix
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
diff --git a/LiveSpeechPortraits/source_code/cog.yaml b/LiveSpeechPortraits/source_code/cog.yaml
@@ -0,0 +1,30 @@
+build:
+  gpu: true
+  python_version: "3.8"
+  system_packages:
+    - "libgl1-mesa-glx"
+    - "libglib2.0-0"
+    - "libsox-fmt-mp3"
+  python_packages:
+    - "torch==1.7.1"
+    - "torchvision==0.8.2"
+    - "numpy==1.18.1"
+    - "ipython==7.21.0"
+    - "Pillow==8.3.1"
+    - "scikit-image==0.18.3"
+    - "librosa==0.7.2"
+    - "tqdm==4.62.3"
+    - "scipy==1.7.1"
+    - "dominate==2.6.0"
+    - "albumentations==0.5.2"
+    - "beautifulsoup4==4.10.0"
+    - "sox==1.4.1"
+    - "h5py==3.4.0"
+    - "numba==0.48"
+    - "moviepy==1.0.3"
+  run:
+    - apt update -y && apt-get install ffmpeg -y
+    - apt-get install sox libsox-fmt-mp3 -y
+    - pip install opencv-python==4.1.2.30
+
+predict: "predict.py:Predictor"
diff --git a/LiveSpeechPortraits/source_code/config/May.yaml b/LiveSpeechPortraits/source_code/config/May.yaml
@@ -0,0 +1,32 @@
+model_params:
+    APC:
+        ckp_path: './data/APC_epoch_160.model'
+        mel_dim: 80
+        hidden_size: 512
+        num_layers: 3
+        residual: false
+        use_LLE: 1
+        Knear: 10
+        LLE_percent: 1
+    Audio2Mouth:
+        ckp_path: './data/May/checkpoints/Audio2Feature.pkl'
+        smooth: 1.5
+        AMP: ['XYZ', 2, 2, 2]  # method, x, y, z
+    Headpose:
+        ckp_path: './data/May/checkpoints/Audio2Headpose.pkl'
+        sigma: 0.3
+        smooth: [5, 10]    # rot, trans
+        AMP: [1, 0.5]    # rot, trans
+        shoulder_AMP: 0.5
+    Image2Image:
+        ckp_path: './data/May/checkpoints/Feature2Face.pkl'
+        size: 'large'
+        save_input: 1
+
+
+dataset_params:
+  root: './data/May/'
+  fit_data_path: './data/May/3d_fit_data.npz'
+  pts3d_path: './data/May/tracked3D_normalized_pts_fix_contour.npy'
+
+
diff --git a/LiveSpeechPortraits/source_code/config/McStay.yaml b/LiveSpeechPortraits/source_code/config/McStay.yaml
@@ -0,0 +1,32 @@
+model_params:
+    APC:
+        ckp_path: './data/APC_epoch_160.model'
+        mel_dim: 80
+        hidden_size: 512
+        num_layers: 3
+        residual: false
+        use_LLE: 1
+        Knear: 10
+        LLE_percent: 1
+    Audio2Mouth:
+        ckp_path: './data/McStay/checkpoints/Audio2Feature.pkl'
+        smooth: 2
+        AMP: ['XYZ', 1.5, 1.5, 1.5]  # method, x, y, z
+    Headpose:
+        ckp_path: './data/McStay/checkpoints/Audio2Headpose.pkl'
+        sigma: 0.3
+        smooth: [5, 10]    # rot, trans
+        AMP: [1, 1]    # rot, trans
+        shoulder_AMP: 0.5
+    Image2Image:
+        ckp_path: './data/McStay/checkpoints/Feature2Face.pkl'
+        size: 'normal'
+        save_input: 1
+
+
+dataset_params:
+  root: './data/McStay/'
+  fit_data_path: './data/McStay/3d_fit_data.npz'
+  pts3d_path: './data/McStay/tracked3D_normalized_pts_fix_contour.npy'
+
+