Skip to content
/ YODA Public

YODA: Yet Another One-step Diffusion-based Video Compression

License

Notifications You must be signed in to change notification settings

NJUVISION/YODA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

12 Commits
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

YODA: Yet Another One-step Diffusion-based Video Compressor


πŸ“’ Introduction

YODA is a novel neural video codec designed to achieve extreme perceptual quality with efficient inference speed.

While one-step diffusion models have excelled in image compression, applying them to video remains a challenge. YODA overcomes the limitations of traditional methods and existing deep learning baselines by introducing a One-step Diffusion Transformer and a Temporal-Awareness mechanism.

Core Highlights:

  • Perceptual Quality: YODA consistently outperforms H.266/VVC and SOTA neural codecs (such as DCVC-RT and PLVC) on perceptual metrics including LPIPS, DISTS, FID, and KID.
  • One-Step Denoising: Utilizing a lightweight linear DiT model, YODA performs denoising in a single step, significantly reducing the inference latency associated with diffusion models.
  • Temporal-Aware Design: Unlike prior efforts that rely on frozen 2D autoencoders, YODA employs a trainable Temporal-Aware AutoEncoder (TA-AE) to fully exploit inter-frame correlations.

πŸš€ Framework

YODA Framework

YODA proposes an end-to-end unified design consisting of three key components:

  • Temporal-Aware AutoEncoder (TA-AE): Extracts multiscale features from reference frames to generate a compact latent representation.
  • Conditional Latent Coder (CLC): Implicitly models motion within the feature space to perform efficient entropy coding.
  • Linear DiT Model: Adopts a linear DiT for efficient one-step denoising.

πŸ† Performance

YODA demonstrates superior performance across multiple datasets (UVG, HEVC-B, MCL-JCV), surpassing both traditional standards (VTM) and recent neural video codecs (DCVC-RT, DiffVC, PLVC)

Perceptual Quality RD Curves (LPIPS, DISTS, FID, KID)

Figure: Perceptual quality performance comparisons on UVG, HEVC-B, and MCL-JCV datasets. Lower is better.



πŸ‘οΈ Visual Comparison

We provide an interactive video comparison (with sliding view) on our project page to demonstrate the visual reconstruction quality of YODA against the Ground Truth.


πŸ“‚ Data Preparation

We utilized the Vimeo-90K dataset for training and evaluated our model on the UVG, MCL-JCV, and HEVC Class B datasets.


🀝 Acknowledgment

We thank the authors of the following projects for their pioneering contributions and open-source efforts:

  • DCVC-RT: Towards Practical Real-time Neural Video Compression.
  • SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers.
  • DC-AE: Deep Compression Autoencoder.

About

YODA: Yet Another One-step Diffusion-based Video Compression

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •