MMGen: Unified Multi-modal Image Generation and Understanding in One Go

In this paper, we introduce MMGen, a unified framework that integrates multiple generative tasks into a single diffusion model, more importantly, in one diffusion process.

This includes: (1) multi-modal category-conditioned generation, where multi-modal outputs are generated simultaneously through a single inference process, given category information; (2) multi-modal visual understanding, which accurately predicts depth, surface normals, and segmentation maps from RGB images; and (3) multi-modal conditioned generation, which produces corresponding RGB images based on specific modality conditions and other aligned modalities.

Project page | Paper | Data

Code is coming soon.

If you have any questions about this project or want any discussions, feel free to drop me an email.

Citation

Cite as below if you find this repository is helpful to your project:

@article{wang2025mmgen,
  title={MMGen: Unified Multi-modal Image Generation and Understanding in One Go},
  author={Wang, Jiepeng and Wang, Zhaoqing and Pan, Hao and Liu, Yuan and Yu, Dongdong and Wang, Changhu and Wang, Wenping},
  journal={arXiv preprint arXiv:2503.20644},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data		data
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MMGen: Unified Multi-modal Image Generation and Understanding in One Go

Project page | Paper | Data

Citation

About

Uh oh!

Releases

Packages

License

jiepengwang/MMGen

Folders and files

Latest commit

History

Repository files navigation

MMGen: Unified Multi-modal Image Generation and Understanding in One Go

Project page | Paper | Data

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages