PDF(16138 KB)
Diffusion model-empowered generative visual semantic communication
Hailong QIN, Jincheng DAI, Sixian WANG, Shengshi YAO, Kai NIU, Wenjun XU
Journal of Tsinghua University(Science and Technology) ›› 2025, Vol. 65 ›› Issue (11) : 2080-2094.
PDF(16138 KB)
PDF(16138 KB)
Diffusion model-empowered generative visual semantic communication
Significance: End-to-end semantic communication leverages deep learning models to extract semantic features from data, enabling intent-driven communication processes that significantly enhance transmission efficiency. However, existing semantic communication paradigms based on discriminative models employ symbol-level rate-distortion optimization and perform maximum likelihood estimation solely based on received signals, failing to satisfy the perceptual requirements of users. To ensure the visual quality of transmitted data, a generative visual semantic communication paradigm has emerged, which adopts a rate-distortion-perception optimization framework to achieve alignment between data transmission and human perception through maximum a posteriori estimation. Diffusion models are advantageous for controlling visual generation and have thus become essential tools for this generative paradigm. Nevertheless, systematic organization of the technical roadmaps for empowering semantic communication using diffusion models is lacking in current research. Progress: This study addresses this gap by modeling the communication process as a mathematical inverse problem and elucidating the general methodology by which diffusion models solve data compression and transmission challenges through posterior sampling. The fundamental concepts, mathematical formulations, and sampling strategies underpinning diffusion models are systematically introduced. In addition, the general methods and key technologies employed for diffusion model-enabled generative compression and transmission are comprehensively reviewed from an inverse problem-solving perspective. Moreover, the performance metrics commonly used for objective assessment of the visual quality of transmitted data are summarized to provide a comprehensive evaluation framework. The core methodology demonstrates that generalized communication processes can be effectively modeled as inverse problems. The approach involves inferring the source data distribution using maximum a posteriori estimation based on channel measurements and forward operators composed of various signal processing operations. Through diffusion posterior sampling, diffusion models solve these communication inverse problems via a three-step process: first, pre-training diffusion models from large-scale datasets are used to obtain diffusion priors; second, joint source-channel codecs are used to mitigate channel distortions in visual data transmission and construct proximal regularization terms; finally, measurement regularization terms are constructed based on channel measurements. By integrating these regularization terms for posterior estimation and distribution sampling, diffusion models can implicitly reconstruct source data through gradient descent, effectively overcoming transmission challenges caused by strong channel noise, nonlinear operators, and time-varying channel conditions. Conclusions and Prospects: The analysis reveals that compared to visual semantic communication approaches based on discriminative deep learning models, the generative visual semantic communication paradigm based on diffusion models can significantly improve transmission efficiency and resilience while ensuring perceptual quality and semantic consistency of visual information. This advancement represents a fundamental shift toward communication systems that prioritize human perceptual requirements alongside traditional distortion metrics. Open issues, including image realism modeling and acceleration of diffusion model sampling, are discussed. The report highlights the effectiveness of conditional diffusion models for enabling existing semantic communication architectures to recover sources at the receiver based on minimal tokens and highly degraded measurements, offering an intelligent and concise design philosophy for future generative visual semantic communication systems.
generative visual semantic communication / diffusion models / inverse problems / maximum a posterior estimation
| 1 |
|
| 2 |
牛凯, 戴金晟, 张平. 面向6G的语义通信[J]. 移动通信, 2021, 45 (4): 85- 90.
|
| 3 |
QIN Z J, TAO X M, LU J H, et al. Semantic communications: Principles and challenges[EB/OL]. (2022-01-04) [2024-10-24]. https://doi.org/10.48550/ arXiv. 2201.01389.
|
| 4 |
石光明, 肖泳, 李莹玉, 等. 面向万物智联的语义通信网络[J]. 物联网学报, 2021, 5 (2): 26- 36.
|
| 5 |
|
| 6 |
|
| 7 |
|
| 8 |
|
| 9 |
刘传宏, 郭彩丽, 杨洋, 等. 面向智能任务的语义通信: 理论、技术和挑战[J]. 通信学报, 2022, 43 (6): 41- 57.
|
| 10 |
|
| 11 |
秦志金, 赵菼菼, 李凡, 等. 多模态语义通信研究综述[J]. 通信学报, 2023, 44 (5): 28- 41.
|
| 12 |
张平, 戴金晟, 张育铭, 等. 面向语义通信的非线性变换编码[J]. 通信学报, 2023, 44 (4): 1- 14.
|
| 13 |
|
| 14 |
|
| 15 |
|
| 16 |
|
| 17 |
|
| 18 |
BLAU Y, MICHAELI T. The perception-distortion tradeoff[C]// Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE, 2018: 6228-6237.
|
| 19 |
BLAU Y, MICHAELI T. Rethinking lossy compression: The rate-distortion-perception tradeoff[C]// Proceedings of the 36th International Conference on Machine Learning. Long Beach, USA: PMLR, 2019: 675-685.
|
| 20 |
KINGMA D P, WELLING M. Auto-encoding variational bayes[C]// Proceedings of the 2nd International Conference on Learning Representations. Banff, Canada: ICLR, 2014: 1-14.
|
| 21 |
MENTZER F, TODERICI G, TSCHANNEN M, et al. High-fidelity generative image compression[C]// Proceedings of the 34th International Conference on Neural Information Processing Systems. Vancouver, Canada: Curran Associates Inc., 2020: 999.
|
| 22 |
GOODFELLOW I J, POUGET-ABADIE J, MIRZA M, et al. Generative adversarial nets[C]// Proceedings of the 28th International Conference on Neural Information Processing Systems. Montreal, Canada: MIT Press, 2014: 2672-2680.
|
| 23 |
WEI X F, TONG H N, YANG N C, et al. Language-oriented semantic communication for image transmission with fine-tuned diffusion model[C]// Proceedings of 2024 16th International Conference on Wireless Communications and Signal Processing. Hefei, China: IEEE, 2024: 1456-1461.
|
| 24 |
|
| 25 |
ROMBACH R, BLATTMANN A, LORENZ D, et al. High-resolution image synthesis with latent diffusion models[C]// Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans, USA: IEEE, 2022: 10674-10685.
|
| 26 |
|
| 27 |
YANG M Y, LIU B W, WANG B Y, et al. Diffusion-aided joint source channel coding for high realism wireless image transmission[EB/OL]. (2024-04-27) [2024-10-24]. https://doi.org/10.48550/arXiv.2404.17736.
|
| 28 |
WANG Y H, YU J W, ZHANG J. Zero-shot image restoration using denoising diffusion null-space model[C]// Proceedings of the 11th International Conference on Learning Representations. Kigali, Rwanda: ICLR, 2023: 1-31.
|
| 29 |
WANG S X, DAI J C, TAN K L, et al. DiffCom: Channel received signal is a natural condition to guide diffusion posterior sampling[EB/OL]. (2024-06-11) [2024-10-24]. https://doi.org/10.48550/arXiv.2406.07390.
|
| 30 |
CHUNG H, KIM J, MCCANN M T, et al. Diffusion posterior sampling for general noisy inverse problems[C]// Proceedings of the 11th International Conference on Learning Representations. Kigali, Rwanda: ICLR, 2023: 1-30.
|
| 31 |
SONG Y, ERMON S. Generative modeling by estimating gradients of the data distribution[C]// Proceedings of the 33rd International Conference on Neural Information Processing Systems. Curran Associates Inc., 2019: 1067.
|
| 32 |
王磊, 张潘. 写给物理学家的生成模型[J]. 物理, 2024, 53 (6): 368- 378.
|
| 33 |
|
| 34 |
|
| 35 |
WELLING M, TEH Y W. Bayesian learning via stochastic gradient langevin dynamics[C]// Proceedings of the 28th International Conference on International Conference on Machine Learning. Bellevue, USA: Omnipress, 2011: 681-688.
|
| 36 |
SONG Y, SOHL-DICKSTEIN J, KINGMA D P, et al. Score-based generative modeling through stochastic differential equations[C]// Proceedings of the 9th International Conference on Learning Representations. Virtual Event: ICLR, 2021: 1-36.
|
| 37 |
KARRAS T, AITTALA M, LAINE S, et al. Elucidating the design space of diffusion-based generative models[C]// Proceedings of the 36th International Conference on Neural Information Processing Systems. New Orleans, USA: Curran Associates Inc., 2022: 1926.
|
| 38 |
|
| 39 |
DARAS G, CHUNG H, LAI C H, et al. A survey on diffusion models for inverse problems[EB/OL]. (2024-09-30) [2024-10-24]. https://doi.org/10.48550/arXiv.2410.00083.
|
| 40 |
SONG Y, DHARIWAL P, CHEN M, et al. Consistency models[C]// Proceedings of the 40th International Conference on Machine Learning. Honolulu, USA: PMLR, 2023: 32211-32252.
|
| 41 |
ROUT L, RAOOF N, DARAS G, et al. Solving linear inverse problems provably via posterior sampling with latent diffusion models[C]// Proceedings of the 37th International Conference on Neural Information Processing Systems. New Orleans, USA: Curran Associates Inc., 2024: 2174.
|
| 42 |
ZHANG B L, CHU W D, BERNER J, et al. Improving diffusion inverse problem solving with decoupled noise annealing[EB/OL]. (2024-07-01) [2024-10-24]. https://doi.org/10.48550/arXiv.2407.01521.
|
| 43 |
CHUNG H, KIM J, KIM S, et al. Parallel diffusion models of operator and image for blind inverse problems[C]// Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE, 2023: 6059-6069.
|
| 44 |
|
| 45 |
|
| 46 |
XUE W F, MOU X Q, ZHANG L, et al. Perceptual fidelity aware mean squared error[C]// Proceedings of 2013 IEEE International Conference on Computer Vision. Sydney, Australia: IEEE, 2013: 705-712.
|
| 47 |
|
| 48 |
WANG Z, SIMONCELLI E P, BOVIK A C. Multiscale structural similarity for image quality assessment[C]// Proceedings of the 37th Asilomar Conference on Signals, Systems & Computers. Pacific Grove, USA: IEEE, 2003: 1398-1402.
|
| 49 |
WANG Z, SIMONCELLI E P. Translation insensitive image similarity in complex wavelet domain[C]// Proceedings of 2005 IEEE International Conference on Acoustics, Speech, and Signal Processing. Philadelphia, USA: IEEE, 2005: ii/573-ii/576.
|
| 50 |
|
| 51 |
|
| 52 |
|
| 53 |
|
| 54 |
|
| 55 |
|
| 56 |
BHARDWAJ S, FISCHER I, BALLÉ J, et al. An unsupervised information-theoretic perceptual quality metric[C]// Proceedings of the 34th International Conference on Neural Information Processing Systems. Vancouver, Canada: Curran Associates Inc., 2020: 2.
|
| 57 |
MA K D, DUANMU Z F, WANG Z. Geometric transformation invariant image quality assessment using convolutional neural networks[C]// Proceedings of 2018 IEEE International Conference on Acoustics, Speech, and Signal Processing. Calgary, USA: IEEE, 2018: 6732-6736.
|
| 58 |
PRASHNANI E, CAI H, MOSTOFI Y, et al. PieAPP: Perceptual image-error assessment through pairwise preference[C]// Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE, 2018: 1808-1817.
|
| 59 |
ZHANG R, ISOLA P, EFROS A A, et al. The unreasonable effectiveness of deep features as a perceptual metric[C]// Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE, 2018: 586-595.
|
| 60 |
|
| 61 |
SALIMANS T, GOODFELLOW I, ZAREMBA W, et al. Improved techniques for training GANs[C]// Proceedings of the 30th International Conference on Neural Information Processing Systems. Barcelona, Spain: Curran Associates Inc., 2016: 2234-2242.
|
| 62 |
|
| 63 |
YE P, KUMAR J, DOERMANN D. Beyond human opinion scores: Blind image quality assessment based on synthetic scores[C]// Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus, USA: IEEE, 2014: 4241-4248.
|
| 64 |
|
| 65 |
HEUSEL M, RAMSAUER H, UNTERTHINER T, et al. GANs trained by a two time-scale update rule converge to a local nash equilibrium[C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach, USA: Curran Associates Inc., 2017: 6629-6640.
|
| 66 |
BIИ́KOWSKI M, SUTHERLAND D J, ARBEL M, et al. Demystifying MMD GANs[C]// Proceedings of the 6th International Conference on Learning Representations. Vancouver, Canada: ICLR, 2018: 1-36.
|
| 67 |
JIA Z Y, LI J H, LI B, et al. Generative latent coding for ultra-low bitrate image compression[C]// Proceedings of 2014 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE, 2024: 26088-26098.
|
| 68 |
VAN DEN OORD A, VINYALS O, KAVUKCUOGLU K. Neural discrete representation learning[C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach, USA: Curran Associates Inc., 2017: 6309-6318.
|
| 69 |
CAREIL M, MUCKLEY M J, VERBEEK J, et al. Towards image compression with perfect realism at ultra-low bitrates[C]// Proceedings of the 12th International Conference on Learning Representations. Vienna, Austria: ICLR, 2024: 1-21.
|
| 70 |
XU T D, ZHU Z R, HE D L, et al. Idempotence and perceptual image compression[C]// Proceedings of the 12th International Conference on Learning Representations. Vienna, Austria: ICLR, 2024: 1-28.
|
| 71 |
BALLÉ J, LAPARRA V, SIMONCELLI E P. End-to-end optimized image compression[C]// Proceedings of the 5th International Conference on Learning Representations. Toulon, France: ICLR, 2017: 1-27.
|
| 72 |
|
| 73 |
SONG Y, SHEN L Y, XING L, et al. Solving inverse problems in medical imaging with score-based generative models[C]// Proceedings of the 10th International Conference on Learning Representations. ICLR, 2022: 1-18.
|
| 74 |
|
| 75 |
THEIS L. What makes an image realistic? [EB/OL]. (2024-03-07) [2024-10-24]. https://doi.org/10.48550/arXiv.2403.04493.
|
| 76 |
SONG J M, MENG C L, ERMON S. Denoising diffusion implicit models[C]// Proceedings of the 9th International Conference on Learning Representations. Virtual Event: ICLR, 2021: 1-20.
|
| 77 |
KOO G, YOON S, YOO C D. Wavelet-guided acceleration of text inversion in diffusion-based image editing[C]// Proceedings of 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing. Seoul, Republic of Korea: IEEE, 2024: 4380-4384.
|
| 78 |
HSIAO Y T, KHODADADEH S, DUARTE K, et al. Plug-and-play diffusion distillation[C]// Proceedings of 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE, 2024: 13743-13752.
|
| 79 |
SALIMANS T, HO J. Progressive distillation for fast sampling of diffusion models[C]// Proceedings of the 10th International Conference on Learning Representations. Virtual Event: ICLR, 2022: 1-21.
|
| 80 |
|
| 81 |
MA X Y, FANG G F, WANG X C. DeepCache: Accelerating diffusion models for free[C]// Proceedings of 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE, 2024: 15762-15772.
|
| 82 |
XUE S C, LIU Z Q, CHEN F, et al. Accelerating diffusion sampling with optimized time steps[C]// Proceedings of 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE, 2024: 8292-8301.
|
/
| 〈 |
|
〉 |