Diffusion model-empowered generative visual semantic communication

Hailong QIN, Jincheng DAI, Sixian WANG, Shengshi YAO, Kai NIU, Wenjun XU

Journal of Tsinghua University(Science and Technology) ›› 2025, Vol. 65 ›› Issue (11) : 2080-2094.

PDF(16138 KB)
PDF(16138 KB)
Journal of Tsinghua University(Science and Technology) ›› 2025, Vol. 65 ›› Issue (11) : 2080-2094. DOI: 10.16511/j.cnki.qhdxxb.2025.27.046
Frontiers in New-Quality Communication Technology

Diffusion model-empowered generative visual semantic communication

Author information +
History +

Abstract

Significance: End-to-end semantic communication leverages deep learning models to extract semantic features from data, enabling intent-driven communication processes that significantly enhance transmission efficiency. However, existing semantic communication paradigms based on discriminative models employ symbol-level rate-distortion optimization and perform maximum likelihood estimation solely based on received signals, failing to satisfy the perceptual requirements of users. To ensure the visual quality of transmitted data, a generative visual semantic communication paradigm has emerged, which adopts a rate-distortion-perception optimization framework to achieve alignment between data transmission and human perception through maximum a posteriori estimation. Diffusion models are advantageous for controlling visual generation and have thus become essential tools for this generative paradigm. Nevertheless, systematic organization of the technical roadmaps for empowering semantic communication using diffusion models is lacking in current research. Progress: This study addresses this gap by modeling the communication process as a mathematical inverse problem and elucidating the general methodology by which diffusion models solve data compression and transmission challenges through posterior sampling. The fundamental concepts, mathematical formulations, and sampling strategies underpinning diffusion models are systematically introduced. In addition, the general methods and key technologies employed for diffusion model-enabled generative compression and transmission are comprehensively reviewed from an inverse problem-solving perspective. Moreover, the performance metrics commonly used for objective assessment of the visual quality of transmitted data are summarized to provide a comprehensive evaluation framework. The core methodology demonstrates that generalized communication processes can be effectively modeled as inverse problems. The approach involves inferring the source data distribution using maximum a posteriori estimation based on channel measurements and forward operators composed of various signal processing operations. Through diffusion posterior sampling, diffusion models solve these communication inverse problems via a three-step process: first, pre-training diffusion models from large-scale datasets are used to obtain diffusion priors; second, joint source-channel codecs are used to mitigate channel distortions in visual data transmission and construct proximal regularization terms; finally, measurement regularization terms are constructed based on channel measurements. By integrating these regularization terms for posterior estimation and distribution sampling, diffusion models can implicitly reconstruct source data through gradient descent, effectively overcoming transmission challenges caused by strong channel noise, nonlinear operators, and time-varying channel conditions. Conclusions and Prospects: The analysis reveals that compared to visual semantic communication approaches based on discriminative deep learning models, the generative visual semantic communication paradigm based on diffusion models can significantly improve transmission efficiency and resilience while ensuring perceptual quality and semantic consistency of visual information. This advancement represents a fundamental shift toward communication systems that prioritize human perceptual requirements alongside traditional distortion metrics. Open issues, including image realism modeling and acceleration of diffusion model sampling, are discussed. The report highlights the effectiveness of conditional diffusion models for enabling existing semantic communication architectures to recover sources at the receiver based on minimal tokens and highly degraded measurements, offering an intelligent and concise design philosophy for future generative visual semantic communication systems.

Key words

generative visual semantic communication / diffusion models / inverse problems / maximum a posterior estimation

Cite this article

Download Citations
Hailong QIN , Jincheng DAI , Sixian WANG , et al . Diffusion model-empowered generative visual semantic communication[J]. Journal of Tsinghua University(Science and Technology). 2025, 65(11): 2080-2094 https://doi.org/10.16511/j.cnki.qhdxxb.2025.27.046

References

1
SHANNON C E . A mathematical theory of communication[J]. The Bell system technical journal, 1948, 27 (3): 379- 423.
2
牛凯, 戴金晟, 张平. 面向6G的语义通信[J]. 移动通信, 2021, 45 (4): 85- 90.
NIU K , DAI J C , ZHANG P , et al. 6G-oriented semantic communications[J]. Mobile Communications, 2021, 45 (4): 85- 90.
3
QIN Z J, TAO X M, LU J H, et al. Semantic communications: Principles and challenges[EB/OL]. (2022-01-04) [2024-10-24]. https://doi.org/10.48550/ arXiv. 2201.01389.
4
石光明, 肖泳, 李莹玉, 等. 面向万物智联的语义通信网络[J]. 物联网学报, 2021, 5 (2): 26- 36.
SHI G M , XIAO Y , LI Y Y , et al. Semantic communication networking for the intelligence of everything[J]. Chinese Journal on Internet of Things, 2021, 5 (2): 26- 36.
5
ZHANG P , XU W J , GAO H , et al. Toward wisdom- evolutionary and primitive-concise 6G: A new paradigm of semantic communication networks[J]. Engineering, 2022, 8, 60- 73.
6
ZHANG P , XU W J , LIU Y M , et al. Intellicise wireless networks from semantic communications: A survey, research issues, and challenges[J]. IEEE Communications Surveys & Tutorials, 2025, 27 (3): 2051- 2084.
7
NIU K , DAI J C , YAO S S , et al. A paradigm shift toward semantic communications[J]. IEEE Communications Maga-zine, 2022, 60 (11): 113- 119.
8
GÜNDÜZ D , QIN Z J , AGUERRI I E , et al. Beyond transmitting bits: Context, semantics, and task-oriented communications[J]. IEEE Journal on Selected Areas in Communications, 2023, 41 (1): 5- 41.
9
刘传宏, 郭彩丽, 杨洋, 等. 面向智能任务的语义通信: 理论、技术和挑战[J]. 通信学报, 2022, 43 (6): 41- 57.
LIU C H , GUO C L , YANG Y , et al. Intelligent task-oriented semantic communications: Theory, technology and challenges[J]. Journal on Communications, 2022, 43 (6): 41- 57.
10
LUO X W , CHEN H H , GUO Q . Semantic communications: Overview, open issues, and future research directions[J]. IEEE Wireless Communications, 2022, 29 (1): 210- 219.
11
秦志金, 赵菼菼, 李凡, 等. 多模态语义通信研究综述[J]. 通信学报, 2023, 44 (5): 28- 41.
QIN Z J , ZHAO T T , LI F , et al. Survey of research on multimodal semantic communication[J]. Journal on Communications, 2023, 44 (5): 28- 41.
12
张平, 戴金晟, 张育铭, 等. 面向语义通信的非线性变换编码[J]. 通信学报, 2023, 44 (4): 1- 14.
ZHANG P , DAI J C , ZHANG Y M , et al. Nonlinear transform coding for semantic communications[J]. Journal on Communications, 2023, 44 (4): 1- 14.
13
BOURTSOULATZE E , KURKA D B , GÜNDÜZ D . Deep joint source-channel coding for wireless image trans-mission[J]. IEEE Transactions on Cognitive Communications and Net-working, 2019, 5 (3): 567- 579.
14
DAI J C , QIN X Q , WANG S X , et al. Deep generative modeling reshapes compression and transmission: From efficiency to resiliency[J]. IEEE Wireless Communications, 2024, 31 (4): 48- 56.
15
PÁRRAGA C A , TROSCIANKO T , TOLHURST D J . The human visual system is optimised for processing the spatial information in natural visual images[J]. Current Biology, 2000, 10 (1): 35- 38.
16
ADINI Y , SAGI D , TSODYKS M . Context-enabled learning in the human visual system[J]. Nature, 2002, 415 (6873): 790- 793.
17
DAI J C , WANG S X , TAN K L , et al. Nonlinear transform source-channel coding for semantic communications[J]. IEEE Journal on Selected Areas in Communications, 2022, 40 (8): 2300- 2316.
18
BLAU Y, MICHAELI T. The perception-distortion tradeoff[C]// Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE, 2018: 6228-6237.
19
BLAU Y, MICHAELI T. Rethinking lossy compression: The rate-distortion-perception tradeoff[C]// Proceedings of the 36th International Conference on Machine Learning. Long Beach, USA: PMLR, 2019: 675-685.
20
KINGMA D P, WELLING M. Auto-encoding variational bayes[C]// Proceedings of the 2nd International Conference on Learning Representations. Banff, Canada: ICLR, 2014: 1-14.
21
MENTZER F, TODERICI G, TSCHANNEN M, et al. High-fidelity generative image compression[C]// Proceedings of the 34th International Conference on Neural Information Processing Systems. Vancouver, Canada: Curran Associates Inc., 2020: 999.
22
GOODFELLOW I J, POUGET-ABADIE J, MIRZA M, et al. Generative adversarial nets[C]// Proceedings of the 28th International Conference on Neural Information Processing Systems. Montreal, Canada: MIT Press, 2014: 2672-2680.
23
WEI X F, TONG H N, YANG N C, et al. Language-oriented semantic communication for image transmission with fine-tuned diffusion model[C]// Proceedings of 2024 16th International Conference on Wireless Communications and Signal Processing. Hefei, China: IEEE, 2024: 1456-1461.
24
YANG P J , ZHANG G Y , CAI Y L . Rate-adaptive generative semantic communication using conditional diffusion models[J]. IEEE Wireless Communications Letters, 2025, 14 (2): 539- 543.
25
ROMBACH R, BLATTMANN A, LORENZ D, et al. High-resolution image synthesis with latent diffusion models[C]// Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans, USA: IEEE, 2022: 10674-10685.
26
ERDEMIR E , TUNG T Y , DRAGOTTI P L , et al. Generative joint source-channel coding for semantic image transmission[J]. IEEE Journal on Selected Areas in Communications, 2023, 41 (8): 2645- 2657.
27
YANG M Y, LIU B W, WANG B Y, et al. Diffusion-aided joint source channel coding for high realism wireless image transmission[EB/OL]. (2024-04-27) [2024-10-24]. https://doi.org/10.48550/arXiv.2404.17736.
28
WANG Y H, YU J W, ZHANG J. Zero-shot image restoration using denoising diffusion null-space model[C]// Proceedings of the 11th International Conference on Learning Representations. Kigali, Rwanda: ICLR, 2023: 1-31.
29
WANG S X, DAI J C, TAN K L, et al. DiffCom: Channel received signal is a natural condition to guide diffusion posterior sampling[EB/OL]. (2024-06-11) [2024-10-24]. https://doi.org/10.48550/arXiv.2406.07390.
30
CHUNG H, KIM J, MCCANN M T, et al. Diffusion posterior sampling for general noisy inverse problems[C]// Proceedings of the 11th International Conference on Learning Representations. Kigali, Rwanda: ICLR, 2023: 1-30.
31
SONG Y, ERMON S. Generative modeling by estimating gradients of the data distribution[C]// Proceedings of the 33rd International Conference on Neural Information Processing Systems. Curran Associates Inc., 2019: 1067.
32
王磊, 张潘. 写给物理学家的生成模型[J]. 物理, 2024, 53 (6): 368- 378.
WANG L , ZHANG P . Generative models for physicists[J]. Physics, 2024, 53 (6): 368- 378.
33
HINTON G E . Boltzmann machine[J]. Scholarpedia, 2007, 2 (5): 1668.
34
VINCENT P . A connection between score matching and denoising autoencoders[J]. Neural computation, 2011, 23 (7): 1661- 1674.
35
WELLING M, TEH Y W. Bayesian learning via stochastic gradient langevin dynamics[C]// Proceedings of the 28th International Conference on International Conference on Machine Learning. Bellevue, USA: Omnipress, 2011: 681-688.
36
SONG Y, SOHL-DICKSTEIN J, KINGMA D P, et al. Score-based generative modeling through stochastic differential equations[C]// Proceedings of the 9th International Conference on Learning Representations. Virtual Event: ICLR, 2021: 1-36.
37
KARRAS T, AITTALA M, LAINE S, et al. Elucidating the design space of diffusion-based generative models[C]// Proceedings of the 36th International Conference on Neural Information Processing Systems. New Orleans, USA: Curran Associates Inc., 2022: 1926.
38
EFRON B . Tweedie's formula and selection bias[J]. Journal of the American Statistical Association, 2011, 106 (496): 1602- 1614.
39
DARAS G, CHUNG H, LAI C H, et al. A survey on diffusion models for inverse problems[EB/OL]. (2024-09-30) [2024-10-24]. https://doi.org/10.48550/arXiv.2410.00083.
40
SONG Y, DHARIWAL P, CHEN M, et al. Consistency models[C]// Proceedings of the 40th International Conference on Machine Learning. Honolulu, USA: PMLR, 2023: 32211-32252.
41
ROUT L, RAOOF N, DARAS G, et al. Solving linear inverse problems provably via posterior sampling with latent diffusion models[C]// Proceedings of the 37th International Conference on Neural Information Processing Systems. New Orleans, USA: Curran Associates Inc., 2024: 2174.
42
ZHANG B L, CHU W D, BERNER J, et al. Improving diffusion inverse problem solving with decoupled noise annealing[EB/OL]. (2024-07-01) [2024-10-24]. https://doi.org/10.48550/arXiv.2407.01521.
43
CHUNG H, KIM J, KIM S, et al. Parallel diffusion models of operator and image for blind inverse problems[C]// Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE, 2023: 6059-6069.
44
SARA U , AKTER M , UDDIN M S . Image quality assessment through FSIM, SSIM, MSE and PSNR-a comparative study[J]. Journal of Computer and Communications, 2019, 7 (3): 8- 18.
45
LARSON E C , CHANDLER D M . Most apparent distortion: Full-reference image quality assessment and the role of strategy[J]. Journal of Electronic Imaging, 2010, 19 (1): 011006.
46
XUE W F, MOU X Q, ZHANG L, et al. Perceptual fidelity aware mean squared error[C]// Proceedings of 2013 IEEE International Conference on Computer Vision. Sydney, Australia: IEEE, 2013: 705-712.
47
LAPARRA V , BALLÉ J , BERARDINO A , et al. Perceptual image quality assessment using a normalized Laplacian pyramid[J]. Electronic Imaging, 2016, 28 (16): art00008.
48
WANG Z, SIMONCELLI E P, BOVIK A C. Multiscale structural similarity for image quality assessment[C]// Proceedings of the 37th Asilomar Conference on Signals, Systems & Computers. Pacific Grove, USA: IEEE, 2003: 1398-1402.
49
WANG Z, SIMONCELLI E P. Translation insensitive image similarity in complex wavelet domain[C]// Proceedings of 2005 IEEE International Conference on Acoustics, Speech, and Signal Processing. Philadelphia, USA: IEEE, 2005: ii/573-ii/576.
50
CHANG H W , YANG H , GAN Y , et al. Sparse feature fidelity for perceptual image quality assessment[J]. IEEE Transactions on Image Processing, 2013, 22 (10): 4007- 4018.
51
XUE W F , ZHANG L , MOU X Q , et al. Gradient magnitude similarity deviation: A highly efficient perceptual image quality index[J]. IEEE Transactions on Image Processing, 2014, 23 (2): 684- 695.
52
ZHANG L , SHEN Y , LI H Y . VSI: A visual saliency-induced index for perceptual image quality assessment[J]. IEEE Transactions on Image Processing, 2014, 23 (10): 4270- 4281.
53
WANG T H , ZHANG L , JIA H Z , et al. Multiscale contrast similarity deviation: An effective and efficient index for perceptual image quality assessment[J]. Signal Processing: Image Communication, 2016, 45, 1- 9.
54
SHEIKH H R , BOVIK A C , DE VECIANA G . An information fidelity criterion for image quality assessment using natural scene statistics[J]. IEEE Transactions on Image Processing, 2005, 14 (12): 2117- 2128.
55
SHEIKH H R , BOVIK A C . Image information and visual quality[J]. IEEE Transactions on Image Processing, 2006, 15 (2): 430- 444.
56
BHARDWAJ S, FISCHER I, BALLÉ J, et al. An unsupervised information-theoretic perceptual quality metric[C]// Proceedings of the 34th International Conference on Neural Information Processing Systems. Vancouver, Canada: Curran Associates Inc., 2020: 2.
57
MA K D, DUANMU Z F, WANG Z. Geometric transformation invariant image quality assessment using convolutional neural networks[C]// Proceedings of 2018 IEEE International Conference on Acoustics, Speech, and Signal Processing. Calgary, USA: IEEE, 2018: 6732-6736.
58
PRASHNANI E, CAI H, MOSTOFI Y, et al. PieAPP: Perceptual image-error assessment through pairwise preference[C]// Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE, 2018: 1808-1817.
59
ZHANG R, ISOLA P, EFROS A A, et al. The unreasonable effectiveness of deep features as a perceptual metric[C]// Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE, 2018: 586-595.
60
DING K Y , MA K D , WANG S Q , et al. Image quality assessment: Unifying structure and texture similarity[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 44 (5): 2567- 2581.
61
SALIMANS T, GOODFELLOW I, ZAREMBA W, et al. Improved techniques for training GANs[C]// Proceedings of the 30th International Conference on Neural Information Processing Systems. Barcelona, Spain: Curran Associates Inc., 2016: 2234-2242.
62
LIU L X , LIU B , HUANG H , et al. No-reference image quality assessment based on spatial and spectral entropies[J]. Signal Processing: Image Communication, 2014, 29 (8): 856- 863.
63
YE P, KUMAR J, DOERMANN D. Beyond human opinion scores: Blind image quality assessment based on synthetic scores[C]// Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus, USA: IEEE, 2014: 4241-4248.
64
MITTAL A , SOUNDARARAJAN R , BOVIK A C . Making a "completely blind" image quality analyzer[J]. IEEE Signal Processing Letters, 2013, 20 (3): 209- 212.
65
HEUSEL M, RAMSAUER H, UNTERTHINER T, et al. GANs trained by a two time-scale update rule converge to a local nash equilibrium[C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach, USA: Curran Associates Inc., 2017: 6629-6640.
66
BIИ́KOWSKI M, SUTHERLAND D J, ARBEL M, et al. Demystifying MMD GANs[C]// Proceedings of the 6th International Conference on Learning Representations. Vancouver, Canada: ICLR, 2018: 1-36.
67
JIA Z Y, LI J H, LI B, et al. Generative latent coding for ultra-low bitrate image compression[C]// Proceedings of 2014 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE, 2024: 26088-26098.
68
VAN DEN OORD A, VINYALS O, KAVUKCUOGLU K. Neural discrete representation learning[C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach, USA: Curran Associates Inc., 2017: 6309-6318.
69
CAREIL M, MUCKLEY M J, VERBEEK J, et al. Towards image compression with perfect realism at ultra-low bitrates[C]// Proceedings of the 12th International Conference on Learning Representations. Vienna, Austria: ICLR, 2024: 1-21.
70
XU T D, ZHU Z R, HE D L, et al. Idempotence and perceptual image compression[C]// Proceedings of the 12th International Conference on Learning Representations. Vienna, Austria: ICLR, 2024: 1-28.
71
BALLÉ J, LAPARRA V, SIMONCELLI E P. End-to-end optimized image compression[C]// Proceedings of the 5th International Conference on Learning Representations. Toulon, France: ICLR, 2017: 1-27.
72
BALLÉ J , CHOU P , MINNEN D , et al. Nonlinear transform coding[J]. IEEE Journal of Selected Topics in Signal Processing, 2021, 15 (2): 339- 353.
73
SONG Y, SHEN L Y, XING L, et al. Solving inverse problems in medical imaging with score-based generative models[C]// Proceedings of the 10th International Conference on Learning Representations. ICLR, 2022: 1-18.
74
SHIN C , HEATH R W , POWERS E J . Blind channel estimation for MIMO-OFDM systems[J]. IEEE Transactions on Vehicular Technology, 2007, 56 (2): 670- 685.
75
THEIS L. What makes an image realistic? [EB/OL]. (2024-03-07) [2024-10-24]. https://doi.org/10.48550/arXiv.2403.04493.
76
SONG J M, MENG C L, ERMON S. Denoising diffusion implicit models[C]// Proceedings of the 9th International Conference on Learning Representations. Virtual Event: ICLR, 2021: 1-20.
77
KOO G, YOON S, YOO C D. Wavelet-guided acceleration of text inversion in diffusion-based image editing[C]// Proceedings of 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing. Seoul, Republic of Korea: IEEE, 2024: 4380-4384.
78
HSIAO Y T, KHODADADEH S, DUARTE K, et al. Plug-and-play diffusion distillation[C]// Proceedings of 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE, 2024: 13743-13752.
79
SALIMANS T, HO J. Progressive distillation for fast sampling of diffusion models[C]// Proceedings of the 10th International Conference on Learning Representations. Virtual Event: ICLR, 2022: 1-21.
80
CHENG H R , ZHANG M , SHI J Q . A survey on deep neural network pruning: Taxonomy, comparison, analysis, and recommendations[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024, 46 (12): 10558- 10578.
81
MA X Y, FANG G F, WANG X C. DeepCache: Accelerating diffusion models for free[C]// Proceedings of 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE, 2024: 15762-15772.
82
XUE S C, LIU Z Q, CHEN F, et al. Accelerating diffusion sampling with optimized time steps[C]// Proceedings of 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE, 2024: 8292-8301.

RIGHTS & PERMISSIONS

All rights reserved. Unauthorized reproduction is prohibited.
PDF(16138 KB)

Accesses

Citation

Detail

Sections
Recommended

/