Mix-of-CLIP-experts model for pathological image classification

YU Jinze; LIU Yanwen; CHEN Yongquan; LI Shuo; SUN Qingyun; ZHOU Haoyi; LI Jianxin

doi:10.16511/j.cnki.qhdxxb.2026.21.005

PDF(4157 KB)

Journal of Tsinghua University(Science and Technology) ›› 2026, Vol. 66 ›› Issue (5) : 977-990. DOI: 10.16511/j.cnki.qhdxxb.2026.21.005

BIG DATA

Mix-of-CLIP-experts model for pathological image classification

{{article.zuoZhe_EN}}

Author information +

History +

Abstract

[Objective] Vision-language models (VLMs), such as contrastive language-image pretraining (CLIP), achieve cross-modal image-text alignment by employing large-scale contrastive pretraining and enable prompt-driven zero-shot classification; thus, they overcome the limitations of closed category spaces in traditional supervised classification models. However, their generalization performance is constrained by distribution shifts between pretraining and downstream tasks, as well as by inherent knowledge boundaries, particularly in specialized domains with scarce labeled data (e.g., pathology). Various CLIP variants and extensions have emerged; these variants exhibit complementary but inconsistent downstream performance, which is attributed to their distinct model architectures and pretraining datasets. Although various parameter-efficient fine-tuning techniques have been proposed for adapting individual CLIP models to downstream tasks, existing studies have mainly focused on optimizing single pretrained models; thus, they fail to effectively exploit the complementary advantages of heterogeneous models. This study aimed to exploit the complementary strengths of heterogeneous pretrained CLIP models for pathology image classification. Specifically, we conduct a systematic comparison among ensemble strategies at both the model output and middle-layer feature levels; additionally, we propose a novel feature-level ensemble framework termed Mix-of-CLIP-Experts (MoCE). [Methods] Initially, we evaluated multiple pretrained CLIP models for pathology image classification tasks under the zero-shot setting to demonstrate their complementary strengths and weaknesses across different datasets. Next, we designed and evaluated various ensemble strategies. At the output level, we investigated simple averaging and a weighted combination of predictions based on model confidence scores or learned gating networks. At the feature level, we applied the proposed MoCE method to the fusion of image features obtained from heterogeneous CLIP models. The main challenge in feature-level CLIP model ensembling is the misalignment of embedding across incompatible cross-modal spaces of different CLIP models. To address this challenge, we combined MoCE adapter-based fine-tuning with the mix-of-experts (MoE) framework. Using this approach, the pretrained models can be simultaneously adapted to the downstream pathology task, and their image (and aligned text) features can be projected onto a unified embedding space. A learned router was employed to dynamically weight and aggregate these aligned image features to generate a fused representation; this representation was subsequently compared against text prompts, which were encoded using a single text encoder to perform the final classification. This process reduces computational redundancy by eliminating the need for multiple text encoders; additionally, it improves downstream performance via adapter-based fine-tuning and MoE routing to fully exploit model complementarity. [Results] We comprehensively evaluated the proposed framework and baseline ensemble strategies using multiple public pathology datasets under various few-shot settings. The results showed that MoCE consistently outperformed single-model fine-tuning baselines and output-level ensemble methods, demonstrating the advantages of feature-level model ensembling via adapter-based alignment and dynamic routing. Detailed ablation studies validated the effectiveness of the proposed MoCE framework and its specific components. [Conclusions] To the best of our knowledge, MoCE is the first feature-level ensembling framework for heterogeneous pretrained CLIP models. By combining adapter-based cross-model feature alignment with MoCE routing, effective fusion of diverse CLIP backbones can be achieved; additionally, the pathology image classification under limited-data regimes can be substantially improved. The parameter-efficient cross-model alignment and dynamic expert fusion mechanisms are broadly applicable beyond pathology to other specialized domains. In future work, we will focus on scaling to larger model pools, applying model distillation for inference efficiency, and extending the framework to dense prediction tasks such as object detection and image segmentation.

Key words

deep learning / vision-language models / model ensemble / mixture of experts / pathological image classification

Cite this article

EndNote

Ris (Procite)

Bibtex

Download Citations

YU Jinze, LIU Yanwen, CHEN Yongquan, LI Shuo, SUN Qingyun, ZHOU Haoyi, LI Jianxin. Mix-of-CLIP-experts model for pathological image classification[J]. Journal of Tsinghua University(Science and Technology). 2026, 66(5): 977-990 https://doi.org/10.16511/j.cnki.qhdxxb.2026.21.005

References

[1] RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[C]// Proceedings of the 38th International Conference on Machine Learning. Virtual Event: PMLR, 2021: 8748-8763.
[2] ZHOU K Y, YANG J K, LOY C C, et al. Learning to prompt for vision-language models[J]. International Journal of Computer Vision, 2022, 130(9): 2337-2348.
[3] GAO P, GENG S J, ZHANG R R, et al. CLIP-adapter: Better vision-language models with feature adapters[J]. International Journal of Computer Vision, 2024, 132(2): 581-595.
[4] ZANELLA M, BEN AYED I. Low-rank few-shot adaptation of vision-language models[C]// Proceedings of 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. Seattle, WA, USA: IEEE, 2024: 1593-1603.
[5] LU M Y, CHEN B W, WILLIAMSON D F K, et al. A visual-language foundation model for computational pathology[J]. Nature Medicine, 2024, 30(3): 863-874.
[6] HUANG Z, BIANCHI F, YUKSEKGONUL M, et al. A visual-language foundation model for pathology image analysis using medical twitter[J]. Nature Medicine, 2023, 29(9): 2307-2316.
[7] LU M Y, CHEN B W, ZHANG A, et al. Visual language pretrained multiple instance zero-shot transfer for histopathology images[C]// Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver, Canada: IEEE, 2023: 19764-19775.
[8] ZHANG S, XU Y B, USUYAMA N, et al. A multimodal biomedical foundation model trained from fifteen million image-text pairs[J]. NEJM AI, 2025, 2(1): AIoa2400640.
[9] SALVI M, BOSCO M, MOLINARO L, et al. A hybrid deep learning approach for gland segmentation in prostate histopathological images[J]. Artificial Intelligence in Medicine, 2021, 115: 102076.
[10] KATHER J N, HALAMA N, MARX A. 100, 000 histological images of human colorectal cancer and healthy tissue[DB/OL]. (2018-05-22)[2026-01-26]. https://doi.org/10.5281/zenodo.1214456.
[11] BRANCATI N, ANNICIELLO A M, PATI P, et al. BRACS: A dataset for BReAst carcinoma subtyping in H&E histology images[J]. Database, 2022, 2022: baac093.
[12] BORKOWSKI A A, BUI M M, THOMAS L B, et al. Lung and colon cancer histopathological image dataset (LC25000)[EB/OL]. (2019-12-16)[2026-01-26]. https://arxiv.org/abs/1912.12142.
[13] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16x16 words: Transformers for image recognition at scale[C]// Proceedings of the 9th International Conference on Learning Representations. Vienna, Austria: ICLR, 2021: 611-631.
[14] HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]// Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE, 2016: 770-778.
[15] ESLAMI S, MEINEL C, DE MELO G. PubMedCLIP: How much does CLIP benefit visual question answering in the medical domain?[C]// Proceedings of the Findings of the Association for Computational Linguistics: EACL 2023. Dubrovnik, Croatia: Association for Computational Linguistics, 2023: 1181-1193.
[16] ZHOU K Y, YANG J K, LOY C C, et al. Conditional prompt learning for vision-language models[C]// Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans, USA: IEEE, 2022: 16795-16804.
[17] KHATTAK M U, WASIM S T, NASEER M, et al. Self-regulating prompts: Foundational model adaptation without forgetting[C]// Proceedings of 2023 IEEE/CVF International Conference on Computer Vision. Paris, France: IEEE, 2023: 15144-15154.
[18] GE Y H, REN J, GALLAGHER A, et al. Improving zero-shot generalization and robustness of multi-modal models[C]// Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver, Canada: IEEE, 2023: 11093-11101.
[19] ALLINGHAM J U, REN J, DUSENBERRY M W, et al. A simple zero-shot prompt weighting technique to improve prompt ensembling in text-image models[C]// Proceedings of the 40th International Conference on Machine Learning. Honolulu, USA: PMLR, 2023: 547-568.
[20] QU X Y, GOU G P, ZHUANG J M, et al. ProAPO: Progressively automatic prompt optimization for visual classification[C]// Proceedings of 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, TN, USA: IEEE, 2025: 25145-25155.
[21] HU E J, SHEN Y L, WALLIS P, et al. LoRA: Low-rank adaptation of large language models[C]// Proceedings of the 10th International Conference on Learning Representations. Online: ICLR, 2022: 12513-12525.
[22] WORTSMAN M, ILHARCO G, GADRE S Y, et al. Model soups: Averaging weights of multiple fine-tuned models improves accuracy without increasing inference time[C]// Proceedings of the 39th International Conference on Machine Learning. Baltimore, USA: PMLR, 2022: 23965-23998.
[23] JACOBS R A, JORDAN M I, NOWLAN S J, et al. Adaptive mixtures of local experts[J]. Neural Computation, 1991, 3(1): 79-87.
[24] LU Z H, BAI J W, LI X, et al. Beyond sole strength: Customized ensembles for generalized vision-language models[C]// Proceedings of the 41st International Conference on Machine Learning. Vienna, Austria: PMLR, 2024: 32924-32938.
[25] LEPIKHIN D, LEE H, XU Y, et al. GShard: Scaling giant models with conditional computation and automatic sharding[C]// Proceedings of the 9th International Conference on Learning Representations. Virtual Event, Austria: ICLR, 2021: 14225-14247.
[26] SHAZEER N, MIRHOSEINI A, MAZIARZ K, et al. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer[C]// Proceedings of the 5th International Conference on Learning Representations. Toulon, France: ICLR, 2017: 878-896.
[27] FEDUS W, ZOPH B, SHAZEER N. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity[J]. The Journal of Machine Learning Research, 2022, 23(1): 120.
[28] WANG X Z, CHEN C, YANG Y F, et al. CLIP-UP: A simple and efficient mixture-of-experts clip training recipe with sparse upcycling[C]// Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2025. Suzhou, China: Association for Computational Linguistics, 2025: 21186-21200.
[29] ZHANG J H, QU X Y, ZHU T, et al. CLIP-MoE: Towards building mixture of experts for CLIP with diversified multiplet upcycling[C]// Proceedings of 2025 Conference on Empirical Methods in Natural Language Processing. Suzhou, China: Association for Computational Linguistics, 2025: 5406-5419.
[30] HENDRYCKS D, GIMPEL K. A baseline for detecting misclassified and out-of-distribution examples in neural networks[C]// Proceedings of the 5th International Conference on Learning Representations. Toulon, France: ICLR, 2017: 2410-2421.
[31] HENDRYCKS D, BASART S, MAZEIKA M, et al. Scaling out-of-distribution detection for real-world settings[C]// Proceedings of the 39th International Conference on Machine Learning. Baltimore, USA: PMLR, 2022: 8759-8773.
[32] DEVRIES T, TAYLOR G W. Learning confidence for out-of-distribution detection in neural networks[EB/OL]. (2018-02-13)[2026-01-26]. https://arxiv.org/abs/1802.04865.
[33] YU J, WANG Z, VASUDEVAN V, et al. CoCa: Contrastive Captioners are image-text foundation models[EB/OL]. (2022-08-27)[2026-01-26]. https://openreview.net/forum?id=Ee277P3AYC.
[34] DEVLIN J, CHANG M W, LEE K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding[C]// Proceedings of 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1(Long and Short Papers). Minneapolis, Minnesota: Association for Computational Linguistics, 2019: 4171-4186.
[35] WANG X Y, YANG S, ZHANG J, et al. Transformer- based unsupervised contrastive learning for histopathological image classification[J]. Medical Image Analysis, 2022, 81: 102559.
[36] GU Y, TINN R, CHENG H, et al. Domain-specific language model pretraining for biomedical natural language processing[J]. ACM Transactions on Computing for Healthcare, 2022, 3(1): 2.