农作物病虫害自监督适应性多模态特征融合识别方法

doi:10.12133/j.smartag.SA202509032

Smart Agriculture ›› 2026, Vol. 8 ›› Issue (1): 72-84.doi: 10.12133/j.smartag.SA202509032

• 专题--农业病虫害智能识别与诊断 • 上一篇下一篇

农作物病虫害自监督适应性多模态特征融合识别方法

叶鹏林¹^,²(), 闵超¹^,²^,³(), 苟良杰²^,³, 王鹏程¹^,², 黄小鹏¹^,², 李鑫¹^,², 蒙玉平⁴

^1. 西南石油大学理学院，四川成都 610500，中国
^2. 西南石油大学人工智能研究院，四川成都 610500，中国
^3. 西南石油大学油气藏地质及开发工程全国重点实验室，四川成都 610500，中国
^4. 中国石化中原油田分公司，河南濮阳 457001，中国

收稿日期:2025-09-20 出版日期:2026-01-30
基金项目:
国家自然科学基金项目(52574048); 四川省科技计划项目(2025NSFTD0016)
作者简介:
叶鹏林，硕士研究生，研究方向为多模态模式识别。E-mail：yplin0907@outlook.com
通信作者:
闵超，博士，教授，研究方向为最优化方法与不确定性的数学理论、可解释机器学习等领域。E-mail： minchaosc@126.com

Self-Supervised Adaptive Multimodal Feature Fusion Recognition of Crop Diseases and Pests

YE Penglin¹^,²(), MIN Chao¹^,²^,³(), GOU Liangjie²^,³, WANG Pengcheng¹^,², HUANG Xiaopeng¹^,², LI Xin¹^,², MENG Yuping⁴

^1. School of Science, Southwest Petroleum University, Chengdu 610500, China
^2. Institute of Artificial Intelligence, Southwest Petroleum University, Chengdu 610500, China
^3. National Key Laboratory of Oil and Gas Reservoir Geology and Development Engineering, Southwest Petroleum University, Chengdu 610500, China
^4. Information Center, Sinopec Zhongyuan Oilfield Company, Puyang 457001, China

Received:2025-09-20 Online:2026-01-30
Foundation items:National Natural Science Foundation of China(52574048); Sichuan Science and Technology Program(2025NSFTD0016)
About author:
YE Penglin, E-mail: yplin0907@outlook.com
Corresponding author:
MIN Chao, E-mail: minchaosc@126.com

摘要/Abstract

摘要：

［目的/意义］ 传统农作物病虫害识别普遍依赖单模态图像，信息利用不充分，导致识别精度受限。针对该问题，本研究提出融合图像与文本的多模态识别方法，旨在显著提升分类准确率与模型鲁棒性，为农业精准防控提供数据驱动的新路径。 ［方法］ 构建基于自监督适应性特征融合的识别模型。首先，利用大语言模型结合权威农业指南进行提示工程，将标签转化为细粒度病理语义描述；其次，基于图像文本对比学习（Contrastive Language-Image Pre-training, CLIP）提取图像与文本双流特征，设计跨模态平衡对齐模块解决样本不对称问题；再次，通过适应性融合机制动态分配模态权重，实现深层语义交互；最后，引入自监督特征重构任务以增强特征表征的鲁棒性。 ［结果和讨论］ 在PlantVillage标准数据集上的实验表明：模型分类准确率达99.67%，较ResNet50（96.51%）、Swin-Transformer（97.48%）和基础CLIP（98.23%）准确率高；同时，精确率、召回率与F₁分数均超过99.00%，验证了方法的有效性与稳定性。 ［结论］ 该方法通过融合文本语义与视觉特征，有效突破了单模态识别局限，显著提升了模型在细粒度分类任务中的准确率与泛化能力。

关键词: 病虫害识别, 多模态融合, 适应性特征融合, 自监督学习

Abstract:

[Objective] Crop diseases and pests are significant factors restricting global agricultural production. Traditional intelligent recognition technologies predominantly rely on single-modal image data processed by convolutional neural networks (CNNs) or Transformers. However, in complex natural environments, these methods often suffer from insufficient information utilization and limited robustness due to the lack of semantic guidance. Although emerging multimodal approaches like CLIP have introduced textual information, they typically rely on shallow feature alignment in the embedding space without achieving deep semantic interaction or effective feature fusion. Furthermore, the asymmetry between the quantity of image samples and text labels during training poses a challenge for effective cross-modal learning. In this study, a self-supervised adaptive multimodal feature fusion recognition (SAFusion-CLIP) method is proposed, aiming to significantly enhance classification accuracy and model generalization in fine-grained diseases and pests recognition tasks. [Methods] A comprehensive recognition framework was constructed, integrating four key components to achieve deep fusion of visual and textual features. First, prompt engineering was conducted by utilizing large language models (LLMs) combined with authoritative agricultural guides to transform simple category labels into fine-grained pathological semantic descriptions. These descriptions encapsulated morphological details, color gradients, and texture features, with quality verified by BERTScore and ROUGE-L metrics. Second, a cross-modal balanced alignment module was designed to resolve the problem of sample asymmetry between image batches and fixed text labels. This module employed a dot-product attention mechanism to calculate the correlation between image and text projections, applying Softmax normalization to dynamically align image features with their corresponding textual representations. Third, an adaptive fusion mechanism was employed to achieve deep semantic interaction. A gating unit based on the Sigmoid function was designed to calculate a gate value, which dynamically allocated weights to image and text features, allowing the model to adaptively integrate complementary information from both modalities. Finally, a self-supervised feature reconstruction task was introduced to enhance the robustness of feature representation. A simple decoder was utilized to reconstruct the original image and text embeddings from the fused features, and the model was optimized using a composite objective function combining image-text contrastive loss, mean squared error reconstruction loss, and weighted cross-entropy classification loss. [Results and Discussions] Extensive experiments were conducted on the standard PlantVillage dataset, which includes 39 categories covering 14 crop species. The proposed SAFusion-CLIP model achieved a classification accuracy of 99.67%, with precision, recall, and F₁-Score all exceeding 99.00%. Comparative analysis demonstrated that the proposed method significantly outperformed mainstream single-modal and baseline multimodal models, ResNet50 (96.51%), Swin-Transformer (97.48%), and baseline CLIP (98.23%), respectively. Visualization analysis using Gradient-weighted Class Activation Mapping (Grad-CAM) indicated that, unlike single-modal models which were susceptible to background noise or non-specific physical damage, the SAFusion-CLIP model focused more precisely on core lesion areas, effectively suppressing background interference. Furthermore, ablation studies confirmed the effectiveness of the proposed modules, showing that the combination of the self-supervised architecture and the adaptive fusion mechanism resulted in a 2.46 percentage points accuracy improvement over the baseline, validating the necessity of deep feature interaction and reconstruction tasks. [Conclusions] By fusing textual semantics with visual features, the SAFusion-CLIP method effectively overcame the limitations of single-modal recognition. The adaptive fusion mechanism ensured deep interaction between modalities, while the self-supervised reconstruction task significantly enhanced the robustness of feature representation. The experimental results verified that this data-driven approach significantly improves accuracy and generalization capabilities in fine-grained crop disease classification tasks, providing a new and effective solution for precision agricultural prevention and control.

Key words: diseases and pests recognition, multimodal fusion, adaptive feature fusion, self-supervised learning

中图分类号:

S435

叶鹏林, 闵超, 苟良杰, 王鹏程, 黄小鹏, 李鑫, 蒙玉平. 农作物病虫害自监督适应性多模态特征融合识别方法[J]. 智慧农业(中英文), 2026, 8(1): 72-84.

YE Penglin, MIN Chao, GOU Liangjie, WANG Pengcheng, HUANG Xiaopeng, LI Xin, MENG Yuping. Self-Supervised Adaptive Multimodal Feature Fusion Recognition of Crop Diseases and Pests[J]. Smart Agriculture, 2026, 8(1): 72-84.

图/表 18

图 1

图 2

图 3

图4

图5

图6

图7

表1

图8

表2

图9

图10

表3

表4

表5

图11

图12

图13

参考文献 36

[1]	FAO. Global crop losses due to pests and diseases: Annual impact on food security and economy[R]. Food and Agriculture Organization of the United Nation. [2025-09-02]. https://www.fao.org/plant-production-protection/about/zh.
[2]	王聃, 柴秀娟. 机器学习在植物病害识别研究中的应用[J]. 中国农机化学报, 2019, 40(9): 171-180.
	WANG D, CHAI X J. Application of machine learning in plant diseases recognition[J]. Journal of Chinese Agricultural Mechanization, 2019, 40(9): 171-180.
[3]	邵明月, 张建华, 冯全, 等. 深度学习在植物叶部病害检测与识别的研究进展[J]. 智慧农业(中英文), 2022(1): 29-46.
	SHAO M Y, ZHANG J H, FENG Q, et al. Research progress of deep learning in detection and recognition of plant leaf diseases[J]. Smart agriculture, 2022(1): 29-46.
[4]	慕君林, 马博, 王云飞, 等. 基于深度学习的农作物病虫害检测算法综述[J]. 农业机械学报, 2023, 54(S2): 301-313.
	MU J L, MA B, WANG Y F, et al. A review of deep-learning-based algorithms for crop pest and disease detection[J]. Transactions of the Chinese Society for Agricultural Machinery, 2023, 54(S2): 301-313.
[5]	杨锋, 姚晓通. 基于改进YOLOv8的小麦叶片病虫害检测轻量化模型[J]. 智慧农业(中英文), 2024(1): 147-157.
	YANG F, YAO X T. Lightweighted wheat leaf diseases and pests detection model based on improved YOLOv8[J]. Smart Agriculture, 2024(1): 147-157.
[6]	彭红星, 何慧君, 高宗梅, 等. 基于改进ShuffleNetV2模型的荔枝病虫害识别方法[J]. 农业机械学报, 2022, 53(12): 290-300.
	PENG H X, HE H J, GAO Z M, et al. Litchi diseases and insect pests identification method based on improved ShuffleNetV2[J]. Transactions of the Chinese Society for Agricultural Machinery, 2022, 53(12): 290-300.
[7]	冯峰, 周鑫, 陈诗瑶, 等. 一种基于改进神经网络算法ResNet50的玉米病虫害识别模型[J]. 江苏农业科学, 2024, 52(16): 239-244.
	FENG F, ZHOU X, CHEN S Y, et al. A maize pest and disease identification model based on improved neural network algorithm ResNet50[J]. Jiangsu Agricultural Sciences, 2024, 52(16): 239-244.
[8]	孙杨俊, 陈滔, 刘志梁, 等. 基于双线性卷积宽度网络的水稻病虫害识别[J]. 计算机应用, 2024, 44(S1): 314-318.
	SUN Y J, CHEN T, LIU Z L, et al. Rice pests and diseases recognition based on bilinear convolutional broad network[J]. Journal of Computer Applications, 2024, 44(S1): 314-318.
[9]	王杨, 李迎春, 许佳炜, 等. 基于改进Vision Transformer网络的农作物病害识别方法[J]. 小型微型计算机系统, 2024, 45(4): 887-893.
	WANG Y, LI Y C, XU J W, et al. Crop disease recognition method based on improved vision transformer network[J]. Journal of Chinese Computer Systems, 2024, 45(4): 887-893.
[10]	刘拥民, 刘翰林, 石婷婷, 等. 一种优化的Swin Transformer番茄叶片病害识别方法[J]. 中国农业大学学报, 2023, 28(4): 80-90.
	LIU Y M, LIU H L, SHI T T, et al. Tomato leaf disease recognition based on an optimized Swin Transformer[J]. Journal of China Agricultural University, 2023, 28(4): 80-90.
[11]	RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[EB/OL]. arXiv: 2103.00020, 2021.
[12]	陈燕, 赖宇斌, 肖澳, 等. 基于CLIP和交叉注意力的多模态情感分析模型[J]. 郑州大学学报(工学版), 2024, 45(2): 42-50.
	CHEN Y, LAI Y B, XIAO A, et al. Multimodal sentiment analysis model based on clip and cross-attention [J]. Journal of Zhengzhou University (Engineering Science), 2024, 45(2): 42-50.
[13]	FU J M, XU S Y, LIU H D, et al. CMA-CLIP: Cross-modality attention clip for text-image classification[C]// 2022 IEEE International Conference on Image Processing (ICIP). Piscataway, New Jersey, USA: IEEE, 2022: 2846-2850.
[14]	许睿, 邵帅, 曹维佳, 等. 基于重构对比的广义零样本图像分类[J]. 模式识别与人工智能, 2022, 35(12): 1078-1088.
	XU R, SHAO S, CAO W J, et al. Generalized zero-shot image classification based on reconstruction contrast [J]. Pattern Recognition and Artificial Intelligence, 2022, 35(12): 1078-1088.
[15]	LI J N, SELVARAJU R R, GOTMARE A D, et al. Align before fuse: vision and language representation learning with momentum distillation[EB/OL]. arXiv: 2107.07651, 2021.
[16]	LI J N, LI D X, XIONG C M, et al. BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation[EB/OL]. arXiv: 2201.12086, 2022.
[17]	谢润锋, 张博超, 杜永萍, 等. 基于视觉语言模型的跨模态多级融合情感分析方法[J]. 模式识别与人工智能, 2024, 37(5): 459-468.
	XIE R F, ZHANG B C, DU Y P, et al. Cross-modal multi-level fusion sentiment analysis method based on visual language model [J]. Pattern Recognition and Artificial Intelligence, 2024, 37(5): 459-468.
[18]	FENG X G, ZHAO C J, WANG C S, et al. A vegetable leaf disease identification model based on image-text cross-modal feature fusion[J]. Frontiers in Plant Science, 2022, 13: 918940.
[19]	CAO Y Y, CHEN L, YUAN Y, et al. Cucumber disease recognition with small samples using image-text-label-based multi-modal language model[J]. Computers and Electronics in Agriculture, 2023, 211: 107993.
[20]	LIU W J, WU G Q, WANG H, et al. Cross-modal data fusion via vision-language model for crop disease recognition[J]. Sensors, 2025, 25(13): 4096.
[21]	DENG S M, ZHU J L, HU Y, et al. Tomato leaf disease identification framework FCMNet based on multimodal fusion[J]. Plants, 2025, 14(15): 2329.
[22]	VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[EB/OL]. arXiv:1706.03762, 2017.
[23]	ERICSSON L, GOUK H, LOY C C, et al. Self-supervised representation learning: Introduction, advances, and challenges[J]. IEEE Signal Processing Magazine, 2022, 39(3): 42-62.
[24]	BERAHMAND K, DANESHFAR F, SALEHI E S, et al. Autoencoders and their applications in machine learning: A survey[J]. Artificial Intelligence Review, 2024, 57(2): 28.
[25]	HUGHES D P, SALATHE M. An open access repository of images on plant health to enable the development of mobile disease diagnostics[EB/OL]. arXiv: 1511.08060, 2015.
[26]	PENN STATE EXTENSION. Pests and diseases[EB/OL]. [2025-12-16].
[27]	THE AMERICAN PHYTOPATHOLOGICAL SOCIETY. Common names of plant diseases[EB/OL]. [2025-12-16].
[28]	ZHANG T Y, KISHORE V, WU F, et al. BERTScore: Evaluating text generation with BERT[EB/OL]. arXiv: 1904.09675, 2019.
[29]	SAI A B, MOHANKUMAR A K, KHAPRA M M. A survey of evaluation metrics used for NLG systems[J]. ACM Computing Surveys, 2023, 55(2):1-39.
[30]	LOSHCHILOV I, HUTTER F. Decoupled weight decay regularization[EB/OL]// arXiv:1711.05101, 2017.
[31]	LOSHCHILOV I, HUTTER F. SGDR: Stochastic gradient descent with restarts[EB/OL]. arXiv: 1608.03983, 2016.
[32]	HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway, New Jersey, USA: IEEE, 2016: 770-778.
[33]	LIU Z, LIN Y T, CAO Y, et al. Swin transformer: Hierarchical vision transformer using shifted windows[C]// 2021 IEEE/CVF International Conference on Computer Vision (ICCV). Piscataway, New Jersey, USA: IEEE, 2022: 9992-10002.
[34]	SELVARAJU R R, COGSWELL M, DAS A, et al. Grad-CAM: Visual explanations from deep networks via gradient-based localization[J]. International Journal of Computer Vision, 2020, 128(2): 336-359.
[35]	王科平, 左鑫浩, 杨艺, 等. 基于伪全局Swin Transformer的遥感图像识别算法[J]. 模式识别与人工智能, 2023, 36(9): 818-831.
	WANG K P, ZUO X H, YANG Y, et al. Remote sensing image recognition algorithm based on pseudo global swin transformer[J]. Pattern Recognition and Artificial Intelligence, 2023, 36(9): 818-831.
[36]	朱杰, 陈黎飞. 核密度估计的聚类算法[J]. 模式识别与人工智能, 2017, 30(5): 439-447.
	ZHU J, CHEN L F. Clustering algorithm with kernel density estimation[J]. Pattern Recognition and Artificial Intelligence, 2017, 30(5): 439-447.

评价指标	平均得分
BERTScore	0.912
ROUGE-L	0.685

模型	单张图像耗时/（ms/img）	吞吐量/（img/s）
ResNet50	1.88	530.62
Swin-Transformer	4.31	231.92
CLIP	6.29	159.10
SAFusion-CLIP	8.07	123.95

模型名称	自监督	适应性融合机制	备注
Model 1	×	√	去除自监督模块
Model 2	×	×	去除自监督模块，去掉适应性融合
Model 3（SAFusion-CLIP）	√	√	去掉适应性融合

模型	消融实验设置	多头注意力头数	模态交互及融合层数	隐藏层大小	测试集准确率/%
Model 1	无自监督	12	4	768	98.42
Model 2	无自监督，无自适应融合	12	4	768	97.21
Model 3	有自监督	12	4	768	99.67

模型	评价指标/%
模型	准确率	精准率	召回率	F ₁分数
Model 1	98.42	98.26	98.09	98.14
Model 2	97.21	97.05	97.05	96.95
Model 3	99.67	99.62	99.62	99.62

农作物病虫害自监督适应性多模态特征融合识别方法

Self-Supervised Adaptive Multimodal Feature Fusion Recognition of Crop Diseases and Pests

在线阅读

知网下载

本地下载

可视化

摘要/Abstract

引用本文

使用本文

图/表 18

参考文献 36

相关文章 3

编辑推荐

Metrics

本文评价

[1]	黄先果, 朱启兵, 黄敏. 基于时序多源信息融合的果蔬新鲜度在线检测系统[J]. 智慧农业(中英文), 2026, 8(1): 203-212.
[2]	吴华瑞, 赵春江, 李静晨. 基于多模态融合大模型架构Agri-QA Net的作物知识问答系统[J]. 智慧农业(中英文), 2025, 7(1): 1-10.
[3]	潘晨露, 张正华, 桂文豪, 马家俊, 严晨曦, 张晓敏. 融合ECA机制与DenseNet201的水稻病虫害识别方法[J]. 智慧农业(中英文), 2023, 5(2): 45-55.