欢迎您访问《智慧农业(中英文)》官方网站! English

Smart Agriculture ›› 2026, Vol. 8 ›› Issue (1): 72-84.doi: 10.12133/j.smartag.SA202509032

• 专题--农业病虫害智能识别与诊断 • 上一篇    下一篇

农作物病虫害自监督适应性多模态特征融合识别方法

叶鹏林1,2(), 闵超1,2,3(), 苟良杰2,3, 王鹏程1,2, 黄小鹏1,2, 李鑫1,2, 蒙玉平4   

  1. 1. 西南石油大学 理学院,四川 成都 610500,中国
    2. 西南石油大学 人工智能研究院,四川 成都 610500,中国
    3. 西南石油大学 油气藏地质及开发工程全国重点实验室,四川 成都 610500,中国
    4. 中国石化中原油田分公司,河南 濮阳 457001,中国
  • 收稿日期:2025-09-20 出版日期:2026-01-30
  • 基金项目:
    国家自然科学基金项目(52574048); 四川省科技计划项目(2025NSFTD0016)
  • 作者简介:

    叶鹏林,硕士研究生,研究方向为多模态模式识别。E-mail:

  • 通信作者:
    闵 超,博士,教授,研究方向为最优化方法与不确定性的数学理论、可解释机器学习等领域。E-mail:

Self-Supervised Adaptive Multimodal Feature Fusion Recognition of Crop Diseases and Pests

YE Penglin1,2(), MIN Chao1,2,3(), GOU Liangjie2,3, WANG Pengcheng1,2, HUANG Xiaopeng1,2, LI Xin1,2, MENG Yuping4   

  1. 1. School of Science, Southwest Petroleum University, Chengdu 610500, China
    2. Institute of Artificial Intelligence, Southwest Petroleum University, Chengdu 610500, China
    3. National Key Laboratory of Oil and Gas Reservoir Geology and Development Engineering, Southwest Petroleum University, Chengdu 610500, China
    4. Information Center, Sinopec Zhongyuan Oilfield Company, Puyang 457001, China
  • Received:2025-09-20 Online:2026-01-30
  • Foundation items:National Natural Science Foundation of China(52574048); Sichuan Science and Technology Program(2025NSFTD0016)
  • About author:

    YE Penglin, E-mail:

  • Corresponding author:
    MIN Chao, E-mail:

摘要:

[目的/意义] 传统农作物病虫害识别普遍依赖单模态图像,信息利用不充分,导致识别精度受限。针对该问题,本研究提出融合图像与文本的多模态识别方法,旨在显著提升分类准确率与模型鲁棒性,为农业精准防控提供数据驱动的新路径。 [方法] 构建基于自监督适应性特征融合的识别模型。首先,利用大语言模型结合权威农业指南进行提示工程,将标签转化为细粒度病理语义描述;其次,基于图像文本对比学习(Contrastive Language-Image Pre-training, CLIP)提取图像与文本双流特征,设计跨模态平衡对齐模块解决样本不对称问题;再次,通过适应性融合机制动态分配模态权重,实现深层语义交互;最后,引入自监督特征重构任务以增强特征表征的鲁棒性。 [结果和讨论] 在PlantVillage标准数据集上的实验表明:模型分类准确率达99.67%,较ResNet50(96.51%)、Swin-Transformer(97.48%)和基础CLIP(98.23%)准确率高;同时,精确率、召回率与F1分数均超过99.00%,验证了方法的有效性与稳定性。 [结论] 该方法通过融合文本语义与视觉特征,有效突破了单模态识别局限,显著提升了模型在细粒度分类任务中的准确率与泛化能力。

关键词: 病虫害识别, 多模态融合, 适应性特征融合, 自监督学习

Abstract:

[Objective] Crop diseases and pests are significant factors restricting global agricultural production. Traditional intelligent recognition technologies predominantly rely on single-modal image data processed by convolutional neural networks (CNNs) or Transformers. However, in complex natural environments, these methods often suffer from insufficient information utilization and limited robustness due to the lack of semantic guidance. Although emerging multimodal approaches like CLIP have introduced textual information, they typically rely on shallow feature alignment in the embedding space without achieving deep semantic interaction or effective feature fusion. Furthermore, the asymmetry between the quantity of image samples and text labels during training poses a challenge for effective cross-modal learning. In this study, a self-supervised adaptive multimodal feature fusion recognition (SAFusion-CLIP) method is proposed, aiming to significantly enhance classification accuracy and model generalization in fine-grained diseases and pests recognition tasks. [Methods] A comprehensive recognition framework was constructed, integrating four key components to achieve deep fusion of visual and textual features. First, prompt engineering was conducted by utilizing large language models (LLMs) combined with authoritative agricultural guides to transform simple category labels into fine-grained pathological semantic descriptions. These descriptions encapsulated morphological details, color gradients, and texture features, with quality verified by BERTScore and ROUGE-L metrics. Second, a cross-modal balanced alignment module was designed to resolve the problem of sample asymmetry between image batches and fixed text labels. This module employed a dot-product attention mechanism to calculate the correlation between image and text projections, applying Softmax normalization to dynamically align image features with their corresponding textual representations. Third, an adaptive fusion mechanism was employed to achieve deep semantic interaction. A gating unit based on the Sigmoid function was designed to calculate a gate value, which dynamically allocated weights to image and text features, allowing the model to adaptively integrate complementary information from both modalities. Finally, a self-supervised feature reconstruction task was introduced to enhance the robustness of feature representation. A simple decoder was utilized to reconstruct the original image and text embeddings from the fused features, and the model was optimized using a composite objective function combining image-text contrastive loss, mean squared error reconstruction loss, and weighted cross-entropy classification loss. [Results and Discussions] Extensive experiments were conducted on the standard PlantVillage dataset, which includes 39 categories covering 14 crop species. The proposed SAFusion-CLIP model achieved a classification accuracy of 99.67%, with precision, recall, and F1-Score all exceeding 99.00%. Comparative analysis demonstrated that the proposed method significantly outperformed mainstream single-modal and baseline multimodal models, ResNet50 (96.51%), Swin-Transformer (97.48%), and baseline CLIP (98.23%), respectively. Visualization analysis using Gradient-weighted Class Activation Mapping (Grad-CAM) indicated that, unlike single-modal models which were susceptible to background noise or non-specific physical damage, the SAFusion-CLIP model focused more precisely on core lesion areas, effectively suppressing background interference. Furthermore, ablation studies confirmed the effectiveness of the proposed modules, showing that the combination of the self-supervised architecture and the adaptive fusion mechanism resulted in a 2.46 percentage points accuracy improvement over the baseline, validating the necessity of deep feature interaction and reconstruction tasks. [Conclusions] By fusing textual semantics with visual features, the SAFusion-CLIP method effectively overcame the limitations of single-modal recognition. The adaptive fusion mechanism ensured deep interaction between modalities, while the self-supervised reconstruction task significantly enhanced the robustness of feature representation. The experimental results verified that this data-driven approach significantly improves accuracy and generalization capabilities in fine-grained crop disease classification tasks, providing a new and effective solution for precision agricultural prevention and control.

Key words: diseases and pests recognition, multimodal fusion, adaptive feature fusion, self-supervised learning

中图分类号: