Welcome to Smart Agriculture 中文

Smart Agriculture ›› 2026, Vol. 8 ›› Issue (1): 72-84.doi: 10.12133/j.smartag.SA202509032

• Topic--Intelligent Identification and Diagnosis of Agricultural Diseases and Pests • Previous Articles     Next Articles

Self-Supervised Adaptive Multimodal Feature Fusion Recognition of Crop Diseases and Pests

YE Penglin1,2(), MIN Chao1,2,3(), GOU Liangjie2,3, WANG Pengcheng1,2, HUANG Xiaopeng1,2, LI Xin1,2, MENG Yuping4   

  1. 1. School of Science, Southwest Petroleum University, Chengdu 610500, China
    2. Institute of Artificial Intelligence, Southwest Petroleum University, Chengdu 610500, China
    3. National Key Laboratory of Oil and Gas Reservoir Geology and Development Engineering, Southwest Petroleum University, Chengdu 610500, China
    4. Information Center, Sinopec Zhongyuan Oilfield Company, Puyang 457001, China
  • Received:2025-09-20 Online:2026-01-30
  • Foundation items:National Natural Science Foundation of China(52574048); Sichuan Science and Technology Program(2025NSFTD0016)
  • About author:

    YE Penglin, E-mail:

  • corresponding author:
    MIN Chao, E-mail:

Abstract:

[Objective] Crop diseases and pests are significant factors restricting global agricultural production. Traditional intelligent recognition technologies predominantly rely on single-modal image data processed by convolutional neural networks (CNNs) or Transformers. However, in complex natural environments, these methods often suffer from insufficient information utilization and limited robustness due to the lack of semantic guidance. Although emerging multimodal approaches like CLIP have introduced textual information, they typically rely on shallow feature alignment in the embedding space without achieving deep semantic interaction or effective feature fusion. Furthermore, the asymmetry between the quantity of image samples and text labels during training poses a challenge for effective cross-modal learning. In this study, a self-supervised adaptive multimodal feature fusion recognition (SAFusion-CLIP) method is proposed, aiming to significantly enhance classification accuracy and model generalization in fine-grained diseases and pests recognition tasks. [Methods] A comprehensive recognition framework was constructed, integrating four key components to achieve deep fusion of visual and textual features. First, prompt engineering was conducted by utilizing large language models (LLMs) combined with authoritative agricultural guides to transform simple category labels into fine-grained pathological semantic descriptions. These descriptions encapsulated morphological details, color gradients, and texture features, with quality verified by BERTScore and ROUGE-L metrics. Second, a cross-modal balanced alignment module was designed to resolve the problem of sample asymmetry between image batches and fixed text labels. This module employed a dot-product attention mechanism to calculate the correlation between image and text projections, applying Softmax normalization to dynamically align image features with their corresponding textual representations. Third, an adaptive fusion mechanism was employed to achieve deep semantic interaction. A gating unit based on the Sigmoid function was designed to calculate a gate value, which dynamically allocated weights to image and text features, allowing the model to adaptively integrate complementary information from both modalities. Finally, a self-supervised feature reconstruction task was introduced to enhance the robustness of feature representation. A simple decoder was utilized to reconstruct the original image and text embeddings from the fused features, and the model was optimized using a composite objective function combining image-text contrastive loss, mean squared error reconstruction loss, and weighted cross-entropy classification loss. [Results and Discussions] Extensive experiments were conducted on the standard PlantVillage dataset, which includes 39 categories covering 14 crop species. The proposed SAFusion-CLIP model achieved a classification accuracy of 99.67%, with precision, recall, and F1-Score all exceeding 99.00%. Comparative analysis demonstrated that the proposed method significantly outperformed mainstream single-modal and baseline multimodal models, ResNet50 (96.51%), Swin-Transformer (97.48%), and baseline CLIP (98.23%), respectively. Visualization analysis using Gradient-weighted Class Activation Mapping (Grad-CAM) indicated that, unlike single-modal models which were susceptible to background noise or non-specific physical damage, the SAFusion-CLIP model focused more precisely on core lesion areas, effectively suppressing background interference. Furthermore, ablation studies confirmed the effectiveness of the proposed modules, showing that the combination of the self-supervised architecture and the adaptive fusion mechanism resulted in a 2.46 percentage points accuracy improvement over the baseline, validating the necessity of deep feature interaction and reconstruction tasks. [Conclusions] By fusing textual semantics with visual features, the SAFusion-CLIP method effectively overcame the limitations of single-modal recognition. The adaptive fusion mechanism ensured deep interaction between modalities, while the self-supervised reconstruction task significantly enhanced the robustness of feature representation. The experimental results verified that this data-driven approach significantly improves accuracy and generalization capabilities in fine-grained crop disease classification tasks, providing a new and effective solution for precision agricultural prevention and control.

Key words: diseases and pests recognition, multimodal fusion, adaptive feature fusion, self-supervised learning

CLC Number: