Select

Lightweight Detection Method for Pepper Leaf Diseases and Pests Based on Improved YOLOv12s

YAO Xiaotong, QU Shaoye

Smart Agriculture 2026, 8 (1): 1-14. DOI: 10.12133/j.smartag.SA202506005

Abstract （488）

HTML （64）

PDF（pc）（2085KB）（55）

Save

[Objective] Pepper cultivation frequently faces challenges from diseases and pests, and early detection is critical for reducing yield losses. However, existing detection models often suffer from limitations such as insufficient feature extraction for subtle lesions, loss of edge information due to complex backgrounds, and high missed detection rates for small lesions. To address these issues, the YOLO-MDFR (You Only Look Once), a lightweight detection algorithm was proposed based on an enhanced YOLOv12s, specifically designed for accurate identification of pepper leaf diseases and pests in complex natural environments. [Methods] The dataset was established in the primary pepper cultivation zone of Gangu county, Tianshui city, Gansu province. The cultivated variety was the locally dominant Capsicum annuum L. var. conoides (Mill.). Data collection was conducted from March 15 to May 20, 2024. The collected samples included four categories of pepper leaves: healthy leaves, leaves damaged by thrips, leaves infected with tobacco mosaic virus exhibiting yellowing symptoms, and leaves affected by bacterial leaf spot. First, the original YOLOv12s backbone was replaced with an improved MobileNetV4 architecture to enhance lightweight performance while preserving feature extraction capability. Specifically, the original 5×5 standard convolutions in the bottleneck layers of MobileNetV4 were substituted with two sequential 3×3 depthwise separable convolutions. This design was based on the principle that two 3×3 convolutions achieve an equivalent receptive field (matching the 5×5 coverage) while reducing parameter count—depthwise separable convolutions further decompose spatial and channel convolution, minimizing redundant computations. Second, a novel dimensional frequency reciprocal attention mixing transformer (D-F-Ramit) module was introduced to enhance sensitivity to lesion boundaries and fine-grained textures. The module first converted feature maps from the spatial domain to the frequency domain using discrete cosine transform (DCT), capturing high-frequency components often lost in spatial-only attention. It then integrated three parallel branches: channel attention, spatial attention, and frequency-domain attention. Finally, a residual aggregation gate-controlled convolution (RAGConv) module was developed for the neck network. This module included a residual aggregation path to collect multi-layer feature information and a gate control unit that dynamically weighted feature components based on their relevance. The residual structure provided a direct gradient propagation path, alleviating gradient vanishing during backpropagation and ensuring efficient information transfer during feature fusion. A systematic experimental framework was established to comprehensively evaluate model performance: (1) Ablation studies were conducted using a controlled variable approach to verify the individual contributions of the improved MobileNetV4, D-F-Ramit, and RAGConv modules; (2) Lesion scale sensitivity analysis assessed detection performance across different lesion sizes, with emphasis on small-spot recognition; (3) Resolution impact analysis evaluated five common input resolutions (320×320–736×736) to explore the trade-offs among accuracy, speed, and computational efficiency; and (4) Embedded deployment validation involved model quantization and implementation on the Rockchip RK3588 platform to measure inference speed and power consumption on edge devices. [Results and Discussions] The proposed YOLO-MDFR achieved an mAP@0.5 of 95.6% on this dataset. Compared to YOLOv12s, it improved accuracy by 2.0%, reduced parameters by 61.5%, and lowered computational complexity by 68.5%. Real-time testing showed 43.4 f/s on an NVIDIA RTX 4060 GPU (CUDA 12.2) and 22.8 f/s on a Rockchip RK3588 embedded platform with only 3.5 W power consumption—suitable for battery-powered field devices. Lesion-scale analysis revealed 33.5% accuracy for <16×16 pixel lesions critical for early detection. Confusion matrix evaluation reduced misclassification, bacterial leaf spot/thrips damage misrates fell from 5.8% to 2.1%, and tobacco mosaic virus/healthy leaves from 3.2% to 1.5%, resulting in an overall 2.3% misrate. Experiments across varying input resolutions revealed a clear performance–resolution trade-off. As resolution increased from 320×320 to 736×736, mAP rose from 89.5% to 96.2%, showing diminishing returns beyond 512×512. Concurrently, computational cost grew roughly quadratically, reducing inference speed from 65.2 f/s to 35.1 f/s. [Conclusions] This study presents YOLO-MDFR, a lightweight detection model for identifying pepper leaf diseases and pests under complex natural conditions. By integrating an improved MobileNetV4 backbone, a multi-dimensional frequency reciprocal attention mixing transformer (D-F-Ramit), and a residual aggregation gate-controlled convolution (RAGConv) module, YOLO-MDFR outperforms mainstream detection models in both accuracy and efficiency. Systematic deployment experiments yielded optimized configurations for different application scenarios. Despite its strong performance, the model shows limitations in robustness under extreme lighting, generalization to emerging diseases, and detection of small targets under occlusion. Future work will address these issues through ambient light data fusion, domain adaptation with semi-supervised learning, and binocular vision integration.

Table and Figures | Reference | Related Articles | Metrics | Comments（0）

Select

Tea Leaf Disease Diagnosis Based on Improved Lightweight U-Net3+

HU Yumeng, GUAN Feifan, XIE Dongchen, MA Ping, YU Youben, ZHOU Jie, NIE Yanming, HUANG Lüwen

Smart Agriculture 2026, 8 (1): 15-27. DOI: 10.12133/j.smartag.SA202507010

Abstract （327）

HTML （10）

PDF（pc）（1627KB）（20）

Save

[Objective] Leaf diseases significantly affect both the yield and quality of tea throughout the year. To address the issue of inadequate segmentation finesse in the current tea spot segmentation models, a novel diagnosis of the severity of tea spots was proposed in this research, designated as MDC-U-Net3+, to enhance segmentation accuracy on the base framework of U-Net3+. [Methods] Multi-scale feature fusion module (MSFFM) was incorporated into the backbone network of U-Net3+ to obtain feature information across multiple receptive fields of diseased spots, thereby reducing the loss of features within the encoder. Dual multi-scale attention (DMSA) was incorporated into the skip connection process to mitigate the segmentation boundary ambiguity issue. This integration facilitates the comprehensive fusion of fine-grained and coarse-grained semantic information at full scale. Furthermore, the segmented mask image was subjected to conditional random fields (CRF) to enhance the optimization of the segmentation results [Results and Discussions] The improved model MDC-U-Net3+ achieved a mean pixel accuracy (mPA) of 94.92%, accompanied by a mean Intersection over Union (mIoU) ratio of 90.9%. When compared to the mPA and mIoU of U-Net3+, MDC-U-Net3+ model showed improvements of 1.85 and 2.12 percentage points, respectively. These results illustrated a more effective segmentation performance than that achieved by other classical semantic segmentation models. [Conclusions] The methodology presented herein could provide data support for automated disease detection and precise medication, consequently reducing the losses associated with tea diseases.

Table and Figures | Reference | Related Articles | Metrics | Comments（0）

Select

Rice Disease Identification Method Based on Improved MobileViT Model and System Development

LIU Xiaojun, WU Qian, SUN Chuanliang, QI Chao, ZHANG Gufeng, LEI Tianjie, LIANG Wanjie

Smart Agriculture 2026, 8 (1): 28-39. DOI: 10.12133/j.smartag.SA202507043

Abstract （365）

HTML （14）

PDF（pc）（3484KB）（31）

Save

[Objective] Under abiotic stress conditions, rice plants become fragile and susceptible to disease infection. Accurate diagnosis and scientific prevention and control strategies for rice diseases are crucial for the prevention and control of rice diseases, even disasters such as blooding and high temperatures. However, under field natural environmental conditions, the identification of rice diseases is a challenging problem. There are various issues such as complex backgrounds, illumination changes, occlusion, which make it extremely difficult to comprehensively obtain disease information, thus significantly increasing the difficulty of disease identification. This study aims to develop an efficient rice disease recognition model by integrating the efficient channel attention (ECA) mechanism with the MobileViT model, enhancing the accuracy of rice disease identification in the field. Additionally, the rice disease knowledge graph was combined to achieve precise diagnosis and generate scientifically grounded control prescriptions for effective disease management. [Methods] A total of 1 304 raw images of rice diseases were collected from different rice disease investigation and long-term monitoring points in Jiangsu province, at different periods of time, using mobile phones and cameras. 167 disease images from the rice leaf disease image samples dataset were used to supplement the dataset. The raw images were accurately classified and preprocessed under the guidance of plant protection experts. A dataset containing 1 471 original images was constructed that includes seven types of rice diseases: bacterial leaf blight, false smut, leaf blast, bakanae disease, heart rot, grain discoloration, and panicle blast. The dataset was partitioned into training, validation, and test sets following a 7:1.5:1.5 ratio. Data augmentation techniques were applied exclusively to the training and validation sets to enhance sample diversity, while the test set remained unaugmented to preserve its independence for unbiased model evaluation. Post-augmentation, the total image count increased to 7 735. A novel rice disease recognition model was established by integrating the efficient channel attention (ECA) module into the MobileViT model. The recognition model architecture was optimized by improving convolutional structures, reconstructing Transformer encoding blocks, replacing activation function using SiLU. To verify the performance of the model, cross validation and ablation experiments were conducted. After establishing a highly accurate recognition model, the recognition model was combined with the rice disease knowledge graph to achieve accurate diagnosis of rice diseases and generate scientific prevention and control strategies. Finally, an intelligent rice disease diagnostic system was developed using the Flask framework and cloud computing technologies. [Results and Discussions] The results of the ablation study revealed that the model, which combined convolutional layer optimization, Transformer block reconstruction, and the integration of the ECA module, got outstanding performance.The overall precision, F₁-Score and recall rate achieved 97.27%, 97.32%, and 97.46%, respectively. In terms of accuracy, the improved model increased to 97.25%, representing an improvement of 2.3% over the original model (94.95%). To further verify the effectiveness of the improved model, various mainstream models such as Swin Transformer, TinyVit, and ConvNeXt were compared with the proposed model.The experimental results showed that the improved model outperformed the suboptimal model (TinyVit) by 0.92, 1.43, 0.95, 1.32 percent points in overall accuracy, F₁-Score and recall rate, respectively. Moreover, the improved model showed significant advantages in terms of floating-point operations, model size, and parameter count, with a parameter count of only 6.02 MB, making it more suitable for deployment on hardware-constrained devices. Analysis of the confusion matrix and heatmap visualizations revealed that the enhanced model achieved recognition accuracy improvements of 0.6, 0.3, 0.3, and 0.6 percentage points for bacterial leaf blight, heart rot, grain discoloration, and panicle blast, respectively. The integrated system, combining this model with the knowledge graph, demonstrated significantly enhanced accuracy in disease identification and diagnosis. Meanwhile, the disease prevention and control strategies were generated to guide rice disease prevention and control. During field deployment, the rice disease diagnosis system achieved an accuracy rate as high as 98%, with an average response time of 181 ms, demonstrating reliable real-time performance and stability. [Conclusions] By integrating ECA module and reconstructing Transformer encoding blocks, the MobileViT model achieved noticeable improvements in precision, recall and F₁ score, while effectively reducing computational costs, leading to more efficient recognition capabilities of rice diseases in complex field environments. The application of the rice disease intelligent diagnosis system revealed that the system could achieve accurate rice disease diagnosis results, and generate disease prevention and control strategies for guide rice disease prevention and control. This method could effectively improve the prevention and control efficiency of rice diseases, providing technical support for improving the quality, efficiency, digitization and intelligence of rice production.

Table and Figures | Reference | Related Articles | Metrics | Comments（0）

Select

Low-rank Adaptation Method for Fine-tuning Plant Disease Recognition Models

HUANG Jinqing, YE Jin, HU Huilin, YANG Jihui, LAN Wei, ZHANG Yanqing

Smart Agriculture 2026, 8 (1): 40-51. DOI: 10.12133/j.smartag.SA202504003

Abstract （381）

HTML （31）

PDF（pc）（2250KB）（52）

Save

[Objective] When deep learning is applied to plant disease recognition tasks, model fine-tuning faces significant challenges, including limited computational resources and high parameter update overhead. Although traditional low-rank adaptation (LoRA) methods effectively reduce parameter overhead, their strategy of assigning a uniform, fixed rank to all layers often overlooks the varying importance of different layers. This approach may still lead to constrained optimization in critical layers or resource waste in less significant ones. To address this limitation, a dynamic rank allocation (DRA) algorithm is proposed in this research. The DRA algorithm is designed to evaluate and adjust the required parameter resources for each layer during training, enhance the accuracy of plant disease classification models while more efficiently balancing computational resources. [Methods] Public datasets of the Wheat Plant Diseases Dataset and the Plants Disease Dataset were utilized in the experiments. The Wheat Plant Diseases Dataset comprised 13 104 images covering 15 types of wheat diseases such as black rust and fusarium head blight, while the Plants Disease Dataset included 37 505 images of 26 types of plant diseases such as algal leaf spot, corn rust, and bacterial spot of tomato. These datasets were captured under varied lighting, different backgrounds, diverse angles, and at various stages of plant growth. A cross-layer feature similarity metric based on centred kernel alignment (CKA) was introduced to quantify the representational correlation between different layers. Concurrently, a correction factor was constructed based on gradient information and activation intensity to measure the direct impact of each layer on the loss function. These two metrics were then fused using a weighted harmonic mean to generate a comprehensive importance score, which was subsequently used for the initial rank allocation. Furthermore, considering the effect of feature representation changes during training, a stability-triggered adaptive rank update strategy rank re-allocation (RRA) was proposed. This strategy monitored the average parameter change of the low-rank adapters during the training process to determine the convergence state. When this change fell below a specific threshold, the low-rank matrices were merged into the original weights, and the rank allocation table was then re-calculated and updated. This process ensured that more resources were allocated to critical layers, thereby achieving an optimized allocation of parameter resources across different layers. [Results and Discussions] Tests on four models (AlexNet, MobileNetV2, RegNetY, and ConvNeXt) indicated that, compared to full-parameter fine-tuning, the proposed method reduced resource consumption to 0.42%, 2.46%, 3.56%, and 1.25%, respectively, while maintaining a comparable average accuracy. The RRA strategy demonstrated continuous parameter optimization throughout the model's training. On the ConvNeXt model, the trainable parameters on the plants disease dataset were progressively reduced from 18.34 to 9.26 M, a reduction of nearly 50%. In comparison with the standard LoRA method (R=16), the method reduced the accuracy by 0.38, 0.40 and 0.05 percentage points on the wheat plant diseases dataset for AlexNet, MobileNetV2, and RegNetY, respectively, while resource consumption was reduced by 59.3%, 87.4% and 50.5%. Robustness was tested by applying perturbations to the test set, including Gaussian noise, random cropping, color jitter, and random rotation. The results showed that the model was most affected by color jitter and random rotation on the Plants Disease Dataset, with accuracy decreasing by 6.02 and 5.11 percentage points, respectively. On the wheat plant diseases dataset, the model was more sensitive to random cropping and random rotation, with accuracy decreasing by 4.33 and 4.40 percentage points, respectively; the overall performance degradation remained within an acceptable range. When compared to other advanced low-rank methods such as AdaLoRA and DyLoRA under the same parameter budget, the DRA method exhibited higher accuracy. On the RegNetY model, the DRA method achieved an accuracy of 90.96% on the Plants Disease Dataset, which was 0.55 percentage points higher than AdaLoRA and 0.94 percentage points higher than DyLoRA. In terms of training efficiency on the Plants Disease Dataset, the DRA method required 43.5 minutes to reach its peak validation accuracy of 89.84%, whereas AdaLoRA required 52.3 minutes, representing a training time increase of approximately 20.23%. Regarding inference flexibility, the DyLoRA method was designed to generate a universal model capable of adapting to multiple rank configurations after a single training run, allowing for dynamic rank switching during inference based on hardware or latency requirements. The DRA method, however, did not possess this inference-time flexibility. It was focused on converging to a single, high-performance rank configuration for a specific task during the training phase. [Conclusions] The low-rank adaptive fine-tuning method proposed in this research significantly reduced the number of model training parameters while ensuring plant disease recognition accuracy. Compared to traditional fixed-rank LoRA and other advanced low-rank optimization methods, it demonstrated distinct advantages, providing an effective pathway for efficient model deployment on resource-constrained devices.

Table and Figures | Reference | Related Articles | Metrics | Comments（0）

Select

Intelligent Q&A Method for Crop Diseases and Pests Using LLM Augmented by Adaptive Hybrid Retrieval

YANG Jun, YANG Wanxia, YANG Sen, HE Liang, ZHANG Di

Smart Agriculture 2026, 8 (1): 52-61. DOI: 10.12133/j.smartag.SA202506026

Abstract （463）

HTML （18）

PDF（pc）（1647KB）（56）

Save

[Objective] Extracting valuable knowledge from vast amounts of dispersed, heterogeneous, and unstructured agricultural big data, correlating and structuring it, and enhancing large models to form intelligent question-answering systems enables the effective delivery of services to all in agriculture. This approach can rapidly advance the scientific and precision-based development of agricultural production. Existing agricultural Q&A systems lack enough semantic understanding of complex symptoms, while general-purpose large language models (LLMs) produce factual hallucinations due to incomplete training data coverage. The research aims to address the issues of insufficient scale and low quality in the construction of knowledge bases in the agricultural field. [Methods] First, disease and pest data were collected along for five typical crops: wheat, rice, corn, potatoes, and cotton. Using manual verification methods, outliers were precisely identified and removed, ultimately yielding 87 901 unstructured data entries. Then, a few-shot learning model was employed to extract entities defined in the pattern layer, and these entities were aligned with the semantic vectors of Bert and LLM prompt engineering, ultimately yielding a triplet knowledge base of 916 239 entries for knowledge retrieval. A knowledge retrieval-augmented LLM approach for intelligent Q&A on crop diseases and pests was proposed, specifically the adaptive hybrid retrieval-augmented generation (AHR-RAG) approach. Firstly, an overlapping mechanism was introduced during fixed-length segmentation to mitigate semantic fragmentation. Simultaneously, vector semantic similarity was used to match highly related text blocks with the topic for optimization and storage. Then, single-hop and multi-hop retrieval were designed based on the complexity of the problem. Single-hop retrieval used the BM25 algorithm to match information extracted from the query with document content in the Elasticsearch index, feeding the results into the LLM to enhance answer generation. Multi-hop retrieval first converted user queries into structured conditions and semantic vector representations. Results retrieved from different knowledge bases were then fused using reciprocal rank fusion (RRF) and fed into the LLM. [Results and Discussions] The proposed method was experimentally compared with multiple baseline approaches, including different query types and complexity queries. The results demonstrated that the proposed method achieved accuracy and F₁ improvements of 0.193 and 0.170, respectively, on the Qwen1.5-7B-Chat model. Compared to the improved methods Self-RAG and Adaptive-RAG, AHR-RAG maintained low response times while achieving F₁improvements of 0.05 and 0.021, respectively, with an accuracy as high as 0.896. For multi-type question-answering tasks, compared to the Naive-RAG method that relied solely on prior knowledge, the AHR-RAG approach achieved accuracy improvements of 0.231, 0.123, and 0.157 for comparison, judgment and selection query types, respectively. For parsing complex semantics, AHR-RAG also demonstrated significant advantages. In single-hop queries, its accuracy reached 0.921, representing a 0.029 improvement over Adaptive-RAG. In multi-hop query scenarios, its accuracy reached 0.748, achieving gains of 0.082 and 0.059 over Self-RAG and Adaptive-RAG respectively. In retrieval-augmented generation, AHR-RAG achieved a 0.013 increase in accuracy and a 0.009 improvement in F₁ by optimizing prompt strategies, compared to directly feeding retrieval results to the model for output. [Conclusions] This research demonstrates strong adaptability to diverse query types and excels at reasoning complex queries such as multi-hop searches. It delivers significant advantages in answer generation accuracy, relevance, and comprehensiveness, producing responses with enhanced logical coherence and richer content. Future work will explore the integration of multimodal knowledge bases.

Table and Figures | Reference | Related Articles | Metrics | Comments（0）

Select

Multi-Scale Tea Leaf Disease Detection Method Based on Improved YOLOv11n

XIAO Ruihong, TAN Lixin, WANG Rifeng, SONG Min, HU Chengxi

Smart Agriculture 2026, 8 (1): 62-71. DOI: 10.12133/j.smartag.SA202509014

Abstract （365）

HTML （25）

PDF（pc）（1532KB）（20）

Save

[Objective] Preventing and containing leaf diseases is a critical component of tea production, and accurate identification and localization of symptoms are essential for modern, automated plantation management. Field inspection in tea gardens poses distinctive challenges for vision-based detection algorithms: targets appeared at widely varying scales and morphologies under complex backgrounds and unfixed acquisition distances, which easily misled detectors. Models trained on standardized datasets with uniform distance and background often underperform, leading to false alarms and missed detections. To support method development under realistic constraints, YOLO-SADMFA (You Only Look Once-Switchable Atrous Dynamic Multi-scale Frequency-aware Adaptive), a detector based on the YOLOv11n backbone was proposed. The architecture aims to preserve fine details during repeated re-sampling (down- and up-sampling), strengthen modeling of lesions at varying scales, and refine multi-scale feature fusion. [Methods] The proposed architecture incorporated additional convolutional, feature extraction, upsampling, and detection head stages to better handle multi-scale representations, and a DMF-Upsample (Dynamic Multi-scale Frequency-aware Upsample) module that performed upsampling through multi-scale feature analysis and dynamic frequency adjustment fusion was introduced. This module enabled efficient multi-scale feature integration while effectively mitigating information loss during up- and down-sampling. Concretely, the DMF-Upsample analyzed multi-frequency responses from adjacent pyramid levels and fused them with dynamically learned frequency-selective weights, which preserved high-frequency lesion boundaries and textures while retaining low-frequency contextual structure such as leaf contours and global shading. A lightweight gating mechanism estimates per-location and per-channel coefficients to regulate the contribution of different bands, and a residual bypass preserved identity information to further reduce aliasing and oversmoothing introduced by repeated resampling. Furthermore, the baseline C3k2 block was replaced with a switchable atrous convolution (SAConv) module, which enhanced multi-scale feature capture by combining outputs from different dilation rates and incorporates a weight locking mechanism to improve model stability and performance. In practice, the SAConv aggregated parallel atrous branched at multiple dilation factors through learned coefficients under weight locking, which expanded the effective receptive field without sacrificing spatial resolution and suppressed gridding artifacts, while incurring modest parameter overhead. Lastly, an adaptive spatial feature fusion (ASFF) mechanism was integrated into the detection head, forming an ASFF-Head that learned spatially varying fusion weights across different feature scales, effectively filters conflicting information, and strengthens the model's robustness and overall detection accuracy. Together, these components formed a deeper yet efficient multi-scale pathway suited to complex field scenes. [Results and Discussions] Compared with the original YOLOv11n model, YOLO-SADMFA improved precision, recall, and mAP by 4.4, 8.4, and 3.7 percentage points, respectively, indicating more reliable identification and localization across diverse field scenes. The detector was particularly effective for multi-scale targets where the lesion area occupied approximately 10%－65% of the image, reflecting the variability introduced by unfixed acquisition distance during tea garden patrols. Under low illumination and in complex backgrounds with occlusions and clutter, it maintained stable performance, reduced both missed detections and false alarms, and effectively distinguished disease categories with similar morphology and color. On edge computing devices, it sustained about 161 f/s, which met real-time requirements for mobile inspection robots and portable systems. These outcomes demonstrated strengthened robustness to background interference and improved sensitivity at extreme scales, which was consistent with practical demands where the acquisition distance was not fixed. From an ablation perspective, DMF-Upsample preserved high-frequency lesion boundaries while retaining low-frequency structural context after resampling, SAConv expanded receptive fields through multi-dilation aggregation under a weight-locking mechanism, and the ASFF-Head mitigated conflicts among feature pyramids. Their combination yielded cumulative gains in stability and accuracy. Qualitative analyses further supported the quantitative results: Boundary localization improved for small, speckled lesions, large blotches were captured with fewer spurious edges, and distractors such as veins, shadows, and soil textures were less frequently misclassified, confirming the benefits of dynamic multi-scale frequency-aware fusion and adaptive spatial weighting in real field conditions. [Conclusions] The proposed YOLO-SADMFA effectively addressed the multi-scale disease detection challenge in complex tea garden environments, where acquisition distance was not fixed, lesion morphology and color were diverse, and cluttered backgrounds easily caused misjudgments and omissions. It significantly improved detection accuracy and robustness relative to the original YOLOv11n model across a wide range of target scales, and it maintained stable performance under low illumination and complex backgrounds typical of field inspections. It provided reliable technical support for automated tea leaf disease inspection systems by enabling accurate localization and identification of lesions in real operating conditions and by sustaining real-time inference on edge devices suitable for patrol-style deployment.

Table and Figures | Reference | Related Articles | Metrics | Comments（0）

Select

Self-Supervised Adaptive Multimodal Feature Fusion Recognition of Crop Diseases and Pests

YE Penglin, MIN Chao, GOU Liangjie, WANG Pengcheng, HUANG Xiaopeng, LI Xin, MENG Yuping

Smart Agriculture 2026, 8 (1): 72-84. DOI: 10.12133/j.smartag.SA202509032

Abstract （173）

HTML （9）

PDF（pc）（4722KB）（22）

Save

[Objective] Crop diseases and pests are significant factors restricting global agricultural production. Traditional intelligent recognition technologies predominantly rely on single-modal image data processed by convolutional neural networks (CNNs) or Transformers. However, in complex natural environments, these methods often suffer from insufficient information utilization and limited robustness due to the lack of semantic guidance. Although emerging multimodal approaches like CLIP have introduced textual information, they typically rely on shallow feature alignment in the embedding space without achieving deep semantic interaction or effective feature fusion. Furthermore, the asymmetry between the quantity of image samples and text labels during training poses a challenge for effective cross-modal learning. In this study, a self-supervised adaptive multimodal feature fusion recognition (SAFusion-CLIP) method is proposed, aiming to significantly enhance classification accuracy and model generalization in fine-grained diseases and pests recognition tasks. [Methods] A comprehensive recognition framework was constructed, integrating four key components to achieve deep fusion of visual and textual features. First, prompt engineering was conducted by utilizing large language models (LLMs) combined with authoritative agricultural guides to transform simple category labels into fine-grained pathological semantic descriptions. These descriptions encapsulated morphological details, color gradients, and texture features, with quality verified by BERTScore and ROUGE-L metrics. Second, a cross-modal balanced alignment module was designed to resolve the problem of sample asymmetry between image batches and fixed text labels. This module employed a dot-product attention mechanism to calculate the correlation between image and text projections, applying Softmax normalization to dynamically align image features with their corresponding textual representations. Third, an adaptive fusion mechanism was employed to achieve deep semantic interaction. A gating unit based on the Sigmoid function was designed to calculate a gate value, which dynamically allocated weights to image and text features, allowing the model to adaptively integrate complementary information from both modalities. Finally, a self-supervised feature reconstruction task was introduced to enhance the robustness of feature representation. A simple decoder was utilized to reconstruct the original image and text embeddings from the fused features, and the model was optimized using a composite objective function combining image-text contrastive loss, mean squared error reconstruction loss, and weighted cross-entropy classification loss. [Results and Discussions] Extensive experiments were conducted on the standard PlantVillage dataset, which includes 39 categories covering 14 crop species. The proposed SAFusion-CLIP model achieved a classification accuracy of 99.67%, with precision, recall, and F₁-Score all exceeding 99.00%. Comparative analysis demonstrated that the proposed method significantly outperformed mainstream single-modal and baseline multimodal models, ResNet50 (96.51%), Swin-Transformer (97.48%), and baseline CLIP (98.23%), respectively. Visualization analysis using Gradient-weighted Class Activation Mapping (Grad-CAM) indicated that, unlike single-modal models which were susceptible to background noise or non-specific physical damage, the SAFusion-CLIP model focused more precisely on core lesion areas, effectively suppressing background interference. Furthermore, ablation studies confirmed the effectiveness of the proposed modules, showing that the combination of the self-supervised architecture and the adaptive fusion mechanism resulted in a 2.46 percentage points accuracy improvement over the baseline, validating the necessity of deep feature interaction and reconstruction tasks. [Conclusions] By fusing textual semantics with visual features, the SAFusion-CLIP method effectively overcame the limitations of single-modal recognition. The adaptive fusion mechanism ensured deep interaction between modalities, while the self-supervised reconstruction task significantly enhanced the robustness of feature representation. The experimental results verified that this data-driven approach significantly improves accuracy and generalization capabilities in fine-grained crop disease classification tasks, providing a new and effective solution for precision agricultural prevention and control.

Table and Figures | Reference | Related Articles | Metrics | Comments（0）

Content of Topic--Intelligent Identification and Diagnosis of Agricultural Diseases and Pests in our journal