基于改进YOLOv12s的辣椒叶片病虫害轻量化检测方法

doi:10.12133/j.smartag.SA202506005

摘要/Abstract

摘要：

［目的/意义］ 针对自然环境干扰下检测模型对辣椒叶片病虫害的特征提取不充分、容易忽视目标物体的边缘信息，以及小块病斑与虫害病灶易漏检等问题，本研究提出一种轻量化辣椒叶片病害检测算法，即YOLO-MDFR（You Only Look Once Version 12-MDFR）。 ［方法］ 基于YOLOv12s模型做出改进。首先用两个堆叠的3×3的深度可分离卷积代替一个5×5的深度可分离卷积以改进MobileNetV4，并将其代替YOLOv12s的原始骨干网络实现骨干网络轻量化。其次为提高小目标物体的特征提取能力，提出了多维频域互补自注意力机制模块（Dimensional Frequency Reciprocal Attention Mixing Transformer, D-F-Ramit）。最后利用D-F-Ramit与RAGConv（Residual Aggregation Gate-Controlled Convolution）重新设计颈部网络，增强模型的特征融合能力和信息传递能力。基于以上改进提出YOLO-MDFR目标检测算法。 ［结果和讨论］ 实验结果表明，本研究提出的YOLO-MDFR模型在实验数据集上的平均识别精确度达到95.6%，与YOLOv12s模型相比，平均识别精度提高了2.0%，同时参数量下降了61.5%，计算量下降了68.5%，帧率达到43.4帧/s。 ［结论］ 本研究通过系统性的架构优化，在保持模型轻量化的同时显著提升了检测性能，实现了计算效率与检测精度的最佳平衡。

关键词: YOLO, 叶片病虫害检测, MobileNetV4, 轻量化模型, 注意力机制

Abstract:

[Objective] Pepper cultivation frequently faces challenges from diseases and pests, and early detection is critical for reducing yield losses. However, existing detection models often suffer from limitations such as insufficient feature extraction for subtle lesions, loss of edge information due to complex backgrounds, and high missed detection rates for small lesions. To address these issues, the YOLO-MDFR (You Only Look Once), a lightweight detection algorithm was proposed based on an enhanced YOLOv12s, specifically designed for accurate identification of pepper leaf diseases and pests in complex natural environments. [Methods] The dataset was established in the primary pepper cultivation zone of Gangu county, Tianshui city, Gansu province. The cultivated variety was the locally dominant Capsicum annuum L. var. conoides (Mill.). Data collection was conducted from March 15 to May 20, 2024. The collected samples included four categories of pepper leaves: healthy leaves, leaves damaged by thrips, leaves infected with tobacco mosaic virus exhibiting yellowing symptoms, and leaves affected by bacterial leaf spot. First, the original YOLOv12s backbone was replaced with an improved MobileNetV4 architecture to enhance lightweight performance while preserving feature extraction capability. Specifically, the original 5×5 standard convolutions in the bottleneck layers of MobileNetV4 were substituted with two sequential 3×3 depthwise separable convolutions. This design was based on the principle that two 3×3 convolutions achieve an equivalent receptive field (matching the 5×5 coverage) while reducing parameter count—depthwise separable convolutions further decompose spatial and channel convolution, minimizing redundant computations. Second, a novel dimensional frequency reciprocal attention mixing transformer (D-F-Ramit) module was introduced to enhance sensitivity to lesion boundaries and fine-grained textures. The module first converted feature maps from the spatial domain to the frequency domain using discrete cosine transform (DCT), capturing high-frequency components often lost in spatial-only attention. It then integrated three parallel branches: channel attention, spatial attention, and frequency-domain attention. Finally, a residual aggregation gate-controlled convolution (RAGConv) module was developed for the neck network. This module included a residual aggregation path to collect multi-layer feature information and a gate control unit that dynamically weighted feature components based on their relevance. The residual structure provided a direct gradient propagation path, alleviating gradient vanishing during backpropagation and ensuring efficient information transfer during feature fusion. A systematic experimental framework was established to comprehensively evaluate model performance: (1) Ablation studies were conducted using a controlled variable approach to verify the individual contributions of the improved MobileNetV4, D-F-Ramit, and RAGConv modules; (2) Lesion scale sensitivity analysis assessed detection performance across different lesion sizes, with emphasis on small-spot recognition; (3) Resolution impact analysis evaluated five common input resolutions (320×320–736×736) to explore the trade-offs among accuracy, speed, and computational efficiency; and (4) Embedded deployment validation involved model quantization and implementation on the Rockchip RK3588 platform to measure inference speed and power consumption on edge devices. [Results and Discussions] The proposed YOLO-MDFR achieved an mAP@0.5 of 95.6% on this dataset. Compared to YOLOv12s, it improved accuracy by 2.0%, reduced parameters by 61.5%, and lowered computational complexity by 68.5%. Real-time testing showed 43.4 f/s on an NVIDIA RTX 4060 GPU (CUDA 12.2) and 22.8 f/s on a Rockchip RK3588 embedded platform with only 3.5 W power consumption—suitable for battery-powered field devices. Lesion-scale analysis revealed 33.5% accuracy for <16×16 pixel lesions critical for early detection. Confusion matrix evaluation reduced misclassification, bacterial leaf spot/thrips damage misrates fell from 5.8% to 2.1%, and tobacco mosaic virus/healthy leaves from 3.2% to 1.5%, resulting in an overall 2.3% misrate. Experiments across varying input resolutions revealed a clear performance–resolution trade-off. As resolution increased from 320×320 to 736×736, mAP rose from 89.5% to 96.2%, showing diminishing returns beyond 512×512. Concurrently, computational cost grew roughly quadratically, reducing inference speed from 65.2 f/s to 35.1 f/s. [Conclusions] This study presents YOLO-MDFR, a lightweight detection model for identifying pepper leaf diseases and pests under complex natural conditions. By integrating an improved MobileNetV4 backbone, a multi-dimensional frequency reciprocal attention mixing transformer (D-F-Ramit), and a residual aggregation gate-controlled convolution (RAGConv) module, YOLO-MDFR outperforms mainstream detection models in both accuracy and efficiency. Systematic deployment experiments yielded optimized configurations for different application scenarios. Despite its strong performance, the model shows limitations in robustness under extreme lighting, generalization to emerging diseases, and detection of small targets under occlusion. Future work will address these issues through ambient light data fusion, domain adaptation with semi-supervised learning, and binocular vision integration.

Key words: YOLO, leaf disease and pest detection, MobileNetV4, lightweight deep learning model, attention mechanism

中图分类号:

姚晓通, 曲绍业. 基于改进YOLOv12s的辣椒叶片病虫害轻量化检测方法[J]. 智慧农业(中英文), 2026, 8(1): 1-14.

YAO Xiaotong, QU Shaoye. Lightweight Detection Method for Pepper Leaf Diseases and Pests Based on Improved YOLOv12s[J]. Smart Agriculture, 2026, 8(1): 1-14.

图/表 20

图1

图2

表1

图3

图4

图5

图6

图7

图8

表2

表 3

图9

表4

图10

图11

图12

表5

表6

表7

表 8

参考文献 32

[1]	司超国, 刘梦晨, 吴华瑞, 等. Chilli-YOLO: 基于改进YOLOv10的露地辣椒成熟度智能检测算法[J]. 智慧农业(中英文), 2025, 7(2): 160-171.
	SI C G, LIU M C, WU H R, et al. Chilli-YOLO: an intelligent maturity detection algorithm for field-grown chilli based on improved YOLOv10[J]. Smart Agriculture, 2025, 7(2): 160-171.
[2]	邹玮, 岳延滨, 冯恩英, 等. 基于MobileNet V2的辣椒果实炭疽病识别及其应用[J]. 贵州农业科学, 2024, 52(9): 125-132.
	ZOU W, YUE Y B, FENG E Y, et al. Identification of anthracnose in pepper fruit based on MobileNet V2 and the application[J]. Guizhou Agricultural Sciences, 2024, 52(9): 125-132.
[3]	兰玉彬, 单常峰, 王庆雨, 等. 不同喷雾助剂在植保无人机喷施作业中对雾滴沉积特性的影响[J]. 农业工程学报, 2021, 37(16): 31-38.
	LAN Y B, SHAN C F, WANG Q Y, et al. Effects of different spray additives on droplet deposition characteristics during plant protection UAV spraying operations[J]. Transactions of the Chinese Society of Agricultural Engineering, 2021, 37(16): 31-38.
[4]	曹英丽, 张弘泽, 郭福旭, 等. 基于无人机遥感的农作物病害监测研究进展[J]. 沈阳农业大学学报, 2024, 55(5): 616-628.
	CAO Y L, ZHANG H Z, GUO F X, et al. Research progress of crop disease monitoring based on UAV remote sensing[J]. Journal of Shenyang Agricultural University, 2024, 55(5): 616-628.
[5]	李董, 汤启国, 王红波, 等. 农作物重大病虫害预警与应急防控技术研究现状与趋势[J]. 智能化农业装备学报(中英文), 2025, 6(1): 25-40.
	LI D, TANG Q G, WANG H B, et al. Current status and trends of research on early warning and emergency control technology for major crop pests and diseases[J]. Journal of Intelligent Agricultural Mechanization, 2025, 6(1): 25-40.
[6]	SHARIF M, KHAN M A, IQBAL Z, et al. Detection and classification of Citrus diseases in agriculture based on optimized weighted segmentation and feature selection[J]. Computers and Electronics in Agriculture, 2018, 150: 220-234.
[7]	EBRAHIMI M A, KHOSHTAGHAZA M H, MINAEI S, et al. Vision-based pest detection based on SVM classification method[J]. Computers and Electronics in Agriculture, 2017, 137: 52-58.
[8]	PUJARI J D, YAKKUNDIMATH R, BYADGI A S. Image processing based detection of fungal diseases in plants[J]. Procedia Computer Science, 2015, 46: 1802-1808.
[9]	石天怡, 南新元, 郭翔羽, 等. 基于改进ConvNeXt的苹果叶片病害分类算法[J]. 广西师范大学学报(自然科学版), 2025, 43(4): 83-96.
	SHI T Y, NAN X Y, GUO X Y, et al. Improved ConvNeXt-based algorithm for apple leaf disease classification[J]. Journal of Guangxi Normal University (Natural Science Edition), 2025, 43(4): 83-96.
[10]	魏天宇, 柳天虹, 张善文, 等. 基于改进YOLOv5s的辣椒采摘机器人识别定位方法[J]. 扬州大学学报(自然科学版), 2023, 26(1): 61-69.
	WEI T Y, LIU T H, ZHANG S W, et al. Research on pepper picking robot recognition and positioning method based on improved YOLOv5s[J]. Journal of Yangzhou University (Natural Science Edition), 2023, 26(1): 61-69.
[11]	王震鲁, 白涛, 李东亚, 等. 基于改进YOLOv5的绿辣椒目标检测方法[J]. 新疆农业科学, 2024, 61(12): 3032-3041.
	WANG Z L, BAI T, LI D Y, et al. Green chili pepper target detection method based on improved YOLOv5[J]. Xinjiang Agricultural Sciences, 2024, 61(12): 3032-3041.
[12]	邹玮, 岳延滨, 冯恩英, 等. 基于YOLOv2的辣椒叶部蚜虫图像识别[J]. 山东农业大学学报(自然科学版), 2023, 54(5): 700-709.
	ZOU W, YUE Y B, FENG E Y, et al. Image recognition of aphid on pepper leaves based on YOLOv2[J]. Journal of Shandong Agricultural University (Natural Science Edition), 2023, 54(5): 700-709.
[13]	李桂松, 黎敬涛, 杨艳丽, 等. 基于改进残差网络的马铃薯叶片病害识别[J]. 湖南农业大学学报(自然科学版), 2024, 50(6): 123-128.
	LI G S, LI J T, YANG Y L, et al. Potato leaf disease identification based on improved residual networks[J]. Journal of Hunan Agricultural University (Natural Sciences), 2024, 50(6): 123-128.
[14]	SHORTEN C, KHOSHGOFTAAR T M. A survey on image data augmentation for deep learning[J]. Journal of Big Data, 2019, 6(1): 60.
[15]	蔄志贤. 基于频域分析的多尺度卷积神经网络[J]. 河北软件职业技术学院学报, 2025, 27(2): 13-16, 43.
	MAN Z X. Multi-scale convolutional neural networks based on frequency domain analysis[J]. Journal of Hebei Software Institute, 2025, 27(2): 13-16, 43.
[16]	DING M Y, XIAO B, CODELLA N, et al. DaViT: Dual attention vision transformers[M]// Computer Vision-ECCV 2022. Cham: Springer Nature Switzerland, 2022: 74-92.
[17]	YU Q H, XIA Y D, BAI Y T, et al. Glance-and-gaze vision transformer[C]// Neural Information Processing Systems. California,USA: NeurIPS Foundation, 2022.
[18]	ZHU X Z, CHENG D Z, ZHANG Z, et al. An empirical study of spatial attention mechanisms in deep networks[C]// 2019 IEEE/CVF International Conference on Computer Vision (ICCV). Piscataway, New Jersey, USA: IEEE, 2019: 6687-6696.
[19]	WOO S, PARK J, LEE J Y, et al. CBAM: Convolutional Block attention module[C]// Computer Vision-ECCV 2018. Cham, Germany: Springer, 2018: 3-19.
[20]	LEE-THORP J, AINSLIE J, ECKSTEIN I, et al. Fnet: Mixing tokens with fourier transforms[EB/OL]. arXiv: 2105.03824, 2021.
[21]	LIANG J Y, CAO J Z, SUN G L, et al. SwinIR: Image restoration using swin transformer[C]// 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW). Piscataway, New Jersey, USA: IEEE, 2021: 1833-1844.
[22]	ZAMIR S W, ARORA A, KHAN S, et al. Restormer: Efficient transformer for high-resolution image restoration[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway, New Jersey, USA: IEEE, 2022: 5718-5729.
[23]	HOWARD A G, ZHU M L, CHEN B, et al. MobileNets: Efficient convolutional neural networks for mobile vision applications[EB/OL]. arXiv: 1704.04861, 2017.
[24]	SANDLER M, HOWARD A, ZHU M L, et al. MobileNetV2: Inverted residuals and linear bottlenecks[C]// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway, New Jersey, USA: IEEE, 2018: 4510-4520.
[25]	LIU Z, LIN Y T, CAO Y, et al. Swin transformer: Hierarchical vision transformer using shifted windows[C]// 2021 IEEE/CVF International Conference on Computer Vision (ICCV). Piscataway, New Jersey, USA: IEEE, 2021: 9992-10002.
[26]	QIN D F, LEICHNER C, DELAKIS M, et al. MobileNetV4: Universal models for the mobile ecosystem[M]// Computer Vision – ECCV 2024. Cham: Springer Nature Switzerland, 2024: 78-96.
[27]	姜舒, 陈琨, 丁卫平, 等. Axial-FNet: 基于模糊卷积结合门控轴向自注意力的皮肤癌图像分割模型[J]. 智能科学与技术学报, 2025, 7(2): 221-233.
	JIANG S, CHEN K, DING W P, et al. Axial-FNet: Skin cancer image segmentation model based on fuzzy convolution combined with gated axial self-attention[J]. Chinese Journal of Intelligent Science and Technology, 2025, 7(2): 221-233.
[28]	SZEGEDY C, IOFFE S, VANHOUCKE V, et al. Inception-v4, inception-ResNet and the impact of residual connections on learning[C]// Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence. New York, USA: ACM, 2017: 4278-4284.
[29]	YU J H, LIN Z, YANG J M, et al. Free-form image inpainting with gated convolution[C]// 2019 IEEE/CVF International Conference on Computer Vision (ICCV). Piscataway, New Jersey, USA: IEEE, 2019: 4470-4479.
[30]	TIAN Y J, YE Q X, DOERMANN D. YOLOv12: Attention-centric real-time object detectors[EB/OL]. arXiv: 2502.12524, 2025.
[31]	ZHAO J Y, QU J H. A detection method for tomato fruit common physiological diseases based on YOLOv2[C]// 2019 10th International Conference on Information Technology in Medicine and Education (ITME). Piscataway, New Jersey, USA: IEEE, 2020: 559-563.
[32]	AJAYI O G, ASHI J, GUDA B. Performance evaluation of YOLO v5 model for automatic crop and weed classification on UAV images[J]. Smart Agricultural Technology, 2023, 5: 100231.

辣椒叶片病虫害类别	训练集样本数量/张	测试集样本数量/张	验证集样本数量/张	样本总数量/张
健康叶片	2 193	95	223	2 511
细菌性叶斑病叶片	2 283	96	186	2 565
烟草花叶病毒黄化花叶病叶片	2 076	102	175	2 353
被蓟马侵染叶片	2 127	129	232	2 488

网络模型	骨干网络	mAP%0.5	Params/M	GFLOPs/G	FPS/（帧/s）
YOLOv12s	原始网络	93.6	9.1	19.7	33.4
YOLOv12s	MobileNetV4	89.2	3.6	6.0	46.0
YOLOv12s	改进MobileNetV4	91.8	2.9	5.8	46.5

注意力机制	注意力维度（通道/ 空间/ 频域）	参数量/K	计算量/M FLOPs	小目标 mAP 提升/%	边缘信息保留能力
SE	通道	8.2	12.5	1.1	弱
ECA	通道	0.5	3.2	1.5	弱
CBAM	通道 + 空间	15.6	28.7	2.3	中等
D-F-Ramit	通道 + 空间 + 频域	18.3	35.1	3.8	强

名称	改进的MobileNetV4（含双3×3）	D-F-Ramit	RAGConv	mAP%0.5	Params/M	GFLOPs /G	FPS（帧/s）
YOLOv12s	×	×	×	93.6	9.1	19.7	33.4
YOLO-Mob	√	×	×	91.8	2.9	5.8	46.5
YOLO-MD	√	√	×	92.3	3.6	6.5	38.7
YOLO-MR	√	×	√	90.8	3.4	5.5	56.4
YOLO-MDFR（未平衡数据）	√	√	√	93.8	3.5	6.2	43.4
YOLO-MDFR	√	√	√	95.6	3.5	6.2	43.4

模型名称	mAP%0.5	Params/M	GFLOPs/G	FPS/（帧/s）	Epochs	Batch_size
Faster R-CNN	60.9	137.0	370.2	5.3	500	16
YOLOv11n	90.8	2.6	6.6	53.1	500	16
YOLOv10b	92.2	20.0	99.4	39.5	500	16
YOLOv9s	89.7	7.5	12.5	55.0	500	16
YOLOv8n	91.9	3.2	8.9	85.0	500	16
YOLOv5s	90.0	7.0	15.8	68.1	500	16
SSD	60.7	26.0	62.7	46.0	500	16
RT-DETR	91.2	32.0	100.0	50.0	500	16
YOLO-MDFR	95.6	3.5	6.2	43.4	500	16