A Lightweight Detection Method for Pepper Leaf Diseases Based on Improved YOLOv12m

doi:10.12133/j.smartag.SA202506005

Abstract

Abstract:

[Objective] The YOLO-MDFR (You Only Look Once Version 12-MDFR), a lightweight detection algorithm was proposed based on an enhanced YOLOv12m, specifically designed for accurate identification of pepper leaf diseases and pests in complex natural environments. Pepper cultivation frequently faces challenges from diseases and pests, and early detection is critical for reducing yield losses. However, existing detection models often suffer from limitations such as insufficient feature extraction for subtle lesions, loss of edge information due to complex backgrounds, and high missed detection rates for small lesions. To address these issues, the model was systematically improved in three key aspects: backbone lightweighting, attention mechanism enhancement, and optimized multi-scale feature fusion, aiming to balance detection accuracy, model compactness, and real-time performance for field applications. [Methods] The dataset was established in the primary pepper cultivation zone of Gangu County, Tianshui City, Gansu Province. The cultivated variety was the locally dominant Capsicum annuum L. var. conoides (Mill.). Data collection was conducted from March 15 to May 20, 2024. The collected samples included four categories of pepper leaves: healthy leaves, leaves damaged by thrips, leaves infected with tobacco mosaic virus exhibiting yellowing symptoms, and leaves affected by bacterial leaf spot. First, the original YOLOv12m backbone was replaced with an improved MobileNetV4 architecture to enhance lightweight performance while preserving feature extraction capability. Specifically, the original 5×5 standard convolutions in the bottleneck layers of MobileNetV4 were substituted with two sequential 3×3 depthwise separable convolutions. This design was based on the principle that two 3×3 convolutions achieve an equivalent receptive field (matching the 5×5 coverage) while reducing parameter count—depthwise separable convolutions further decompose spatial and channel convolution, minimizing redundant computations. Second, a novel Dimensional Frequency Reciprocal Attention Mixing Transformer (D-F-Ramit) module was introduced to enhance sensitivity to lesion boundaries and fine-grained textures. The module first converted feature maps from the spatial domain to the frequency domain using discrete cosine transform (DCT), capturing high-frequency components often lost in spatial-only attention. It then integrated three parallel branches: channel attention, spatial attention, and frequency-domain attention. Finally, a Residual Aggregation Gate-controlled Convolution (RAGConv) module was developed for the neck network. This module included a residual aggregation path to collect multi-layer feature information and a gate control unit that dynamically weighted feature components based on their relevance. The residual structure provided a direct gradient propagation path, alleviating gradient vanishing during backpropagation and ensuring efficient information transfer during feature fusion. A systematic experimental framework was established to comprehensively evaluate model performance: (1) Ablation studies were conducted using a controlled variable approach to verify the individual contributions of the improved MobileNetV4, D-F-Ramit, and RAGConv modules; (2) Lesion scale sensitivity analysis assessed detection performance across different lesion sizes, with emphasis on small-spot recognition; (3) Resolution impact analysis evaluated five common input resolutions (320×320–736×736) to explore the trade-offs among accuracy, speed, and computational efficiency; and (4) Embedded deployment validation involved model quantization and implementation on the Rockchip RK3588 platform to measure inference speed and power consumption on edge devices. [Results and Discussion] YOLO-MDFR achieved a mAP@0.5 of 95.6% on this dataset. Compared to YOLOv12m, it improved accuracy by 2.0%, reduced parameters by 61.5%, and lowered computational complexity by 68.5%. Real-time testing showed 43.4 FPS on an NVIDIA RTX 4060 GPU (CUDA 12.2) and 22.8 FPS on a Rockchip RK3588 embedded platform with only 3.5 W power consumption—suitable for battery-powered field devices. Lesion-scale analysis revealed 33.5% accuracy for <16×16 pixel lesions critical for early detection. Confusion matrix evaluation reduced misclassification, bacterial leaf spot/thrips damage misrates fell from 5.8% to 2.1%, and tobacco mosaic virus/healthy leaves from 3.2% to 1.5%, resulting in an overall 2.3% misrate. Experiments across varying input resolutions revealed a clear performance–resolution trade-off. As resolution increased from 320×320 to 736×736, mAP rose from 89.5% to 96.2%, showing diminishing returns beyond 512×512. Concurrently, computational cost grew roughly quadratically, reducing inference speed from 65.2 FPS to 35.1 FPS. [Conclusion] This study presents YOLO-MDFR, a lightweight detection model for identifying pepper leaf diseases and pests under complex natural conditions. By integrating an improved MobileNetV4 backbone, a multi-dimensional frequency reciprocal attention mixing transformer (D-F-Ramit), and a residual aggregation gate-controlled convolution (RAGConv) module, YOLO-MDFR outperforms mainstream detection models in both accuracy and efficiency. Systematic deployment experiments yielded optimized configurations for different application scenario. Despite its strong performance, the model shows limitations in robustness under extreme lighting, generalization to emerging diseases, and detection of small targets under occlusion. Future work will address these issues through ambient light data fusion, domain adaptation with semi-supervised learning, and binocular vision integration.

Key words: pepper leaf, leaf disease and pest detection, MobileNetV4, lightweight deep learning model, attention mechanism

CLC Number:

YAO Xiaotong, QU Shaoye. A Lightweight Detection Method for Pepper Leaf Diseases Based on Improved YOLOv12m[J]. Smart Agriculture, doi: 10.12133/j.smartag.SA202506005.

Figures/Tables 20

Fig. 1

Fig. 2

Table 1

Fig. 3

Fig. 4

Fig. 5

Fig. 6

Fig.7

Fig. 8

Table 2

Table 3

Fig. 9

Table 4

Fig. 10

Fig. 11

Fig. 12

Table 5

Table 6

Table 7

Table 8

References 32

[1]	付浩, 夏忠敏, 张小明. 贵州省辣椒产业改革发展成效及对策建议[J]. 南方农业, 2022, 16(11): 68-72, 76.
	FU H, XIA Z M, ZHANG X M. Achievements and countermeasures of the reform and development of pepper industry in Guizhou Province[J]. South China agriculture, 2022, 16(11): 68-72, 76.
[2]	邹玮, 岳延滨, 冯恩英, 等. 基于MobileNet V2的辣椒果实炭疽病识别及其应用[J]. 贵州农业科学, 2024, 52(9): 125-132.
	ZOU W, YUE Y B, FENG E Y, et al. Identification of anthracnose in pepper fruit based on MobileNet V2 and the application[J]. Guizhou agricultural sciences, 2024, 52(9): 125-132.
[3]	兰玉彬, 单常峰, 王庆雨, 等. 不同喷雾助剂在植保无人机喷施作业中对雾滴沉积特性的影响[J]. 农业工程学报, 2021, 37(16): 31-38.
	LAN Y B, SHAN C F, WANG Q Y, et al. Effects of different spray additives on droplet deposition characteristics during plant protection UAV spraying operations[J]. Transactions of the Chinese society of agricultural engineering, 2021, 37(16): 31-38.
[4]	苏柯雨, 罗必良. 连片种植能促进农户的绿色生产行为吗 : 以化肥、农药减量施用为例[J]. 华中农业大学学报(社会科学版), 2024(5): 44-56.
	SU K Y, LUO B L. Can connected planting promote green production practices among farmers : An empirical analysis of fertilizers and pesticides reduction[J]. Journal of Huazhong agricultural university (social sciences edition), 2024(5): 44-56.
[5]	刘梦姝, 张春琪, 晁金阳, 等. 基于YOLO v8n改进的小麦病害检测系统[J]. 农业机械学报, 2024, 55(S1): 280-287, 355.
	LIU M S, ZHANG C Q, CHAO J Y, et al. Improved wheat disease detection system based on YOLO v8n[J]. Transactions of the Chinese society for agricultural machinery, 2024, 55(S1): 280-287, 355.
[6]	SHARIF M, KHAN M A, IQBAL Z, et al. Detection and classification of Citrus diseases in agriculture based on optimized weighted segmentation and feature selection[J]. Computers and electronics in agriculture, 2018, 150: 220-234.
[7]	EBRAHIMI M A, KHOSHTAGHAZA M H, MINAEI S, et al. Vision-based pest detection based on SVM classification method[J]. Computers and electronics in agriculture, 2017, 137: 52-58.
[8]	PUJARI J D, YAKKUNDIMATH R, BYADGI A S. Image processing based detection of fungal diseases in plants[J]. Procedia computer science, 2015, 46: 1802-1808.
[9]	石天怡, 南新元, 郭翔羽, 等. 基于改进ConvNeXt的苹果叶片病害分类算法[J]. 广西师范大学学报(自然科学版), 2025, 43(4): 83-96.
	SHI T Y, NAN X Y, GUO X Y, et al. Improved ConvNeXt-based algorithm for apple leaf disease classification[J]. Journal of Guangxi normal university (natural science edition), 2025, 43(4): 83-96.
[10]	魏天宇, 柳天虹, 张善文, 等. 基于改进YOLOv5s的辣椒采摘机器人识别定位方法[J]. 扬州大学学报(自然科学版), 2023, 26(1): 61-69.
	WEI T Y, LIU T H, ZHANG S W, et al. Research on pepper picking robot recognition and positioning method based on improved YOLOv5s[J]. Journal of Yangzhou university (natural science edition), 2023, 26(1): 61-69.
[11]	王震鲁, 白涛, 李东亚, 等. 基于改进YOLOv5的绿辣椒目标检测方法[J]. 新疆农业科学, 2024, 61(12): 3032-3041.
	WANG Z L, BAI T, LI D Y, et al. Green chili pepper target detection method based on improved YOLOv5[J]. Xinjiang agricultural sciences, 2024, 61(12): 3032-3041.
[12]	邹玮, 岳延滨, 冯恩英, 等. 基于YOLO v2的辣椒叶部蚜虫图像识别[J]. 山东农业大学学报(自然科学版), 2023, 54(5): 700-709.
	ZOU W, YUE Y B, FENG E Y, et al. Image recognition of aphid on pepper leaves based on YOLOv2[J]. Journal of Shandong agricultural university (natural science edition), 2023, 54(5): 700-709.
[13]	李桂松, 黎敬涛, 杨艳丽, 等. 基于改进残差网络的马铃薯叶片病害识别[J]. 湖南农业大学学报(自然科学版), 2024, 50(6): 123-128.
	LI G S, LI J T, YANG Y L, et al. Potato leaf disease identification based on improved residual networks[J]. Journal of Hunan agricultural university (natural sciences), 2024, 50(6): 123-128.
[14]	SHORTEN C, KHOSHGOFTAAR T M. A survey on image data augmentation for deep learning[J]. Journal of big data, 2019, 6(1): ID 60.
[15]	蔄志贤. 基于频域分析的多尺度卷积神经网络[J]. 河北软件职业技术学院学报, 2025, 27(2): 13-16, 43.
	MAN Z X. Multi-scale convolutional neural networks based on frequency domain analysis[J]. Journal of Hebei software institute, 2025, 27(2): 13-16, 43.
[16]	DING M Y, XIAO B, CODELLA N, et al. DaViT: Dual attention vision transformers[M]// Computer Vision-ECCV 2022. Cham: Springer Nature Switzerland, 2022: 74-92.
[17]	YU Q H, XIA Y D, BAI Y T, et al. Glance-and-gaze vision transformer[C]// Neural Information Processing Systems. California,USA: NeurIPS Foundation, 2022.
[18]	ZHU X Z, CHENG D Z, ZHANG Z, et al. An empirical study of spatial attention mechanisms in deep networks[C]// 2019 IEEE/CVF International Conference on Computer Vision (ICCV). Piscataway, New Jersey, USA: IEEE, 2019: 6687-6696.
[19]	WOO S, PARK J, LEE J Y, et al. CBAM: Convolutional Block attention module[C]// Computer Vision-ECCV 2018. Cham, Germany: Springer, 2018: 3-19.
[20]	LEE-THORP J, AINSLIE J, ECKSTEIN I, et al. Fnet: Mixing tokens with fourier transforms[EB/OL]. arXiv: 2105.03824, 2021.
[21]	LIANG J Y, CAO J Z, SUN G L, et al. SwinIR: Image restoration using swin transformer[C]// 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW). Piscataway, New Jersey, USA: IEEE, 2021: 1833-1844.
[22]	ZAMIR S W, ARORA A, KHAN S, et al. Restormer: Efficient transformer for high-resolution image restoration[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway, New Jersey, USA: IEEE, 2022: 5718-5729.
[23]	HOWARD A G, ZHU M L, CHEN B, et al. MobileNets: Efficient convolutional neural networks for mobile vision applications[EB/OL]. arXiv: 1704.04861, 2017.
[24]	SANDLER M, HOWARD A, ZHU M L, et al. MobileNetV2: Inverted residuals and linear bottlenecks[C]// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway, New Jersey, USA: IEEE, 2018: 4510-4520.
[25]	LIU Z, LIN Y T, CAO Y, et al. Swin transformer: Hierarchical vision transformer using shifted windows[C]// 2021 IEEE/CVF International Conference on Computer Vision (ICCV). Piscataway, New Jersey, USA: IEEE, 2021: 9992-10002.
[26]	QIN D F, LEICHNER C, DELAKIS M, et al. MobileNetV4: Universal models for the mobile ecosystem[M]// Computer Vision – ECCV 2024. Cham: Springer Nature Switzerland, 2024: 78-96.
[27]	姜舒, 陈琨, 丁卫平, 等. Axial-FNet: 基于模糊卷积结合门控轴向自注意力的皮肤癌图像分割模型[J]. 智能科学与技术学报, 2025, 7(2): 221-233.
	JIANG S, CHEN K, DING W P, et al. Axial-FNet: Skin cancer image segmentation model based on fuzzy convolution combined with gated axial self-attention[J]. Chinese journal of intelligent science and technology, 2025, 7(2): 221-233.
[28]	SZEGEDY C, IOFFE S, VANHOUCKE V, et al. Inception-v4, inception-ResNet and the impact of residual connections on learning[C]// Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence. New York, USA: ACM, 2017: 4278-4284.
[29]	YU J H, LIN Z, YANG J M, et al. Free-form image inpainting with gated convolution[C]// 2019 IEEE/CVF International Conference on Computer Vision (ICCV). Piscataway, New Jersey, USA: IEEE, 2019: 4470-4479.
[30]	TIAN Y J, YE Q X, DOERMANN D. YOLOv12: Attention-centric real-time object detectors[EB/OL]. arXiv: 2502.12524, 2025.
[31]	ZHAO J Y, QU J H. A detection method for tomato fruit common physiological diseases based on YOLOv2[C]// 2019 10th International Conference on Information Technology in Medicine and Education (ITME). Piscataway, New Jersey, USA: IEEE, 2020: 559-563.
[32]	AJAYI O G, ASHI J, GUDA B. Performance evaluation of YOLO v5 model for automatic crop and weed classification on UAV images[J]. Smart agricultural technology, 2023, 5: ID 100231.

辣椒叶片病虫害类别	训练集样本数量/张	测试集样本数量/张	验证集样本数量/张	样本总数量/张
健康的叶片	2 193	95	223	2 511
细菌性叶斑病叶片	2 283	96	186	2 565
烟草花叶病毒黄化花叶病叶片	2 076	102	175	2 353
被蓟马浸染的叶片	2 127	129	232	2 488

网络模型	骨干网络	mAP%0.5	Params/M	GFLOPs/G	FPS
YOLOv12m	原始网络	93.6	0.91	19.7	33.4
YOLOv12m	MobileNetV4	89.2	0.36	6.0	46.0
YOLOv12m	改进MobileNetV4	91.8	0.29	5.8	46.5

注意力机制	注意力维度（通道/ 空间/ 频域）	参数量/K	计算量/M FLOPs	小目标 mAP 提升/%	边缘信息保留能力
SE	通道	8.2	12.5	1.1	弱
ECA	通道	0.5	3.2	1.5	弱
CBAM	通道 + 空间	15.6	28.7	2.3	中等
D-F-Ramit	通道 + 空间 + 频域	18.3	35.1	3.8	强

名称	改进的MobileNetV4（含双3×3）	D-F-Ramit	RAGConv	mAP%0.5	Params/M	GFLOPs /G	FPS
YOLOv12m	×	×	×	93.6	0.91	19.7	33.4
YOLO-Mob	√	×	×	91.8	0.29	5.8	46.5
YOLO-MD	√	√	×	92.3	0.36	6.5	38.7
YOLO-MR	√	×	√	90.8	0.34	5.5	56.4
YOLO-MDFR（未平衡数据）	√	√	√	93.8	0.35	6.2	43.4
YOLO-MDFR	√	√	√	95.6	0.35	6.2	43.4

模型名称	mAP%0.5	Params/M	GFLOPs/G	FPS/（帧/s）	Epochs	Batch_size
Faster R-CNN	60.9	13 700	370.2	5.3	500	16
YOLOv11n	90.8	260	6.6	53.1	500	16
YOLOv10b	92.2	200	99.4	39.5	500	16
YOLOv9s	89.7	750	12.5	55.0	500	16
YOLOv8n	91.9	320	8.9	85.0	500	16
YOLOv5s	90.0	700	15.8	68.1	500	16
SSD	60.7	260	62.7	46.0	500	16
RT-DETR	91.2	320	100.0	50.0	500	16
YOLO-MDFR	95.6	350	6.2	43.4	500	16