Welcome to Smart Agriculture 中文

Smart Agriculture ›› 2022, Vol. 4 ›› Issue (4): 84-104.doi: 10.12133/j.smartag.SA202210004

• Topic--Smart Farming of Field Crops • Previous Articles     Next Articles

Multi-Class on-Tree Peach Detection Using Improved YOLOv5s and Multi-Modal Images

LUO Qing1,2,3(), RAO Yuan1,2,3(), JIN Xiu1,2,3, JIANG Zhaohui1,2,3, WANG Tan1,2,3, WANG Fengyi1,2,3, ZHANG Wu1,2,3   

  1. 1.College of Information and Computer Science, Anhui Agricultural University, Hefei 230036, China
    2.Key Laboratory of Agricultural Sensors, Ministry of Agriculture and Rural Affairs, Hefei 230036, China
    3.Anhui Provincial Key Laboratory of Smart Agricultural Technology and Equipment, Hefei 230036, China
  • Received:2022-10-30 Online:2022-12-30 Published:2023-01-27
  • corresponding author: RAO Yuan E-mail:tsing.omg@gmail.com;raoyuan@ahau.edu.cn
  • About author:LUO Qing (1997-), male, graduate student, research interest: smart agriculture. E-mail: tsing.omg@gmail.com
  • Supported by:
    The Anhui Provincial Key Laboratory of Smart Agricultural Technology and Equipment(APKLSATE2021X004);The International Cooperation Project of Ministry of Agriculture and Rural Affairs(125A0607);The Key Research and Development Plan of Anhui Province(201904a06020056);The Natural Science Major Project for Anhui Provincial University(2022AH040125);The Natural Science Foundation of Anhui Province, China(2008085MF203)


Accurate peach detection is a prerequisite for automated agronomic management, e.g., peach mechanical harvesting. However, due to uneven illumination and ubiquitous occlusion, it is challenging to detect the peaches, especially when the peaches are bagged in orchards. To this end, an accurate multi-class peach detection method was proposed by means of improving YOLOv5s and using multi-modal visual data for mechanical harvesting in this paper. RGB-D dataset with multi-class annotations of naked and bagging peach was proposed, including 4127 multi-modal images of corresponding pixel-aligned color, depth, and infrared images acquired with consumer-level RGB-D camera. Subsequently, an improved lightweight YOLOv5s (small depth) model was put forward by introducing a direction-aware and position-sensitive attention mechanism, which could capture long-range dependencies along one spatial direction and preserve precise positional information along the other spatial direction, helping the networks accurately detect peach targets. Meanwhile, the depthwise separable convolution was employed to reduce the model computation by decomposing the convolution operation into convolution in the depth direction and convolution in the width and height directions, which helped to speed up the training and inference of the network while maintaining accuracy. The comparison experimental results demonstrated that the improved YOLOv5s using multi-modal visual data recorded the detection mAP of 98.6% and 88.9% on the naked and bagging peach with 5.05 M model parameters in complex illumination and severe occlusion environment, increasing by 5.3% and 16.5% than only using RGB images, as well as by 2.8% and 6.2% when compared to YOLOv5s. As compared with other networks in detecting bagging peaches, the improved YOLOv5s performed best in terms of mAP, which was 16.3%, 8.1% and 4.5% higher than YOLOX-Nano, PP-YOLO-Tiny, and EfficientDet-D0, respectively. In addition, the proposed improved YOLOv5s model offered better results in different degrees than other methods in detecting Fuji apple and Hayward kiwifruit, verified the effectiveness on different fruit detection tasks. Further investigation revealed the contribution of each imaging modality, as well as the proposed improvement in YOLOv5s, to favorable detection results of both naked and bagging peaches in natural orchards. Additionally, on the popular mobile hardware platform, it was found out that the improved YOLOv5s model could implement 19 times detection per second with the considered five-channel multi-modal images, offering real-time peach detection. These promising results demonstrated the potential of the improved YOLOv5s and multi-modal visual data with multi-class annotations to achieve visual intelligence of automated fruit harvesting systems.

Key words: multi-class detection, YOLOv5s, multi-modal visual data, mechanical harvesting, deep learning

CLC Number: