欢迎您访问《智慧农业(中英文)》官方网站! English

Smart Agriculture ›› 2024, Vol. 6 ›› Issue (6): 121-131.doi: 10.12133/j.smartag.SA202407008

• 专题--农业知识智能服务和智慧无人农场(上) • 上一篇    下一篇

基于3C-YOLOv8n和深度相机的葡萄识别与定位方法

刘畅(), 孙雨, 杨晶, 王凤超(), 陈进()   

  1. 上海应用技术大学 理学院,上海 201418,中国
  • 收稿日期:2024-07-09 出版日期:2024-11-30
  • 基金项目:
    中国上海市杨帆计划(20YF1447600); 上海应用技术大学科研启动项目(YJ2021-60); 上海应用技术大学协同创新项目(XTCX2023-22); 上海应用技术大学中青年教师科技人才发展基金(ZQ2022-6)
  • 作者简介:
    刘 畅,研究方向为机器视觉,E-mail:
  • 通信作者:
    陈 进,博士,副教授,研究方向为机器视觉和嵌入式系统开发。E-mail:
    王凤超,博士,副教授,研究方向为光电应用系统开发。E-mail:

Grape Recognition and Localization Method Based on 3C-YOLOv8n and Depth Camera

LIU Chang(), SUN Yu, YANG Jing, WANG Fengchao(), CHEN Jin()   

  1. College of Sciences, Shanghai Institute of Technology, Shanghai 201418, China
  • Received:2024-07-09 Online:2024-11-30
  • Foundation items:Shanghai Sailing Program, China(20YF1447600); Research Start-up Project of Shanghai Institute of Technology(YJ2021-60); Collaborative Innovation Project of Shanghai Institute of Technology(XTCX2023-22); Science and Technology Talent Development Fund for Young and Middle-aged Teachers at Shanghai Institute of Technology(ZQ2022-6)
  • About author:
    LIU Chang, E-mail:
  • Corresponding author:
    CHEN Jin, E-mail:
    WANG Fengchao, E-mail:

摘要:

[目的/意义] 为了提高葡萄采摘效率、做到葡萄采摘自动化,提出了3C-YOLOv8n目标检测模型,与RealSense D415深度相机结合,对葡萄进行识别和定位。 [方法] 首先对YOLOv8n主干网络进行改进,将卷积块注意力模块(Convolutional Block Attention Module, CBAM)与原本的网络结构融合,使模块全面捕获特征中的关键信息。再嵌入坐标注意力(Coordinate Attention, CA),既可以对一个通道的特征进行全面捕获,又可以对不同方向的特征进行敏锐感知。然后,在YOLOv8n颈部将最近邻插值上采样算法替换为内容感知特征重组算法(Content-aware ReAssembly of Features, CARAFE),弥补YOLOv8n的原始上采样模块没有利用特征图语义信息的缺点,增大了感受野。最后转换相机坐标系,将目标葡萄的二维平面坐标和距离深度相机的垂直距离结合,得到目标葡萄的世界坐标,实现葡萄的识别和定位。 [结果和讨论] 经过对比试验和消融试验,3C-YOLOv8n模型在并交比为0.5(IOU=0.5)的平均精度均值(Mean Average Precision, mAP)达到94.3%,高于YOLOv8n模型1%,准确率(Precision, P)和召回率(Recall, R)分别为91.6%和86.4%,3种改进策略的结合使损失初始值降低,曲线收敛更快。与其他主流YOLO系列算法对比,3C-YOLOv8n各个评价指标都有所提升,且漏检率、错检率为所有算法中最低,在实际检测中具有很大的优势。 [结论] 基于3C-YOLOv8n网络模型和RealSense D415深度相机,对葡萄进行精准识别和定位,为采摘自动化提供了技术手段。

关键词: 机器视觉, YOLOv8n, 目标检测, 葡萄, CBAM, 深度相机

Abstract:

[Objective] Grape picking is a key link in increasing production. However, in this process, a large amount of manpower and material resources are required, which makes the picking process complex and slow. To enhance harvesting efficiency and achieve automated grape harvesting, an improved YOLOv8n object detection model named 3C-YOLOv8n was proposed, which integrates the RealSense D415 depth camera for grape recognition and localization. [Methods] The propoesed 3C-YOLOv8n incorporated a convolutional block attention module (CBAM) between the first C2f module and the third Conv module in the backbone network. Additionally, a channel attention (CA) module was added at the end of the backbone structure, resulting in a new 2C-C2f backbone network architecture. This design enabled the model to sequentially infer attention maps across two independent dimensions (channel and spatial), optimize features by considering relationships between channels and positional information. The network structure was both flexible and lightweight. Furthermore, the Content-aware ReAssembly of Features up sampling operator was implemented to support instance-specific kernels (such as deconvolution) for feature reconstruction with neighboring pixels, replacing the nearest neighbor interpolation operator in the YOLOv8n neck network. This enhancement increased the receptive field and guided the reconstruction process based on input features while maintaining low parameter and computational complexity, thereby forming the 3C-YOLOv8n model. The pyrealsense2 library was utilized to obtain pixel position information from the target area using the Intel RealSense D415 camera. During this process, the depth camera was used to capture images, and target detection algorithms were employed to pinpoint the location of grapes. The camera's depth sensor facilitated the acquisition of the three-dimensional point cloud of grapes, allowing for the calculation of the distance from the pixel point to the camera and the subsequent determination of the three-dimensional coordinates of the center of the target's bounding box in the camera coordinate system, thus achieving grape recognition and localization. [Results and Discussions] Comparative and ablation experiments were conducted. it was observed that the 3C-YOLOv8n model achieved a mean average precision (mAP) of 94.3% at an intersection ratio of 0.5 (IOU=0.5), surpassing the YOLOv8n model by 1%. The accuracy (P) and recall (R) rates were recorded at 91.6% and 86.4%, respectively, reflecting increases of 0.1% and 0.7%. The F1-Score also improved by 0.4%, demonstrating that the improved network model met the experimental accuracy and recall requirements. In terms of loss, the 3C-YOLOv8n algorithm exhibited superior performance, with a rapid decrease in loss values and minimal fluctuations, ultimately leading to a minimized loss value. This indicated that the improved algorithm quickly reached a convergence state, enhancing both model accuracy and convergence speed. The ablation experiments revealed that the original YOLOv8n model yielded a mAP of 93.3%. The integration of the CBAM and CA attention mechanisms into the YOLOv8n backbone resulted in mAP values of 93.5% each. The addition of the Content-aware ReAssembly of Features up sampling operator to the neck network of YOLOv8n produced a 0.5% increase in mAP, culminating in a value of 93.8%. The combination of the three improvement strategies yielded mAP increases of 0.3, 0.7, and 0.8%, respectively, compared to the YOLOv8n model. Overall, the 3C-YOLOv8n model demonstrated the best detection performance, achieving the highest mAP of 94.3%. The ablation results confirmed the positive impact of the proposed improvement strategies on the experimental outcomes. Compared to other mainstream YOLO series algorithms, all evaluation metrics showed enhancements, with the lowest missed detection and false detection rates among all tested algorithms, underscoring its practical advantages in detection tasks. [Conclusions] By effectively addressing the inefficiencies of manual labor, 3C-YOLOv8n network model not only enhances the precision of grape recognition and localization but also significantly optimizes overall harvesting efficiency. Its superior performance in evaluation metrics such as precision, recall, mAP, and F1-Score, alongside the lowest recorded loss values among YOLO series algorithms, indicates a remarkable advancement in model convergence and operational effectiveness. Furthermore, the model's high accuracy in grape target recognition not only lays the groundwork for automated harvesting systems but also enables the implementation of complementary intelligent operations.

Key words: machine vision, YOLOv8n, object detection, grape, CBAM, depth camera

中图分类号: