欢迎您访问《智慧农业(中英文)》官方网站! English

Smart Agriculture

• •    

基于迁移学习的多模态葡萄检测与计数方法

徐雯雯1,2(), 余克健3, 戴泽旭1,2, 吴云志1,2()   

  1. 1.安徽农业大学信息与人工智能学院,安徽 合肥 230036,中国
    2.安徽省北斗精准农业信息工程研究中心,安徽 合肥 230036,中国
    3.东华大学计算机科学与技术学院,上海市 201620,中国
  • 收稿日期:2025-04-06 出版日期:2025-06-16
  • 基金项目:
    2024年安徽省科技创新攻坚计划项目(202423k09020031)
  • 作者简介:徐雯雯,硕士研究生,研究方向为计算机视觉。E-mail:wenwenxu@stu.ahau.edu.cn
  • 通信作者: 吴云志,博士,副教授,研究方向为计算机视觉。E-mail:wuyzh@ahau.edu.cn

A Transfer Learning-Based Multimodal Model for Grape Detection and Counting

XU Wenwen1,2(), YU Kejian3, DAI Zexu1,2, WU Yunzhi1,2()   

  1. 1.School of Information and Artificial Intelligence, Anhui Agricultural University, Hefei 230036, China
    2.Anhui Beidou Precision Agriculture Information Engineering Research Center, Hefei 230036, China
    3.School of Computer Science and Technology, Donghua University, Shanghai 201620, China
  • Received:2025-04-06 Online:2025-06-16
  • Foundation items:2024 Anhui Provincial Science and Technology Innovation Plan Project(202423k09020031)
  • About author:XU Wenwen, E-mail: wenwenxu@stu.ahau.edu.cn
  • Corresponding author:WU Yunzhi, E-mail: wuyzh@ahau.edu.cn

摘要:

[目的/意义] 葡萄作为全球综合产值最大的经济作物之一,其产量估计在农业经济发展中具有重要的意义。然而,目前葡萄产量预测困难且成本高。为解决上述问题,本研究提出了一种基于迁移学习的多模态检测框架,旨在实现不同品种葡萄的检测和计数,从而为葡萄产量预测及果园智能化管理提供有效支持。 [方法] 该框架利用公开数据集的预训练模型进行特征提取,并通过特征增强模块提高葡萄图像和文本之间的跨模态融合效果。在跨模态查询选择阶段,该框架通过语言引导的查询选择策略,从葡萄图像中筛选查询,进而采用跨模态解码器输出相应的预测结果。 [结果和讨论] 与9个基线模型相比该方法在检测和计数方面均展现出最优效果。具体而言,在检测任务上达到80.3%的交并比(Intersection Over Union, IoU)阈值为0.5时的平均精度均值(Mean Average Precision, mAP);在计数任务上实现了1.65的平均绝对误差(Mean Absolute Error, MAE),2.48的均方根误差(Root Mean Square Error, RMSE)。值得关注的是,该方法在识别不同目标大小的效果均表现较好,并且在不同环境条件下表现出良好的泛化能力和更快的收敛速度。 [结论] 本研究提出的葡萄检测与计数方法能够为精准农业提供强有力的技术支持。

关键词: 迁移学习, 计数, 多模态, 检测, 葡萄

Abstract:

[Objective] As one of the largest cash crops in the world in terms of combined production value, grape yield estimation is of great importance in agricultural and economic development. However, at present, grape yield prediction is difficult and costly, detection of green grape varieties with similar colors of grape berries and grape leaves has limitations, and detection of grape bunches with small berries is ineffective. In order to solve the above problems, a multimodal detection framework is proposed based on transfer learning, which aims to realize the detection and counting of different varieties of grapes, so as to provide reliable technical support for grape yield prediction and intelligent management of orchards. [Methods] A multimodal grape detection framework was proposed based on transfer learning, where transfer learning utilized the feature representation capabilities of pretrained models, requiring only a small number of grape images for fine-tuning to adapt to the task. This approach not only reduced labeling costs but also enhanced the ability to capture grape features effectively. The multimodal framework adopted a dual-encoder-single-decoder structure, consisting of three core modules: the image and text feature extraction and enhancement module, the language-guided query selection module, and the cross-modality decoder module. In the feature extraction stage, the framework employed pretrained models from public datasets for transfer learning, which significantly reduced the training time and costs of the model on the target task while effectively improving the capability to capture grape features. By introducing a feature enhancement module, the framework achieved cross-modality fusion effects between grape images and text. Additionally, the attention mechanism was implemented to enhance both image and text features, facilitating cross-modality feature learning between images and text. During the cross-modality query selection phase, the framework utilized a language-guided query selection strategy that enabled the filtering of queries from grape images. This strategy allowed for a more effective use of input text to guide the object in target detection, selecting features that were more relevant to the input text as queries for the decoder. The cross-modality decoder combined the features from grape images and text modalities to achieve more accurate modality alignment, thereby facilitating a more effective fusion of grape image and text information, ultimately producing the corresponding grape prediction results. Finally, to comprehensively evaluate the model's performance, the mean average precision (mAP) and average recall (AR) were adopted as evaluation metrics for the detection task, while the counting task was quantified using the mean absolute error (MAE) and root mean square error (RMSE) as assessment indicators. [Results and Discussions] This method exhibited optimal performance in both detection and counting when compared to nine baseline models. Specifically, a comprehensive evaluation was conducted on the WGISD public dataset, where the method achieved an mAP50 of 80.3% in the detection task, representing a 2.7 percent point improvement over the second-best model. Additionally, it reached 53.2% mAP and 58.2% mAP75, surpassing the second-best models by 13.4 and 22 percent point, respectively, and achieved an mAR of 76.5%, which was a 9.8 percent point increase over the next best model. In the counting task, the method realized a MAE of 1.65 and a RMSE of 2.48, outperforming all other baseline models in counting effectiveness. Furthermore, experiments were conducted using a total of nine grape varieties from both the WGISD dataset and field-collected data, resulting in an mAP50 of 82.5%, 58.5% mAP, 64.4% mAP75, 77.1% mAR, an MAE of 1.44, and an RMSE of 2.19. These results demonstrated the model's strong adaptability and effectiveness across diverse grape varieties. Notably, the method not only performed well in identifying large grape clusters but also showed superior performance on smaller grape clusters, achieving an mAP_s of 74.2% in the detection task, which was a 9.5 percent point improvement over the second-best model. Additionally, to provide a more intuitive assessment of model performance, this study selected grape images from the test set for visual comparison analysis. The results revealed that the model's detection and counting outcomes for grape clusters closely aligned with the original annotation information from the label dataset. Overall, this method demonstrated strong generalization capabilities and higher accuracy under various environmental conditions for different grape varieties. This technology has the potential to be applied in estimating total orchard yield and reducing pre-harvest measurement errors, thereby effectively enhancing the precision management level of vineyards. Conclusions The proposed method achieved higher accuracy and better adaptability in detecting five grape varieties compared to other baseline models. Furthermore, the model demonstrated substantial practicality and robustness across nine different grape varieties. These findings suggested that the method developed in this study had significant application potential in grape detection and counting tasks. It could provide strong technical support for the intelligent development of precision agriculture and the grape cultivation industry, highlighting its promising prospects in enhancing agricultural practices.

Key words: transfer learning, counting, multimodal, detection, grape

中图分类号: