欢迎您访问《智慧农业(中英文)》官方网站! English

Smart Agriculture ›› 2026, Vol. 8 ›› Issue (2): 188-199.doi: 10.12133/j.smartag.SA202507030

• 信息处理与决策 • 上一篇    

基于图文定位生成的育秧盘内密集稻种分割计数方法

欧阳盟1, 邹荣1(), 陈进1, 李耀明2, 陈宇航1, 严昊1   

  1. 1. 江苏大学机械工程学院,江苏 镇江 212013,中国
    2. 江苏大学农业工程学院,江苏 镇江 212013,中国
  • 收稿日期:2025-07-21 出版日期:2026-03-30
  • 基金项目:
    国家自然科学基金项目(31871528)
  • 作者简介:

    欧阳盟,硕士研究生,研究方向为物体实例分割。E-mail:

  • 通信作者:
    邹 荣,博士,副教授,研究方向为深度学习、人工智能与机器视觉等。E-mail:

CGG-Based Segmentation and Counting of Densely Distributed Rice Seeds in Seedling Trays

OUYANG Meng1, ZOU Rong1(), CHEN Jin1, LI Yaoming2, CHEN Yuhang1, YAN Hao1   

  1. 1. School of Mechanical Engineering, Jiangsu University, Zhenjiang 212013, China
    2. School of Agricultural Engineering, Jiangsu University, Zhenjiang 212013, China
  • Received:2025-07-21 Online:2026-03-30
  • Foundation items:The National Natural Science Foundation of China(31871528)
  • About author:

    OUYANG Meng, E-mail:

  • Corresponding author:
    ZOU Rong, E-mail:

摘要:

【目的/意义】 随着工厂化育秧的快速发展,育秧环节的智能化已成为提升水稻生产效率与品质的关键。育秧盘中穴孔内稻种数量的准确识别,直接影响气振式育苗精密播种装置的作业效率与参数优化。然而,在复杂环境下,穴孔内稻种检测精度低,难以实现单粒精确分割。 【方法】 提出一种融合图文定位生成(Caption Grounding and Generation, CGG)模型与预训练模型的稻种实例分割方法,通过图像与文本特征的联合对齐,实现目标定位与语义理解的协同学习,从而显著提升了稻种的检测与分割精度。 【结果和讨论】 消融实验结果表明,核心改进模块基于自举式语言-图像预训练模型生成的伪标签,以及基于双向编码器表示的 Transformer的词向量嵌入,均能够提升模型性能。二者结合时表现出协同效应,使分割准确率较基线提升超过3个百分点。所提CGG模型在交并比阈值为0.5时,边界框检测平均精度达到90.7%,实例分割平均精度达到91.4%,显著优于Mask R-CNN和Mask2Former等主流模型。进一步在播种场景验证中,CGG模型在单穴种子数量检测精度达88%,表现最优。其针对单粒稻种的误差指标,包括均方根误差16.8粒、平均绝对误差13.7粒和平均绝对百分比误差2.46%,均显著低于对比模型。 【结论】 该模型具备实时在线检测稻种数量的能力,为后续需补种穴位的精准再播种提供了可量化操作依据,具有良好的实际应用前景,可为智慧农业和智能播种技术的发展提供有力支持。

关键词: 水稻播种, BERT, 实例分割, 预训练模型, 视觉语言模型, 图文定位生成, 大模型

Abstract:

[Objective] The precise quantification of rice seeds within individual cavities of seedling trays constitutes a critical operational parameter for optimizing seeding efficiency and fine-tuning the performance of air-vibration precision seeders. Achieving high accuracy in this task directly impacts resource utilization, seedling uniformity, and ultimately crop yield. However, the operational environment presents significant challenges, including complex backgrounds, seed overlap, variations in lighting and seed orientation, and the inherent difficulty of distinguishing individual seeds within dense clusters. These factors often lead to suboptimal performance in existing automated detection systems, manifesting as low detection accuracy and an inability to achieve robust, precise instance segmentation of individual rice seeds. To address these persistent limitations and advance the state-of-the-art in precision seeding monitoring, an integrated framework for rice seed instance segmentation was proposed. The core innovation lies in the synergistic combination of a cross-modal grounding generation (CGG) network with a pretrained model, which is designed to leverage complementary information from visual and textual domains. [Methods] The proposed methodology fundamentally aimed to bridge the gap between visual perception and semantic understanding within the specific context of rice seed detection. The CGG-pretrained model framework achieved this through deep joint alignment of visual features extracted from seedling tray images and textual features derived from contextual knowledge. This cross-modal grounding enabled collaborative learning, where the visual processing stream (handling object localization and pixel-level segmentation) was continuously informed and refined by the semantic understanding stream (interpreting context and relationships). Specifically, the visual backbone network processes input imagery to generate feature maps, while the pretrained language model component, which utilized contextual embeddings, generated semantically rich textual representations. The CGG module acted as the fusion engine, establishing explicit correspondences between specific regions in the image (potential seeds or clusters) and relevant semantic concepts or descriptors provided by the pretrained model. This bidirectional interaction significantly enhanced the model's ability to disambiguate overlapping seeds, resolved occlusions, and accurately delineated individual seed boundaries under challenging conditions. Key technical innovations validated through rigorous ablation studies include: (1) The strategic use of the bootstrapping language-image pre-training (BLIP) model for generating high-quality pseudo-labels from unlabeled or weakly labeled image data, facilitating more effective semi-supervised learning and reducing annotation burden, and (2) the application of bidirectional encoder representations from transformers (BERT)-based word embed to capture deep semantic relationships and contextual nuances within textual descriptors related to seeds and seeding environments. [Results and Discussions] The ablation experiments demonstrated a pronounced synergistic effect when the core improvements were combined, resulting in a segmentation accuracy improvement exceeding 3 percentage points compared to the baseline model that lacking the integration. Comprehensive experimental evaluation demonstrated the superior performance of the proposed CGG model against established benchmarks. Under the standard intersection over union (IoU) threshold of 0.5, the model achieved a mean average precision (mAP) of 90.7% for bounding box detection (denoted as mAP50bb for detection) and an outstanding 91.4% mAP for instance segmentation (denoted as mAP50seg for segmentation). These results represented a statistically significant improvement over leading contemporary models, including region-based convolutional neural network (Mask R-CNN) and Mask2Former, which highlighted the efficacy of the cross-modal grounding approach in accurately localizing and segmenting individual rice seeds. Further validation within realistic seeding trial scenarios, which involved direct comparison with meticulous manual annotations, confirmed the model's practical robustness. The CGG model attained the highest accuracy in two critical operational metrics: (1) Precision in segmenting individual seed instances (single-seed segmentation accuracy), and (2) accuracy in determining the exact seed count per cavity, and it achieved an average accuracy of 88% for per-cavity quantification. Moreover, the model exhibited superior performance in minimizing estimation errors for cavity seed counts, as evidenced by its significantly lower error metrics: a root mean square error (RMSE) of 16.8 seeds, a mean absolute error (MAE) of 13.7 seeds, and a mean absolute percentage error (MAPE) of 2.46%. These error values were markedly lower than those recorded by the comparison models, which underscored the CGG model's enhanced reliability in practical counting tasks. The discussion contextualized these results and attributed the performance gains to the model's ability to leverage semantic context to resolve ambiguities inherent in visual-only approaches, particularly in dense and overlapping seed scenarios common in precision seeding trays. [Conclusions] The developed CGG-pretrained model integration presents a significant advancement in automated monitoring for precision rice seeding. The model successfully addresses the core challenges of low detection accuracy and imprecise instance segmentation for seeds in complex environments. Its high accuracy in both individual seed segmentation and per-cavity seed count quantification, coupled with low error rates, demonstrates strong potential for practical deployment. Importantly, the model enables real-time detection of rice seeds during the image analysis stage, this functionality provides a quantifiable, data-driven basis for making immediate operational decisions, most notably enabling the targeted precision reseeding of empty or under-seeded cavities identified during the seeding process. By ensuring optimal seed placement and density from the outset, the technology contributes directly to improved resource efficiency (reducing seed waste), enhanced seedling uniformity, and potentially higher crop yields. Future work will focus on further optimizing inference speed for higher-throughput seeding lines and exploring generalization to other crop types and seeding mechanisms.

Key words: rice seeding, BERT, instance segmentation, pre-trained model, visual-language model, caption grounding and generation, large model

中图分类号: