基于图文定位生成的育秧盘内密集稻种分割计数方法

doi:10.12133/j.smartag.SA202507030

摘要/Abstract

摘要：

【目的/意义】 随着工厂化育秧的快速发展，育秧环节的智能化已成为提升水稻生产效率与品质的关键。育秧盘中穴孔内稻种数量的准确识别，直接影响气振式育苗精密播种装置的作业效率与参数优化。然而，在复杂环境下，穴孔内稻种检测精度低，难以实现单粒精确分割。 【方法】 提出一种融合图文定位生成（Caption Grounding and Generation, CGG）模型与预训练模型的稻种实例分割方法，通过图像与文本特征的联合对齐，实现目标定位与语义理解的协同学习，从而显著提升了稻种的检测与分割精度。 【结果和讨论】 消融实验结果表明，核心改进模块基于自举式语言-图像预训练模型生成的伪标签，以及基于双向编码器表示的 Transformer的词向量嵌入，均能够提升模型性能。二者结合时表现出协同效应，使分割准确率较基线提升超过3个百分点。所提CGG模型在交并比阈值为0.5时，边界框检测平均精度达到90.7%，实例分割平均精度达到91.4%，显著优于Mask R-CNN和Mask2Former等主流模型。进一步在播种场景验证中，CGG模型在单穴种子数量检测精度达88%，表现最优。其针对单粒稻种的误差指标，包括均方根误差16.8粒、平均绝对误差13.7粒和平均绝对百分比误差2.46%，均显著低于对比模型。 【结论】 该模型具备实时在线检测稻种数量的能力，为后续需补种穴位的精准再播种提供了可量化操作依据，具有良好的实际应用前景，可为智慧农业和智能播种技术的发展提供有力支持。

关键词: 水稻播种, BERT, 实例分割, 预训练模型, 视觉语言模型, 图文定位生成, 大模型

Abstract:

[Objective] The precise quantification of rice seeds within individual cavities of seedling trays constitutes a critical operational parameter for optimizing seeding efficiency and fine-tuning the performance of air-vibration precision seeders. Achieving high accuracy in this task directly impacts resource utilization, seedling uniformity, and ultimately crop yield. However, the operational environment presents significant challenges, including complex backgrounds, seed overlap, variations in lighting and seed orientation, and the inherent difficulty of distinguishing individual seeds within dense clusters. These factors often lead to suboptimal performance in existing automated detection systems, manifesting as low detection accuracy and an inability to achieve robust, precise instance segmentation of individual rice seeds. To address these persistent limitations and advance the state-of-the-art in precision seeding monitoring, an integrated framework for rice seed instance segmentation was proposed. The core innovation lies in the synergistic combination of a cross-modal grounding generation (CGG) network with a pretrained model, which is designed to leverage complementary information from visual and textual domains. [Methods] The proposed methodology fundamentally aimed to bridge the gap between visual perception and semantic understanding within the specific context of rice seed detection. The CGG-pretrained model framework achieved this through deep joint alignment of visual features extracted from seedling tray images and textual features derived from contextual knowledge. This cross-modal grounding enabled collaborative learning, where the visual processing stream (handling object localization and pixel-level segmentation) was continuously informed and refined by the semantic understanding stream (interpreting context and relationships). Specifically, the visual backbone network processes input imagery to generate feature maps, while the pretrained language model component, which utilized contextual embeddings, generated semantically rich textual representations. The CGG module acted as the fusion engine, establishing explicit correspondences between specific regions in the image (potential seeds or clusters) and relevant semantic concepts or descriptors provided by the pretrained model. This bidirectional interaction significantly enhanced the model's ability to disambiguate overlapping seeds, resolved occlusions, and accurately delineated individual seed boundaries under challenging conditions. Key technical innovations validated through rigorous ablation studies include: (1) The strategic use of the bootstrapping language-image pre-training (BLIP) model for generating high-quality pseudo-labels from unlabeled or weakly labeled image data, facilitating more effective semi-supervised learning and reducing annotation burden, and (2) the application of bidirectional encoder representations from transformers (BERT)-based word embed to capture deep semantic relationships and contextual nuances within textual descriptors related to seeds and seeding environments. [Results and Discussions] The ablation experiments demonstrated a pronounced synergistic effect when the core improvements were combined, resulting in a segmentation accuracy improvement exceeding 3 percentage points compared to the baseline model that lacking the integration. Comprehensive experimental evaluation demonstrated the superior performance of the proposed CGG model against established benchmarks. Under the standard intersection over union (IoU) threshold of 0.5, the model achieved a mean average precision (mAP) of 90.7% for bounding box detection (denoted as mAP50^bb for detection) and an outstanding 91.4% mAP for instance segmentation (denoted as mAP50^seg for segmentation). These results represented a statistically significant improvement over leading contemporary models, including region-based convolutional neural network (Mask R-CNN) and Mask2Former, which highlighted the efficacy of the cross-modal grounding approach in accurately localizing and segmenting individual rice seeds. Further validation within realistic seeding trial scenarios, which involved direct comparison with meticulous manual annotations, confirmed the model's practical robustness. The CGG model attained the highest accuracy in two critical operational metrics: (1) Precision in segmenting individual seed instances (single-seed segmentation accuracy), and (2) accuracy in determining the exact seed count per cavity, and it achieved an average accuracy of 88% for per-cavity quantification. Moreover, the model exhibited superior performance in minimizing estimation errors for cavity seed counts, as evidenced by its significantly lower error metrics: a root mean square error (RMSE) of 16.8 seeds, a mean absolute error (MAE) of 13.7 seeds, and a mean absolute percentage error (MAPE) of 2.46%. These error values were markedly lower than those recorded by the comparison models, which underscored the CGG model's enhanced reliability in practical counting tasks. The discussion contextualized these results and attributed the performance gains to the model's ability to leverage semantic context to resolve ambiguities inherent in visual-only approaches, particularly in dense and overlapping seed scenarios common in precision seeding trays. [Conclusions] The developed CGG-pretrained model integration presents a significant advancement in automated monitoring for precision rice seeding. The model successfully addresses the core challenges of low detection accuracy and imprecise instance segmentation for seeds in complex environments. Its high accuracy in both individual seed segmentation and per-cavity seed count quantification, coupled with low error rates, demonstrates strong potential for practical deployment. Importantly, the model enables real-time detection of rice seeds during the image analysis stage, this functionality provides a quantifiable, data-driven basis for making immediate operational decisions, most notably enabling the targeted precision reseeding of empty or under-seeded cavities identified during the seeding process. By ensuring optimal seed placement and density from the outset, the technology contributes directly to improved resource efficiency (reducing seed waste), enhanced seedling uniformity, and potentially higher crop yields. Future work will focus on further optimizing inference speed for higher-throughput seeding lines and exploring generalization to other crop types and seeding mechanisms.

Key words: rice seeding, BERT, instance segmentation, pre-trained model, visual-language model, caption grounding and generation, large model

中图分类号:

欧阳盟, 邹荣, 陈进, 李耀明, 陈宇航, 严昊. 基于图文定位生成的育秧盘内密集稻种分割计数方法[J]. 智慧农业(中英文), 2026, 8(2): 188-199.

OUYANG Meng, ZOU Rong, CHEN Jin, LI Yaoming, CHEN Yuhang, YAN Hao. CGG-Based Segmentation and Counting of Densely Distributed Rice Seeds in Seedling Trays[J]. Smart Agriculture, 2026, 8(2): 188-199.

图/表 13

图1

图2

图3

图4

图5

表1

表2

图6

图7

图8

表3

图9

表4

参考文献 30

[1]	陈品, 徐春春, 纪龙, 等. 2024年我国水稻产业形势分析及2025年展望[J]. 中国稻米, 2025, 31(2): 1-5.
	CHEN P, XU C C, JI L, et al. Analysis of China's rice industry in 2024 and the outlook for 2025[J]. China Rice, 2025, 31(2): 1-5.
[2]	CHENG J H, LI Y M, CHEN J, et al. Design and realization of seeding quality monitoring system for air-suction vibrating disc type seed meter[J]. Processes, 2022, 10(9): 1745.
[3]	丁幼春, 王凯阳, 刘晓东, 等. 中小粒径种子播种检测技术研究进展[J]. 农业工程学报, 2021, 37(8): 30-41.
	DING Y C, WANG K Y, LIU X D, et al. Research progress of seeding detection technology for medium and small- size seeds[J]. Transactions of the Chinese Society of Agricultural Engineering, 2021, 37(8): 30-41.
[4]	DONG W H, MA X, LI H W, et al. Detection of performance of hybrid rice pot-tray sowing utilizing machine vision and machine learning approach[J]. Sensors, 2019, 19(23): 5332.
[5]	BAI J Q, HAO F Q, CHENG G H, et al. Machine vision-based supplemental seeding device for plug seedling of sweet corn[J]. Computers and Electronics in Agriculture, 2021, 188: 106345.
[6]	WANG S X, CHEN J, LI Y M, et al. The detection method of rice seedling tray hole seeding quantity based on improved YOLOv5s[C]// 2023 5th International Conference on Robotics, Intelligent Control and Artificial Intelligence (RICAI). Piscataway, New Jersey, USA: IEEE, 2023: 1165-1169.
[7]	YAN Z Y, ZHAO Y M, LUO W S, et al. Machine vision-based tomato plug tray missed seeding detection and empty cell replanting[J]. Computers and Electronics in Agriculture, 2023, 208: 107800.
[8]	LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: Common objects in context[M]// Computer Vision-ECCV 2014. Cham: Springer International Publishing, 2014: 740-755.
[9]	GIRSHICK R, DONAHUE J, DARRELL T, et al. Rich feature hierarchies for accurate object detection and semantic segmentation[C]// 2014 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, New Jersey, USA: IEEE, 2014: 580-587.
[10]	LONG J, SHELHAMER E, DARRELL T. Fully convolutional networks for semantic segmentation[C]// 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway, New Jersey, USA: IEEE, 2015: 3431-3440.
[11]	HE K M, GKIOXARI G, DOLLAR P, et al. Mask R-CNN[C]// 2017 IEEE International Conference on Computer Vision (ICCV). Piscataway, New Jersey, USA: IEEE, 2017: 2980-2988.
[12]	REN S Q, HE K M, GIRSHICK R, et al. Faster R-CNN: Towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6): 1137-1149.
[13]	WOLF T, DEBUT L, SANH V, et al. Transformers: State-of-the-art natural language processing[C]// Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Stroudsburg, PA, USA: ACL, 2020: 38-45.
[14]	CHENG B W, MISRA I, SCHWING A G, et al. Masked-attention mask transformer for universal image segmentation[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway, New Jersey, USA: IEEE, 2022: 1280-1289.
[15]	CHENG B, SCHWING A, KIRILLOV A. Per-pixel classification is not all you need for semantic segmentation[C/OL]// 35th Conference on Neural Information Processing Systems. 2021.[2025-07-06].
[16]	RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[C]// International Conference on Machine Learning. Proceedings of Machine Learning Research. New York, USA: PMLR, 2021: 8748-8763..
[17]	TODA Y, OKURA F, ITO J, et al. Training instance segmentation neural network with synthetic datasets for crop seed phenotyping[J]. Communications Biology, 2020, 3(1): 173.
[18]	YE S, LIU W H, ZENG S, et al. SY-net: A rice seed instance segmentation method based on a six-layer feature fusion network and a parallel prediction head structure[J]. Sensors, 2023, 23(13): 6194.
[19]	GAO Q, LI H, MENG T Y, et al. A rapid construction method for high-throughput wheat grain instance segmentation dataset using high-resolution images[J]. Agronomy, 2024, 14(5): 1032.
[20]	TANG S X, XIA Z L, GU J N, et al. High-precision apple recognition and localization method based on RGB-D and improved SOLOv2 instance segmentation[J]. Frontiers in Sustainable Food Systems, 2024, 8: 1403872.
[21]	MA J, ZHAO Y K, FAN W P, et al. An improved YOLOv8 model for Lotus seedpod instance segmentation in the Lotus pond environment[J]. Agronomy, 2024, 14(6): 1325.
[22]	PENG Y, WANG A C, LIU J Z, et al. A comparative study of semantic segmentation models for identification of grape with different varieties[J]. Agriculture, 2021, 11(10): 997.
[23]	XING H, WAN Y K, ZHONG P, et al. Design and experimental analysis of real-time detection system for the seeding accuracy of rice pneumatic seed metering device based on the improved YOLOv5n[J]. Computers and Electronics in Agriculture, 2024, 227: 109614.
[24]	WU Z P, CHEN J, MA Z, et al. Development of a lightweight online detection system for impurity content and broken rate in rice for combine harvesters[J]. Computers and Electronics in Agriculture, 2024, 218: 108689.
[25]	MAO J, MA X, BI Y, et al. TJDR: A high-quality diabetic retinopathy pixel-level annotation dataset[EB/OL]. arXiv:2312.15389, 2023.
[26]	XIONG Y Y, VARADARAJAN B, WU L M, et al. EfficientSAM: Leveraged masked image pretraining for efficient segment anything[C]// 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway, New Jersey, USA: IEEE, 2024: 16111-16121.
[27]	WU J Z, LI X T, DING H H, et al. Betrayed by captions: Joint caption grounding and generation for open vocabulary instance segmentation[C]// 2023 IEEE/CVF International Conference on Computer Vision (ICCV). Piscataway, New Jersey, USA: IEEE, 2023: 21881-21891.
[28]	MILLETARI F, NAVAB N, AHMADI S A. V-net: Fully convolutional neural networks for volumetric medical image segmentation[C]// 2016 Fourth International Conference on 3D Vision (3DV). Piscataway, New Jersey, USA: IEEE, 2016: 565-571.
[29]	LI J, LI D, XIONG C, et al. BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation[C]// International Conference on Machine Learning. Proceedings of Machine Learning Research. New York, USA: PMLR, 2022: 12888-12900..
[30]	KOROTEEV M V. BERT: A review of applications in natural language processing and understanding[EB/OL]. arXiv:2103.11943, 2021.

Group	BLIP	BERT	mAP50^bb/%	mAP50^seg/%	Params/M	计算量/GFLOPs	帧率/（帧/s）
1	×	×	87.5	88.3	33.9	225.1	15
2	√	×	89.8	90.6	34.5	229.2	15
3	×	√	90.3	91.1	34.0	202.1	18
4	√	√	90.7	91.4	35.7	207.5	18

Model	Input size	mAP50^bb/%	mAP50^seg/%	Params/M	计算量/GFLOPs	帧率/（帧/s）
Mask R-CNN	640	81.1	80.7	44.1	137.7	13
Mask2Former	640	84.4	86.1	44.4	226.0	16
CGG	640	90.7	91.4	35.7	207.5	18

单穴稻种数量		0	1	2	3	4	5	6	7
检测方法	Mask R-CNN	3.30	10.83	29.19	30.26	20.20	4.07	1.77	0.38
	Mask2Former	2.69	10.98	33.33	23.50	20.35	7.45	0.92	0.77
	CGG	2.38	10.22	33.10	22.50	22.73	6.99	1.38	0.69
	人工检测	2.07	9.52	32.41	23.12	22.12	8.76	1.08	0.92
真实误差	Mask R-CNN	0.00	1.23	3.69	4.84	7.83	5.99	1.15	0.54
	Mask2Former	0.00	0.61	3.69	3.53	3.46	3.15	0.46	0.15
	CGG	0.00	0.31	2.46	2.84	3.15	2.23	0.69	0.23
误差	Mask R-CNN	-1.23	-1.31	3.23	-7.14	1.92	4.69	-0.69	0.54
	Mask2Former	-0.61	-1.46	-0.92	-0.38	1.77	1.31	0.15	0.15
	CGG	-0.31	-0.69	-0.69	0.61	-0.61	1.77	-0.31	0.23

Model	RMSE/颗	MAE/颗	MAPE/%
Mask R-CNN	49.6	44.4	7.27
Mask2Former	30.9	26.4	5.21
CGG	16.8	13.7	2.46