CGG-Based Segmentation and Counting of Densely Distributed Rice Seeds in Seedling Trays

doi:10.12133/j.smartag.SA202507030

Abstract

Abstract:

[Objective] The precise quantification of rice seeds within individual cavities of seedling trays constitutes a critical operational parameter for optimizing seeding efficiency and fine-tuning the performance of air-vibration precision seeders. Achieving high accuracy in this task directly impacts resource utilization, seedling uniformity, and ultimately crop yield. However, the operational environment presents significant challenges, including complex backgrounds, seed overlap, variations in lighting and seed orientation, and the inherent difficulty of distinguishing individual seeds within dense clusters. These factors often lead to suboptimal performance in existing automated detection systems, manifesting as low detection accuracy and an inability to achieve robust, precise instance segmentation of individual rice seeds. To address these persistent limitations and advance the state-of-the-art in precision seeding monitoring, a novel, integrated framework for rice seed instance segmentation is proposed. The core innovation lies in the synergistic combination of a cross-modal grounding generation network (CGG) with a pretrained model, designed to leverage complementary information from visual and textual domains. [Methods] The proposed methodology fundamentally aimed to bridge the gap between visual perception and semantic understanding within the specific context of rice seed detection. The CGG-pretrained model framework achieved this through deep joint alignment of visual features extracted from seedling tray images and textual features derived from contextual knowledge. This cross-modal grounding enabled collaborative learning, where the visual processing stream (handled object localization and pixel-level segmentation) was continuously informed and refined by the semantic understanding stream (interpreting context and relationships). Specifically, the visual backbone network processes input imagery to generate feature maps, while the pretrained language model component, which utilized contextual embeddings, generated semantically rich textual representations. The CGG module acted as the fusion engine, establishing explicit correspondences between specific regions in the image (potential seeds or clusters) and relevant semantic concepts or descriptors provided by the pretrained model. This bidirectional interaction significantly enhanced the model's ability to disambiguate overlapping seeds, resolved occlusions, and accurately delineated individual seed boundaries under challenging conditions. Key technical innovations validated through rigorous ablation studies include: (1) The strategic use of the BLIP model for generating high-quality pseudo-labels from unlabeled or weakly labeled image data, facilitating more effective semi-supervised learning and reducing annotation burden; and (2) the application of BERT-based word embed to capture deep semantic relationships and contextual nuances within textual descriptors related to seeds and seeding environments. Crucially, the ablation experiments demonstrated a pronounced synergistic effect when these two core improvements were combined, resulting in a segmentation accuracy improvement exceeding 3 percentage points compared to the baseline model lacked these integrations. [Results and Discussions] Comprehensive experimental evaluation demonstrated the superior performance of the proposed CGG model against established benchmarks. Under the standard intersection over union (IoU) threshold of 0.5, the model achieved a mean average precision (mAP) of 90.7% for bounding box detection (denoted as mAP50^bb for detection) and an outstanding 91.4% mAP for instance segmentation (denoted as mAP50^seg for segmentation). These results represented a statistically significant improvement over leading contemporary models, including MaskR-CNN and Mask2Former, which highlighted the efficacy of the cross-modal grounding approach in accurately localizing and segmenting individual rice seeds. Further validation within realistic seeding trial scenarios, which involved direct comparison with meticulous manual annotations, confirmed the model's practical robustness. The CGG model attained the highest accuracy in two critical operational metrics: (1) Precision in segmenting individual seed instances (single-seed segmentation accuracy), and (2) accuracy in determining the exact seed count per cavity, and it achieved an average accuracy of 88% for per-cavity quantification. Moreover, the model exhibited superior performance in minimizing estimation errors for cavity seed counts, which was evidenced by its significantly lower error metrics: a root mean square error (RMSE = 16.8 seeds), a mean absolute error (MAE = 13.7 seeds), and a mean absolute percentage error (MAPE = 2.46%). These error values were markedly lower than those recorded by the comparison models, which underscored the CGG model's enhanced reliability in practical counting tasks. The discussion contextualized these results and attributed the performance gains to the model's ability to leverage semantic context to resolve ambiguities inherent in visual-only approaches, particularly in dense and overlapping seed scenarios common in precision seeding trays. [Conclusions] The developed CGG-pretrained model integration presents a significant advancement in automated monitoring for precision rice seeding. The model successfully addresses the core challenges of low detection accuracy and imprecise instance segmentation in complex environments. Its high accuracy in both individual seed segmentation and per-cavity seed count quantification, coupled with low error rates, demonstrates strong potential for practical deployment. Importantly, the model enables real-time detection of rice seeds during the image analysis stage. This functionality provides a quantifiable, data-driven basis for making immediate operational decisions, most notably enabling the targeted precision reseeding of empty or under-seeded cavities identified during the seeding process. By ensuring optimal seed placement and density from the outset, the technology contributes directly to improved resource efficiency (reducing seed waste), enhanced seedling uniformity, and potentially higher crop yields. Consequently, this research offers valuable and robust technical support for the ongoing advancement of smart agriculture systems and the development of next-generation intelligent seeding machinery, paving the way for more automated, efficient, and sustainable rice production. Future work will focus on further optimizing inference speed for higher-throughput seeding lines and exploring generalization to other crop types and seeding mechanisms.

Key words: rice seeding, smart agriculture, instance segmentation, pretrained model, visual-language model

CLC Number:

OUYANG Meng, ZOU Rong, CHEN Jin, LI Yaoming, CHEN Yuhang, YAN Hao. CGG-Based Segmentation and Counting of Densely Distributed Rice Seeds in Seedling Trays[J]. Smart Agriculture, doi: 10.12133/j.smartag.SA202507030.

Figures/Tables 13

Fig. 1

Fig. 2

Fig. 3

Fig. 4

Fig. 5

Table 1

Table 2

Fig. 6

Fig. 7

Fig. 8

Table 3

Fig. 9

Table 4

References 30

[1]	陈品, 徐春春, 纪龙, 等. 2024年我国水稻产业形势分析及2025年展望[J]. 中国稻米, 2025, 31(2): 1-5.
	CHEN P, XU C C, JI L, et al. Analysis of China's rice industry in 2024 and the outlook for 2025[J]. China rice, 2025, 31(2): 1-5.
[2]	CHENG J H, LI Y M, CHEN J, et al. Design and realization of seeding quality monitoring system for air-suction vibrating disc type seed meter[J]. Processes, 2022, 10(9): ID 1745.
[3]	DING Y C, WANG K Y, LIU X D, et al. Research progress of seeding detection technology for medium and small- size seeds[J]. Transactions of the Chinese society of agricultural engineering, 2021, 37(8): 30-41.
[4]	DONG W H, MA X, LI H W, et al. Detection of performance of hybrid rice pot-tray sowing utilizing machine vision and machine learning approach[J]. Sensors, 2019, 19(23): ID 5332.
[5]	BAI J Q, HAO F Q, CHENG G H, et al. Machine vision-based supplemental seeding device for plug seedling of sweet corn[J]. Computers and electronics in agriculture, 2021, 188: ID 106345.
[6]	WANG S X, CHEN J, LI Y M, et al. The detection method of rice seedling tray hole seeding quantity based on improved YOLOv5s[C]//2023 5th International Conference on Robotics, Intelligent Control and Artificial Intelligence (RICAI). Piscataway, New Jersey, USA: IEEE, 2023: 1165-1169.
[7]	YAN Z Y, ZHAO Y M, LUO W S, et al. Machine vision-based tomato plug tray missed seeding detection and empty cell replanting[J]. Computers and electronics in agriculture, 2023, 208: ID 107800.
[8]	LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: Common objects in context[M]//Computer Vision – ECCV 2014. Cham: Springer International Publishing, 2014: 740-755.
[9]	GIRSHICK R, DONAHUE J, DARRELL T, et al. Rich feature hierarchies for accurate object detection and semantic segmentation[C]// 2014 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, New Jersey, USA: IEEE, 2014: 580-587.
[10]	LONG J, SHELHAMER E, DARRELL T. Fully convolutional networks for semantic segmentation[C]// 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway, New Jersey, USA: IEEE, 2015: 3431-3440.
[11]	HE K M, GKIOXARI G, DOLLAR P, et al. Mask R-CNN[C]// 2017 IEEE International Conference on Computer Vision (ICCV). Piscataway, New Jersey, USA: IEEE, 2017: 2980-2988.
[12]	REN S Q, HE K M, GIRSHICK R, et al. Faster R-CNN: Towards real-time object detection with region proposal networks[J]. IEEE transactions on pattern analysis and machine intelligence, 2017, 39(6): 1137-1149.
[13]	WOLF T, DEBUT L, SANH V, et al. Transformers: State-of-the-art natural language processing[C]// Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Online. Stroudsburg, PA, USA: ACL, 2020: 38-45.
[14]	CHENG B W, MISRA I, SCHWING A G, et al. Masked-attention mask transformer for universal image segmentation[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway, New Jersey, USA: IEEE, 2022: 1280-1289.
[15]	CHENG B, SCHWING A, KIRILLOV A. Per-pixel classification is not all you need for semantic segmentation[C/OL]// 35th Conference on Neural Information Processing Systems. 2021.[2025-07-06].
[16]	RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[C]// International Conference on Machine Learning. Proceedings of Machine Learning Research. New York, USA: PMLR, 2021: 8748-8763..
[17]	TODA Y, OKURA F, ITO J, et al. Training instance segmentation neural network with synthetic datasets for crop seed phenotyping[J]. Communications biology, 2020, 3(1): ID 173.
[18]	YE S, LIU W H, ZENG S, et al. SY-net: A rice seed instance segmentation method based on a six-layer feature fusion network and a parallel prediction head structure[J]. Sensors, 2023, 23(13): ID 6194.
[19]	GAO Q, LI H, MENG T Y, et al. A rapid construction method for high-throughput wheat grain instance segmentation dataset using high-resolution images[J]. Agronomy, 2024, 14(5): ID 1032.
[20]	TANG S X, XIA Z L, GU J N, et al. High-precision apple recognition and localization method based on RGB-D and improved SOLOv2 instance segmentation[J]. Frontiers in sustainable food systems, 2024, 8: ID 1403872.
[21]	MA J, ZHAO Y K, FAN W P, et al. An improved YOLOv8 model for Lotus seedpod instance segmentation in the Lotus pond environment[J]. Agronomy, 2024, 14(6): ID 1325.
[22]	PENG Y, WANG A C, LIU J Z, et al. A comparative study of semantic segmentation models for identification of grape with different varieties[J]. Agriculture, 2021, 11(10): ID 997.
[23]	XING H, WAN Y K, ZHONG P, et al. Design and experimental analysis of real-time detection system for the seeding accuracy of rice pneumatic seed metering device based on the improved YOLOv5n[J]. Computers and electronics in agriculture, 2024, 227: ID 109614.
[24]	WU Z P, CHEN J, MA Z, et al. Development of a lightweight online detection system for impurity content and broken rate in rice for combine harvesters[J]. Computers and electronics in agriculture, 2024, 218: ID 108689.
[25]	MAO J, MA X, BI Y, et al. TJDR: A high-quality diabetic retinopathy pixel-level annotation dataset[EB/OL]. arXiv:2312.15389, 2023.
[26]	XIONG Y Y, VARADARAJAN B, WU L M, et al. EfficientSAM: Leveraged masked image pretraining for efficient segment anything[C]// 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway, New Jersey, USA: IEEE, 2024: 16111-16121.
[27]	WU J Z, LI X T, DING H H, et al. Betrayed by captions: Joint caption grounding and generation for open vocabulary instance segmentation[C]// 2023 IEEE/CVF International Conference on Computer Vision (ICCV). Piscataway, New Jersey, USA: IEEE, 2023: 21881-21891.
[28]	MILLETARI F, NAVAB N, AHMADI S A. V-net: Fully convolutional neural networks for volumetric medical image segmentation[C]// 2016 Fourth International Conference on 3D Vision (3DV). Piscataway, New Jersey, USA: IEEE, 2016: 565-571.
[29]	LI J, LI D, XIONG C, et al. BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation[C]// International Conference on Machine Learning. Proceedings of Machine Learning Research. New York, USA: PMLR, 2022: 12888-12900..
[30]	KOROTEEV M V. BERT: A review of applications in natural language processing and understanding[EB/OL]. arXiv:2103.11943, 2021.

Group	BLIP	BERT	mAP50^bb/%	mAP50^seg/%	Prams/M	FLOPs/G	Frame rate/FPS
1			87.5	88.3	33.9	225.1	15
2	√		89.8	90.6	34.5	229.2	15
3		√	90.3	91.1	34.0	202.1	18
4	√	√	90.7	91.4	35.7	207.5	18

Model	Input size	mAP50^bb/%	mAP50^seg/%	Prams/M	FLOPs/G	Frame rate/FPS
Mask R-CNN	640	81.1	80.7	44.1	137.7	13
Mask2Former	640	84.4	86.1	44.4	226.0	16
CGG	640	90.7	91.4	35.7	207.5	18

单穴稻种数量		0	1	2	3	4	5	6	7
检测方法	Mask R-CNN	3.30	10.83	29.19	30.26	20.20	4.07	1.77	0.38
	Mask2Former	2.69	10.98	33.33	23.50	20.35	7.45	0.92	0.77
	CGG	2.38	10.22	33.10	22.50	22.73	6.99	1.38	0.69
	人工检测	2.07	9.52	32.41	23.12	22.12	8.76	1.08	0.92
真实误差	Mask R-CNN	0.00	1.23	3.69	4.84	7.83	5.99	1.15%	0.54
	Mask2Former	0.00	0.61	3.69	3.53	3.46	3.15	0.46	0.15
	CGG	0.00	0.31	2.46	2.84	3.15	2.23	0.69	0.23
误差	Mask R-CNN	-1.23	-1.31	3.23	-7.14	1.92	4.69	-0.69	0.54
	Mask2Former	-0.61	-1.46	-0.92	-0.38	1.77	1.31	0.15	0.15
	CGG	-0.31	-0.69	-0.69	0.61	-0.61	1.77	-0.31	0.23

Model	RMSE/颗	MAE/颗	MAPE/%
Mask R-CNN	49.6	44.4	7.27
Mask2Former	30.9	26.4	5.21
CGG	16.8	13.7	2.46