Chinese Kiwifruit Text Named Entity Recognition Method Based on Dual-Dimensional Information and Pruning

doi:10.12133/j.smartag.SA202410022

Abstract

Abstract:

[Objective] Chinese kiwifruit texts exhibit unique dual-dimensional characteristics. The cross-paragraph dependency is complex semantic structure, whitch makes it challenging to capture the full contextual relationships of entities within a single paragraph, necessitating models capable of robust cross-paragraph semantic extraction to comprehend entity linkages at a global level. However, most existing models rely heavily on local contextual information and struggle to process long-distance dependencies, thereby reducing recognition accuracy. Furthermore, Chinese kiwifruit texts often contain highly nested entities. This nesting and combination increase the complexity of grammatical and semantic relationships, making entity recognition more difficult. To address these challenges, a novel named entity recognition (NER) method, KIWI-Coord-Prune(kiwifruit-CoordKIWINER-PruneBi-LSTM) was proposed in this research, which incorporated dual-dimensional information processing and pruning techniques to improve recognition accuracy. [Methods] The proposed KIWI-Coord-Prune model consisted of a character embedding layer, a CoordKIWINER layer, a PruneBi-LSTM layer, a self-attention mechanism, and a CRF decoding layer, enabling effective entity recognition after processing input character vectors. The CoordKIWINER and PruneBi-LSTM modules were specifically designed to handle the dual-dimensional features in Chinese kiwifruit texts. The CoordKIWINER module applied adaptive average pooling in two directions on the input feature maps and utilized convolution operations to separate the extracted features into vertical and horizontal branches. The horizontal and vertical features were then independently extracted using the Criss-Cross Attention (CCNet) mechanism and Coordinate Attention (CoordAtt) mechanism, respectively. This module significantly enhanced the model's ability to capture cross-paragraph relationships and nested entity structures, thereby generating enriched character vectors containing more contextual information, which improved the overall representation capability and robustness of the model. The PruneBi-LSTM module was built upon the enhanced dual-dimensional vector representations and introduced a pruning strategy into Bi-LSTM to effectively reduce redundant parameters associated with background descriptions and irrelevant terms. This pruning mechanism not only enhanced computational efficiency while maintaining the dynamic sequence modeling capability of Bi-LSTM but also improved inference speed. Additionally, a dynamic feature extraction strategy was employed to reduce the computational complexity of vector sequences and further strengthen the learning capacity for key features, leading to improved recognition of complex entities in kiwifruit texts. Furthermore, the pruned weight matrices become sparser, significantly reducing memory consumption. This made the model more efficient in handling large-scale agricultural text-processing tasks, minimizing redundant information while achieving higher inference and training efficiency with fewer computational resources. [Results and Discussions] Experiments were conducted on the self-built KIWIPRO dataset and four public datasets: People's Daily, ClueNER, Boson, and ResumeNER. The proposed model was compared with five advanced NER models: LSTM, Bi-LSTM, LR-CNN, Softlexicon-LSTM, and KIWINER. The experimental results showed that KIWI-Coord-Prune achieved F₁-Scores of 89.55%, 91.02%, 83.50%, 83.49%, and 95.81%, respectively, outperforming all baseline models. Furthermore, controlled variable experiments were conducted to compare and ablate the CoordKIWINER and PruneBi-LSTM modules across the five datasets, confirming their effectiveness and necessity. Additionally, the impact of different design choices was explored for the CoordKIWINER module, including direct fusion, optimized attention mechanism fusion, and network structure adjustment residual optimization. The experimental results demonstrated that the optimized attention mechanism fusion method yielded the best performance, which was ultimately adopted in the final model. These findings highlight the significance of properly designing attention mechanisms to extract dual-dimensional features for NER tasks. Compared to existing methods, the KIWI-Coord-Prune model effectively addressed the issue of underutilized dual-dimensional information in Chinese kiwifruit texts. It significantly improved entity recognition performance for both overall text structures and individual entity categories. Furthermore, the model exhibited a degree of generalization capability, making it applicable to downstream tasks such as knowledge graph construction and question-answering systems. [Conclusions] This study presents an novel NER approach for Chinese kiwifruit texts, which integrating dual-dimensional information extraction and pruning techniques to overcome challenges related to cross-paragraph dependencies and nested entity structures. The findings offer valuable insights for researchers working on domain-specific NER and contribute to the advancement of agriculture-focused natural language processing applications. However, two key limitations remain: 1) The balance between domain-specific optimization and cross-domain generalization requires further investigation, as the model's adaptability to non-agricultural texts has yet to be empirically validated; 2) the multilingual applicability of the model is currently limited, necessitating further expansion to accommodate multilingual scenarios. Future research should focus on two key directions: 1) Enhancing domain robustness and cross-lingual adaptability by incorporating diverse textual datasets and leveraging pre-trained multilingual models to improve generalization, and 2) Validating the model's performance in multilingual environments through transfer learning while refining linguistic adaptation strategies to further optimize recognition accuracy.

Key words: Chinese named entity recognition, kiwifruit texts, custom-built dataset, multi-dimensional attention mechanism, pruning, deep learning, text feature enhancement

CLC Number:

QI Zijun, NIU Dangdang, WU Huarui, ZHANG Lilin, WANG Lunfeng, ZHANG Hongming. Chinese Kiwifruit Text Named Entity Recognition Method Based on Dual-Dimensional Information and Pruning[J]. Smart Agriculture, 2025, 7(1): 44-56.

Figures/Tables 13

Fig. 1

Table 1

Fig. 2

Table 2

Table. 3

Table. 4

Table. 5

Table. 6

Table 7

Table 8

Fig.3

Fig. 4

Table 9

References 36

1	齐秀娟, 郭丹丹, 王然, 等. 我国猕猴桃产业发展现状及对策建议[J]. 果树学报, 2020, 37(5): 754-763.
	QI X J, GUO D D, WANG R, et al. Development status and suggestions on Chinese kiwifruit industry[J]. Journal of fruit science, 2020, 37(5): 754-763.
2	计洁, 金洲, 王儒敬, 等. 基于递进式卷积网络的农业命名实体识别方法[J]. 智慧农业(中英文), 2023, 5 (1): 122-131.
	JI J, JIN Z, WANG R J, et al. Progressive convolutional net based method for agricultural named entity recognition[J]. Smart agriculture, 2023, 5(1): 122-131.
3	GOLSHAN P N, DASHTI H R, AZIZI S, et al. A study of recent contributions on information extraction[EB/OL]. arXiv:1803.056 67, 2018.
4	GUPTA N, SINGH S, ROTH D. Entity linking via joint encoding of types, descriptions, and context[C]// Proceedings of the 2017 Conference on Empirical Methods in NaturalLanguage Processing. San Diego, USA: ACL, 2017: 2681-2690.
5	GENG Z Q, CHEN G F, HAN Y M, et al. Semantic relation extraction using sequential and tree-structured LSTM with attention[J]. Information sciences, 2020, 509: 183-192.
6	JI B, LIU R, LI S S, et al. A hybrid approach for named entity recognition in Chinese electronic medical record[J]. BMC medical informatics and decision making, 2019, 19(2): ID 64.
7	张善文, 王振, 王祖良. 结合知识图谱与双向长短时记忆网络的小麦条锈病预测[J]. 农业工程学报, 2020, 36(12): 172-178.
	ZHANG S W, WANG Z, WANG Z L. Prediction of wheat stripe rust disease by combining knowledge graph and bidirectional long short term memory network[J]. Transactions of the Chinese society of agricultural engineering, 2020, 36(12): 172-178.
8	刘浏, 王东波. 命名实体识别研究综述[J]. 情报学报, 2018, 37(3): 329-340.
	LIU L, WANG D B. A review on named entity recognition[J]. Journal of the China society for scientific and technical information, 2018, 37(3): 329-340.
9	赵继贵, 钱育蓉, 王魁, 等. 中文命名实体识别研究综述[J]. 计算机工程与应用, 2024, 60(1): 15-27.
	ZHAO J G, QIAN Y R, WANG K, et al. Survey of Chinese named entity recognition research[J]. Computer engineering and applications, 2024, 60(1): 15-27.
10	杜晋华, 尹浩, 冯嵩. 中文电子病历命名实体识别的研究与进展[J]. 电子学报, 2022, 50(12): 3030-3053.
	DU J H, YIN H, FENG S. Research and development of named entity recognition in Chinese electronic medical record[J]. Acta electronica sinica, 2022, 50(12): 3030-3053.
11	陈婕卿, 竹志超, 张锋, 等. 中文电子病历命名实体识别方法研究[J]. 医学信息学杂志, 2024, 45(4): 78-84.
	CHEN J Q, ZHU Z C, ZHANG F, et al. Study on Named Entity Recognition of Chinese Electronic Medical Records[J]. Journal of medical informatics, 2024, 45(04): 78-84.
12	ZHANG Z Q, ZHENG X W, ZHANG J S. Machine reading comprehension based named entity recognition for medical text[J/OL]. Multimedia tools and applications, 2025. (2025-01-07)[2025-02-13].
13	张文东, 吴子炜, 宋国昌, 等. 基于SiKuBERT与多元数据嵌入的中医古籍命名实体识别[J]. 华南理工大学学报(自然科学版), 2024, 52(6): 128-137.
	ZHANG W D, WU Z W, SONG G C, et al. Named entity recognition of traditional Chinese medicine classics based on SiKuBERT and multivariate data embedding[J]. Journal of South China university of technology (natural science edition), 2024, 52(6): 128-137.
14	聂啸林, 张礼麟, 牛当当, 等. 面向葡萄知识图谱构建的多特征融合命名实体识别[J]. 农业工程学报, 2024, 40(3): 201-210.
	NIE X L, ZHANG L L, NIU D D, et al. Multi-feature fusion named entity recognition method for grape knowledge graph construction[J]. Transactions of the Chinese society of agricultural engineering, 2024, 40(3): 201-210.
15	毕达天, 张雪, 孔婧媛, 等. 基于异质图注意力网络与多特征融合的跨社交媒体用户识别研究[J]. 情报学报, 2024, 43(10): 1213-1226.
	BI D T, ZHANG X, KONG J Y, et al. User identification across social media based on heterogeneous graph attention network and multi-feature fusion[J]. Journal of the China society for scientific and technical information, 2024, 43(10): 1213-1226.
16	ARAS G, MAKAROĞLU D, DEMIR S, et al. An evaluation of recent neural sequence tagging models in Turkish named entity recognition[J]. Expert systems with applications, 2021, 182: ID 115049.
17	DRURY B, ROCHE M. A survey of the applications of text mining for agriculture[J]. Computers and electronics in agriculture, 2019, 163: ID 104864.
18	李书琴, 张明美, 刘斌. 融合字词语义信息的猕猴桃种植领域命名实体识别研究[J]. 农业机械学报, 2022, 53(12): 323-331.
	LI S Q, ZHANG M M, LIU B. Kiwifruit planting entity recognition based on character and word information fusion[J]. Transactions of the Chinese society for agricultural machinery, 2022, 53(12): 323-331.
19	ZHANG L L, NIE X L, ZHANG M M, et al. Lexicon and attention-based named entity recognition for kiwifruit diseases and pests: A Deep learning approach[J]. Frontiers in plant science, 2022, 13: ID 1053449.
20	季源泽, 李霏. CMNER: 基于微博的中文多模态实体识别数据集[J]. 计算机技术与发展, 2024, 34(10): 110-117.
	JI Y Z, LI F. CMNER: A Chinese multimodal NER dataset based on weibo[J]. Computer technology and development, 2024, 34(10): 110-117.
21	XU Y, TAN X, TONG X, et al. A robust chinese named entity recognition method based on integrating dual-layer features and csbert[J]. Applied sciences, 2024, 14(3): ID 1060.
22	LIANG J Q, LI D C, LIN Y T, et al. Named entity recognition of Chinese crop diseases and pests based on RoBERTa-wwm with adversarial training[J]. Agronomy, 2023, 13(3): ID 941.
23	YOUNG T, HAZARIKA D, PORIA S, et al. Recent trends in deep learning based natural language processing[J]. IEEE computational intelligence magazine, 2018, 13(3): 55-75.
24	KHURANA D, KOLI A, KHATTER K, et al. Natural language processing: State of the art, current trends and challenges[J]. Multimedia tools and applications, 2023, 82(3): 3713-3744.
25	张宏鸣, 齐梓均, 赵春江, 等. 一种考虑双维信息的中文猕猴桃文本命名实体识别方法: CN202410434428.0[P]. 2024-07-12.
26	HUANG Z L, WANG X G, HUANG L C, et al. CCNet: Criss-cross attention for semantic segmentation[C]// 2019 IEEE/CVF International Conference on Computer Vision (ICCV). Piscataway, New Jersey, USA: IEEE, 2019: 603-612.
27	HOU Q B, ZHOU D Q, FENG J S. Coordinate attention for efficient mobile network design[C]// 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway, New Jersey, USA: IEEE, 2021: 13708-13717.
28	HU J, SHEN L, SUN G. Squeeze-and-excitation networks[C]// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway, New Jersey, USA: IEEE, 2018: 7132-7141.
29	HUANG Z, XU W, YU K. Bidirectional LSTM-CRF models for sequence tagging[EB/OL]. arXiv:1508.01991, 2015.
30	DONG C H, ZHANG J J, ZONG C Q, et al. Character-based LSTM-CRF with radical-level features for Chinese named entity recognition[C]// Natural Language Understanding and Intelligent Applications. Cham, Germany: Springer International Publishing, 2016: 239-250.
31	AKBIK A, BLYTHE D, VOLLGRAF R. Contextual string embeddings for sequence labeling[C]// Proceedings of the 27th International Conference on Computational Linguistics. San Diego, USA: ACL, 2018: 1638-1649.
32	MURPHY K, SCHÖLKOPF B, SRIVASTAVA N, et al. Dropout: A simple way to prevent neural networks from overfitting[J]. Journal of machine learning research, 2014, 15(1): 1929-1958.
33	GRIDACH M. Character-level neural network for biomedical named entity recognition[J]. Journal of biomedical informatics, 2017, 70: 85-91.
34	LI S Y, QI R Z, ZHANG S N. Chinese named entity recognition based on boundary enhancement with multi-class information[J]. Applied sciences, 2023, 13(23): ID 12925.
35	GUI T, MA R T, ZHANG Q, et al. CNN-based Chinese NER with lexicon rethinking[C]// Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence. Macao, China: International Joint Conferences on Artificial Intelligence, 2019: 4982-4988.
36	MA R, PENG M, ZHANG Q, et al. Simplify the usage of lexicon in Chinese NER[EB/OL]. arXiv: 1908.05969, 2019.

类别标签（释义）	类别	类别定义	示例	实体数量/个
KIWI（Variety）	品种	不同品种的猕猴桃名称	中华猕猴桃（Chinese kiwifruit）	3 821
DIS（Disease）	病害	猕猴桃容易遭受的病害	软腐病（soft rot）	1 402
PEST（Insect Pest）	虫害	猕猴桃容易遭受的虫害	蝙蝠蛾（Hawk moth）	1 320
PART（Part）	部位	猕猴桃易受病害侵扰的部位	果实（Fruit）	5 489
MED（Pesticide）	农药	处理猕猴桃病害的药剂	多菌灵（Carbendazim）	1 394
LOC（Location）	位置	不同品种猕猴桃的产地	陕西（Shaanxi）	4 094
COL（Color）	颜色	猕猴桃果肉颜色	红色（red），绿色（green）	1 268
TAS（Taste）	口感	猕猴桃果肉口感	酸（Sour），甜（Sweet）	892
SHA（Shape）	形状	猕猴桃果实形状	椭圆形（Elliptical）	168
NUT（Nutritional）	营养成分	猕猴桃果实所含营养成分	维生素C（Vitamin C）	325

参数名称	取值	参数名称	取值
字符向量维度	50	词典向量维度	50
批大小	16	epoch 数	70
学习率	0.008	Dropout 率	0.5
学习率衰减	0.05	Prune LSTM 隐藏层大小	200

实验环境	配置参数
操作系统	Windows 11（x64）
CPU	英特尔酷睿i9-13900H
GPU	NVIDIA GeForce RTX4060（8 GB）
内存	64 GB
硬盘	2 T
Python版本	3.7.16
Pytorch版本	1.8.1

模型	P/%	R/%	F ₁/%
LSTM	74.85	79.86	77.27
Bi-LSTM	81.43	89.59	85.31
LR-CNN	86.36	90.86	88.55
Softlexicon-LSTM	85.84	90.26	87.99
KIWINER	87.09	90.47	88.75
KIWI-Coord-Prune	87.27	91.95	89.55

模型	P/%	R/%	F ₁/%
LSTM	80.53	75.36	77.86
Bi-LSTM	85.75	80.67	83.13
LR-CNN	90.25	89.42	89.83
Softlexicon-LSTM	89.78	87.50	88.63
KIWINER	91.13	90.74	90.93
KIWI-Coord-Prune	91.96	90.09	91.02