基于改进边界偏移预测网络的中文茶叶病虫害命名实体识别方法

doi:10.12133/j.smartag.SA202505007

摘要/Abstract

摘要：

［目的/意义］ 中文茶叶病虫害文本中存在大量的嵌套实体和长实体，导致中文茶叶病虫害命名实体识别（Named Entity Recognition, NER）的准确率不高。边界偏移网络（Boundary Offset Prediction Networks, BOPN）能够预测候选实体的跨度和最近实体跨度的偏移值，扩展了每个文本跨度的预测范围，有效避免了不同类型实体之间的嵌套边界冲突。针对中文文本的长实体识别，提出了结合边界预测和标签增强的命名实体识别方法。 ［方法］ 首先，边界预测模块定位实体的开始位置，并利用注意力机制计算后续序列属于该实体的概率；其次，基于边界识别的结果设计标签增强，使用双仿射分类器将长实体与类别标签进行联合建模，将建模结果与模型的隐藏层条件归一化输出结合，增强标签分类能力；此外，注意到模型中卷积矩阵的对称性，结合压缩激励注意力机制的自适应通道加权和低秩分解，构建低秩线性层，替换原模型中线性层，提升性能的同时减少了线性层的计算量。 ［结果和讨论］ 为了测试边界增强偏移网络在实体识别上的性能，与BiLSTM（Bidirectional Long Short-Term Memory）、SoftLexicon、Boundary Smooth等多种命名实体方法，包含基于序列和基于跨度的实体识别，以及基线模型，在自建数据集和ResumeNER、WeiboNER、CLUENER（Chinese Language Understanding Evaluation NER）、Taobao四个公开中文数据集上分别进行了对比实验。在五个数据集上分别取得了较好的F₁值。 ［结论］ 相比现有方法，本研究提出的方法能更有效地识别中文茶叶病虫害文本中的实体，效果优于其他模型，并展现了良好的泛化性。

关键词: 命名实体识别, 中文茶叶病虫害文本, 自建数据集, 边界增强, 边界偏移

Abstract:

[Objective] Named entity recognition (NER) is vital for many natural language processing (NLP) applications, including information retrieval and knowledge graph construction. While Chinese NER has advanced with datasets like ResumeNER, WeiboNER, and CLUENER (Chinese language understanding evaluation NER), most focus on general domains such as news or social media. However, there is a notable lack of annotated data in specialized fields, particularly agriculture. In the context of tea pest and disease, this shortage hampers progress in intelligent agricultural information extraction. These domain-specific texts pose unique challenges for NER due to frequent nested and long-span entities, which traditional sequence labeling models struggle to handle. Issues such as boundary ambiguity further complicate accurate entity recognition, leading to poor segmentation and labeling performance. Addressing these challenges requires targeted datasets and improved NER techniques tailored to the agricultural domain. [Methods] The proposed model comprises two core modules specifically designed to enhance performance in BOPN (Boundary-Oriented and Path-aware Named Entity Recognition) tasks, particularly within domains characterized by complex and fine-grained entity structures, such as tea pest and disease recognition. The boundary prediction module was responsible for identifying entity spans within input text sequences. It employed an attention-based mechanism to dynamically estimate the probability that consecutive tokens belong to the same entity, thereby addressing the challenge of boundary ambiguity. This mechanism facilitated more accurate detection of entity boundaries, which was particularly critical in scenarios involving nested or overlapping entities. The label enhancement module further refines entity recognition by employing a biaffine classifier that jointly models entity spans and their corresponding category labels. This joint modeling approach enabled the capture of intricate interactions between span representations and semantic label information, improving the identification of long or syntactically complex entities. The output of this module was integrated with conditionally normalized hidden representations, enhancing the model's capacity to assign context-aware and semantically precise labels. In order to reduce computational complexity while preserving model effectiveness, the architecture incorporated low-rank linear layers. These were constructed by integrating the adaptive channel weighting mechanism of Squeeze-and-Excitation Networks with low-rank decomposition techniques. The modified layers replace traditional linear transformations, yielding improvements in both efficiency and representational capacity. In addition to model development, a domain-specific NER corpus was constructed through the systematic collection and annotation of entity information related to tea pest and disease from scientific literature, agricultural technical reports, and online texts. The annotated entities in the corpus were categorized into ten classes, including tea plant diseases, tea pests, disease symptoms, and pest symptoms. Based on this labeled corpus, a Chinese NER dataset focused on tea pest and disease was developed, referred to as the Chinese tea pest and disease dataset. [Results and Discussions] Extensive experiments were conducted on the constructed dataset, comparing the proposed method with several mainstream NER approaches, including traditional sequence labeling models (e.g., BiLSTM-CRF), lexicon-enhanced models (e.g., SoftLexicon), and boundary smoothing strategies (e.g., Boundary Smooth). These comparisons aimed to rigorously assess the effectiveness of the proposed architecture in handling domain-specific and structurally complex entity types. Additionally, to evaluate the model's generalization capability beyond the tea disease and pest domain, the study performed comprehensive evaluations on four publicly available Chinese NER benchmark datasets: ResumeNER, WeiboNER, CLUENER, and Taobao. Results showed that the proposed model consistently achieved higher F₁-Scores improved across all used datasets: 0.68% on the self-built dataset, 0.29% on ResumeNER, 0.96% on WeiboNER, 0.7% on CLUENER, and 0.5% on Taobao. With particularly notable improvements in the recognition of complex, nested, and long-span entities. These outcomes demonstrate the model's superior capacity for capturing intricate entity boundaries and semantics, and confirm its robustness and adaptability when compared to current state-of-the-art methods. [Conclusions] The study presents a high-performance NER approach tailored to the characteristics of Chinese texts on tea pest and disease. By simultaneously optimizing entity boundary detection and label classification, the proposed method significantly enhanced recognition accuracy in specialized domains. Experimental results demonstrated strong adaptability and robustness of the model across both newly constructed and publicly available datasets, indicating its broad applicability and promising prospects.

Key words: named entity recognition, Chinese tea pest and disease text, self-built dataset, boundary enhancement, boundary offset

中图分类号:

谢宇鑫, 危疆树, 张尧, 李芳. 基于改进边界偏移预测网络的中文茶叶病虫害命名实体识别方法[J]. 智慧农业(中英文), 2025, 7(5): 88-100.

XIE Yuxin, WEI Jiangshu, ZHANG Yao, LI Fang. Chinese Tea Pest and Disease Named Entity Recognition Method Based on Improved Boundary Offset Prediction Network[J]. Smart Agriculture, 2025, 7(5): 88-100.

图/表 15

表1

图1

图2

图3

表2

表3

表4

表5

表6

表7

表8

表9

图4

图5

表10

参考文献 35

[1]	聂啸林, 张礼麟, 牛当当, 等. 面向葡萄知识图谱构建的多特征融合命名实体识别[J]. 农业工程学报, 2024, 40(3): 201-210.
	NIE X L, ZHANG L L, NIU D D, et al. Multi-feature fusion named entity recognition method for grape knowledge graph construction[J]. Transactions of the Chinese society of agricultural engineering, 2024, 40(3): 201-210.
[2]	王彤, 王春山, 李久熙, 等. 基于RoFormer预训练模型的指针网络农业病害命名实体识别[J]. 智慧农业(中英文), 2024, 6(2): 85-94.
	WANG T, WANG C S, LI J X, et al. Agricultural disease named entity recognition with pointer network based on RoFormer pre-trained model[J]. Smart agriculture, 2024, 6(2): 85-94.
[3]	齐梓均, 牛当当, 吴华瑞, 等. 基于双维信息与剪枝的中文猕猴桃文本命名实体识别方法[J]. 智慧农业(中英文), 2025, 7(1): 44-56.
	QI Z J, NIU D D, WU H R, et al. Chinese kiwifruit text named entity recognition method based on dual-dimensional information and pruning[J]. Smart agriculture, 2025, 7(1): 44-56.
[4]	贺子康, 杨勇, 杨国峰, 等. 基于BERT-BiLSTM-CRF的农产品信息文本命名实体识别研究及应用展望[J]. 农业展望, 2022, 18(5): 105-111.
	HE Z K, YANG Y, YANG G F, et al. Research on named entity recognition of agricultural products information text and its application prospect based on BERT-BiLSTM-CRF[J]. Agricultural outlook, 2022, 18(5): 105-111.
[5]	陈瑛, 张晓强, 陈昂轩, 等. 基于信息抽取的食品安全事件自动问答系统方法研究[J]. 农业机械学报, 2020, 51(S2): 442-448.
	CHEN Y, ZHANG X Q, CHEN A X, et al. Methods of food safety question answering system based on LSTM[J]. Transactions of the Chinese society for agricultural machinery, 2020, 51(S2): 442-448.
[6]	韦婷婷, 葛晓月, 熊俊涛. 基于层级多标签的农业病虫害问句分类方法[J]. 农业机械学报, 2024, 55(1): 263-269, 435.
	WEI T T, GE X Y, XIONG J T. Hierarchical multi-label classification of agricultural pest and disease interrogative questions[J]. Transactions of the Chinese society for agricultural machinery, 2024, 55(1): 263-269, 435.
[7]	朱张莉, 饶元, 吴渊, 等. 注意力机制在深度学习中的研究进展[J]. 中文信息学报, 2019, 33(6): 1-11.
	ZHU Z L, RAO Y, WU Y, et al. Research progress of attention mechanism in deep learning[J]. Journal of Chinese information processing, 2019, 33(6): 1-11.
[8]	李金鹏, 张闯, 陈小军, 等. 自动文本摘要研究综述[J]. 计算机研究与发展, 2021, 58(1): 1-21.
	LI J P, ZHANG C, CHEN X J, et al. Survey on automatic text summarization[J]. Journal of computer research and development, 2021, 58(1): 1-21.
[9]	LAMPLE G, BALLESTEROS M, SUBRAMANIAN S, et al. Neural architectures for named entity recognition[EB/OL]. arXiv: 1603.01360, 2016.
[10]	TJONG KIM SANG E F, BUCHHOLZ S. Introduction to the CoNLL-2000 shared task: Chunking[C]// Proceedings of the 2nd Workshop on Learning Language in Logic and the 4th Conference on Computational Natural Language Learning. Morristown, NJ, USA: ACL, 2000: 127.
[11]	XUE N W. Chinese word segmentation as character tagging[J]// International Journal of Computational Linguistics & Chinese Language Processing (IJCLCLP), 2003, 8(1): 29-48.
[12]	SOHRAB M G, MIWA M. Deep exhaustive model for nested named entity recognition[C]// Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Stroudsburg, Pennsylvania, USA: ACL, 2018: 2843-2849.
[13]	LEE K, HE L, LEWIS M, et al. End-to-end neural coreference resolution [C]// Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP). Stroudsburg, Pennsylvania, USA:ACL, 2017: 188-197.
[14]	MARKUS E, ADRIAN U. Span-based joint entity and relation extraction with transformer pre-training[M]// ECAI 2020. Santiago de Compostela, Spain: IOS Press, 2020.
[15]	JOSHI M, CHEN D Q, LIU Y H, et al. SpanBERT: Improving pre-training by representing and predicting spans[J]. Transactions of the association for computational linguistics, 2020, 8: 64-77.
[16]	TAN C Q, QIU W, CHEN M S, et al. Boundary enhanced neural span classification for nested named entity recognition[J]. Proceedings of the AAAI conference on artificial intelligence, 2020, 34(5): 9016-9023.
[17]	YU J T, BOHNET B, POESIO M. Named entity recognition as dependency parsing[C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, Pennsylvania, USA: ACL, 2020: 6470-6476.
[18]	SHEN Y L, MA X Y, TAN Z Q, et al. Locate and label: A two-stage identifier for nested named entity recognition[C]// Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. Stroudsburg, Pennsylvania, USA: ACL, 2021: 2782-2794.
[19]	YUAN Z, TAN C Q, HUANG S F, et al. Fusing heterogeneous factors with triaffine mechanism for nested named entity recognition[C]// Findings of the Association for Computational Linguistics: ACL 2022. Stroudsburg, Pennsylvania, USA: ACL, 2022: 3174-3186.
[20]	LI J Y, FEI H, LIU J, et al. Unified named entity recognition as word-word relation classification[C]// Proceedings of the AAAI Conference on Artificial Intelligence. Washington, D.C., USA: AAAI, 2022: 10965-10973.
[21]	CAI Y X, LIU Q, GAN Y L, et al. DiFiNet: Boundary-aware semantic differentiation and filtration network for nested named entity recognition[C]// Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Stroudsburg, Pennsylvania, USA: ACL, 2024: 6455-6471.
[22]	TANG M H, HE Y Q, XU Y X, et al. A boundary offset prediction network for named entity recognition[C]// Findings of the Association for Computational Linguistics: EMNLP 2023. Singapore. Stroudsburg, Pennsylvania, USA: ACL, 2023: 14834-14846.
[23]	LI F, WANG Z, HUI S C, et al. Modularized interaction network for named entity recognition[C]// Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Stroudsburg, Pennsylvania, USA: ACL, 2021: 200-209.
[24]	LI J, SUN A X, MA Y K. Neural named entity boundary detection[J]. IEEE transactions on knowledge and data engineering, 2021, 33(4): 1790-1795.
[25]	ULYANOV DMITRY, VEDALDI ANDREA, LEMPITSKY VICTOR. Instance normalization: The missing ingredient for fast stylization[EB/OL]. arXiv: 1607.08022, 2017.
[26]	TIMOTHY DOZAT, MANNING CHRISTOPHER D.. Deep biaffine attention for neural dependency parsing[EB/OL]. arXiv: 1611.01734, 2016.
[27]	ZHANG Y, YANG J. Chinese NER using lattice LSTM[C]// Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, Pennsylvania, USA: ACL, 2018: 1554-1564.
[28]	PENG N Y, DREDZE M. Named entity recognition for Chinese social media with jointly trained embeddings[C]// Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Stroudsburg, Pennsylvania, USA: ACL, 2015: 548-554.
[29]	XU L, DONG Q, LIAO Y, et al. CLUENER2020: Fine-grained named entity recognition dataset and benchmark for Chinese[EB/OL]. arXiv: 2001.04351, 2020.
[30]	JIE Z M, XIE P J, LU W, et al. Better modeling of incomplete annotations for named entity recognition[C]// Proceedings of the 2019 Conference of the North. Stroudsburg, Pennsylvania, USA: ACL, 2019: 729-734.
[31]	MIWA M, BANSAL M. End-to-end relation extraction using LSTMs on sequences and tree structures[C]// Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Stroudsburg, Pennsylvania, USA: ACL, 2016: 1105-1116.
[32]	LI X N, YAN H, QIU X P, et al. FLAT: Chinese NER using flat-lattice transformer[C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, Pennsylvania, USA: ACL, 2020: 6836-6842.
[33]	MA R T, PENG M L, ZHANG Q, et al. Simplify the usage of lexicon in Chinese NER[C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, Pennsylvania, USA: ACL, 2020: 5951-5960.
[34]	WU S, SONG X N, FENG Z H. MECT: Multi-metadata embedding based cross-transformer for Chinese named entity recognition[C]// Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Stroudsburg, Pennsylvania, USA: ACL, 2021: 1529-1539.
[35]	ZHU E W, LI J P. Boundary smoothing for named entity recognition[C]// Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Stroudsburg, Pennsylvania, USA: ACL, 2022: 7096-7108.

标签	类别定义	示例	实体数量/个
DISEASE	病害（Disease）	茶炭疽病病	196
PEST	虫害（Pest）	茶尺蠖	622
PART	受害茶叶的部位（Part）	成叶、嫩叶	836
LOC	地区（Location）	四川、安徽、江苏	1 066
COL	颜色（Color）	黑褐色、淡黄色	1 641
SHA	形状（Shape）	椭圆形、灰白色尘末状	1 272
OPE	防控操作（Operator）	选用抗病品种、加强茶园管理	956
MED	处理受害的药剂（Medicine）	马拉硫磷乳油、晶体石硫合剂	942
FEA	病虫特征（Feature）	浅黄色蜡粉、暗褐色波状横纹	872
SYM	受害症状（Symptom）	水渍状暗绿色病斑、树势衰弱	617

数据集	划分	句子数	实体数	类别	数据集	划分	句子数	实体数	类别
ResumeNER^［27］	训练集	3 821	13 343	8	CLUENER（Chinese Language Understanding Evaluation NER）^［29］	训练集	10 748	23 971	10
	测试集	477	1 630			测试集	1 343	3 072
	验证集	463	1 488
WeiboNER^［28］	训练集	1 350	1 885	4	Taobao^［30］	训练集	6 000	29 397	4
	测试集	270	414			测试集	1 000	4886
	验证集	270	389			验证集	998	4941

模型	模型特点
BiLSTM-CRF（Bidirectional Long Short-Term Memory- Conditional Random Field）^［31］	基于双向LSTM和条件随机场，通过序列标注实现高效的扁平实体识别
Lattice^［27］	使用字符-词汇混合编码的方法，避免中文分词错误对NER的影响
Flat^［32］	一种基于扁平化结构的方法，通过优化Lattice架构显著提升中文NER效率
SoftLexicon^［33］	一种基于字符级词汇融合的方法，简化模型结构同时保持较高准确率
MECT（Multi-Metadata Embedding based Cross-Transformer）^［34］	一种基于汉字结构和字根特征的方法，可以通过多源语义融合增强中文NER效果
BOPN	通过预测候选跨度与实体跨度之间的边界偏移进行分类，有效捕捉实体的边界信息
W2NER	基于词对关系建模的方法，通过构建二维的词对网格来捕捉相邻词语关系，解决传统NER模型的边界识别问题
Boundary Smooth^［35］	使用概率重分配，通过平滑实体边界概率来提升模型泛化能力
DiFiNet^［21］	边界感知的嵌套NER模型，通过双仿射跨度表示和自适应语义区分模块解决现有方法边界检测弱、对小变化不敏感的问题，并利用边界过滤模块减轻噪声干扰

模型	F ₁值/%	P/%	R/%
BiLSTM	73.43	70.83	76.23
Lattice	79.03	79.33	78.73
Flat	80.40	78.65	82.22
SoftLexicon	77.80	78.01	77.51
MECT	80.67	80.48	80.87
Boundary Smooth	82.39	80.39	84.61
W2NER	81.85	79.89	83.90
DiFiNET	82.31	80.99	83.68
BOPN	82.08	79.18	85.20
Be-BOPN	82.76	80.36	85.31

模型	F ₁值/%	P/%	R/%
BiLSTM	91.87	92.32	91.42
Lattice	94.46	94.81	94.11
Flat	95.86	—	—
SoftLexicon	96.11	96.08	96.13
MECT	95.98	—	—
Boundary Smooth	95.59	95.41	95.77
W2NER	96.21	95.97	96.44
DiFiNET	96.41	96.50	96.32
BOPN	96.35	95.73	96.97
Be-BOPN	96.64	96.49	96.79