欢迎您访问《智慧农业(中英文)》官方网站! English

Smart Agriculture

• •    

基于改进边界偏移预测网络的中文茶叶病虫害命名实体识别方法

谢宇鑫(), 危疆树(), 张尧, 李芳   

  1. 四川农业大学 信息工程学院,四川 雅安 625000,中国
  • 收稿日期:2025-05-07 出版日期:2025-08-12
  • 基金项目:
    教育部产学合作协同育人项目(22097077265201); 雅安市数字农业工程中心建设项目
  • 作者简介:

    谢宇鑫,硕士研究生,研究方向为自然语言处理,命名实体识别。E-mail:

  • 通信作者:
    危疆树,副教授,研究方向为农业信息工程。E-mail:

Chinese Tea Pest and Disease Named Entity Recognition Method Based on Improved Boundary Offset Prediction Network

XIE Yuxin(), WEI Jiangshu(), ZHANG Yao, LI Fang   

  1. College of Information Engineering, Sichuan Agricultural University, Yaan 625000, China
  • Received:2025-05-07 Online:2025-08-12
  • Foundation items:The University Industry Collaborative Education Program(22097077265201); Ya'an Digital Agriculture Engineering Center Construction Project
  • About author:

    XIE Yuxin, E-mail:

  • Corresponding author:
    WEI Jiangshu, E-mail:

摘要:

【目的/意义】 中文茶叶病虫害文本中存在大量的嵌套实体和长实体,导致了中文茶叶病虫害命名实体识别(Named Entity Recognition, NER)的准确率不高。边界偏移网络(Boundary Offset Prediction Networks)能够预测候选实体的跨度和最近实体跨度的偏移值,扩展了每个文本跨度的预测范围,有效避免了不同类型实体之间的嵌套边界冲突。针对中文文本的长实体识别,提出了结合边界预测和标签增强的命名实体识别方法。 【方法】 首先,边界预测模块定位实体的开始位置,并利用注意力机制计算后续序列属于该实体的概率;其次,基于边界识别的结果设计标签增强,使用双仿射分类器将长实体与类别标签进行联合建模,将建模结果与模型的隐藏层条件归一化输出结合,增强标签分类能力;此外,注意到模型中卷积矩阵的对称性,结合压缩激励注意力机制的自适应通道加权和低秩分解,构建低秩线性层,替换原模型中线性层,提升性能的同时减少了线性层的计算量。 【结果和讨论】 自建中文茶叶病虫害命名实体识别数据集包含122 281个标注字符、2 967条标注语料句子,共9 020个实体。为了测试边界增强偏移网络在实体识别上的性能,与BiLSTM(Bidirectional Long Short-Term Memory)、SoftLexicon、Boundary Smooth等多种命名实体方法,包含基于序列和基于跨度的实体识别,并与基线模型,在自建数据集和ResumeNER、WeiboNER、CLUENER(Chinese Language Understanding Evaluation NER)、Taobao四个公开中文数据集上分别进行了对比实验。在五个数据集上分别取得了较好的F1值。 【结论】 本研究提出的方法相比现有方法,它能更有效地识别中文茶叶病虫害文本中的实体,效果优于其他模型,并展现了良好的泛化性。

关键词: 命名实体识别, 中文茶叶病虫害文本, 自建数据集, 边界增强, 边界偏移

Abstract:

[Objective] Named Entity Recognition (NER) is vital for many NLP applications, including information retrieval and knowledge graph construction. While Chinese NER has advanced with datasets like ResumeNER, WeiboNER, and CLUENER(Chinese Language Understanding Evaluation NER), most focus on general domains such as news or social media. However, there is a notable lack of annotated data in specialized fields, particularly agriculture. In the context of tea plant diseases and pests, this shortage hampers progress in intelligent agricultural information extraction. These domain-specific texts pose unique challenges for NER due to frequent nested and long-span entities, which traditional sequence labeling models struggle to handle. Issues such as boundary ambiguity further complicate accurate entity recognition, leading to poor segmentation and labeling performance. Addressing these challenges requires targeted datasets and improved NER techniques tailored to the agricultural domain. [Methods] The proposed model comprised two core modules specifically designed to enhance performance in BOPN (Boundary-Oriented and Path-aware Named Entity Recognition) tasks, particularly within domains characterized by complex and fine-grained entity structures, such as plant disease and pest recognition. The Boundary Prediction Module was responsible for identifying entity spans within input text sequences. It employed an attention-based mechanism to dynamically estimate the probability that consecutive tokens belong to the same entity, thereby addressing the challenge of boundary ambiguity. This mechanism facilitated more accurate detection of entity boundaries, which was particularly critical in scenarios involving nested or overlapping entities. The Label Enhancement Module further refined entity recognition by employing a biaffine classifier that jointly models entity spans and their corresponding category labels. This joint modeling approach enabled the capture of intricate interactions between span representations and semantic label information, improving the identification of long or syntactically complex entities. The output of this module was integrated with conditionally normalized hidden representations, enhancing the model's capacity to assign context-aware and semantically precise labels. In order to reduce computational complexity while preserving model effectiveness, the architecture incorporated low-rank linear layers. These were constructed by integrating the adaptive channel weighting mechanism of Squeeze-and-Excitation Networks with low-rank decomposition techniques. The modified layers replace traditional linear transformations, yielding improvements in both efficiency and representational capacity. In addition to model development, a domain-specific NER corpus was constructed through the systematic collection and annotation of entity information related to tea plant diseases and pests from scientific literature, agricultural technical reports, and online texts. The annotated entities in the corpus were categorized into ten classes, including tea plant diseases, tea pests, disease symptoms, and pest symptoms. Based on this labeled corpus, a Chinese NER dataset focused on tea plant diseases and pests was developed, referred to as the Chinese tea-pad dataset. [Results and Discussions] The Chinese tea-pad dataset comprised 122 281 annotated characters and 2 967 sentences, containing a total of 9 020 entities across multiple professional entity categories, including disease names, pest names, affected plant parts, and symptom descriptions. The dataset featured a standardized structure and a clear hierarchical organization, effectively supporting the training and evaluation of NER models. Extensive experiments were conducted on the constructed dataset, comparing the proposed method with several mainstream NER approaches, including traditional sequence labeling models (e.g., BiLSTM-CRF), lexicon-enhanced models (e.g., SoftLexicon), and boundary smoothing strategies (e.g., Boundary Smooth). These comparisons aimed to rigorously assess the effectiveness of the proposed architecture in handling domain-specific and structurally complex entity types. Additionally, to evaluate the model's generalization capability beyond the tea disease and pest domain, the study performed comprehensive evaluations on four publicly available Chinese NER benchmark datasets: ResumeNER, WeiboNER, CLUE, and Taobao. Results showed that the proposed model consistently achieved higher F1-Scores across all datasets, with particularly notable improvements in the recognition of complex, nested, and long-span entities. These outcomes demonstrate the model's superior capacity for capturing intricate entity boundaries and semantics, and confirm its robustness and adaptability when compared to current state-of-the-art methods. [Conclusions] In summary, the study presented a high-performance NER approach tailored to the characteristics of Chinese texts on tea plant diseases and pests. By simultaneously optimizing entity boundary detection and label classification, the proposed method significantly enhanced recognition accuracy in specialized domains. Experimental results demonstrated strong adaptability and robustness of the model across both newly constructed and publicly available datasets, indicating its broad applicability and promising prospects.

Key words: named entity recognition, Chinese tea pest and disease text, self-built dataset, boundary enhancement, boundary offset

中图分类号: