欢迎您访问《智慧农业(中英文)》官方网站! English

Smart Agriculture ›› 2025, Vol. 7 ›› Issue (1): 33-43.doi: 10.12133/j.smartag.SA202410026

• 专题--农业知识智能服务和智慧无人农场(下) • 上一篇    下一篇

基于迁移学习的农业短文本语义相似度计算方法

金宁1, 郭宇峰1,2, 韩晓东1, 缪祎晟2,3, 吴华瑞2,3()   

  1. 1. 沈阳建筑大学 计算机科学与工程学院,辽宁 沈阳 110168,中国
    2. 国家农业信息化工程研究中心,北京 100097,中国
    3. 农业农村部农业信息化技术重点实验室,北京 100097,中国
  • 收稿日期:2024-10-25 出版日期:2025-01-30
  • 基金项目:
    国家重点研发计划项目(2024YFD200803-3); 辽宁省教育厅基础研究项目(LJKQZ20222458); 辽宁省科技计划联合计划(2024-MSLH-399)
  • 作者简介:
    金 宁,博士,副教授,研究方向为农业智能系统研究。E-mail:
  • 通信作者:
    吴华瑞,博士,研究员,研究方向为农业智能系统与物联网研究。E-mail:

Method for Calculating Semantic Similarity of Short Agricultural Texts Based on Transfer Learning

JIN Ning1, GUO Yufeng1,2, HAN Xiaodong1, MIAO Yisheng2,3, WU Huarui2,3()   

  1. 1. School of Computer Science and Engineering, Shenyang Jianzhu University, Shenyang 110168, China
    2. National Engineering Research Center for Information Technology in Agriculture, Beijing 100097, China
    3. Key Laboratory of Agricultural Information Technology, Ministry of Agriculture and Rural Areas, Beijing 100097, China
  • Received:2024-10-25 Online:2025-01-30
  • Foundation items:National Key Research and Development Program of China(2024YFD200803-3); Basic Research Project of Education Department of Liaoning Province(LJKQZ20222458); Liaoning Province Science and Technology Plan Joint Plan(2024-MSLH-399)
  • About author:

    JIN Ning, E-mail:

  • Corresponding author:
    WU Huarui, E-mail:

摘要:

【目的/意义】 农业领域高质量的语义相似度计算是推动农业技术推广信息化、智能化发展的重要基础。针对现有文本语义相似度计算模型特征提取不全面、高质量标注数据集少等问题,提出一种基于迁移学习和BERT(Bidirectional Encoder Representations from Transformers)预训练模型的农业短文本语义相似度计算模型CWPT-TSBERT(Chinese-based Wordpiece Tokenization and Transfer-learning by Sentence BERT)。 【方法】 CWPT-TSBERT依托孪生网络架构,利用迁移学习策略在大规模通用领域标注数据集进行模型预训练,解决农业文本标注数据集少、语义稀疏性高等问题。提出面向中文的子词单元分词方法CWPT拆分汉字,增强字向量的语义特征表示,进一步丰富了短文本语义特征表达。根据迁移学习的微调机制,利用SBERT(Sentence BERT)模型提取字向量,挖掘汉字间及字形结构间关联关系,提高模型语义相似度计算的正确率。 【结果和讨论】 CWPT-TSBERT模型的语义相似度计算正确率达到97.18%,高于基于卷积神经网络的TextCNN_Attention、基于循环神经网络的MaLSTM(Manhattan Long Short-Term Memory),以及基于BERT预训练模型的SBERT等12种模型。 【结论】 CWPT-TSBERT模型在小规模农业短文本数据集上语义相似性计算正确率较高,性能优势明显,为语义智能匹配提供了有效的技术参考。

关键词: 迁移学习, 农业短文本, 语义相似度计算, 字形特征, 知识智能服务, 大模型

Abstract:

[Objective] Intelligent services of agricultural knowledge have emerged as a current hot research domain, serving as a significant support for the construction of smart agriculture. The platform "China Agricultural Technology Extension" provides users with efficient and convenient agricultural information consultation services via mobile terminals, and has accumulated a vast amount of Q&A data. These data are characterized by a huge volume of information, rapid update and iteration, and a high degree of redundancy, resulting in the platform encountering issues such as frequent repetitive questions, low timeliness of problem responses, and inaccurate information retrieval. There is an urgent requirement for a high-quality text semantic similarity calculation approach to confront these challenges and effectively enhance the information service efficiency and intelligent level of the platform. In view of the problems of incomplete feature extraction and lack of short agro-text annotation data sets in existing text semantic similarity calculation models, a semantic similarity calculation model for short agro-text, namely CWPT-SBERT, based on transfer learning and BERT pre-training model, was proposed. [Methods] CWPT-SBERT was based on Siamese architecture with identical left and right sides and shared parameters, which had the advantages of low structural complexity and high training efficiency. This network architecture effectively reduced the consumption of computational resources by sharing parameters and ensures that input texts were compared in the same feature space. CWPT-SBERT consisted of four main parts: Semantic enhancement layer, embedding layer, pooling layer, and similarity measurement layer. The CWPT method based on the word segmentation unit was proposed in the semantic enhancement layer to further divide Chinese characters into more fine-grained sub-units maximizing the semantic features in short Chinese text and effectively enhancing the model's understanding of complex Chinese vocabulary and character structures. In the embedding layer, a transfer learning strategy was used to extract features from agricultural short texts based on SBERT. It captured the semantic features of Chinese text in the general domain, and then generated a more suitable semantic feature vector representation after fine-tuning. Transfer learning methods to train models on large-scale general-purposed domain annotation datasets solve the problem of limited short agro-text annotation datasets and high semantic sparsity. The pooling layer used the average pooling strategy to map the high-dimensional semantic vector of Chinese short text to a low-dimensional vector space. The similarity measurement layer used the cosine similarity calculation method to measure the similarity between the semantic feature vector representations of the two output short texts, and the computed similarity degree was finally input into the loss function to guide model training, optimize model parameters, and improve the accuracy of similarity calculation. [Results and Discussions] For the task of calculating semantic similarity in agricultural short texts, on a dataset containing 19 968 pairs of short ago-texts, the CWPT-SBERT model achieved an accuracy rate of 97.18% and 96.93%, a recall rate of 97.14%, and an F1-Score value of 97.04%, which are higher than 12 models such as TextCNN_Attention, MaLSTM and SBERT. By analyzing the Pearson and Spearman coefficients of CWPT-SBERT, SBERT, SALBERT and SRoBERTa trained on short agro-text datasets, it could be observed that the initial training value of the CWPT-SBERT model was significantly higher than that of the comparison models and was close to the highest value of the comparison models. Moreover, it exhibited a smooth growth trend during the training process, indicating that CWPT-SBERT had strong correlation, robustness, and generalization ability from the initial state. During the training process, it could not only learn the features in the training data but also effectively apply these features to new domain data. Additionally, for ALBERT, RoBERTa and BERT models, fine-tuning training was conducted on short agro-text datasets, and optimization was performed by utilizing the morphological structure features to enrich text semantic feature expression. Through ablation experiments, it was evident that both optimization strategies could effectively enhance the performance of the models. By analyzing the attention weight heatmap of Chinese character morphological structure, the importance of Chinese character radicals in representing Chinese character attributes was highlighted, enhancing the semantic representation of Chinese characters in vector space. There was also complex correlation within the morphological structure of Chinese characters. [Conclusions] CWPT-SBERT uses transfer learning methods to solve the problem of limited short agro-text annotation datasets and high semantic sparsity. By leveraging the Chinese-oriented word segmentation method CWPT to break down Chinese characters, the semantic representation of word vectors is enhanced, and the semantic feature expression of short texts is enriched. CWPT-SBERT model has high accuracy of semantic similarity on small-scale short agro-text and obvious performance advantages, which provides an effective technical reference for semantic intelligence matching.

Key words: transfer learning, short agro-text, semantic similarity calculation, glyph features, intelligent knowledge service, big model

中图分类号: