基于迁移学习的农业短文本语义相似度计算方法

doi:10.12133/j.smartag.SA202410026

Smart Agriculture ›› 2025, Vol. 7 ›› Issue (1): 33-43.doi: 10.12133/j.smartag.SA202410026

• 专题--农业知识智能服务和智慧无人农场（下） • 上一篇下一篇

基于迁移学习的农业短文本语义相似度计算方法

金宁¹, 郭宇峰¹^,², 韩晓东¹, 缪祎晟²^,³, 吴华瑞²^,³()

^1. 沈阳建筑大学计算机科学与工程学院，辽宁沈阳 110168，中国
^2. 国家农业信息化工程研究中心，北京 100097，中国
^3. 农业农村部农业信息化技术重点实验室，北京 100097，中国

收稿日期:2024-10-25 出版日期:2025-01-30
基金项目:
国家重点研发计划项目(2024YFD200803-3); 辽宁省教育厅基础研究项目(LJKQZ20222458); 辽宁省科技计划联合计划(2024-MSLH-399)
作者简介:
金宁，博士，副教授，研究方向为农业智能系统研究。E-mail：jinning21@126.com
通信作者:
吴华瑞，博士，研究员，研究方向为农业智能系统与物联网研究。E-mail：wuhr@nercita.org.cn

Method for Calculating Semantic Similarity of Short Agricultural Texts Based on Transfer Learning

JIN Ning¹, GUO Yufeng¹^,², HAN Xiaodong¹, MIAO Yisheng²^,³, WU Huarui²^,³()

^1. School of Computer Science and Engineering, Shenyang Jianzhu University, Shenyang 110168, China
^2. National Engineering Research Center for Information Technology in Agriculture, Beijing 100097, China
^3. Key Laboratory of Agricultural Information Technology, Ministry of Agriculture and Rural Areas, Beijing 100097, China

Received:2024-10-25 Online:2025-01-30
Foundation items:National Key Research and Development Program of China(2024YFD200803-3); Basic Research Project of Education Department of Liaoning Province(LJKQZ20222458); Liaoning Province Science and Technology Plan Joint Plan(2024-MSLH-399)
About author:
JIN Ning, E-mail: jinning21@126.com
Corresponding author:
WU Huarui, E-mail: wuhr@nercita.org.cn

摘要/Abstract

摘要：

【目的/意义】 农业领域高质量的语义相似度计算是推动农业技术推广信息化、智能化发展的重要基础。针对现有文本语义相似度计算模型特征提取不全面、高质量标注数据集少等问题，提出一种基于迁移学习和BERT（Bidirectional Encoder Representations from Transformers）预训练模型的农业短文本语义相似度计算模型CWPT-TSBERT（Chinese-based Wordpiece Tokenization and Transfer-learning by Sentence BERT）。 【方法】 CWPT-TSBERT依托孪生网络架构，利用迁移学习策略在大规模通用领域标注数据集进行模型预训练，解决农业文本标注数据集少、语义稀疏性高等问题。提出面向中文的子词单元分词方法CWPT拆分汉字，增强字向量的语义特征表示，进一步丰富了短文本语义特征表达。根据迁移学习的微调机制，利用SBERT（Sentence BERT）模型提取字向量，挖掘汉字间及字形结构间关联关系，提高模型语义相似度计算的正确率。 【结果和讨论】 CWPT-TSBERT模型的语义相似度计算正确率达到97.18%，高于基于卷积神经网络的TextCNN_Attention、基于循环神经网络的MaLSTM（Manhattan Long Short-Term Memory），以及基于BERT预训练模型的SBERT等12种模型。 【结论】 CWPT-TSBERT模型在小规模农业短文本数据集上语义相似性计算正确率较高，性能优势明显，为语义智能匹配提供了有效的技术参考。

关键词: 迁移学习, 农业短文本, 语义相似度计算, 字形特征, 知识智能服务, 大模型

Abstract:

[Objective] Intelligent services of agricultural knowledge have emerged as a current hot research domain, serving as a significant support for the construction of smart agriculture. The platform "China Agricultural Technology Extension" provides users with efficient and convenient agricultural information consultation services via mobile terminals, and has accumulated a vast amount of Q&A data. These data are characterized by a huge volume of information, rapid update and iteration, and a high degree of redundancy, resulting in the platform encountering issues such as frequent repetitive questions, low timeliness of problem responses, and inaccurate information retrieval. There is an urgent requirement for a high-quality text semantic similarity calculation approach to confront these challenges and effectively enhance the information service efficiency and intelligent level of the platform. In view of the problems of incomplete feature extraction and lack of short agro-text annotation data sets in existing text semantic similarity calculation models, a semantic similarity calculation model for short agro-text, namely CWPT-SBERT, based on transfer learning and BERT pre-training model, was proposed. [Methods] CWPT-SBERT was based on Siamese architecture with identical left and right sides and shared parameters, which had the advantages of low structural complexity and high training efficiency. This network architecture effectively reduced the consumption of computational resources by sharing parameters and ensures that input texts were compared in the same feature space. CWPT-SBERT consisted of four main parts: Semantic enhancement layer, embedding layer, pooling layer, and similarity measurement layer. The CWPT method based on the word segmentation unit was proposed in the semantic enhancement layer to further divide Chinese characters into more fine-grained sub-units maximizing the semantic features in short Chinese text and effectively enhancing the model's understanding of complex Chinese vocabulary and character structures. In the embedding layer, a transfer learning strategy was used to extract features from agricultural short texts based on SBERT. It captured the semantic features of Chinese text in the general domain, and then generated a more suitable semantic feature vector representation after fine-tuning. Transfer learning methods to train models on large-scale general-purposed domain annotation datasets solve the problem of limited short agro-text annotation datasets and high semantic sparsity. The pooling layer used the average pooling strategy to map the high-dimensional semantic vector of Chinese short text to a low-dimensional vector space. The similarity measurement layer used the cosine similarity calculation method to measure the similarity between the semantic feature vector representations of the two output short texts, and the computed similarity degree was finally input into the loss function to guide model training, optimize model parameters, and improve the accuracy of similarity calculation. [Results and Discussions] For the task of calculating semantic similarity in agricultural short texts, on a dataset containing 19 968 pairs of short ago-texts, the CWPT-SBERT model achieved an accuracy rate of 97.18% and 96.93%, a recall rate of 97.14%, and an F₁-Score value of 97.04%, which are higher than 12 models such as TextCNN_Attention, MaLSTM and SBERT. By analyzing the Pearson and Spearman coefficients of CWPT-SBERT, SBERT, SALBERT and SRoBERTa trained on short agro-text datasets, it could be observed that the initial training value of the CWPT-SBERT model was significantly higher than that of the comparison models and was close to the highest value of the comparison models. Moreover, it exhibited a smooth growth trend during the training process, indicating that CWPT-SBERT had strong correlation, robustness, and generalization ability from the initial state. During the training process, it could not only learn the features in the training data but also effectively apply these features to new domain data. Additionally, for ALBERT, RoBERTa and BERT models, fine-tuning training was conducted on short agro-text datasets, and optimization was performed by utilizing the morphological structure features to enrich text semantic feature expression. Through ablation experiments, it was evident that both optimization strategies could effectively enhance the performance of the models. By analyzing the attention weight heatmap of Chinese character morphological structure, the importance of Chinese character radicals in representing Chinese character attributes was highlighted, enhancing the semantic representation of Chinese characters in vector space. There was also complex correlation within the morphological structure of Chinese characters. [Conclusions] CWPT-SBERT uses transfer learning methods to solve the problem of limited short agro-text annotation datasets and high semantic sparsity. By leveraging the Chinese-oriented word segmentation method CWPT to break down Chinese characters, the semantic representation of word vectors is enhanced, and the semantic feature expression of short texts is enriched. CWPT-SBERT model has high accuracy of semantic similarity on small-scale short agro-text and obvious performance advantages, which provides an effective technical reference for semantic intelligence matching.

Key words: transfer learning, short agro-text, semantic similarity calculation, glyph features, intelligent knowledge service, big model

中图分类号:

TP183

金宁, 郭宇峰, 韩晓东, 缪祎晟, 吴华瑞. 基于迁移学习的农业短文本语义相似度计算方法[J]. 智慧农业(中英文), 2025, 7(1): 33-43.

JIN Ning, GUO Yufeng, HAN Xiaodong, MIAO Yisheng, WU Huarui. Method for Calculating Semantic Similarity of Short Agricultural Texts Based on Transfer Learning[J]. Smart Agriculture, 2025, 7(1): 33-43.

图/表 15

表1

图1

图2

图3

表2

图4

表3

图5

表4

图6

图7

表5

表6

图8

表7

参考文献 33

1	中国农技推广信息平台[DB/OL]. [2023-10-20].
2	饶海笛. 基于语义的作物病虫害多模态知识问答方法研究[D]. 合肥: 安徽农业大学, 2023.
	RAO H D. Semantic-based multimodal knowledge question answer method for crop pests and diseases[D]. Hefei: Anhui Agricultural University, 2023.
3	徐传丽, 周世杰, 吴春江. 深度学习中文本相似度计算研究综述[J]. 计算机应用与软件, 2024, 41(11): 1-14.
	XU C L, ZHOU S J, WU C J. Review of textual similarity calculation in deep learning[J]. Computer applications and software, 2024, 41(11): 1-14.
4	WANG Z G, HAMZA W, FLORIAN R, et al. Bilateral multi-perspective matching for natural language sentences[C]// Proceedings of the 26th International Joint Conference on Artificial Intelligence. New York, USA: ACM, 2017: 4144-4150.
5	CHEN Q, ZHU X D, LING Z H, et al. Enhanced LSTM for natural language inference[EB/OL]. arXiv:1609.06038, 2016.
6	WANG B N, LIU K, ZHAO J. Inner attention based recurrent neural networks for answer selection[C]// Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. San Diego, USA: ACL, 2016: 1288-1297.
7	PANG L, LAN Y Y, GUO J F, et al. Text matching as image recognition[C]// Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence. New York, USA: ACM, 2016: 2793-2799.
8	庞亮, 兰艳艳, 徐君, 等. 深度文本匹配综述[J]. 计算机学报, 2017, 40(4): 985-1003.
	PANG L, LAN Y Y, XU J, et al. A survey on deep text matching[J]. Chinese journal of computers, 2017, 40(4): 985-1003.
9	DEVLIN J, CHANG M W, LEE K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding[EB/OL]. arXiv:1810.04805, 2018.
10	代翔, 孙海春, 牛硕, 等. 融合互注意力机制与BERT的中文问答匹配技术研究[J]. 信息网络安全, 2021, 21(12): 102-108.
	DAI X, SUN H C, NIU S, et al. Research on Chinese question answering matching based on mutual attention mechanism and bert[J]. Netinfo security, 2021, 21(12): 102-108.
11	马新宇, 范意兴, 郭嘉丰, 等. 关于短文本匹配的泛化性和迁移性的研究分析[J]. 计算机研究与发展, 2022, 59(1): 118-126.
	MA X Y, FAN Y X, GUO J F, et al. An empirical investigation of generalization and transfer in short text matching[J]. Journal of computer research and development, 2022, 59(1): 118-126.
12	REIMERS N, GUREVYCH I. Sentence-BERT: Sentence embeddings using Siamese BERT-networks[EB/OL]. arXiv: 1908. 10084, 2019.
13	RYU M H. RE ALBERT: A lite BERT for self-supervised learning of language representations[EB/OL]. arXiv:1909.11942, 2020.
14	LI J Y, ZHANG X J, ZHOU X B. ALBERT-based self-ensemble model with semisupervised learning and data augmentation for clinical semantic textual similarity calculation: Algorithm validation study[J]. JMIR medical informatics, 2021, 9(1): ID e23086.
15	BAI J G, WANG Y J, CHEN Y R, et al. Syntax-BERT: Improving pre-trained transformers with syntax trees[EB/OL]. arXiv:2103.04350, 2021.
16	YANG J L. An empirical study for the GPT-based LLM in paper similarity measurement[C]// 2024 5th International Conference on Electronic Communication and Artificial Intelligence (ICECAI). Piscataway, New Jersey, USA: IEEE, 2024: 814-818.
17	XU S C, WU Z H, ZHAO H Q, et al. Reasoning before comparison: LLM-enhanced semantic similarity metrics for domain specialized text analysis[EB/OL]. arXiv: 2402.11398, 2024.
18	王郝日钦, 王晓敏, 缪祎晟, 等. 基于BERT-Attention-DenseBiGRU的农业问答社区问句相似度匹配[J]. 农业机械学报, 2022, 53(1): 244-252.
	WANG H, WANG X M, MIAO Y S, et al. Densely connected BiGRU neural network based on BERT and attention mechanism for Chinese agriculture-related question similarity matching[J]. Transactions of the Chinese society for agricultural machinery, 2022, 53(1): 244-252.
19	ZHOU H, GUO X, LIU C, et al. Question similarity measurement of Chinese crop diseases and insect pests based on mixed information extraction[J]. KSII transactions on Internet and information systems, 2021, 15(11): 3991-4010.
20	王奥, 吴华瑞, 朱华吉. 基于特征增强的多方位农业问句语义匹配[J]. 西南大学学报(自然科学版), 2023, 45(6): 201-210.
	WANG A, WU H R, ZHU H J. Multi-level semantic matching of agricultural questions based on feature enhancement[J]. Journal of southwest university (natural science edition), 2023, 45(6): 201-210.
21	刘志超, 王晓敏, 吴华瑞, 等. 基于BiLSTM-CNN的水稻问句相似度匹配方法研究[J]. 中国农机化学报, 2022, 43(12): 125-132.
	LIU Z C, WANG X M, WU H R, et al. Research on rice question and sentence similarity matching method based on BiLSTM-CNN[J]. Journal of Chinese agricultural mechanization, 2022, 43(12): 125-132.
22	张莉, 杨明辉, 孙嘉成. 基于注意力机制和迁移学习的小样本茶叶病害识别[J]. 中国农机化学报, 2024, 45(10): 262-268.
	ZHANG L, YANG M H, SUN J C. Identification method of small sample tea leaf diseases based on attention mechanism and transfer learning[J]. Journal of Chinese agricultural mechanization, 2024, 45(10): 262-268.
23	张国忠, 吕紫薇, 刘浩蓬, 等. 基于改进DenseNet和迁移学习的荷叶病虫害识别模型[J]. 农业工程学报, 2023, 39(8): 188-196.
	ZHANG G Z, LYU Z W, LIU H P, et al. Model for identifying Lotus leaf pests and diseases using improved DenseNet and transfer learning[J]. Transactions of the Chinese society of agricultural engineering, 2023, 39(8): 188-196.
24	SHAFIK W, TUFAIL A, DE SILVA LIYANAGE C, et al. Using transfer learning-based plant disease classification and detection for sustainable agriculture[J]. BMC plant biology, 2024, 24(1): ID 136.
25	LIU Z H, LI J H, ASHRAF M, et al. Remote sensing-enhanced transfer learning approach for agricultural damage and change detection: A deep learning perspective[J]. Big data research, 2024, 36: ID 100449.
26	BRITO D F, CARDOSO J L, DOS REIS J C, et al. Exploring supervised techniques for automated recognition of intention classes from Portuguese free texts on agriculture[J]. Revista de informática Teórica e aplicada, 2022, 29(2): 95-120.
27	LIU X, CHEN Q, DENG C, et al. Lcqmc: A large-scale chinese question matching corpus[C]// Proceedings of the 27th international conference on computational linguistics. San Diego, USA: ACL, 2018: 1952-1962.
28	MUELLER J, THYAGARAJAN A, MUELLER J, et al. Siamese recurrent architectures for learning sentence similarity[C]// Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence. New York, USA: ACM, 2016: 2786-2792.
29	NECULOIU P, VERSTEEGH M, ROTARU M. Learning text similarity with Siamese recurrent networks[C]// Proceedings of the 1st Workshop on Representation Learning for NLP. San Diego, USA: ACL, 2016: 148-157.
30	XIANG H, GU J G. Research on question answering system based on Bi-LSTM and self-attention mechanism[C]// 2020 IEEE 7th International Conference on Industrial Engineering and Applications (ICIEA). Piscataway, New Jersey, USA: IEEE, 2020: 726-730.
31	SHI H X, WANG C, SAKAI T. A Siamese CNN architecture for learning Chinese sentence similarity[C]// Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language. San Diego, USA: ACL, 2020: 24-29.
32	ALSHUBAILY I. TextCNN with Attention forText Classification[EB/OL]. arXiv: 2108. 01921, 2021.
33	LIU Y H, OTT M, GOYAL N, et al. RoBERTa: A robustly optimized BERT pretraining approach[EB/OL]. arXiv:1907.11692, 2019.

编号	问句1	问句2	真实标签
1	土豆癌肿病的症状有哪些？	土豆癌肿病有什么症状？	1
2	牡丹缺钾症防治方法有哪些？	杜鹃缺钾症防治方法有哪些？	0
3	防治夏季玉米钻心虫危害，危害症状是什么？	如何防治夏季玉米钻心虫危害？	0
4	种植土豆应该选择什么样的土地种植？	应该选择什么样的种植土豆土地种植？	1
5	柑橘幼树能不能追施尿素肥料吗？	柑橘幼树如何施尿素肥料呢？	0
6	请问各位老师这是什么虫，它正在取食樱桃树叶，该如何防治？	各位老师，正在取食樱桃树叶的虫子是什么，该如何防治这种虫子？	1
7	客源市场对休闲观光农业有什么影响？	休闲观光农业有哪些影响因素？	0

文本1	文本2	真实标签
什么是议价？	什么是议价权？	0
石家庄天气如何？	石家庄天气怎样？	1
同乐是什么意思？	与君同乐什么意思？	0
羽绒服怎么干洗啊？	怎样干洗羽绒服？	1

试验环境	环境配置
操作系统	Windows 11 22H2
内存	DDR4 32 GB 3200 MHz
CPU	AMD RYZEN 5 5600X 3.7 GHz
GPU	NVIDIA RTX 4060 8 G
Python	3.9
Pytorch	2.2.0

试验模型		正确率/%	精确率/%	召回率/%	F ₁值/%
传统神经网络模型	MaLSTM	85.79	88.31	80.83	84.40
	BiLSTM	87.85	91.60	81.99	86.53
	TextCNN	89.01	88.54	88.35	88.44
基于注意力机制模型	TextCNN_Attention	91.99	91.79	91.42	91.56
基于注意力机制模型	BiLSTM_Self-Attention	92.49	91.89	92.37	92.13
基于预训练模型	RoBERTa	71.42	69.14	72.14	70.61
	ALBERT	84.78	83.51	84.75	84.12
	BERT	88.16	87.67	87.39	87.53
基于微调机制模型	SRoBERTa	78.73	77.65	77.65	77.65
	SBERT_OFT	94.35	94.07	94.07	94.07
	SALBERT	95.16	95.11	94.70	94.90
	SBERT	96.42	96.29	96.19	96.24
	CWPT-TSBERT	97.18	96.93	97.14	97.04

试验模型	正确率/%	精确率/%	召回率/%	F ₁值/%
ALBERT	84.78	83.51	84.75	84.12
SALBERT-AGRI	95.16	95.11	94.70	94.90
TSALBERT-AGRI	95.82	95.46	95.76	95.61
CWPT-TSALBERT	95.87	95.46	95.87	95.67
RoBERTa	71.42	69.14	72.14	70.61
SRoBERTa-AGRI	78.73	77.65	77.65	77.65
TSRoBERTa-AGRI	94.15	93.67	94.07	93.87
CWPT-TSRoBERTa	95.92	95.76	95.66	95.71
BERT	88.16	87.67	87.39	87.53
SBERT-AGRI	96.42	96.29	96.19	96.24
TSBERT-AGRI	97.08	96.93	96.93	96.93
CWPT-TSBERT	97.18	96.93	97.14	97.04

基于迁移学习的农业短文本语义相似度计算方法

Method for Calculating Semantic Similarity of Short Agricultural Texts Based on Transfer Learning

在线阅读

知网下载

本地下载

可视化

摘要/Abstract

引用本文

使用本文

图/表 15

参考文献 33

相关文章 15

编辑推荐

Metrics

本文评价

试验模型			训练数据集规模
试验模型	4 000对/%	ΔACC	8 000对/%	ΔACC	16 000对/%	ΔACC
SALBERT	92.04	/	93.55	/	95.16	/
TSALBERT	94.41	+2.37	95.36	+1.81	95.82	+0.66
CWPT-TSALBERT	94.46	+0.05	95.41	+0.05	95.87	+0.05
SRoBERTa	71.17	/	72.83	/	78.73	/
TSRoBERTa	93.04	+21.87	93.20	+20.37	94.15	+15.42
CWPT-TSRoBERTa	95.75	+2.71	95.82	+2.62	95.92	+1.77
SBERT	95.41	/	95.77	/	96.42	/
TSBERT	95.82	+0.41	96.17	+0.40	97.08	+0.66
CWPT-TSBERT	96.47	+0.65	96.57	+0.40	97.18	+0.10

[1]	郭威, 吴华瑞, 郭旺, 顾静秋, 朱华吉. 特色农产品设施环境下品质智能管控技术研究现状与展望[J]. 智慧农业(中英文), 2024, 6(6): 44-62.
[2]	张建华, 姚琼, 周国民, 吴雯迪, 修晓杰, 王健. 作物农艺性状与形态结构表型智能识别技术综述[J]. 智慧农业(中英文), 2024, 6(2): 14-27.
[3]	郭旺, 杨雨森, 吴华瑞, 朱华吉, 缪祎晟, 顾静秋. 农业大模型：关键技术、应用分析与发展方向[J]. 智慧农业(中英文), 2024, 6(2): 1-13.
[4]	王婷, 王娜, 崔运鹏, 刘娟. 基于人工智能大模型技术的果蔬农技知识智能问答系统[J]. 智慧农业(中英文), 2023, 5(4): 105-116.
[5]	张淦, 严海峰, 胡根生, 张东彦, 程涛, 潘正高, 许海峰, 沈书豪, 朱科宇. 基于深度学习语义分割和迁移学习策略的麦田倒伏面积识别方法[J]. 智慧农业(中英文), 2023, 5(3): 75-85.
[6]	唐辉, 王铭, 于秋实, 张佳茜, 刘连涛, 王楠. 融合改进UNet和迁移学习的棉花根系图像分割方法[J]. 智慧农业(中英文), 2023, 5(3): 96-109.
[7]	赵春江. 农业知识智能服务技术综述[J]. 智慧农业(中英文), 2023, 5(2): 126-148.
[8]	王亚鹏, 曹姗姗, 李全胜, 孙伟. 融合迁移学习和集成学习的自然背景下荒漠植物识别方法[J]. 智慧农业(中英文), 2023, 5(2): 93-103.
[9]	朱海鹏, 张玉安, 李欢欢, 王建文, 杨英魁, 宋仁德. 基于改进残差网络模型的不同部位牦牛肉分类识别方法[J]. 智慧农业(中英文), 2023, 5(2): 115-125.
[10]	潘晨露, 张正华, 桂文豪, 马家俊, 严晨曦, 张晓敏. 融合ECA机制与DenseNet201的水稻病虫害识别方法[J]. 智慧农业(中英文), 2023, 5(2): 45-55.
[11]	张文景, 蒋泽中, 秦立峰. 基于弱监督下改进的CBAM-ResNet18模型识别苹果多种叶部病害[J]. 智慧农业(中英文), 2023, 5(1): 111-121.
[12]	张志博, 赵西宁, 高晓东, 张利, 杨孟豪. 基于改进Linknet网络的黄土高原苹果园精准提取[J]. 智慧农业(中英文), 2022, 4(3): 95-107.
[13]	陈占琦, 张玉安, 王文志, 李丹, 何杰, 宋仁德. 基于迁移学习的多尺度特征融合牦牛脸部识别算法[J]. 智慧农业(中英文), 2022, 4(2): 77-85.
[14]	周巧黎, 马丽, 曹丽英, 于合龙. 基于改进轻量级卷积神经网络MobileNetV3的番茄叶片病害识别[J]. 智慧农业(中英文), 2022, 4(1): 47-56.
[15]	魏靖, 王玉亭, 袁会珠, 张梦蕾, 王振营. 基于深度学习与特征可视化方法的草地贪夜蛾及其近缘种成虫识别[J]. 智慧农业(中英文), 2020, 2(3): 75-85.