Method for Calculating Semantic Similarity of Short Agricultural Texts Based on Transfer Learning

doi:10.12133/j.smartag.SA202410026

Abstract

Abstract:

[Objective] Intelligent services of agricultural knowledge have emerged as a current hot research domain, serving as a significant support for the construction of smart agriculture. The platform "China Agricultural Technology Extension" provides users with efficient and convenient agricultural information consultation services via mobile terminals, and has accumulated a vast amount of Q&A data. These data are characterized by a huge volume of information, rapid update and iteration, and a high degree of redundancy, resulting in the platform encountering issues such as frequent repetitive questions, low timeliness of problem responses, and inaccurate information retrieval. There is an urgent requirement for a high-quality text semantic similarity calculation approach to confront these challenges and effectively enhance the information service efficiency and intelligent level of the platform. In view of the problems of incomplete feature extraction and lack of short agro-text annotation data sets in existing text semantic similarity calculation models, a semantic similarity calculation model for short agro-text, namely CWPT-SBERT, based on transfer learning and BERT pre-training model, was proposed. [Methods] CWPT-SBERT was based on Siamese architecture with identical left and right sides and shared parameters, which had the advantages of low structural complexity and high training efficiency. This network architecture effectively reduced the consumption of computational resources by sharing parameters and ensures that input texts were compared in the same feature space. CWPT-SBERT consisted of four main parts: Semantic enhancement layer, embedding layer, pooling layer, and similarity measurement layer. The CWPT method based on the word segmentation unit was proposed in the semantic enhancement layer to further divide Chinese characters into more fine-grained sub-units maximizing the semantic features in short Chinese text and effectively enhancing the model's understanding of complex Chinese vocabulary and character structures. In the embedding layer, a transfer learning strategy was used to extract features from agricultural short texts based on SBERT. It captured the semantic features of Chinese text in the general domain, and then generated a more suitable semantic feature vector representation after fine-tuning. Transfer learning methods to train models on large-scale general-purposed domain annotation datasets solve the problem of limited short agro-text annotation datasets and high semantic sparsity. The pooling layer used the average pooling strategy to map the high-dimensional semantic vector of Chinese short text to a low-dimensional vector space. The similarity measurement layer used the cosine similarity calculation method to measure the similarity between the semantic feature vector representations of the two output short texts, and the computed similarity degree was finally input into the loss function to guide model training, optimize model parameters, and improve the accuracy of similarity calculation. [Results and Discussions] For the task of calculating semantic similarity in agricultural short texts, on a dataset containing 19 968 pairs of short ago-texts, the CWPT-SBERT model achieved an accuracy rate of 97.18% and 96.93%, a recall rate of 97.14%, and an F₁-Score value of 97.04%, which are higher than 12 models such as TextCNN_Attention, MaLSTM and SBERT. By analyzing the Pearson and Spearman coefficients of CWPT-SBERT, SBERT, SALBERT and SRoBERTa trained on short agro-text datasets, it could be observed that the initial training value of the CWPT-SBERT model was significantly higher than that of the comparison models and was close to the highest value of the comparison models. Moreover, it exhibited a smooth growth trend during the training process, indicating that CWPT-SBERT had strong correlation, robustness, and generalization ability from the initial state. During the training process, it could not only learn the features in the training data but also effectively apply these features to new domain data. Additionally, for ALBERT, RoBERTa and BERT models, fine-tuning training was conducted on short agro-text datasets, and optimization was performed by utilizing the morphological structure features to enrich text semantic feature expression. Through ablation experiments, it was evident that both optimization strategies could effectively enhance the performance of the models. By analyzing the attention weight heatmap of Chinese character morphological structure, the importance of Chinese character radicals in representing Chinese character attributes was highlighted, enhancing the semantic representation of Chinese characters in vector space. There was also complex correlation within the morphological structure of Chinese characters. [Conclusions] CWPT-SBERT uses transfer learning methods to solve the problem of limited short agro-text annotation datasets and high semantic sparsity. By leveraging the Chinese-oriented word segmentation method CWPT to break down Chinese characters, the semantic representation of word vectors is enhanced, and the semantic feature expression of short texts is enriched. CWPT-SBERT model has high accuracy of semantic similarity on small-scale short agro-text and obvious performance advantages, which provides an effective technical reference for semantic intelligence matching.

Key words: transfer learning, short agro-text, semantic similarity calculation, glyph features, intelligent knowledge service, big model

CLC Number:

TP183

JIN Ning, GUO Yufeng, HAN Xiaodong, MIAO Yisheng, WU Huarui. Method for Calculating Semantic Similarity of Short Agricultural Texts Based on Transfer Learning[J]. Smart Agriculture, 2025, 7(1): 33-43.

Figures/Tables 15

Table 1

Fig. 1

Fig. 2

Fig. 3

Table 2

Fig. 4

Table 3

Fig. 5

Table 4

Fig. 5

Fig. 7

Table 5

Table 6

Fig. 8

References 33

1	中国农技推广信息平台[DB/OL]. [2023-10-20].
2	饶海笛. 基于语义的作物病虫害多模态知识问答方法研究[D]. 合肥: 安徽农业大学, 2023.
	RAO H D. Semantic-based multimodal knowledge question answer method for crop pests and diseases[D]. Hefei: Anhui Agricultural University, 2023.
3	徐传丽, 周世杰, 吴春江. 深度学习中文本相似度计算研究综述[J]. 计算机应用与软件, 2024, 41(11): 1-14.
	XU C L, ZHOU S J, WU C J. Review of textual similarity calculation in deep learning[J]. Computer applications and software, 2024, 41(11): 1-14.
4	WANG Z G, HAMZA W, FLORIAN R, et al. Bilateral multi-perspective matching for natural language sentences[C]// Proceedings of the 26th International Joint Conference on Artificial Intelligence. New York, USA: ACM, 2017: 4144-4150.
5	CHEN Q, ZHU X D, LING Z H, et al. Enhanced LSTM for natural language inference[EB/OL]. arXiv:1609.06038, 2016.
6	WANG B N, LIU K, ZHAO J. Inner attention based recurrent neural networks for answer selection[C]// Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. San Diego, USA: ACL, 2016: 1288-1297.
7	PANG L, LAN Y Y, GUO J F, et al. Text matching as image recognition[C]// Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence. New York, USA: ACM, 2016: 2793-2799.
8	庞亮, 兰艳艳, 徐君, 等. 深度文本匹配综述[J]. 计算机学报, 2017, 40(4): 985-1003.
	PANG L, LAN Y Y, XU J, et al. A survey on deep text matching[J]. Chinese journal of computers, 2017, 40(4): 985-1003.
9	DEVLIN J, CHANG M W, LEE K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding[EB/OL]. arXiv:1810.04805, 2018.
10	代翔, 孙海春, 牛硕, 等. 融合互注意力机制与BERT的中文问答匹配技术研究[J]. 信息网络安全, 2021, 21(12): 102-108.
	DAI X, SUN H C, NIU S, et al. Research on Chinese question answering matching based on mutual attention mechanism and bert[J]. Netinfo security, 2021, 21(12): 102-108.
11	马新宇, 范意兴, 郭嘉丰, 等. 关于短文本匹配的泛化性和迁移性的研究分析[J]. 计算机研究与发展, 2022, 59(1): 118-126.
	MA X Y, FAN Y X, GUO J F, et al. An empirical investigation of generalization and transfer in short text matching[J]. Journal of computer research and development, 2022, 59(1): 118-126.
12	REIMERS N, GUREVYCH I. Sentence-BERT: Sentence embeddings using Siamese BERT-networks[EB/OL]. arXiv: 1908. 10084, 2019.
13	RYU M H. RE ALBERT: A lite BERT for self-supervised learning of language representations[EB/OL]. arXiv:1909.11942, 2020.
14	LI J Y, ZHANG X J, ZHOU X B. ALBERT-based self-ensemble model with semisupervised learning and data augmentation for clinical semantic textual similarity calculation: Algorithm validation study[J]. JMIR medical informatics, 2021, 9(1): ID e23086.
15	BAI J G, WANG Y J, CHEN Y R, et al. Syntax-BERT: Improving pre-trained transformers with syntax trees[EB/OL]. arXiv:2103.04350, 2021.
16	YANG J L. An empirical study for the GPT-based LLM in paper similarity measurement[C]// 2024 5th International Conference on Electronic Communication and Artificial Intelligence (ICECAI). Piscataway, New Jersey, USA: IEEE, 2024: 814-818.
17	XU S C, WU Z H, ZHAO H Q, et al. Reasoning before comparison: LLM-enhanced semantic similarity metrics for domain specialized text analysis[EB/OL]. arXiv: 2402.11398, 2024.
18	王郝日钦, 王晓敏, 缪祎晟, 等. 基于BERT-Attention-DenseBiGRU的农业问答社区问句相似度匹配[J]. 农业机械学报, 2022, 53(1): 244-252.
	WANG H, WANG X M, MIAO Y S, et al. Densely connected BiGRU neural network based on BERT and attention mechanism for Chinese agriculture-related question similarity matching[J]. Transactions of the Chinese society for agricultural machinery, 2022, 53(1): 244-252.
19	ZHOU H, GUO X, LIU C, et al. Question similarity measurement of Chinese crop diseases and insect pests based on mixed information extraction[J]. KSII transactions on Internet and information systems, 2021, 15(11): 3991-4010.
20	王奥, 吴华瑞, 朱华吉. 基于特征增强的多方位农业问句语义匹配[J]. 西南大学学报(自然科学版), 2023, 45(6): 201-210.
	WANG A, WU H R, ZHU H J. Multi-level semantic matching of agricultural questions based on feature enhancement[J]. Journal of southwest university (natural science edition), 2023, 45(6): 201-210.
21	刘志超, 王晓敏, 吴华瑞, 等. 基于BiLSTM-CNN的水稻问句相似度匹配方法研究[J]. 中国农机化学报, 2022, 43(12): 125-132.
	LIU Z C, WANG X M, WU H R, et al. Research on rice question and sentence similarity matching method based on BiLSTM-CNN[J]. Journal of Chinese agricultural mechanization, 2022, 43(12): 125-132.
22	张莉, 杨明辉, 孙嘉成. 基于注意力机制和迁移学习的小样本茶叶病害识别[J]. 中国农机化学报, 2024, 45(10): 262-268.
	ZHANG L, YANG M H, SUN J C. Identification method of small sample tea leaf diseases based on attention mechanism and transfer learning[J]. Journal of Chinese agricultural mechanization, 2024, 45(10): 262-268.
23	张国忠, 吕紫薇, 刘浩蓬, 等. 基于改进DenseNet和迁移学习的荷叶病虫害识别模型[J]. 农业工程学报, 2023, 39(8): 188-196.
	ZHANG G Z, LYU Z W, LIU H P, et al. Model for identifying Lotus leaf pests and diseases using improved DenseNet and transfer learning[J]. Transactions of the Chinese society of agricultural engineering, 2023, 39(8): 188-196.
24	SHAFIK W, TUFAIL A, DE SILVA LIYANAGE C, et al. Using transfer learning-based plant disease classification and detection for sustainable agriculture[J]. BMC plant biology, 2024, 24(1): ID 136.
25	LIU Z H, LI J H, ASHRAF M, et al. Remote sensing-enhanced transfer learning approach for agricultural damage and change detection: A deep learning perspective[J]. Big data research, 2024, 36: ID 100449.
26	BRITO D F, CARDOSO J L, DOS REIS J C, et al. Exploring supervised techniques for automated recognition of intention classes from Portuguese free texts on agriculture[J]. Revista de informática Teórica e aplicada, 2022, 29(2): 95-120.
27	LIU X, CHEN Q, DENG C, et al. Lcqmc: A large-scale chinese question matching corpus[C]// Proceedings of the 27th international conference on computational linguistics. San Diego, USA: ACL, 2018: 1952-1962.
28	MUELLER J, THYAGARAJAN A, MUELLER J, et al. Siamese recurrent architectures for learning sentence similarity[C]// Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence. New York, USA: ACM, 2016: 2786-2792.
29	NECULOIU P, VERSTEEGH M, ROTARU M. Learning text similarity with Siamese recurrent networks[C]// Proceedings of the 1st Workshop on Representation Learning for NLP. San Diego, USA: ACL, 2016: 148-157.
30	XIANG H, GU J G. Research on question answering system based on Bi-LSTM and self-attention mechanism[C]// 2020 IEEE 7th International Conference on Industrial Engineering and Applications (ICIEA). Piscataway, New Jersey, USA: IEEE, 2020: 726-730.
31	SHI H X, WANG C, SAKAI T. A Siamese CNN architecture for learning Chinese sentence similarity[C]// Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language. San Diego, USA: ACL, 2020: 24-29.
32	ALSHUBAILY I. TextCNN with Attention forText Classification[EB/OL]. arXiv: 2108. 01921, 2021.
33	LIU Y H, OTT M, GOYAL N, et al. RoBERTa: A robustly optimized BERT pretraining approach[EB/OL]. arXiv:1907.11692, 2019.

编号	问句1	问句2	真实标签
1	土豆癌肿病的症状有哪些？	土豆癌肿病有什么症状？	1
2	牡丹缺钾症防治方法有哪些？	杜鹃缺钾症防治方法有哪些？	0
3	防治夏季玉米钻心虫危害，危害症状是什么？	如何防治夏季玉米钻心虫危害？	0
4	种植土豆应该选择什么样的土地种植？	应该选择什么样的种植土豆土地种植？	1
5	柑橘幼树能不能追施尿素肥料吗？	柑橘幼树如何施尿素肥料呢？	0
6	请问各位老师这是什么虫，它正在取食樱桃树叶，该如何防治？	各位老师，正在取食樱桃树叶的虫子是什么，该如何防治这种虫子？	1
7	客源市场对休闲观光农业有什么影响？	休闲观光农业有哪些影响因素？	0

文本1	文本2	真实标签
什么是议价？	什么是议价权？	0
石家庄天气如何？	石家庄天气怎样？	1
同乐是什么意思？	与君同乐什么意思？	0
羽绒服怎么干洗啊？	怎样干洗羽绒服？	1

试验环境	环境配置
操作系统	Windows 11 22H2
内存	DDR4 32 GB 3200 MHz
CPU	AMD RYZEN 5 5600X 3.7 GHz
GPU	NVIDIA RTX 4060 8 G
Python	3.9
Pytorch	2.2.0

试验模型		正确率/%	精确率/%	召回率/%	F ₁值/%
传统神经网络模型	MaLSTM	85.79	88.31	80.83	84.40
	BiLSTM	87.85	91.60	81.99	86.53
	TextCNN	89.01	88.54	88.35	88.44
基于注意力机制模型	TextCNN_Attention	91.99	91.79	91.42	91.56
基于注意力机制模型	BiLSTM_Self-Attention	92.49	91.89	92.37	92.13
基于预训练模型	RoBERTa	71.42	69.14	72.14	70.61
	ALBERT	84.78	83.51	84.75	84.12
	BERT	88.16	87.67	87.39	87.53
基于微调机制模型	SRoBERTa	78.73	77.65	77.65	77.65
	SBERT_OFT	94.35	94.07	94.07	94.07
	SALBERT	95.16	95.11	94.70	94.90
	SBERT	96.42	96.29	96.19	96.24
	CWPT-TSBERT	97.18	96.93	97.14	97.04

试验模型	正确率/%	精确率/%	召回率/%	F ₁值/%
ALBERT	84.78	83.51	84.75	84.12
SALBERT-AGRI	95.16	95.11	94.70	94.90
TSALBERT-AGRI	95.82	95.46	95.76	95.61
CWPT-TSALBERT	95.87	95.46	95.87	95.67
RoBERTa	71.42	69.14	72.14	70.61
SRoBERTa-AGRI	78.73	77.65	77.65	77.65
TSRoBERTa-AGRI	94.15	93.67	94.07	93.87
CWPT-TSRoBERTa	95.92	95.76	95.66	95.71
BERT	88.16	87.67	87.39	87.53
SBERT-AGRI	96.42	96.29	96.19	96.24
TSBERT-AGRI	97.08	96.93	96.93	96.93
CWPT-TSBERT	97.18	96.93	97.14	97.04