自适应混合检索增强大模型的农作物病虫害智能问答方法

doi:10.12133/j.smartag.SA202506026

摘要/Abstract

摘要：

【目的/意义】 为充分发挥隐含在农业大数据中的分散、异构和无关联农业知识的潜在应用价值，通过构建知识库，结合检索技术用于增强大模型输出专业的农业知识，为促进农业知识快速服务于生产实践提供有效手段。 【方法】 提出了检索增强大模型的农作物病虫害智能问答方法，该方法通过自建知识库并协同优化分块策略、自适应检索机制与结构化提示工程，实现了农业病虫害领域知识有效增强大模型的精准专业问答。具体提出了自适应混合检索增强生成方法（Adaptive Hybrid Retrieval -Retrieval-Augmented Generation, AHR-RAG），首先在固定长度分块时引入重叠机制缓解语义割裂，同时，采用向量语义相似度匹配与主题高度相关的文本分块加以存储。依据问题复杂度设计了动态路由的单跳（BM25算法）检索与多跳检索。然后将文本方法与多种基线方法在不同查询类型和不同复杂度查询等多方面进行了对比实验。［结果与讨论］本研究方法在Qwen1.5-7B-Chat模型上的效果最佳，其准确率达到89.6%对单跳与多跳查询的准确率分别达到0.921和0.748，较Self-RAG与Adaptive-RAG多跳查询的准确率分别提升0.082和0.059，说明本研究方法能更好地推理多跳等复杂查询。 【结论】 本研究方法在生成答案的准确性、相关性和全面性方面的显著优势。未来的工作将探索融合多模态知识库。

关键词: 自适应混合检索, 文本分块, 病虫害, 智能问答

Abstract:

[Objective] Extracting valuable knowledge from vast amounts of dispersed, heterogeneous, and unstructured agricultural big data, correlating and structuring it, and enhancing large models to form intelligent question-answering systems enables the effective delivery of services to all in agriculture. This approach can rapidly advance the scientific and precision-based development of agricultural production. Existing agricultural Q&A systems lack enough semantic understanding of complex symptoms, while general-purpose large language models (LLM) produce factual hallucinations due to incomplete training data coverage. It aims to address the issues of insufficient scale and low quality in the construction of knowledge bases in the agricultural field [Methods] First pest and disease data were collected along for five typical crops: wheat, rice, corn, potatoes, and cotton. Using manual verification methods, outliers were precisely identified and removed, ultimately yielding 87 901 unstructured data entries. Then, a few-shot learning model was employed to extract entities defined in the pattern layer, and these entities were aligned with the semantic vectors of Bert and LLM prompt engineering, ultimately yielding a triplet knowledge base of 916 239 entries for knowledge retrieval. A knowledge retrieval-augmented LLM approach for intelligent Q&A on crop pests and diseases was proposed, specifically the adaptive hybrid retrieval-augmented generation (AHR-RAG) approach. Firstly, an overlapping mechanism was introduced during fixed-length segmentation to mitigate semantic fragmentation. Simultaneously, vector semantic similarity was used to match highly related text blocks to the topic for optimization and storage. Then, single-hop and multi-hop retrieval were designed based on the complexity of the problem. Single-hop retrieval used the BM25 algorithm to match information extracted from the query with document content in the Elasticsearch index, feeding the results into the LLM to enhance answer generation. Multi-hop retrieval first converted user queries into structured conditions and semantic vector representations. Results retrieved from different knowledge bases were then fused using reciprocal rank fusion (RRF) and fed into the LLM. [Results and Discussions] The proposed method was experimentally compared with multiple baseline approaches, including different query types and complexity queries. The results demonstrated that the proposed method achieved accuracy and F₁ improvements of 0.193 and 0.170, respectively, on the Qwen1.5-7B-Chat model. Compared to the improved methods Self-RAG and Adaptive-RAG, AHR-RAG maintained low response times while achieving F₁improvements of 0.05 and 0.021, respectively, with an accuracy as high as 0.896. For multi-type question-answering tasks, compared to the Naive-RAG method that relied solely on prior knowledge, our AHR-RAG approach achieved accuracy improvements of 0.231, 0.123, and 0.157 for comparison, judgment and selection query types, respectively. For parsing complex semantics, AHR-RAG also demonstrated significant advantages. In single-hop queries, its accuracy reached 0.921, representing a 0.29 improvement over Adaptive-RAG. In multi-hop query scenarios, its accuracy reached at 0.748, achieving gains of 0.082 and 0.059 over Self-RAG and Adaptive-RAG respectively. In retrieval-augmented generation, AHR-RAG achieved a 0.013 increase in accuracy and a 0.09 improvement in F₁ by optimizing prompt strategies, compared to directly feeding retrieval results to the model's output. [Conclusions] This research methodology demonstrates strong adaptability to diverse query types and excels at reasoning complex queries such as multi-hop searches. It delivers significant advantages in answer generation accuracy, relevance, and comprehensiveness, producing responses with enhanced logical coherence and richer content. Future work will explore integrating multimodal knowledge bases.

Key words: adaptive hybrid retrieval, text blocking, pests and diseases, intelligent Q&A

中图分类号:

TP391.1

杨俊, 杨婉霞, 杨森, 何亮, 张娣. 自适应混合检索增强大模型的农作物病虫害智能问答方法[J]. 智慧农业(中英文), doi: 10.12133/j.smartag.SA202506026.

YANG Jun, YANG Wanxia, YANG Sen, HE Liang, ZHANG Di. Intelligent Q&A Method for Crop Pests and Diseases Using LLM Augmented by Adaptive Hybrid Retrieval[J]. Smart Agriculture, doi: 10.12133/j.smartag.SA202506026.

图/表 21

图1

图2

表1

图3

表2

图4

表3

农作物病虫害研究答案生成提示词设计模板

环节	步骤	内容
上下文组织（为模型推理提供背景等信息）	P1	你是一名农业病虫害防治专家，需基于知识库检索到的内容，依据解读的用户问题生成专业、可信的答案。请严格遵循以下步骤生成回答：用户查询｛query｝上下文组织读取定位相关片段并排序： - 文档1：｛检索片段1｝ - 文档2：｛检索片段2｝ - ……（最多保留Top-K个相关片段）根据这些片段与｛query｝的相关性，从高到低对文档片段重新排序，结果存入数组中如［文档2编号，文档1编号……］
上下文组织（为模型推理提供背景等信息）	P2	关键信息提炼 - 从上述文档中提取与问题直接相关的信息；（如症状描述、病原特征、防治方案等）
模型多步推理与答案生成	P3	模型多步推理根据以下链式模板分步分析问题，确保答案逻辑严谨： 1. 症状匹配： - 用户描述：“｛用户输入症状｝” - 匹配知识库症状：“｛检索片段中的症状描述｝” - 关联病害：“｛病害名称｝”（引用文档） - 关联特征：｛斑点/断裂/疱状病斑/表皮破裂｝（引用文档） - 关联部位：｛叶片/茎/穗｝（引用文档） 2. 病因推断： - 病原类型：｛真菌/细菌/病毒/害虫｝（引用文档） - 诱发因素：｛环境条件/种植习惯｝（引用文档） - 传播途径：｛昆虫/真菌/风雨｝（引用文档） 3. 防治建议： - 农业防治：｛轮作/土壤处理｝（引用文档） - 化学防治：｛农药｝推荐2-3种农药，注明用量（引用文档） - 生物防治：｛天敌/微生物制剂｝（引用文档）
模型多步推理与答案生成	P4	答案生成请从病虫害名称、病原、病原特征、危害部位、危害症状、发病条件、传播方式、防治方法这些方面对｛query｝全面分析，结果控制在200字左右；逻辑清晰，确保内容简练、无冗余。

表3

表4

农作物病虫害智能问答实验环境及参数配置信息

实验环境及参数	配置信息
操作系统	Linux Ubuntu
Python	3.10
CUDA	13.4
GPU	4*NVIDIA GeForce GTX 1080Ti
Torch	1.21.3
显存	48 G
BM25算法参数	$k 1$ = 1.5； b=0.75
RRF算法参数	$k$ 60
模型回答超参数	Temperature= 0.3

表4

表5

图5

表6

图6

表7

表8

图7

表9

图8

表10

图9

图10

图11

参考文献 25

[1]	韦一金, 樊景超. 基于ChatGLM2-6B的农业政策问答系统[J]. 数据与计算发展前沿(中英文), 2024, 6(4): 116-127.
	WEI Y J, FAN J C. An agricultural policy question answering system based on ChatGLM2-6B[J]. Frontiers of data & computing, 2024, 6(4): 116-127.
[2]	赵泽行, 吴晓鹏, 王怡馨, 等. 基于知识图谱的农作物病虫害问答系统研究[J]. 智能化农业装备学报(中英文), 2024(4): 39-50.
	ZHAO Z X, WU X P, WANG Y X, et al. Research on question answering system for crop diseases and pests based on knowledge graph[J]. Journal of intelligent agricultural mechanization, 2024(4): 39-50.
[3]	Yong D, Haifeng L, Ya-Nan C,et al. Similar cluster frequency entropy: A novel uncertainty estimator for detecting large language models confabulations[J].Chinese journal of electronics, 2025, 34(5):1-14.
[4]	NIRANJAN P Y, RAJPUROHIT V S, SANNAKKI S S. Question answering system for agriculture domain using machine learning techniques: Literature survey and challenges[J]. International journal of computational systems engineering, 2020, 6(2): ID 91.
[5]	MALIK S, KHAREL H, DAHIYA D S, et al. Assessing ChatGPT4 with and without retrieval-augmented generation in anticoagulation management for gastrointestinal procedures[J]. Annals of gastroenterology, 2024, 37(5): 514-526.
[6]	LEWIS P, PEREZ E, PIKTUS A, et al. Retrieval-augmented generation for knowledge-intensive NLP tasks[EB/OL]. arXiv: 2005.11401, 2020.
[7]	Zhang Z, Wen L, Zhao W. Rule-KBQA: Rule-Guided Reasoning for Complex Knowledge Base Question Answering with Large Language Models[C]// Proceedings of the 31st International Conference on Computational Linguistics. Bangkok, Thailand: International Committee on Computational Linguistics, 2025: 8399-8399.
[8]	CHEN J K, HU X, LI Z H, et al. Code search is all you need? improving code suggestions with code search[C]// 2024 IEEE/ACM 46th International Conference on Software Engineering (ICSE). Piscataway, New Jersey, USA: IEEE, 2024: 880-892.
[9]	HAN Y, YANG T, YUAN M, et al. Construction of a maritime knowledge graph using GraphRAG for entity and relationship extraction from maritime documents[J]. Journal of computer and communications, 2025, 13(2): 68-93.
[10]	GUTIÉRREZ B J, SHU Y H, GU Y, et al. HippoRAG: Neurobiologically inspired long-term memory for large language models[EB/OL]. arXiv: 2405.14831, 2024.
[11]	JEONG M, SOHN J, SUNG M, et al. Improving medical reasoning through retrieval and self-reflection with retrieval-augmented large language models[J]. Bioinformatics, 2024, 40(): i119-i129.
[12]	ZHANG Y F, ZHAO D, LI H Y, et al. AdamRAG: Adaptive algorithm with ravine method for training deep neural networks[J]. Neural processing letters, 2025, 57(3): ID 53.
[13]	LI Y. A dynamic knowledge base updating mechanism-based retrieval-augmented generation framework for intelligent question-and-answer systems[J]. Journal of computer and communications, 2025, 13(1): 41-58.
[14]	TU Q S, GUO J, LI N, et al. Mitigating grand challenges in life cycle inventory modeling through the applications of large language models[J]. Environmental science & technology, 2024, 58(44): 19595-19603.
[15]	TILTON Z, LAVELLE J M, FORD T, et al. Artificial intelligence and the future of evaluation education: Possibilities and prototypes[J]. New directions for evaluation, 2023, 2023(178/179): 97-109.
[16]	PERKINS G, ANDERSON N W, SPIES N C. Retrieval-augmented generation salvages poor performance from large language models in answering microbiology-specific multiple-choice questions[J]. Journal of clinical microbiology, 2025, 63(3): ID e01624-24.
[17]	SIDDHARTH L, LUO J X. Retrieval augmented generation using engineering design knowledge[J]. Knowledge-based systems, 2024, 303: ID 112410.
[18]	于平. 基于大数据的深度学习网络爬虫算法在信息搜集与处理中的应用[J]. 科技资讯, 2024, 22(16): 55-57.
	YU P. Application of deep learning web crawler algorithm based on big data in information collection and processing[J]. Science & technology information, 2024, 22(16): 55-57.
[19]	YUAN W W, YANG W X, HE L, et al. Research on entity and relationship extraction with small training samples for cotton pests and diseases[J]. Agriculture, 2024, 14(3): ID 457.
[20]	张朵朵, 张雪茹, 王永杰, 等. 基于C/S架构和MySQL数据库的肉类加工信息管理系统的设计与构建[J]. 食品与生物技术学报, 2024, 43(9): 99-106.
	ZHANG D D, ZHANG X R, WANG Y J, et al. Design and construction of meat processing information management system based on C/S architecture and MySQL database[J]. Journal of food science and biotechnology, 2024, 43(9): 99-106.
[21]	范晓磊, 陈钊, 高金萍. Elasticsearch在林业数据领域的应用[J]. 世界林业研究, 2025, 38(1): 60-66.
	FAN X L, CHEN Z, GAO J P. Application of elasticsearch in the field of forestry data[J]. World forestry research, 2025, 38(1): 60-66.
[22]	郑少帅, 翁境鸿, 蒋小洋. 基于BM25、文本Embeddings与交叉编码器的民航客服知识库检索研究[J]. 无线互联科技, 2023, 20(24): 122-125.
	ZHENG S S, WENG J H, JIANG X Y. Research on civil aviation customer service knowledge base retrieval based on BM25, text vector method and cross encoder[J]. wuxian hulian keji, 2023, 20(24): 122-125.
[23]	王苑铮, 范意兴, 陈薇, 等. 稠密向量实体检索模型的二值化提速压缩[J]. 模式识别与人工智能, 2023, 36(1): 60-69.
	WANG Y Z, FAN Y X, CHEN W, et al. Binarization for speed-boosting compression of dense vector-based entity retrieval models[J]. Pattern recognition and artificial intelligence, 2023, 36(1): 60-69.
[24]	赵征宇, 罗景, 涂新辉. 基于多粒度语义融合的信息检索方法[J]. 计算机应用, 2024, 44(6): 1775-1780.
	ZHAO Z Y, LUO J, TU X H. Information retrieval method based on multi-granularity semantic fusion[J]. Journal of computer applications, 2024, 44(6): 1775-1780.
[25]	MAHMOUDI E, ERICKSON B, VAHDATI S, et al. Prompt optimization and chain of thought reasoning for automated classification of echocardiography reports using privacy-preserving open-source language models[J]. Journal of the American college of cardiology, 2025, 85(12): ID 2139.

实体	属性	属性值
水稻稻瘟病	寄主	水稻
玉米黑束病	为害部位	叶片
棉花叶烧病	寄生方式	菌丝体
水稻稻瘟病	病害名称	稻瘟病
小麦锈病	危害部位	叶片、叶鞘
小麦蚜虫	危害部位	叶片、茎部
玉米矮花叶病	传播途径	蚜虫的扩散
棉花枯萎病	危害症状	叶脉褪绿变黄
水稻稻瘟病	发病条件	高温高湿环境

模板类型	表达式示例	用户输入示例
症状解决	｛作物｝｛症状｝怎么办？	小麦叶锈病怎么办？
病虫害	｛症状｝是什么病？	水稻叶子上的白点是什么病？
方法询问	｛病害｝用什么药？	小麦赤霉病用什么药？
传播途径	｛病虫害｝是如何传播的？	稻瘟病是如何传播的？
环境影响	｛病虫害｝与哪些环境因素有关？	稻瘟病与哪些环境因素有关？

Topk	准确率				召回率
Topk	AHR	DVR	SQL	BM25	AHR	SQL	DVR	BM25
1	0.607	0.554	0.452	0.407	0.523	0.487	0.412	0.376
2	0.705	0.654	0.559	0.501	0.634	0.598	0.521	0.463
3	0.756	0.703	0.608	0.554	0.692	0.649	0.584	0.531
4	0.782	0.758	0.659	0.626	0.745	0.708	0.642	0.604
5	0.822	0.774	0.706	0.679	0.791	0.743	0.726	0.657
6	0.785	0.743	0.653	0.627	0.762	0.721	0.668	0.613
7	0.764	0.698	0.621	0.582	0.733	0.683	0.637	0.561
8	0.735	0.662	0.578	0.547	0.705	0.647	0.602	0.535

RAG	Model	Recall	Precision	F ₁	Time/s
No RAG	Qwen1.5-7B-Chat	0.725	0.703	0.714	1.02
Naive RAG	Qwen1.5-7B-Chat	0.751	0.735	0.743	1.13
Self-RAG	Qwen1.5-7B-Chat	0.823	0.845	0.834	3.42
Adaptive-RAG	Qwen1.5-7B-Chat	0.848	0.878	0.863	2.63
AHR-RAG	Qwen1.5-7B-Chat	0.872	0.896	0.884	2.43
AHR-RAG	GLM	0.865	0.868	0.866	2.71
AHR-RAG	Baichuan	0.867	0.857	0.862	2.68

Method	Recall	Precision	F ₁	Time/s
无（RAG）	0.352	0.437	0.389	0.56
Naive-RAG	0.448	0.451	0.449	1.36
Self-RAG	0.528	0.496	0.511	3.54
Adaptive-RAG	0.601	0.562	0.581	2.46
AHR-RAG	0.709	0.583	0.640	3.23