基于多模态融合大模型架构Agri-QA Net的作物知识问答系统

doi:10.12133/j.smartag.SA202411005

摘要/Abstract

摘要：

【目的/意义】 随着农业信息化和智能化的快速发展，多模态人机交互技术在农业领域的重要性日益凸显。本研究提出了一种基于多模态融合的大模型架构Agri-QA Net，旨在针对甘蓝作物的农业知识，设计多模态专业问答系统。 【方法】 该模型通过整合文本、音频和图片数据，利用预训练的BERT（Bidirectional Encoder Representations from Transformers）模型提取文本特征，声学模型提取音频特征，以及卷积神经网络提取图像特征，并采用基于Transformer的融合层来整合这些特征。此外，引入跨模态注意力机制和领域自适应技术，增强了模型对农业领域专业知识的理解和应用能力。本研究通过收集和预处理甘蓝种植相关的多模态数据，训练并优化了Agri-QA Net模型。 【结果和讨论】 实验评估表明，该模型在甘蓝农业知识问答任务上表现出色，相较于传统的单模态或简单多模态模型，具有更高的准确率和更好的泛化能力。在多模态输入的支持下，其准确率达到了89.5%，精确率为87.9%，召回率为91.3%，F₁值为89.6%，均显著高于单一模态模型。 【结论】 案例研究展示了Agri-QA Net在实际农业场景中的应用效果，证明了其在帮助农民解决实际问题中的有效性。未来的工作将探索模型在更多农业场景中的应用，并进一步优化模型性能。

关键词: 多模态融合, 人机交互, 农业知识问答, 甘蓝作物, 大语言模型

Abstract:

[Objective] As agriculture increasingly relies on technological innovations to boost productivity and ensure sustainability, farmers need efficient and accurate tools to aid their decision-making processes. A key challenge in this context is the retrieval of specialized agricultural knowledge, which can be complex and diverse in nature. Traditional agricultural knowledge retrieval systems have often been limited by the modalities they utilize (e.g., text or images alone), which restricts their effectiveness in addressing the wide range of queries farmers face. To address this challenge, a specialized multimodal question-answering system tailored for cabbage cultivation was proposed. The system, named Agri-QA Net, integrates multimodal data to enhance the accuracy and applicability of agricultural knowledge retrieval. By incorporating diverse data modalities, Agri-QA Net aims to provide a holistic approach to agricultural knowledge retrieval, enabling farmers to interact with the system using multiple types of input, ranging from spoken queries to images of crop conditions. By doing so, it helps address the complexity of real-world agricultural environments and improves the accessibility of relevant information. [Methods] The architecture of Agri-QA Net was built upon the integration of multiple data modalities, including textual, auditory, and visual data. This multifaceted approach enables the system to develop a comprehensive understanding of agricultural knowledge, allowed the system to learn from a wide array of sources, enhancing its robustness and generalizability. The system incorporated state-of-the-art deep learning models, each designed to handle one specific type of data. Bidirectional Encoder Representations from Transformers (BERT)'s bidirectional attention mechanism allowed the model to understand the context of each word in a given sentence, significantly improving its ability to comprehend complex agricultural terminology and specialized concepts. The system also incorporated acoustic models for processing audio inputs. These models analyzed the spoken queries from farmers, allowing the system to understand natural language inputs even in noisy, non-ideal environments, which was a common challenge in real-world agricultural settings. Additionally, convolutional neural networks (CNNs) were employed to process images from various stages of cabbage growth. CNNs were highly effective in capturing spatial hierarchies in images, making them well-suited for tasks such as identifying pests, diseases, or growth abnormalities in cabbage crops. These features were subsequently fused in a Transformer-based fusion layer, which served as the core of the Agri-QA Net architecture. The fusion process ensured that each modality—text, audio, and image—contributes effectively to the final model's understanding of a given query. This allowed the system to provide more nuanced answers to complex agricultural questions, such as identifying specific crop diseases or determining the optimal irrigation schedules for cabbage crops. In addition to the fusion layer, cross-modal attention mechanisms and domain-adaptive techniques were incorporated to refine the model's ability to understand and apply specialized agricultural knowledge. The cross-modal attention mechanism facilitated dynamic interactions between the text, audio, and image data, ensuring that the model paid attention to the most relevant features from each modality. Domain-adaptive techniques further enhanced the system's performance by tailoring it to specific agricultural contexts, such as cabbage farming, pest control, or irrigation management. [Results and Discussions] The experimental evaluations demonstrated that Agri-QA Net outperforms traditional single-modal or simple multimodal models in agricultural knowledge tasks. With the support of multimodal inputs, the system achieved an accuracy rate of 89.5%, a precision rate of 87.9%, a recall rate of 91.3%, and an F₁-Score of 89.6%, all of which are significantly higher than those of single-modality models. The integration of multimodal data significantly enhanced the system's capacity to understand complex agricultural queries, providing more precise and context-aware answers. The addition of cross-modal attention mechanisms enabled for more nuanced and dynamic interaction between the text, audio, and image data, which in turn improved the model's understanding of ambiguous or context-dependent queries, such as disease diagnosis or crop management. Furthermore, the domain-adaptive technique enabled the system to focus on specific agricultural terminology and concepts, thereby enhancing its performance in specialized tasks like cabbage cultivation and pest control. The case studies presented further validated the system's ability to assist farmers by providing actionable, domain-specific answers to questions, demonstrating its practical application in real-world agricultural scenarios. [Conclusions] The proposed Agri-QA Net framework is an effective solution for addressing agricultural knowledge questions, especially in the domain of cabbage cultivation. By integrating multimodal data and leveraging advanced deep learning techniques, the system demonstrates a high level of accuracy and adaptability. This study not only highlights the potential of multimodal fusion in agriculture but also paves the way for future developments in intelligent systems designed to support precision farming. Further work will focus on enhancing the model's performance by expanding the dataset to include more diverse agricultural scenarios, refining the handling of dialectical variations in audio inputs, and improving the system's ability to detect rare crop diseases. The ultimate goal is to contribute to the modernization of agricultural practices, offering farmers more reliable and effective tools to solve the challenges in crop management.

Key words: multimodal fusion, human-computer interaction, agricultural knowledge Q&A, cabbage crops, large language model

中图分类号:

吴华瑞, 赵春江, 李静晨. 基于多模态融合大模型架构Agri-QA Net的作物知识问答系统[J]. 智慧农业(中英文), 2025, 7(1): 1-10.

WU Huarui, ZHAO Chunjiang, LI Jingchen. Agri-QA Net: Multimodal Fusion Large Language Model Architecture for Crop Knowledge Question-Answering System[J]. Smart Agriculture, 2025, 7(1): 1-10.

图/表 6

图1

表1

表2

图2

表3

图3

参考文献 26

1	AZADI H, MOVAHHED MOGHADDAM S, BURKART S, et al. Rethinking resilient agriculture: From climate-smart agriculture to vulnerable-smart agriculture[J]. Journal of cleaner production, 2021, 319: ID 128602.
2	FRIHA O, FERRAG M A, SHU L, et al. Internet of Things for the future of smart agriculture: A comprehensive survey of emerging technologies[J]. CAA journal of automatica sinica, 2021, 8(4): 718-752.
3	RASENBERG M, ÖZYÜREK A, DINGEMANSE M. Alignment in multimodal interaction: An integrative framework[J]. Cognitive science, 2020, 44(11): ID e12911.
4	PARR C S, LEMAY D G, OWEN C L, et al. Multimodal AI to improve agriculture[J]. IT professional, 2021, 23(3): 53-57.
5	COHN N. A multimodal parallel architecture: A cognitive framework for multimodal interactions[J]. Cognition, 2016, 146: 304-323.
6	GARG S, PUNDIR P, JINDAL H, et al. Towards a multimodal system for precision agriculture using IoT and machine learning[C]// 2021 12th International Conference on Computing Communication and Networking Technologies. Piscataway, New Jersey, USA: IEEE, 2021.
7	BENDER A, WHELAN B, SUKKARIEH S. A high-resolution, multimodal data set for agricultural robotics: A Ladybird's-eye view of Brassica [J]. Journal of field robotics, 2020, 37(1): 73-96.
8	DUAN S S, SHI Q F, WU J. Multimodal sensors and ML-based data fusion for advanced robots[J]. Advanced intelligent systems, 2022, 4(12): ID 2200213.
9	XIA F L, LOU Z X, SUN D, et al. Weed resistance assessment through airborne multimodal data fusion and deep learning: A novel approach towards sustainable agriculture[J]. International journal of applied earth observation and geoinformation, 2023, 120: ID 103352.
10	GUPTA S, TRIPATHI A K. Fruit and vegetable disease detection and classification: Recent trends, challenges, and future opportunities[J]. Engineering applications of artificial intelligence, 2024, 133: ID 108260.
11	TOMER G, CHAUHAN G S, PANIGRAHI P K. Feasibility of m-governance in agriculture: Insights from a multimodal study in rural India[J]. Transforming government: People, process and policy, 2016, 10(3): 434-456.
12	FALANA O B, DURODOLA O I. Multimodal remote sensing and machine learning for precision agriculture: A review[J]. Journal of engineering research and reports, 2022: 30-34.
13	DEFORCE B, BAESENS B, DIELS J, et al. Harnessing the power of transformers and data fusion in smart irrigation[J]. Applied soft computing, 2024, 152: ID 111246.
14	KARMAKAR P, TENG S W, MURSHED M, et al. Crop monitoring by multimodal remote sensing: A review[J]. Remote sensing applications: Society and environment, 2024, 33: ID 101093.
15	BARBEDO J G A. Data fusion in agriculture: Resolving ambiguities and closing data gaps[J]. Sensors, 2022, 22(6): ID 2285.
16	KUSKA M T, WAHABZADA M, PAULUS S. AI for crop production-Where can large language models (LLMs) provide substantial value?[J]. Computers and electronics in agriculture, 2024, 221: ID 108924.
17	LI L D, LIU L, PENG Y P, et al. Integration of multimodal data for large-scale rapid agricultural land evaluation using machine learning and deep learning approaches[J]. Geoderma, 2023, 439: ID 116696.
18	SINGH R, NISHA R, NAIK R, et al. Sensor fusion techniques in deep learning for multimodal fruit and vegetable quality assessment: A comprehensive review[J]. Journal of food measurement and characterization, 2024, 18(9): 8088-8109.
19	NORTON G W, ALWANG J. Changes in agricultural extension and implications for farmer adoption of new practices[J]. Applied economic perspectives and policy, 2020, 42(1): 8-20.
20	LIU W J, CHEN J W, WANG H B, et al. Perspectives on advancing multimodal learning in environmental science and engineering studies[J]. Environmental science & technology, 2024: ID acs.est.4c03088.
21	LI J S, WANG L, LIU J, et al. ViST: A ubiquitous model with multimodal fusion for crop growth prediction[J]. ACM transactions on sensor networks, 2024, 20(1): ID 23.
22	ZHAO F, ZHANG C C, GENG B C. Deep multimodal data fusion[J]. ACM computing surveys, 2024, 56(9): 1-36.
23	SARKAR S, GANAPATHYSUBRAMANIAN B, SINGH A, et al. Cyber-agricultural systems for crop breeding and sustainable production[J]. Trends in plant science, 2024, 29(2): 130-149.
24	LIU Y, WEI C J, YOON S C, et al. Development of multimodal fusion technology for tomato maturity assessment[J]. Sensors, 2024, 24(8): ID 2467.
25	QING J J, DENG X L, LAN Y B, et al. GPT-aided diagnosis on agricultural image based on a new light YOLOPC[J]. Computers and electronics in agriculture, 2023, 213: ID 108168.
26	DE PERALTA L G, POVEDA L A, POIRIER B. Making relativistic quantum mechanics simple[J]. European journal of physics, 2021, 42(5): ID 055404.

数据源	数据类型	实体与记录数量	格式	备注
农业文献	Text	500 books	JSON	每本书平均313页，单页平均523个单词
研究论文	Text	2 000 papers	JSON	单篇论文平均23页，平均每页2 107个单词
互联网数据	Text	500 entries	JSON	每个实体平均21 654单词
甘蓝图片	Images	50 000 images	JPEG	单张图片大约5 MB
问答对话	Audio	49 713 records	WAV	每个记录平均43个单词

标注类型	回答正确性判定标准	模型性能客观评估标准
人工标注	由于问答的复杂性和回答标准的差异，测试集中所有问题的正确答案均由农业领域专家提供。模型生成的答案与专家答案进行匹配对比	所有问题的正确答案均由农业领域专家提供，确保答案的权威性和准确性
回答相似度	采用基于BERT嵌入的相似度计算方法，将模型生成的答案与标准答案的文本向量进行余弦相似度计算，设定相似度阈值为0.8，超过该阈值的回答视为正确	采用基于BERT嵌入的相似度计算方法，将模型生成的答案与标准答案的文本向量进行余弦相似度计算，设定相似度阈值为0.8，超过该阈值的回答视为正确
多模态信息融合验证	在某些问题中，结合音频或图像模态可以显著提升答案的准确性。例如，对于病害诊断问题，如果模型在文字描述外还能根据图像准确识别病斑位置，则回答更为准确。该类问题的回答正确性需专家判定	对于需要结合音频或图像模态的问题，专家将根据模型的多模态融合能力进行评判，确保答案的准确性和完整性

模型/模态	准确率/%	精确率/%	召回率/%	F ₁值/%
Agri-QA net	89.5	87.9	91.3	89.6
Single Model	83.2	80.5	84.2	82.3
Text Only	75.7	72.4	78.3	75.0
Audio only	70.4	68.2	72.5	70.1

[1]	姜京池, 闫莲, 刘劼. 基于精准知识筛选及知识协同生成的农业大语言模型[J]. 智慧农业(中英文), 2025, 7(1): 20-32.
[2]	宫宇, 王玲, 赵荣强, 尤海波, 周沫, 刘劼. 基于多模态数据表型特征提取的番茄生长高度预测方法[J]. 智慧农业(中英文), 2025, 7(1): 97-110.
[3]	吴华瑞, 李静晨, 杨雨森. 基于大语言模型的个性化作物水肥管理智能决策方法[J]. 智慧农业(中英文), 2025, 7(1): 11-19.
[4]	赵春江, 李静晨, 吴华瑞, 杨雨森. 基于大语言模型推理的数字孪生平台蔬菜作物生长模型研究[J]. 智慧农业(中英文), 2024, 6(6): 63-71.