欢迎您访问《智慧农业(中英文)》官方网站! English

Smart Agriculture ›› 2025, Vol. 7 ›› Issue (1): 1-10.doi: 10.12133/j.smartag.SA202411005

• 专题--农业知识智能服务和智慧无人农场(下) • 上一篇    下一篇

基于多模态融合大模型架构Agri-QA Net的作物知识问答系统

吴华瑞, 赵春江(), 李静晨   

  1. 北京市农林科学院信息技术研究中心,北京 100079,中国
  • 收稿日期:2024-10-31 出版日期:2025-01-30
  • 基金项目:
    国家重点研发计划(2021ZD0113604); 科技创新2030重大项目(2022ZD0115705-05)
  • 作者简介:
    吴华瑞,博士,研究员,研究方向为大语言模型与农业知识服务。E-mail:
  • 通信作者:
    赵春江,博士,研究员,中国工程院院士,研究方向为大语言模型与农业知识服务。E-mail:

Agri-QA Net: Multimodal Fusion Large Language Model Architecture for Crop Knowledge Question-Answering System

WU Huarui, ZHAO Chunjiang(), LI Jingchen   

  1. Information Technology Research Center, Beijing Academy of Agriculture and Forestry Sciences, Beijing 100079, China
  • Received:2024-10-31 Online:2025-01-30
  • Foundation items:National Key Research and Development Program of China(2021ZD0113604); Scientific and Technological Innovation 2030-Major Project(2022ZD0115705-05)
  • About author:

    WU Huarui, E-mail:

  • Corresponding author:
    ZHAO Chunjiang, E-mail:

摘要:

【目的/意义】 随着农业信息化和智能化的快速发展,多模态人机交互技术在农业领域的重要性日益凸显。本研究提出了一种基于多模态融合的大模型架构Agri-QA Net,旨在针对甘蓝作物的农业知识,设计多模态专业问答系统。 【方法】 该模型通过整合文本、音频和图片数据,利用预训练的BERT(Bidirectional Encoder Representations from Transformers)模型提取文本特征,声学模型提取音频特征,以及卷积神经网络提取图像特征,并采用基于Transformer的融合层来整合这些特征。此外,引入跨模态注意力机制和领域自适应技术,增强了模型对农业领域专业知识的理解和应用能力。本研究通过收集和预处理甘蓝种植相关的多模态数据,训练并优化了Agri-QA Net模型。 【结果和讨论】 实验评估表明,该模型在甘蓝农业知识问答任务上表现出色,相较于传统的单模态或简单多模态模型,具有更高的准确率和更好的泛化能力。在多模态输入的支持下,其准确率达到了89.5%,精确率为87.9%,召回率为91.3%,F1值为89.6%,均显著高于单一模态模型。 【结论】 案例研究展示了Agri-QA Net在实际农业场景中的应用效果,证明了其在帮助农民解决实际问题中的有效性。未来的工作将探索模型在更多农业场景中的应用,并进一步优化模型性能。

关键词: 多模态融合, 人机交互, 农业知识问答, 甘蓝作物, 大语言模型

Abstract:

[Objective] As agriculture increasingly relies on technological innovations to boost productivity and ensure sustainability, farmers need efficient and accurate tools to aid their decision-making processes. A key challenge in this context is the retrieval of specialized agricultural knowledge, which can be complex and diverse in nature. Traditional agricultural knowledge retrieval systems have often been limited by the modalities they utilize (e.g., text or images alone), which restricts their effectiveness in addressing the wide range of queries farmers face. To address this challenge, a specialized multimodal question-answering system tailored for cabbage cultivation was proposed. The system, named Agri-QA Net, integrates multimodal data to enhance the accuracy and applicability of agricultural knowledge retrieval. By incorporating diverse data modalities, Agri-QA Net aims to provide a holistic approach to agricultural knowledge retrieval, enabling farmers to interact with the system using multiple types of input, ranging from spoken queries to images of crop conditions. By doing so, it helps address the complexity of real-world agricultural environments and improves the accessibility of relevant information. [Methods] The architecture of Agri-QA Net was built upon the integration of multiple data modalities, including textual, auditory, and visual data. This multifaceted approach enables the system to develop a comprehensive understanding of agricultural knowledge, allowed the system to learn from a wide array of sources, enhancing its robustness and generalizability. The system incorporated state-of-the-art deep learning models, each designed to handle one specific type of data. Bidirectional Encoder Representations from Transformers (BERT)'s bidirectional attention mechanism allowed the model to understand the context of each word in a given sentence, significantly improving its ability to comprehend complex agricultural terminology and specialized concepts. The system also incorporated acoustic models for processing audio inputs. These models analyzed the spoken queries from farmers, allowing the system to understand natural language inputs even in noisy, non-ideal environments, which was a common challenge in real-world agricultural settings. Additionally, convolutional neural networks (CNNs) were employed to process images from various stages of cabbage growth. CNNs were highly effective in capturing spatial hierarchies in images, making them well-suited for tasks such as identifying pests, diseases, or growth abnormalities in cabbage crops. These features were subsequently fused in a Transformer-based fusion layer, which served as the core of the Agri-QA Net architecture. The fusion process ensured that each modality—text, audio, and image—contributes effectively to the final model's understanding of a given query. This allowed the system to provide more nuanced answers to complex agricultural questions, such as identifying specific crop diseases or determining the optimal irrigation schedules for cabbage crops. In addition to the fusion layer, cross-modal attention mechanisms and domain-adaptive techniques were incorporated to refine the model's ability to understand and apply specialized agricultural knowledge. The cross-modal attention mechanism facilitated dynamic interactions between the text, audio, and image data, ensuring that the model paid attention to the most relevant features from each modality. Domain-adaptive techniques further enhanced the system's performance by tailoring it to specific agricultural contexts, such as cabbage farming, pest control, or irrigation management. [Results and Discussions] The experimental evaluations demonstrated that Agri-QA Net outperforms traditional single-modal or simple multimodal models in agricultural knowledge tasks. With the support of multimodal inputs, the system achieved an accuracy rate of 89.5%, a precision rate of 87.9%, a recall rate of 91.3%, and an F1-Score of 89.6%, all of which are significantly higher than those of single-modality models. The integration of multimodal data significantly enhanced the system's capacity to understand complex agricultural queries, providing more precise and context-aware answers. The addition of cross-modal attention mechanisms enabled for more nuanced and dynamic interaction between the text, audio, and image data, which in turn improved the model's understanding of ambiguous or context-dependent queries, such as disease diagnosis or crop management. Furthermore, the domain-adaptive technique enabled the system to focus on specific agricultural terminology and concepts, thereby enhancing its performance in specialized tasks like cabbage cultivation and pest control. The case studies presented further validated the system's ability to assist farmers by providing actionable, domain-specific answers to questions, demonstrating its practical application in real-world agricultural scenarios. [Conclusions] The proposed Agri-QA Net framework is an effective solution for addressing agricultural knowledge questions, especially in the domain of cabbage cultivation. By integrating multimodal data and leveraging advanced deep learning techniques, the system demonstrates a high level of accuracy and adaptability. This study not only highlights the potential of multimodal fusion in agriculture but also paves the way for future developments in intelligent systems designed to support precision farming. Further work will focus on enhancing the model's performance by expanding the dataset to include more diverse agricultural scenarios, refining the handling of dialectical variations in audio inputs, and improving the system's ability to detect rare crop diseases. The ultimate goal is to contribute to the modernization of agricultural practices, offering farmers more reliable and effective tools to solve the challenges in crop management.

Key words: multimodal fusion, human-computer interaction, agricultural knowledge Q&A, cabbage crops, large language model

中图分类号: