Welcome to Smart Agriculture 中文

Smart Agriculture ›› 2025, Vol. 7 ›› Issue (1): 1-10.doi: 10.12133/j.smartag.SA202411005

• Topic--Intelligent Agricultural Knowledge Services and Smart Unmanned Farms (Part 2) • Previous Articles     Next Articles

Agri-QA Net: Multimodal Fusion Large Language Model Architecture for Crop Knowledge Question-Answering System

WU Huarui, ZHAO Chunjiang(), LI Jingchen   

  1. Information Technology Research Center, Beijing Academy of Agriculture and Forestry Sciences, Beijing 100079, China
  • Received:2024-10-31 Online:2025-01-30
  • Foundation items:
    National Key Research and Development Program of China(2021ZD0113604); Scientific and Technological Innovation 2030-Major Project(2022ZD0115705-05)
  • About author:

    WU Huarui, E-mail:

  • corresponding author:
    ZHAO Chunjiang, E-mail:

Abstract:

[Objective] As agriculture increasingly relies on technological innovations to boost productivity and ensure sustainability, farmers need efficient and accurate tools to aid their decision-making processes. A key challenge in this context is the retrieval of specialized agricultural knowledge, which can be complex and diverse in nature. Traditional agricultural knowledge retrieval systems have often been limited by the modalities they utilize (e.g., text or images alone), which restricts their effectiveness in addressing the wide range of queries farmers face. To address this challenge, a specialized multimodal question-answering system tailored for cabbage cultivation was proposed. The system, named Agri-QA Net, integrates multimodal data to enhance the accuracy and applicability of agricultural knowledge retrieval. By incorporating diverse data modalities, Agri-QA Net aims to provide a holistic approach to agricultural knowledge retrieval, enabling farmers to interact with the system using multiple types of input, ranging from spoken queries to images of crop conditions. By doing so, it helps address the complexity of real-world agricultural environments and improves the accessibility of relevant information. [Methods] The architecture of Agri-QA Net was built upon the integration of multiple data modalities, including textual, auditory, and visual data. This multifaceted approach enables the system to develop a comprehensive understanding of agricultural knowledge, allowed the system to learn from a wide array of sources, enhancing its robustness and generalizability. The system incorporated state-of-the-art deep learning models, each designed to handle one specific type of data. Bidirectional Encoder Representations from Transformers (BERT)'s bidirectional attention mechanism allowed the model to understand the context of each word in a given sentence, significantly improving its ability to comprehend complex agricultural terminology and specialized concepts. The system also incorporated acoustic models for processing audio inputs. These models analyzed the spoken queries from farmers, allowing the system to understand natural language inputs even in noisy, non-ideal environments, which was a common challenge in real-world agricultural settings. Additionally, convolutional neural networks (CNNs) were employed to process images from various stages of cabbage growth. CNNs were highly effective in capturing spatial hierarchies in images, making them well-suited for tasks such as identifying pests, diseases, or growth abnormalities in cabbage crops. These features were subsequently fused in a Transformer-based fusion layer, which served as the core of the Agri-QA Net architecture. The fusion process ensured that each modality—text, audio, and image—contributes effectively to the final model's understanding of a given query. This allowed the system to provide more nuanced answers to complex agricultural questions, such as identifying specific crop diseases or determining the optimal irrigation schedules for cabbage crops. In addition to the fusion layer, cross-modal attention mechanisms and domain-adaptive techniques were incorporated to refine the model's ability to understand and apply specialized agricultural knowledge. The cross-modal attention mechanism facilitated dynamic interactions between the text, audio, and image data, ensuring that the model paid attention to the most relevant features from each modality. Domain-adaptive techniques further enhanced the system's performance by tailoring it to specific agricultural contexts, such as cabbage farming, pest control, or irrigation management. [Results and Discussions] The experimental evaluations demonstrated that Agri-QA Net outperforms traditional single-modal or simple multimodal models in agricultural knowledge tasks. With the support of multimodal inputs, the system achieved an accuracy rate of 89.5%, a precision rate of 87.9%, a recall rate of 91.3%, and an F1-Score of 89.6%, all of which are significantly higher than those of single-modality models. The integration of multimodal data significantly enhanced the system's capacity to understand complex agricultural queries, providing more precise and context-aware answers. The addition of cross-modal attention mechanisms enabled for more nuanced and dynamic interaction between the text, audio, and image data, which in turn improved the model's understanding of ambiguous or context-dependent queries, such as disease diagnosis or crop management. Furthermore, the domain-adaptive technique enabled the system to focus on specific agricultural terminology and concepts, thereby enhancing its performance in specialized tasks like cabbage cultivation and pest control. The case studies presented further validated the system's ability to assist farmers by providing actionable, domain-specific answers to questions, demonstrating its practical application in real-world agricultural scenarios. [Conclusions] The proposed Agri-QA Net framework is an effective solution for addressing agricultural knowledge questions, especially in the domain of cabbage cultivation. By integrating multimodal data and leveraging advanced deep learning techniques, the system demonstrates a high level of accuracy and adaptability. This study not only highlights the potential of multimodal fusion in agriculture but also paves the way for future developments in intelligent systems designed to support precision farming. Further work will focus on enhancing the model's performance by expanding the dataset to include more diverse agricultural scenarios, refining the handling of dialectical variations in audio inputs, and improving the system's ability to detect rare crop diseases. The ultimate goal is to contribute to the modernization of agricultural practices, offering farmers more reliable and effective tools to solve the challenges in crop management.

Key words: multimodal fusion, human-computer interaction, agricultural knowledge Q&A, cabbage crops, large language model

CLC Number: