欢迎您访问《智慧农业(中英文)》官方网站! English

Smart Agriculture

• •    

基于跨模态注意力机制的多源遥感农作物分类研究

巫晨旭, 左浩龙, 李刚()   

  1. 黑龙江工程学院测绘工程学院,黑龙江 哈尔滨 150050,中国
  • 收稿日期:2025-10-11 出版日期:2026-01-23
  • 基金项目:
    黑龙江省“双一流”学科协同创新成果项目(LJGXCG2025-P18)
  • 作者简介:

    巫晨旭,硕士研究生,研究方向为时空大数据。E-mail:

  • 通信作者:
    李 刚,博士,讲师,研究方向为时空大数据。E-mail:

Multi-Source Remote Sensing Crop Classification Via Cross-Modal Attention

WU Chenxu, ZUO Haolong, LI Gang()   

  1. School of Geomatics Engineering, Heilongjiang Institute of Technology, Harbin, Heilongjiang Province, 150050
  • Received:2025-10-11 Online:2026-01-23
  • Foundation items:Heilongjiang Province Double First-Class Discipline Coordinated Innovation Achievement Project(LJGXCG2025-P18)
  • About author:

    WU Chenxu, E-mail:

  • Corresponding author:
    LI Gang, E-mail:

摘要:

【目的/意义】 本研究旨在突破传统光学遥感在云遮天气与耕地破碎地形下的局限性,通过构建一种具备优秀跨模态融合能力与泛化性能的深度网络模型来提升农业遥感分类精度。 【方法】 本研究提出一种基于注意力机制的3D卷积神经网络(3D Convolutional Neural Network Based on Attention Mechanism, Attention-3DCNN)模型:其通过3D卷积+2D卷积结构处理时序Sentinel-2多光谱影像,在空间与时间维度上提取丰富特征;同时以深度可分离卷积形式处理 来自于哨兵一号(Sentinel-1)的合成孔径雷达(Synthetic Aperture Radar, SAR)数据,实现对全天候可获取信息的高效抽取;进一步,模型引入“通道-时间-空间”三重注意力机制与残差连接策略,对两个模态的特征进行动态加权与深度融合,使得在光学数据缺失或遮挡严重的情形下,SAR数据能够有效补偿并维持分类性能。 【结果和讨论】 为全面评价模型性能,选取法国全景农业卫星时序数据集,以及山东沂水县实测数据集进行对比实验:在法国数据上模型达成97.5%的整体准确率,在沂水县数据上获得93%的准确率,均显著优于对照基线模型;同时,通过对注意力分布的可视化分析可见,模型聚焦的关键物候期与当地农业实地记录高度一致,其高权重光谱波段亦符合农学机理,这体现出模型在判别机制层面的可解释性。 【结论】 综上,Attention-3DCNN模型 在耕地破碎、云遮影响严重的山区条件下显著提升了作物分类精度,具有良好的推广前景与应用价值。

关键词: 农作物分类, 深度学习, 卷积神经网络, 注意力机制, 遥感

Abstract:

[Objective] Accurate and timely crop mapping is fundamental for agricultural management, yield forecasting, and food security assessment. However, in mountainous and hilly regions characterized by frequent cloud cover and highly fragmented farmland, crop classification methods relying solely on optical remote sensing data are severely constrained. Persistent cloud contamination introduces data gaps and temporal inconsistencies in optical image time series, significantly degrading classification accuracy and robustness. This challenge is particularly pronounced in many agricultural regions of China, where small and irregular field parcels further complicate crop discrimination. To address these limitations, a robust and adaptive deep learning framework is developed capable of effectively integrating multi-modal remote sensing data. The primary objective is to enhance crop classification accuracy and stability under complex conditions where optical observations are scarce or unreliable, thereby supporting reliable agricultural monitoring in cloudy and fragmented landscapes. [Methods] A A novel deep neural network architecture named Attention-3DCNN was proposed, designed to jointly exploit multi-temporal optical and Synthetic Aperture Radar (SAR) observations. The model integrated Sentinel-2 multispectral time-series imagery with weather-insensitive Sentinel-1 SAR data through a dedicated cross-modal fusion strategy driven by a triple-attention mechanism. The network adopted a dual-branch feature extraction architecture. For the Sentinel-2 data, a hybrid module combining three-dimensional and two-dimensional convolutional neural networks (3D-CNN and 2D-CNN) was employed to capture discriminative spatiotemporal features and crop phenological dynamics across the growing season. This design enabled effective modeling of the spectral–temporal interactions inherent in crop development. For the Sentinel-1 SAR data, depthwise separable convolutions were utilized to efficiently extract spatial and textural features related to crop structure and surface scattering characteristics while reducing computational complexity.Features extracted from both modalities were subsequently integrated using a custom-designed attention-based fusion module. This module consisted of three complementary attention mechanisms: channel attention, temporal attention, and spatial attention. Residual connections were incorporated throughout the network to facilitate stable training and effective gradient propagation. The proposed model was evaluated on two datasets to assess both its performance and generalizability. The first was the publicly available Panoptic Agricultural Satellite Time Series (PASTIS) benchmark dataset from France, which contained dense time-series observations and multiple crop classes. The second was a real-world dataset constructed for Yishui county, Shandong province, China, which was characterized by high cloud frequency (approximately 33%), highly fragmented farmland (average parcel size < 0.5 ha), and a relatively simple crop rotation system.Comparative experiments were conducted against several state-of-the-art models, including 3D-ConvSTAR, UNet++, Self-Attention 3D, a CNN–LSTM dual-stream network, and TGF-Net. Ablation studies were also performed to quantify the contribution of each attention component. [Results and Discussions] Experimental results demonstrated that Attention-3DCNN consistently outperformed all baseline methods on both datasets. On the PASTIS benchmark, the model achieved an overall accuracy (OA) of 97.5%, confirming its strong classification capability under favorable observation conditions. On the more challenging Yishui county dataset, Attention-3DCNN attained an OA of 93%, outperforming the other comparison models. Ablation experiments confirmed the effectiveness of the proposed triple-attention mechanism, as removing any attention component resulted in a clear reduction in classification performance. Under heavy cloud coverage, Attention-3DCNN exhibited the smallest accuracy degradation, with an OA drop of only 3.6 percentage points, indicating its ability to adaptively rely on SAR information when optical data quality deteriorated. In regions with highly fragmented farmland, the proposed model also maintained the highest accuracy and the smallest performance decline (2.8 percentage points), benefiting from the spatial attention mechanism. Moreover, attention visualization provided meaningful interpretability. Temporal attention peaks aligned with key crop phenological stages, while channel attention highlighted spectrally and physically informative optical bands and SAR polarizations, which was consistent with established agronomic and remote sensing knowledge. [Conclusions] This study presents the Attention-3DCNN model for accurate and robust crop classification in regions affected by persistent cloud cover and fragmented agricultural landscapes. By fusing Sentinel-2 optical and Sentinel-1 SAR time-series data through a channel–temporal–spatial triple-attention mechanism, the proposed framework enables adaptive integration of complementary multi-modal information. The model achieves state-of-the-art performance on both benchmark and real-world datasets, demonstrates strong robustness under adverse conditions, and offers enhanced interpretability. Overall, the proposed approach provides a reliable and practical solution for crop mapping in complex agricultural environments.

Key words: crop classification, deep learning, convolutional neural network, attention mechanism, remote sensing

中图分类号: