面向云遮挡与复杂地块场景下的跨模态注意力多源遥感农作物分类

doi:10.12133/j.smartag.SA202510010

摘要/Abstract

摘要：

【目的/意义】 本研究旨在突破传统光学遥感在云遮天气与耕地破碎地形下的局限性，通过构建一种具备优秀跨模态融合能力与泛化性能的深度网络模型来提升农业遥感分类精度。 【方法】 提出一种基于注意力机制的3D卷积神经网络（3D Convolutional Neural Network Based on Attention Mechanism, Attention-3DCNN）模型：其通过3D卷积+2D卷积结构处理时序哨兵二号多光谱影像，在空间与时间维度上提取丰富特征；同时以深度可分离卷积形式处理来自哨兵一号的合成孔径雷达（Synthetic Aperture Radar, SAR）数据，实现对全天候可获取信息的高效抽取；进一步，模型引入“通道-时间-空间”三重注意力机制与残差连接策略，对两个模态的特征进行动态加权与深度融合，使得在光学数据缺失或遮挡严重的情形下，SAR数据能够有效补偿并维持分类性能。 【结果和讨论】 为全面评价模型性能，选取法国全景农业卫星时序数据集，以及山东省沂水县实测数据集进行对比实验：在法国数据上模型达到97.5%的整体准确率，在沂水县数据上获得93%的准确率，均显著优于对照基线模型；同时，通过对注意力分布的可视化分析可见，模型聚焦的关键物候期与当地农业实地记录高度一致，其高权重光谱波段亦符合农学机理，这体现出模型在判别机制层面的可解释性。 【结论】 综上，Attention-3DCNN模型在耕地破碎、云遮影响严重的山区条件下显著提升了作物分类精度，具有良好的推广前景与应用价值。

关键词: 农作物分类, 深度学习, 3D卷积神经网络, 注意力机制, 遥感, 合成孔径雷达

Abstract:

[Objective] Accurate and timely crop mapping is fundamental for agricultural management, yield forecasting, and food security assessment. However, in mountainous and hilly regions characterized by frequent cloud cover and highly fragmented farmland, crop classification methods relying solely on optical remote sensing data are severely constrained. Persistent cloud contamination introduces data gaps and temporal inconsistencies in optical image time series, significantly degrading classification accuracy and robustness. To address these limitations, a robust and adaptive deep learning framework is developed capable of effectively integrating multi-modal remote sensing data. The primary objective is to enhance crop classification accuracy and stability under complex conditions where optical observations are scarce or unreliable, thereby supporting reliable agricultural monitoring in cloudy and fragmented landscapes. [Methods] A novel deep neural network architecture named 3D convolutional neural network based on attention mechanism (Attention-3DCNN) was proposed, designed to jointly exploit multi-temporal optical and synthetic aperture radar (SAR) observations. The model integrated Sentinel-2 multispectral time-series imagery with weather-insensitive Sentinel-1 SAR data through a dedicated cross-modal fusion strategy driven by a triple-attention mechanism. The network adopted a dual-branch feature extraction architecture. For the Sentinel-2 data, a hybrid module combining three-dimensional and two-dimensional convolutional neural networks (3D-CNN and 2D-CNN) was employed to capture discriminative spatiotemporal features and crop phenological dynamics across the growing season. This design enabled effective modeling of the spectral-temporal interactions inherent in crop development. For the Sentinel-1 SAR data, depthwise separable convolutions were utilized to efficiently extract spatial and textural features related to crop structure and surface scattering characteristics while reducing computational complexity. Features extracted from both modalities were subsequently integrated using a custom-designed attention-based fusion module. This module consisted of three complementary attention mechanisms: channel attention, temporal attention, and spatial attention. Residual connections were incorporated throughout the network to facilitate stable training and effective gradient propagation. The proposed model was evaluated on two datasets to assess both its performance and generalizability. The first was the publicly available panoptic agricultural satellite time series (PASTIS) benchmark dataset from France, which contained dense time-series observations and multiple crop classes. The second was a real-world dataset constructed for Yishui county, Shandong province, China, which was characterized by high cloud frequency (approximately 33%), highly fragmented farmland (average parcel size < 0.5 hm²), and a relatively simple crop rotation system. Comparative experiments were conducted against several state-of-the-art models, including 3D-ConvSTAR, UNet++, Self-Attention 3D, CNN-LSTM dual-stream network, and TGF-Net. Ablation studies were also performed to quantify the contribution of each attention component. [Results and Discussions] Experimental results demonstrated that Attention-3DCNN consistently outperformed all baseline methods on both datasets. On the PASTIS benchmark, the model achieved an overall accuracy (OA) of 97.5%, confirming its strong classification capability under favorable observation conditions. On the more challenging Yishui county dataset, Attention-3DCNN attained an OA of 93%, outperforming the other comparison models. Ablation experiments confirmed the effectiveness of the proposed triple-attention mechanism, as removing any attention component resulted in a clear reduction in classification performance. Under heavy cloud coverage, Attention-3DCNN exhibited the smallest accuracy degradation, with an OA drop of only 3.6 percentage points, indicating its ability to adaptively rely on SAR information when optical data quality deteriorated. In regions with highly fragmented farmland, the proposed model also maintained the highest accuracy and the smallest performance decline (2.8 percentage points), benefiting from the spatial attention mechanism. Moreover, attention visualization provided meaningful interpretability. Temporal attention peaks aligned with key crop phenological stages, while channel attention highlighted spectrally and physically informative optical bands and SAR polarizations, which was consistent with established agronomic and remote sensing knowledge. [Conclusions] This study presents the Attention-3DCNN model for accurate and robust crop classification in regions affected by persistent cloud cover and fragmented agricultural landscapes. By fusing Sentinel-2 optical and Sentinel-1 SAR time-series data through a channel-temporal-spatial triple-attention mechanism, the proposed framework enables adaptive integration of complementary multi-modal information. The model achieves outstanding performance on both benchmark and real-world datasets, demonstrates strong robustness under adverse conditions, and offers enhanced interpretability. Overall, the proposed approach provides a reliable and practical solution for crop mapping in complex agricultural environments.

Key words: crop classification, deep learning, 3D convolutional neural network, attention mechanism, remote sensing, SAR

中图分类号:

S127
TP18

巫晨旭, 左浩龙, 李刚. 面向云遮挡与复杂地块场景下的跨模态注意力多源遥感农作物分类[J]. 智慧农业(中英文), 2026, 8(2): 118-132.

WU Chenxu, ZUO Haolong, LI Gang. Cross-Modal Attention for Multi-Source Remote Sensing Crop Classification under Cloud Occlusion and Complex Field Scenarios[J]. Smart Agriculture, 2026, 8(2): 118-132.

图/表 23

图1

表1

图2

表2

图3

图4

图5

图6

图7

图8

表3

图9

图10

表4

图11

表5

表6

图12

表7

表8

图13

表9

表10

参考文献 34

[1]	刘伊,张彦军. ReluformerN:轻量化高低频增强高光谱农业地物分类方法 [J]. 智慧农业(中英文), 2024, 6 (5): 74-87.
	LIU Y, ZHANG Y J. ReluformerN: Lightweight high-low frequency enhanced for hyperspectral agricultural lancover classification[J]. Smart Agriculture, 2024, 6(5): 74-87.
[2]	SUN Y W, LI Z L, LUO J C, et al. Farmland parcel-based crop classification in cloudy/rainy mountains using Sentinel-1 and Sentinel-2 based deep learning[J]. International Journal of Remote Sensing, 2022, 43(3): 1054-1073.
[3]	YE Y X, ZHANG J C, ZHOU L, et al. Optical and SAR image fusion based on complementary feature decomposition and visual saliency features[J]. IEEE Transactions on Geoscience and Remote Sensing, 2024, 62: 5205315.
[4]	LI J J, ZHANG J C, YANG C, et al. Comparative analysis of pixel-level fusion algorithms and a new high-resolution dataset for SAR and optical image fusion[J]. Remote Sensing, 2023, 15(23): 5514.
[5]	RAČIČ M, OŠTIR K, ZUPANC A, et al. Multi-year time series transfer learning: Application of early crop classification[J]. Remote Sensing, 2024, 16(2): 270.
[6]	VIZZARI M, LESTI G, ACHARKI S. Crop classification in Google Earth Engine: Leveraging Sentinel-1, Sentinel-2, European CAP data, and object-based machine-learning approaches[J]. Geo-Spatial Information Science, 2025, 28(3): 815-830.
[7]	ZHAO F, YANG G J, YANG X D, et al. Determination of key phenological phases of winter wheat based on the time-weighted dynamic time warping algorithm and MODIS time-series data[J]. Remote Sensing, 2021, 13(9): 1836.
[8]	郭交, 王鹤颖, 项诗雨, 等. 基于多特征优化的PolSAR数据农作物精细分类方法[J]. 农业机械学报, 2024, 55(9): 275-285.
	GUO J, WANG H Y, XIANG S Y, et al. Crop classification based on PolSAR data using multiple feature optimization[J]. Transactions of the Chinese Society for Agricultural Machinery, 2024, 55(9): 275-285.
[9]	王佳玥, 蔡志文, 王文静, 等. 协同多源国产高分影像和面向对象方法的南方农作物遥感识别[J]. 中国农业科学, 2023, 56(13): 2474-2490.
	WANG J Y, CAI Z W, WANG W J, et al. Integrating multi-source Gaofen images and object-based methods for crop type identification in South China[J]. Scientia Agricultura Sinica, 2023, 56(13): 2474-2490.
[10]	林云浩, 王艳军, 李少春, 等. 一种耦合DeepLab与Transformer的农作物种植类型遥感精细分类方法[J]. 测绘学报, 2024, 53(2): 353-366.
	LIN Y H, WANG Y J, LI S C, et al. A coupled DeepLab and Transformer approach for fine classification of crop cultivation types in remote sensing[J]. Acta Geodaetica et Cartographica Sinica, 2024, 53(2): 353-366.
[11]	张伟雄, 唐娉, 孟瑜, 等. 基于多尺度时空全局注意力的遥感影像时间序列农作物分类[J]. 遥感学报, 2024, 28(11): 2865-2877.
	ZHANG WX, TANG P, MENG Y, et al. Crop type classification of remote sensing image time series based on multi-scale spatial-temporal global attention model[J]. National Remote Sensing Bulletin, 2024, 28(11): 2865-2877.
[12]	ZHANG L P, ZHANG L F, DU B. Deep learning for remote sensing data: A technical tutorial on the state of the art[J]. IEEE Geoscience and Remote Sensing Magazine, 2016, 4(2): 22-40.
[13]	HUANG X X, ZHANG X J, WANG L B, et al. MMA-net: A semantic segmentation network for high-resolution remote sensing images based on multimodal fusion and multi-scale multi-attention mechanisms[J]. Remote Sensing, 2025, 17(21): 3572.
[14]	MIAO J M, GAO J, WANG L, et al. Deep learning application of fruit planting classification based on multi-source remote sensing images[J]. Applied Sciences, 2025, 15(20): 10995.
[15]	LIU X, ZOU H J, WANG S X, et al. Joint network combining dual-attention fusion modality and two specific modalities for land cover classification using optical and SAR images[J]. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2024, 17: 3236-3250.
[16]	MEI Y, FAN J, FAN X, et al. CSTC: Visual transformer network with multimodal dual fusion for hyperspectral and LiDAR image classification[J]. Remote Sensing, 2025, 17(18): 3158.
[17]	RAMATHILAGAM A B, NATARAJAN S, KUMAR A. TransCropNet: A multichannel transformer with feature-level fusion for crop classification in agricultural smallholdings using Sentinel images[J]. Journal of Applied Remote Sensing, 2023, 17(2): 024501.
[18]	YU L S, ZHANG F G, ZANG K, et al. Potential ecological risk assessment of heavy metals in cultivated land based on soil geochemical zoning: Yishui county, North China case study[J]. Water, 2021, 13(23): 3322.
[19]	SAINTE FARE GARNOT V, LANDRIEU L, CHEHATA N. Multi-modal temporal attention models for crop mapping from satellite time series[J]. ISPRS Journal of Photogrammetry and Remote Sensing, 2022, 187: 294-305.
[20]	蔡玉林, 王兴路, 高洪振, 等. 融合3DCNN和Vision Transformer的多模态遥感数据树种分类方法[J]. 激光与光电子学进展, 2025, 62(20): 2028001.
	CAI Y L, WANG X L, GAO H Z, et al. Tree species classification method based on multi-modal remote sensing data combined with 3DCNN and vision transformer[J]. Laser & Optoelectronics Progress, 2025, 62(20): 2028001.
[21]	李泽慧, 张琳, 山显英. 三维卷积神经网络方法改进及其应用综述[J]. 计算机工程与应用, 2025, 61(3): 48-61.
	LI Z H, ZHANG L, SHAN X Y. Review on improvement and application of 3D convolutional neural networks[J]. Computer Engineering and Applications, 2025, 61(3): 48-61.
[22]	杨朋辉, 杨长青, 刘静, 等. 基于2D-3D卷积神经网络的情绪识别模型[J]. 燕山大学学报, 2025, 49(1): 66-73.
	YANG P H, YANG C Q, LIU J, et al. Emotion recognition model based on 2D-3D convolutional neural network[J]. Journal of Yanshan University, 2025, 49(1): 66-73.
[23]	SHAFIQ M, GU Z Q. Deep residual learning for image recognition: A survey[J]. Applied Sciences, 2022, 12(18): 8972.
[24]	WANG H Y, MIAO F. Building extraction from remote sensing images using deep residual U-Net[J]. European Journal of Remote Sensing, 2022, 55(1): 71-85.
[25]	LI H, XU Z, TAYLOR G, et al. Visualizing the loss landscape of neural nets[C]// Proceedings of the 32nd International Conference on Neural Information Processing Systems. New York, USA: ACM, 2018: 6391-6401.
[26]	FARMONOV N, ESMAEILI M, ABBASI-MOGHADAM D, et al. HypsLiDNet: 3-D–2-D CNN model and spatial-spectral morphological attention for crop classification with DESIS and LiDAR data[J]. IEEE Journal of Selected Topics in Applied Earth Observations and remote Sensing, 2024, 17: 11969-11996.
[27]	KHAN Z Y, NIU Z D. CNN with depthwise separable convolutions and combined kernels for rating prediction[J]. Expert Systems with Applications, 2021, 170: 114528.
[28]	LIU F C, XU H, QI M, et al. Depth-wise separable convolution attention module for garbage image classification[J]. Sustainability, 2022, 14(5): 3099.
[29]	TONG W, CHEN W T, HAN W, et al. Channel-attention-based DenseNet network for remote sensing image scene classification[J]. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2020, 13: 4121-4132.
[30]	LIU N T, ZHAO Q S, WILLIAMS R, et al. Enhanced crop classification through integrated optical and SAR data: A deep learning approach for multi-source image fusion[J]. International Journal of Remote Sensing, 2024, 45(19/20): 7605-7633.
[31]	WANG L J, WANG J Y, LIU Z Z, et al. Evaluation of a deep-learning model for multispectral remote sensing of land use and crop classification[J]. The Crop Journal, 2022, 10(5): 1435-1451.
[32]	史洁宁, 吴田军, 黄启厅, 等. 耦合NDVI与纹理时序特征的地块作物遥感分类[J]. 南方农业学报, 2025, 56(1): 29-40.
	SHI J N, WU T J, HUANG Q T, et al. Land parcel crop remote sensing classification via coupleing with time series features of NDVI and texture[J]. Journal of Southern Agriculture, 2025, 56(1): 29-40.
[33]	WANG H Q, WANG H J, WU L F. TGF-Net: Transformer and gist CNN fusion network for multi-modal remote sensing image classification[J]. PLoS One, 2025, 20(2): e0316900.
[34]	CHABALALA Y, ADAM E, ALI K A. Machine learning classification of fused Sentinel-1 and Sentinel-2 image data towards mapping fruit plantations in highly heterogenous landscapes[J]. Remote Sensing, 2022, 14(11): 2621.

PASTIS 波段	Sentinel-2 波段	原始分辨率/m	重采样后分辨率/m
1—4	B2—B4， B8	10	10
5—7	B5—B7	20	10
8	B8A	20	10
9—10	B11—B12	20	10

参数	PASTIS（法国）	沂水县（中国）	对迁移的潜在影响
云覆盖率/%	28	33	沂水县需更强的抗云干扰能力
时相数量（4—9月）	32个时相	23个时相	沂水县时序信息更稀疏，考验时间注意力
平均地块面积/hm²	1.2	<0.5	沂水县需更高的空间细节捕捉能力

模型名称	OA（PASTIS）/%	宏平均F ₁分数（PASTIS）	Kappa（PASTIS）	OA（沂水县）/%	宏平均F ₁分数（沂水县）	Kappa（沂水县）
标准3D-CNN	94.2	0.932	0.915	87.2	0.845	0.822
光学-SAR简单融合模型	95.3	0.945	0.928	89.5	0.876	0.858
仅SAR 3D-CNN模型	91.6	0.902	0.889	88.3	0.861	0.842
注意力双分支融合模型	95.1	0.938	0.935	89.9	0.882	0.870
通道注意力双分支融合模型	96.5	0.953	0.942	90.0	0.898	0.890
时间注意力双分支融合模型	95.8	0.945	0.942	89.5	0.894	0.887
空间注意力双分支融合模型	96.9	0.959	0.956	91.0	0.906	0.900
Attention-3DCNN	97.5	0.970	0.965	93.0	0.920	0.910

模型名称	核心方法简述	OA（沂水县）/%	Kappa（沂水县）	参数量/M	计算量/GFLOPs	推理时间/（ms/景）
3D-ConvSTAR	3DCNN多源融合（固定权重）	89.5	0.858	45.2	128.3	156
Self-Attention 3D	自注意力机制+3DCNN	90.5	0.872	52.7	145.6	183
UNet++	编码器-解码器结构，多尺度特征融合	88.0	0.835	68.9	212.4	245
CNN-LSTM-DS	光学影像时序+纹理特征融合	86.5	0.815	43.0	98.7	132
TGF-Net	基于Transformer与卷积（CNN）架构	90.5	0.900	105.3	285.1	312
Attention-3DCNN	跨模态三重注意力（通道-时间-空间）	93.5	0.910	41.3	97.2	133

模型名称	主要依赖信息	OA/%	宏平均F ₁	Kappa
3D-ConvSTAR	光学时序为主	83.6	0.802	0.775
Self-Attention 3D	光学 + 时序注意力	85.1	0.821	0.796
UNet++	光学空间特征	82.4	0.789	0.761
CNN-LSTM-DS	光学时序 + 手工特征	81.9	0.781	0.754
TGF-Net	Transformer + CNN融合	86.3	0.836	0.812
Attention-3DCNN	跨模态三重注意力（S2+S1）	89.4	0.872	0.846