LightTassel-YOLO：一种基于无人机遥感的玉米雄穗实时检测方法

曹玉莹; 刘银川; 高新悦; 贾银江; 董守田

doi:10.12133/j.smartag.SA202505021

智慧农业 >

2025 , Vol. 7 >Issue 6: 96 - 110

DOI: https://doi.org/10.12133/j.smartag.SA202505021

专刊--遥感+AI 赋能农业农村现代化

LightTassel-YOLO：一种基于无人机遥感的玉米雄穗实时检测方法

曹玉莹 ¹^,² ,
刘银川 ¹^,² ,
高新悦 ¹^,² ,
贾银江 ^,¹^,² ,
董守田 ^,¹^,²

展开

^1. 东北农业大学电气与信息学院，黑龙江哈尔滨 150030，中国
^2. 黑龙江省农业农村部东北智慧农业技术重点实验室，黑龙江哈尔滨 150030，中国

收稿日期: 2025-05-19

网络出版日期: 2025-08-01

收起

LightTassel-YOLO: A Real-Time Detection Method for Maize Tassels Based on UAV Remote Sensing

CAO Yuying ¹^,² ,
LIU Yinchuan ¹^,² ,
GAO Xinyue ¹^,² ,
JIA Yinjiang ^,¹^,² ,
DONG Shoutian ^,¹^,²

Expand

^1. Institutions of Electrical and Information, Northeast Agricultural University, Harbin 150030, China
^2. Key Laboratory of Northeast Smart Agricultural Technology, Ministry of Agriculture and Rural Affairs, Heilongjiang Province, Harbin 150030, China

贾银江，博士，教授，研究方向为智慧农业，E-mail： jiayinjiang@126.com；2

董守田，硕士，副教授，研究方向为智慧农业，E-mail： stdongneau@163.com

JIA Yinjiang, E-mail: jiayinjiang@126.com; 2

DONG Shoutian, E-mail: stdongneau@163.com

曹玉莹，硕士，讲师，研究方向为农业视觉感知，E-mail： neau_caoyuying@163.com

CAO Yuying, E-mail: neau_caoyuying@163.com

Received date: 2025-05-19

Online published: 2025-08-01

Supported by

Science and Technology Innovation 2030 of New Generation of Artificial Intelligence Major Project(2021ZD0110904)

国家科技创新2030“新一代人工智能”重大项目(2021ZD0110904)

Unveiling projects of Heilongjiang Province(20212XJ05A0201)

黑龙江省“揭榜挂帅”科技攻关项目(20212XJ05A0201)

Fold

摘要

［目的/意义］ 玉米雄穗的精准识别是制种生产的关键环节。针对现有目标检测模型在复杂大田场景下的研究存在数据维度受限、特征提取不足、计算负载较高、检测效率低下等问题，本研究提出一种基于改进YOLOv11n的大田玉米雄穗实时检测模型LightTassel-YOLO，旨在快速、准确地识别玉米雄穗，以实现去雄无人机的高效作业，减少人工干预的影响。 ［方法］ 利用无人机连续获取2023—2024年玉米抽雄期数据，构建了覆盖玉米抽雄不同阶段、多品种、多高度及多气象条件的大规模高质量玉米雄穗数据集。首先，将EfficientViT应用于主干网中，以增强在多尺度特征中感知信息的能力；其次，设计C2PSA-CPCA模块通过为特征图动态分配通道和空间维度的注意力权重，有效增强网络对目标特征提取能力的同时降低了计算复杂度；最后构建C3k2-SCConv模块，促进代表性特征学习的同时达成低成本空间特征重构，提高模型检测准确率。［结果与讨论］ LightTassel-YOLO为玉米雄穗检测提供了一种可靠方法，最终模型的准确率为92.6%，召回率为89.1%，AP@0.5为94.7%，较基准模型YOLOv11n分别提升2.5、3.8、4.0个百分点，参数量仅为3.23 M，计算量为6.7 GFLOPs。此外，LightTassel-YOLO还与目前主流的目标检测算法Faster R-CNN，SSD和YOLO系列的多个版本进行对比，验证本研究提出方法在综合性能上均优于上述算法，在典型田间场景中，模型亦展现出优异适应性。 ［结论］ 本研究所提出的方法为玉米雄穗精准监测提供了有效的理论基础，对提升田间管理智能化水平具有重要意义。

关键词： 玉米雄穗检测; YOLOv11; EfficientViT; CPCA; SCConv; 无人机

本文引用格式

曹玉莹 , 刘银川 , 高新悦 , 贾银江 , 董守田 . LightTassel-YOLO：一种基于无人机遥感的玉米雄穗实时检测方法[J]. 智慧农业, 2025 , 7(6) : 96 -110 . DOI: 10.12133/j.smartag.SA202505021

Abstract

[Objective] The accurate identification of maize tassels is critical for the production of hybrid seed. Existing object detection models in complex farmland scenarios face limitations such as restricted data diversity, insufficient feature extraction, high computational load, and low detection efficiency. To address these challenges, a real-time field maize tassel detection model, LightTassel-YOLO (You Only Look Once) based on an improved YOLOv11n is proposed. The model is designed to quickly and accurately identify maize tassels, enabling efficient operation of detasseling unmanned aerial vehicles (UAVs) and reducing the impact of manual intervention. [Methods] Data was continuously collected during the tasseling stage of maize from 2023 to 2024 using UAVs, establishing a large-scale, high-quality maize tassel dataset that covered different maize tasseling stages, multiple varieties, varying altitudes, and diverse meteorological conditions. First, EfficientViT (Efficient vision transformer) was applied as the backbone network to enhance the ability to perceive information across multi-scale features. Second, the C2PSA-CPCA (Convolutional block with parallel spatial attention with channel prior convolutional attention) module was designed to dynamically assign attention weights to the channel and spatial dimensions of feature maps, effectively enhancing the network's capability to extract target features while reducing computational complexity. Finally, the C3k2-SCConv module was constructed to facilitate representative feature learning and achieve low-cost spatial feature reconstruction, thereby improving the model's detection accuracy. [Results and Discussions] The results demonstrated that LightTassel-YOLO provided a reliable method for maize tassel detection. The final model achieved an accuracy of 92.6%, a recall of 89.1%, and an AP@0.5 of 94.7%, representing improvements of 2.5, 3.8 and 4.0 percentage points over the baseline model YOLOv11n, respectively. The model had only 3.23 M parameters and a computational cost of 6.7 GFLOPs. In addition, LightTassel-YOLO was compared with mainstream object detection algorithms such as Faster R-CNN, SSD, and multiple versions of the YOLO series. The results demonstrated that the proposed method outperformed these algorithms in overall performance and exhibits excellent adaptability in typical field scenarios. [Conclusions] The proposed method provides an effective theoretical framework for precise maize tassel monitoring and holds significant potential for advancing intelligent field management practices.

Key words： maize tassel detection; YOLOv11; EfficientViT; CPCA; SCConv; UAV

0 Introduction

Maize, as one of the three major global food crops, plays a pivotal role in maintaining stable food supplies, which is critical for global food security^[1]. High-quality seeds are central to improving maize productivity and ensuring grain quality. As a typical monoecious crop, maize exhibits genetic phenomena such as self-incompatibility and heterosis, which are pivotal in plant genetics^[2]. To preserve superior parental traits, hybrid breeding is widely adopted in maize seed production, where maternal detasseling serves as a key step. Traditional detasseling relied heavily on labor-intensive manual methods, which were subjective and environmentally sensitive. Current practices combine mechanical and manual detasseling, yet challenges persist due to varietal differences among parental lines and environmental factors such as temperature, light, and nutrient availability. These variables lead to variations in plant height and tasseling timing, while existing tassel detection methods suffer from insufficient feature extraction capability and suboptimal performance. Even when deployed on detasseling UAVs, these methods often fail to achieve desired outcomes, necessitating repeated manual interventions. Consequently, developing a rapid and accurate maize tassel detection method is essential for advancing precision and automation in UAV-based detasseling systems.

In recent years, with the continuous development of smart agriculture, UAV remote sensing technology and computer vision methods have been widely applied in crop growth monitoring^[3], pest and disease identification^[4], and yield prediction^[5]. Domestic and international researchers have conducted extensive research on maize tassel detection. Traditional machine vision methods mainly utilize feature extraction algorithms based on color, texture, and geometric shapes to extract maize tassel features. For example, Lu et al.^[6]developed a color-based joint segmentation method that achieves joint segmentation of maize and tassels through the biological characteristics of maize tassels changing over time, reaching an average accuracy of 74.3%. However, excessive reliance on color features easily leads to performance bottlenecks. Kurtulmuş and Kavdir^[7] used support vector machines (SVM) classifiers and morphological operations to determine the final position of maize tassels. However, due to the diversity of backgrounds, differences in lighting, occlusion, shadow areas, and color similarities, the model's correct detection rate was only 81.6%. Although the above research has made certain progress, traditional machine vision methods often require profound domain knowledge to manually design feature extraction methods. Additionally, the relatively high planting density of maize, occlusion between tassels and between tassels and leaves, and complex field environments further increase the difficulty of maize tassel feature extraction. Therefore, many researchers have combined deep learning algorithms with computer vision techniques to achieve higher accuracy than traditional machine vision by automatically learning features from large amounts of data. Currently, this technology has been applied to maize tassel detection work.

There are mainly two categories of methods using deep learning algorithms for maize phenotype detection: segmentation^[8] and detection^[9]. Segmentation methods distinguish the target from the background at the pixel level. For example, Yu et al.^[10] used a U-Net model to segment maize tassels from ground-level RGB images and UAV images, resulting in clearer segmentation boundaries and more complete preservation of tassel morphology, and the intersection over union (IoU) reached 71%. Wan et al.^[11] proposed an improved U-Net model that optimizing encoder feature extraction through cascaded convolutional networks and constructed expansion paths through multi-scale dilated convolution fusion to preserve spatial details, ultimately achieving accurate recognition of maize field growth stages. Liu et al.^[12] implemented efficient image segmentation based on the DeepLabv3+ model and adopted distance transform skeletonization technology to extract stem and leaf morphological features, achieving a segmentation mIoU of 79.91%.

The above-mentioned maize phenotype detection method based on semantic segmentation has a more precise segmentation effect, but the model is more complex and requires more computing resources. Moreover, maize tasseling is a critical stage in maize growth and development, and detasseling is a task that demands high real-time performance, making this method unsuitable for deployment on detasseling UAVs. Unlike segmentation methods, YOLO (You Only Look Once), one of the representative object detection algorithms, is widely applied in real-time detection tasks due to its outstanding performance. For instance, Li et al.^[13] proposed a real-time pattern recognition framework for ground-penetrating radar (GPR) images using YOLOv3 implemented with TensorFlow. Their approach employs the visual intersection over union (V-IoU) method to address electromagnetic signal vacillation, significantly enhancing bounding box localization accuracy. Therefore, YOLO is particularly suitable for maize tassel detection tasks. Pu et al.^[14] proposed Tassel-YOLO, an improved maize tassel detection and counting model based on YOLOv7, enhancing the network's nonlinear expression capability through the vision-oriented variant grouped spatial convolution cross-stage partial (VoVGSCSP) module. Niu et al.^[15] proposed a method based on RGB images and the YOLOv8 model to identify and count maize tassels, evaluating the model's accuracy under different height conditions and determining that the highest accuracy of 97.59% was achieved at a height of 5 m, providing valuable reference for our subsequent data collection and research direction. Jia et al.^[16] designed a maize tassel detection model based on YOLOv5 embedded with coordinate attention mechanism, effectively suppressing irrelevant features in field environments, indicating that the introduction of attention mechanisms can dynamically focus on key information and reduce redundant data, providing new insights for visual detection of maize tassels in complex field environments. Liu et al.^[17] used an improved Faster R-CNN (Fast region-based convolutional neural network) model combined with ResNet network to detect maize tassels with high precision from images collected by UAVs and mobile phones, and optimized small target detection capability by adjusting anchor box sizes, achieving an accuracy of 95.95%. Falahat and Karami^[18]proposed a lightweight maize tassel detection model which is based on improved YOLOv5 and compared with Faster R-CNN, SSD (Single Shot MultiBox Detector), RetinaNet, and TasselNetv2+. In terms of speed and performance quality, results showed that this model outperformed other advanced detection methods in overall performance. The above research further validates the broad potential of object detection models in maize tassel detection applications.

Despite the progress made by previous researchers in maize tassel detection, there are still some deficiencies in existing research. First, data collection time is concentrated and data samples are single, which lack multiple growth stages and diverse environmental conditions, remain a bottleneck constraining the development of maize tassel detection technology. Second, maize tasseling is a continuous biological process, and tassel phenotypes vary significantly across different varieties. Traditional architectures exhibit a limited capacity to capture subtle tassel features, resulting in poor detection performance and restricting their effectiveness in complex scenarios. Finally, existing models demand high computational resource, which is unfavorable to deployment on hardware devices like UAVs, thereby making it challenging to meet real-time field detection needs. In order to solve the above problems, a field maize tassel identification method based on improved YOLOv11n was proposed, named LightTassel-YOLO.

1 Materials and methods

The experimental design of this research is presented in Fig.1. The experimental design of this study is in Fig. 1. Detailed explanations of the maize planting situation, data acquisition scheme, model architecture design, experimental setup, and evaluation are provided in the following four sections.

View original graphic|Download|PPT slide

Fig. 1 Flowchart of the LightTassel-YOLO

1.1　Dataset construction

1.1.1　Data collection

The research area was situated at the Xiangyang base (126°55′39″E, 45°45′48″N) and Acheng base (127°2′58″E, 45°31′18″N) of Northeast Agricultural University in Harbin city, Heilongjiang province, China. The experimental base cultivated four varieties, DN279, DN285, AB368, and QS370. This study employed a DJI Phantom 4 Pro unmanned aerial vehicle (UAV), which was equipped with a high-resolution digital camera, the DJI FC6310 (DJI Innovations, Shenzhen, Guangdong, China), having a resolution of 5 472 × 3 648 pixels. Maize tassel data were collected in July of both 2023 and 2024. To ensure the diversity of the dataset and robustness, data acquisition was conducted under various weather conditions, including clear skies and overcast days. During the early tasseling stage, the exposed part of the tassel was minimal and visually similar to the surrounding leaves, making it susceptible to occlusion. As such, a relatively low flight altitude of 5 m was selected to enhance image acquisition accuracy. During the mid to late tasseling stages, as the tassels became more prominent and distinguishable, the flight altitude was adjusted to 10 m to better capture their structural features.

Representative images of maize tassels captured are shown in Fig. 2. The maize tassels at the early, partial, and full tasseling stages are illustrated in Fig. 2a~Fig. 2c, respectively. At the early tasseling stage, the tassels have just emerged and exhibited weak textural features while sharing a similar coloration with the leaf veins, which makes them difficult to distinguish and increasing the complexity of tassel detection. From partial to full tasseling, the tassels gradually unfold and become visually prominent due to their golden-yellow pollen. The maize tassels captured under sunny and cloudy conditions are illustrated in Fig. 2d~Fig. 2e, respectively. As seen from the comparison, under strong sunlight, images tend to be overexposed, resulting in blurred boundaries between tassels and leaves, which complicates their differentiation. Conversely, under overcast conditions, although the overall image appears darker, the difference in reflectivity between tassels and leaves becomes more distinct, thereby facilitating their separation from the complex background. The images captured at 5 m and 10 m flight altitudes are shown in Fig. 2f and Fig. 2g. Due to variations in maize cultivars within the experimental field, tasseling times also varied slightly. Based on these circumstances, the model was trained using a combination of data collected at different altitudes, tasseling stages, weather conditions, and from different maize varieties, to further enhance its generalization capability. During the early to mid-tasseling stages, tassels are frequently occluded by surrounding leaves, as illustrated in Fig. 2h. In addition, high planting density often leads to mutual occlusion between adjacent tassels, as shown in Fig. 2i. These diverse tassel samples collectively constitute the dataset used in this study.

View original graphic|Download|PPT slide

Fig. 2 Maize tassel dataset

1.1.2　Data pre-processing

A total of 684 UAV images of maize tassels were collected in 2023. The boundaries of the tassels were manually annotated using the LabelImg software tool. Given the original image resolution of 5 472×3 648 pixels, directly inputting these high-resolution images into a deep learning network would substantially increased memory consumption during model training and inference, potentially causing memory overflow. Therefore, prior to being fed into the network, the images were cropped to a size of 640×640 pixels. Through a series of data augmentation techniques, including rotation, vertical and horizontal flipping, brightness adjustment, contrast adjustment, hue adjustment, and sharpening, a total of 3 423 images were generated. The data collected in 2023 was used for training and validation, and the training set and validation set were divided into 7∶3 ratio for the above enhanced data. In order to reduce the imbalance between the training samples and the validation samples that may be caused by the differences in lighting conditions, weather (such as wind influence) and other variables, the images were randomly shuffled to form the training set and the validation set. In order to verify the robustness of the model, 300 images were selected from 728 UAV corn ear images obtained in 2024 as the test set, and the images were cropped to 640 × 640 pixels, and a total of 944 images were obtained in the test set. Four sets of UAV image datasets were constructed according to the maize tasseling period, maize varieties, collection height and collection weather, as shown in Table 1. According to the different tasseling periods of maize, the test set was divided into early tasseling, partial tasseling, and complete tasseling. According to the different maize varieties, it was divided into DN279, DN285, AB368 and QS370. According to the different heights, it was divided into 5 and 10 m. Depending on the weather, it was divided into sunny and cloudy days.

Table1 Description of the maize dataset (a. Composition of the training and validation sets)

Year	Number of images
Year	Train	Validation	Total
2023	2 397	1 026	3 423

b. Composition of the test sets

Year	Dataset	Dimensions	Number of images
Year	Dataset	Dimensions	Test	Total
2024	Maize tasseling stage	Early tasseling	288	944
		Partical tasseling	303
		Full tasseling	353
	Variety	DN279	223
		DN285	261
		AB368	212
		QS370	248
	Height	5 m	478
	Height	10 m	466
	Weather	Sunny	465
	Weather	Cloudy	479

1.2　Construction of the LightTassel-YOLO model

The YOLO series of algorithms was first introduced by Redmon et al.^[19] in 2016. Compared to two-stage object detection algorithms (such as the R-CNN series), YOLO performs single-stage object detection based on regression by simultaneously determining object bounding boxes and their classes through an end-to-end network. This approach improves both detection speed and accuracy. With ongoing development and technological iterations, the YOLO series had evolved to its eleventh generation by October 2024, YOLOv11^[20] Considering a balance between accuracy and model complexity, YOLOv11n was selected as the baseline model for this study. Based on this, a field maize tassel detection method named LightTassel-YOLO was proposed, as illustrated in Fig.3. The improvements introduced are as follows.

View original graphic|Download|PPT slide

Fig. 3 Architecture of LightTassel-YOLO model

The Transformer-based EfficientViT module was applied to the backbone network to enhance the model's ability to perceive information from multi-scale features. While maintaining efficient real-time inference, the innovative architectural design enables improved detection accuracy even with a moderately reduced feature extraction depth. A channel-prioritized convolutional attention module, C2PSA-CPCA (Convolutional block with Parallel Spatial Attention and Channel prior convolutional attention), was designed to enhance the representation of small objects with subtle early-stage tassel features. By employing depthwise convolutions to capture spatial relationships among features, the module dynamically allocates attention weights across both channel and spatial dimensions. The C3k2-SCConv module was constructed based on SCConv to optimize feature representation by capturing channel relationships and semantic information across feature maps. This approach reduced the spatial and channel redundancy commonly present in standard convolutions, achieved improved performance while lowering computational load, and further enhanced the model's lightweight design.

1.2.1　EfficientViT feature extraction network

The traditional backbone network structure has limitations in processing cross-scale information and struggles to accurately capture the global features of maize tassels. To address these issues, Cai et al.^[21] proposed an efficient vision transformer (EfficientViT) network, as shown in Fig. 4a. The core building block of this network is shown in Fig.4b, consisting of the mobile inverted bottleneck convolution (MBConv) module and the Lightweight multi-scale attention (MSA)^[22] module, as depicted in Fig. 4c.

View original graphic|Download|PPT slide

Fig. 4 EfficientViT model architecture

a. EfficientViT network b. Core building block c. Light weight MSA module

In this model, the feature maps progressively decreased in size while the number of channels increased. The MSA module was used to capture contextual information, while the MBConv module enhanced gradient propagation characteristics to better capture local features^[23]. Additionally, the network was designed with six versions of varying sizes (M0~M5) to meet different efficiency constraints. Given that this study required real-time maize tassel detection and considering the potential future deployment on edge devices, the backbone of YOLOv11 was replaced with the smallest EfficientViT_M0.

1.2.2　Channel prioritized convolutional attention

In field maize tassel images, due to the uncontrollable lighting conditions during data collection, maize tassels may exhibit varying degrees of reflection in RGB images, making feature extraction relatively challenging. Although embedding attention mechanisms could help the model focusing more on regions of interest, common attention mechanisms typically measure and extract effective information from feature maps only from either the channel or spatial dimension, often neglecting the positional information within the feature maps. The CPCA mechanism achieved a dynamic distribution of attention weights across both channel and spatial dimensions, allowing more selective focus on regions with luminance features. This improved the accuracy and robustness of tassel detection^[24]. The structure of the CPCA module is shown in Fig. 5.

View original graphic|Download|PPT slide

Fig. 5 Structure of CPCA

The CPCA module mainly consists of two parts, the channel attention (CA) module and the spatial attention (SA) module.

F ∈ R C × H × W

represents the input intermediate feature map. The channel attention module was first applied to generate a one-dimensional channel attention map

M c ∈ R C × 1 × 1

. The channel attention map

M c

was element-wise multiplied with the input feature map

F

, and then the channel attention values were propagated along the spatial dimension to obtain the channel attention feature map

F c ∈ R C × H × W

, as shown in Equation (1). The SA processed

F c

to generate a three-dimensional spatial attention map

M s ∈ R C × H × W

.The final output feature map

F^∈ R C × H × W

is obtained by element-wise multiplying

M s

and

F c

as shown in Equation (2), and

⊗

represents element-wise multiplication.

F c = C A (F) ⊗ F

(1)

F^= S A (F c) ⊗ F c

(2)

The CPCA uses the method proposed in the convolutional block attention module (CBAM)^[25] module, where spatial information of the feature map was aggregated through average pooling and max pooling operations. The two independent spatial context descriptors generated during the aggregation process were then sent into a shared multi-layer perceptron (MLP). After element-wise summation, the shared MLP was combined to generate the channel attention map. The computation of CPCA can be summarized in Equation (3), where

σ

sigmoid represents the sigmoid activation function.

C A (F) = σ M L P A v g P o o l F + M L P M a x P o o l F

(3)

The generation of the spatial attention map was achieved by extracting spatial relationships between features, using a multi-scale structure to enhance the convolution operation's ability to discern spatial relationships. In particular, a 1×1 convolution operation was used at the end of the spatial attention module to integrate the channels, thereby generating a more refined attention map. The computation of CPCA spatial attention is shown in Equation (4), where DwConv represents depthwise convolution,

B r a n c h i

, where

i ∈ 0, 1, 2, 3

,represents the i-th branch, while

B r a n c h 0

represents the identity connection.

S A F = C o n v 1 × 1 ∑ i = 0 3 B r a n c h i D w C o n v F

(4)

1.2.3　SCConv convolutional module

The ScConv^[26] structure is shown in Fig. 6. This convolution introduces two core components, the spatial reconstruction unit (SRU) and the channel reconstruction unit (CRU). The SRU uses a block-based heterogeneous convolution kernel strategy, decomposing the feature map into multiple spatial blocks and extracting local features in a differentiated manner, which significantly reduced spatial dimension redundancy. The CRU, on the other hand, alleviates channel dimension redundancy by channel splitting and lightweight convolution operations, combined with a feature reuse mechanism to select high-response channels.

View original graphic|Download|PPT slide

Fig. 6 Structure of SCConv

Inspired by SCConv's channel rearrangement and mixed feature design that reduces redundant features, an improved structure called C3k2-SCConv based on the SCConv module was proposed. The heterogeneous convolution kernels in SRU enhanced fine-grained texture perception, while the channel rearrangement in CRU suppressed background interference. This structure could accurately focus on target regions in complex farmland scenarios while balancing feature representation capability and computational efficiency.

1.3　Evaluation indexes

To comprehensively evaluate the performance of the LightTassel-YOLO model, the following evaluation metrics were selected: Precision (P), Recall (R), AP@0.5, AP@0.5:0.95, Params (Model Parameters), Giga Floating Point Operations per Second (GFLOPs), and Frames Per Second (FPS). The calculation formulas for P, R, and AP are as Equation (5)~ Equation (7) :

P = T P F P + T P

(5)

R = T P F N + T P

(6)

A P = ∫ 01 P (R) d R

(7)

Where, TP (True Positive) represents the number of maize tassels that are actually maize tassels and are predicted as maize tassels by the model; FP (False Positive) represents the number of negative samples that are predicted as maize tassels by the model, false detections. FN (False Negative) represents the number of maize tassels that are actually maize tassels but are predicted as negative samples by the model, missed detections. Since maize tassels have only one class (there are no actual negative samples), there is no case where both the actual sample is negative and the prediction is negative; P represents the proportion of actual maize tassels among the maize tassels predicted by the model. R represents the proportion of actual maize tassels that are predicted as maize tassels by the model. When the intersection over union between the model's predicted box and the ground truth label reaches a set threshold, the sample is considered correctly predicted; AP@0.5 referred to the AP calculated when the intersection over Union threshold was set to 0.5. AP@0.5∶0.95 referred to the average AP calculated over IoU thresholds ranging from 0.5 to 0.95 in steps from 0.05. AP can be directly used to evaluate the model. Higher values indicated better performance. Params represents the size of the model. GFLOPs indicated the number of floating-point operations required during execution. Lower GFLOPs generally imply less computational complexity. FPS, which indicates how many images the model can process per second, was used to evaluate its processing speed. These metrics were collectively used to assess the overall performance of the model.

1.4　Experimental platform and configuration

The experimental platform of this study was the AutoDL online GPU processing platform, with the model being RTX A4000. The processor model was Intel(R) Xeon(R) Silver 4310, the video memory size was 16 GB, and the memory size was 30 GB. The operating system was Ubuntu 20.04, the deep learning framework was PyTorch 2.0.0, the Python version was 3.8, and Cuda 11.8 was used to accelerate training. When training different models, the batch size of each parameter variable was controlled to 16, the images were scaled to 640×640 and then input into the model, and 100 epochs of training were performed uniformly. The initial learning rate, optimizer and other parameters are shown in Table 2.

Table 2 Key parameters settings of maize tassel detection research

Parameters	Setup
Epochs	100
Batch size	16
Learning rate	0.01
Optimizer	AdamW
Number of workers	8
Image size	640 × 640

2 Results and discussion

Compared with existing object detection architectures, the three modules integrated in this study exhibited remarkable structural advantages and practical application values. Firstly, the EfficientViT module introduced a multi-scale lightweight attention mechanism, which effectively improved the model's perception accuracy while significantly compressing the parameter scale, adapting to agricultural scenarios with high requirements for real-time performance and model size. Secondly, the C2PSA-CPCA module integrated attention allocation strategies in channel and spatial dimensions, significantly enhancing the model's perception ability for early tiny targets of maize tassels and maintaining good stability and generalization ability under complex lighting and background interference conditions. Finally, the C3k2-SCConv module suppressed feature redundancy through spatial reconstruction and channel rearrangement mechanisms, significantly reducing the model's computational load, which was one of the key designs to promote efficient deployment of the model on edge devices. This chapter will be unfolded from the following six sections to systematically demonstrate the design of the above modules between model performance and resource efficiency, and conduct a detailed analysis combined with experimental results.

2.1　Comparison with different feature extraction networks

This study compared the effects of improving YOLOv11n using different feature extraction networks, including StarNet^[27], VanillaNet^[28], MobileNetV4^[29], ShuffleNetv2^[30], RepViT^[31], and EfficientViT. The comparison results are shown in Table 3. The experimental results demonstrated that the EfficientViT architecture exhibited a significant overall advantage, achieving P, R, AP@0.5, and AP@0.5:0.95 scores of 91.5%, 87.9%, 93.8%, and 56.2%, respectively, ranking first among the evaluated models in detection accuracy. It was worth noting that although RepViT outperforms StarNet (54.3%), MobileNet V4 (53.2%), and ShuffleNetv2 (54.6%) in the AP@0.5:0.95 metric (55.5%), this improvement came with a substantial increase in parameter size and computational cost, resulting in a 19.9% decrease in FPS compared to EfficientViT. In contrast, the model integrated with EfficientViT maintains higher detection accuracy while reducing parameters by 42.2% and computational load by 61.0% compared to RepViT, along with a 24.9% increase in inference speed. Additionally, the lightweight ShuffleNetv2 network stands out in efficiency, reaching an FPS of 123.3, but its AP@0.5:0.95 was 1.6 percentage points lower than that of EfficientViT. Overall, the analysis indicated that EfficientViT, as the feature extraction backbone for YOLOv11n, achieved an optimal balance among detection accuracy, model complexity, and real-time performance, effectively reconciling detection performance with lightweight requirements.

Table 3 Comparison of experimental results for YOLOv11n models improved with different feature extraction networks

Model	P/%	R/%	AP@0.5/%	AP@0.5：0.95/%	Params/M	FPS	GFLOPs
YOLOv11n+ StarNet	91.1	86.8	93.5	54.3	2.64	179.5	5.2
YOLOv11n+ VanillaNet	90.9	85.7	93.1	53.3	3.69	101.2	6.2
YOLOv11n+MoblieNetV4	91.2	85.2	92.7	53.2	3.84	112.4	7.2
YOLOv11n+ ShuffleNetv2	91.3	85.8	93.4	54.6	2.48	123.3	5.9
YOLOv11n+ RepViT	91.1	87.6	93.9	55.5	6.16	131.2	17.7
YOLOv11n+ EfficientViT	91.5	87.9	93.8	56.2	3.56	163.9	6.9

2.2　Comparison with different EfficientViT variants

In this experiment, the YOLOv11n backbone was replaced with six variants of EfficientViT (M0－M5) to systematically evaluate the model's performance in maize tassel detection. As shown in Table 4, with the increase in model size from M0 to M5, both the number of parameters and computational complexity surged significantly, while the FPS exhibited a clear downward trend, and the improvement in P showed diminishing returns. In terms of P, M4 achieved the highest precision at 92.3%, but its AP@0.5:0.95 was only 56%, slightly lower than M0's 56.2%. Although M5 achieved the best AP@0.5:0.95 at 56.5%, this was only a 0.3 percentage point improvement over M0, while incurring substantially higher parameters and GFLOPs. Overall, M0 demonstrated significant advantages, requiring only 3.56 M parameters and 6.9 GFLOPs, while achieving the highest inference speed of 163.9 FPS. These results indicated that M0 striked the best balance among accuracy, efficiency, and computational cost. Its lightweight characteristics made it more suitable for real-time detection on UAV platforms, validating the rationale for adopting EfficientViT_M0 as the backbone network.

Table 4 Comparison of YOLOv11n models with different EfficientViT varians

Model	P/%	AP@0.5/%	AP@0.5：0.95/%	Params/M	FPS	GFLOPs
YOLOv11n+ EfficientViT_M0	91.5	93.8	56.2	3.56	163.9	6.9
YOLOv11n+ EfficientViT_M1	92.0	94.3	55.8	4.37	152.8	12.6
YOLOv11n+ EfficientViT_M2	91.8	94.1	55.5	5.58	143.2	14.7
YOLOv11n+ EfficientViT_M3	90.8	94.2	55.7	8.27	139.8	18.4
YOLOv11n+ EfficientViT_M4	92.3	94.2	56.0	10.17	137.6	20.5
YOLOv11n+ EfficientViT_M5	91.4	94.3	56.5	13.85	122.4	33.0

2.3　Ablation experiments

To validate the effectiveness of each improvement in LightTassel-YOLO, ablation experiments was conducted using YOLOv11n as the baseline model, with all training processes following the same parameter settings. The experiments were carried out on a self-constructed maize tassel dataset, and the performance metrics of each module were evaluated on the test set. The experimental results are shown in Table 5.

Table 5 Ablation experiment results of LightTassel-YOLO model

Baseline model	Models	EfficientViT	C2PSA-CPCA	C3k2-SCConv	P/%	R/%	AP@0.5/%	Params/M	GFLOPs	FPS
YOLOv11n	Model 1	×	×	×	90.1	85.3	90.7	2.43	6.3	124.9
	Model 2	√	×	×	91.5	87.9	93.8	3.56	6.9	163.9
	Model 3	×	√	×	90.8	86.7	93.3	2.43	6.3	239.2
	Model 4	×	×	√	90.7	86.8	93.2	2.11	5.4	226.3
	Model 5	√	√	×	92.2	87.9	94.1	3.56	6.9	238.5
	Model 6	√	×	√	92.1	88.4	94.4	3.15	6.7	190.8
	Model 7	×	√	√	91.8	87.6	93.9	2.26	5.4	285.6
	LightTassel-YOLO	√	√	√	92.6	89.1	94.7	3.23	6.7	226.9

Note: √ means the corresponding module is added to the model; × means the corresponding module is not added.

The experimental results showed that using EfficientViT alone to optimize the backbone, compared to the baseline model, increased the number of parameters and computational cost by 1.13 M and 0.6 GFLOPs, respectively. However, P, R, and AP@0.5 improved by 1.4, 2.6, and 3.1 percentage points. Because of the EfficientViT module, based on the Transformer architecture, extracted multi-scale global features by concatenating features from different scales along the head dimension, which were then fused through a linear projection layer, leading to a slight increase in parameters and computation. With EfficientViT applied to the YOLOv11 model, the FPS increased from the baseline model's 124.9 to 163.9, demonstrating that EfficientViT as a feature extractor could maintain efficient real-time inference while improving detection accuracy. Adding only the CPCA attention mechanism to the baseline model, while keeping the number of parameters and computational cost unchanged, resulted in an increase of 0.7, 1.4, and 2.6 percentage points in P, R, and AP@0.5, respectively. This validated the enhancement effect of the channel-prioritized strategy on the model's ability to perceive critical regions. Further analysis of the C3k2-SCConv module showed that this structure reduced parameters by 0.32 M and computational cost by 0.9 GFLOPs, while improving P, R, and AP@0.5 by 0.6, 1.5, and 2.5 percentage points, respectively, with a significant increase in the number of images processed per second. This improvement was attributed to its cross-channel semantic interaction mechanism, which significantly enhanced feature representation while reducing computational load. Notably, the modules exhibited nonlinear synergistic gains. When EfficientViT and C2PSA-CPCA were applied together, AP@0.5 reached 94.1%, with 3.4 percentage point improvement over the baseline. The fully integrated LightTassel-YOLO model, combining all three improvements, achieved 3.23 M parameters and 6.7 GFLOPs computation, with P, R, and AP@0.5 improved by 2.5, 3.8, and 4.0 percentage points over the baseline, while the FPS increased to 226.9. These results indicated that the deep coupling of EfficientViT's global feature extraction, C2PSA-CPCA's attention focus, and C3k2-SCConv's lightweight optimization achieved an optimal balance of detection accuracy and inference speed under limited computational resources. This provided a high-performance and lightweight solution for real-time maize tassel detection in complex field scenarios.

2.4　Analysis of the LightTassel-YOLO training process

During the training process of the LightTassel-YOLO model, dynamic response curves were plotted based on performance metrics on the validation set, including P, R, AP@0.5, and AP@0.5:0.95, as shown in Fig. 7. The experimental data indicated that all evaluation metrics exhibited a rapid upward trend during the initial training phase, with a smooth convergence process and no significant fluctuations. After approximately 40 epochs, all metrics had reached a stable convergence state. The final model achieved a P of 92.6%, R of 89.1%, AP@0.5 of 94.7%, and AP@0.5:0.95 of 56.7%.

View original graphic|Download|PPT slide

Fig. 7 Model training performance of LightTassel-YOLO

2.5　Comparison with mainstream object detection models

To further evaluate the overall performance of LightTassel-YOLO, a comparative study was conducted against several mainstream models, including the two-stage detection framework Faster R-CNN, the single-stage detector SSD, and various versions of the YOLO series (YOLOv5, YOLOv7, YOLOv8, YOLOv10, and YOLOv11). The experimental results are presented in Table 6.

Table 6 Comparison between LightTassel-YOLO and mainstream object detection models

Models	P/%	R/%	AP@0.5/%	Params/M	GFLOPs
Faster R-CNN+ResNet50	85.4	83.7	86.5	41.35	93.6
SSD+ResNet50	79.1	75.3	82.3	27.39	30.6
YOLOv5s	90.0	86.3	90.5	7.03	15.8
YOLOv7-tiny	88.4	84.3	90.0	6.40	13.2
YOLOv8n	89.9	86.7	92.5	3.01	8.1
YOLOv10n	89.0	84.6	91.1	2.76	8.2
YOLOv11n	90.1	85.3	90.7	2.43	6.3
LightTassel-YOLO	92.6	89.1	94.7	3.23	6.7

The results showed that the proposed field maize tassel detection model, LightTassel-YOLO, performs excellently in terms of P, R, and AP@0.5, achieving 92.6%, 89.1%, and 94.7%, respectively. Compared with traditional two-stage network Faster R-CNN and one-stage network SSD, LightTassel-YOLO demonstrates significant improvements across all evaluation metrics. Specifically, compared to YOLOv5s, LightTassel-YOLO achieved 2.6, 2.8, and 4.2 percentage points higher in P, R, and AP@0.5, respectively, while reducing parameter count and computation by 3.8 M and 9.1 GFLOPs. In comparison with previous YOLO models, LightTassel-YOLO outperforms YOLOv7-tiny, YOLOv8n, YOLOv10n, and YOLOv11n in P by 4.2, 2.7, 3.6, and 2.5 percentage points, respectively, in R by 4.8, 2.4, 4.5, and 3.8 percentage points, and in AP@0.5 by 4.7, 2.2, 3.6, and 4.0 percentage points, respectively. Overall, these results highlighted the superior performance of LightTassel-YOLO in maize tassel detection tasks, especially under resource-constrained scenarios.

Fig. 8 illustrates the visualization of detection results from various models, where red boxes indicate correctly detected maize tassels and yellow boxes represent missed detections. As shown in Fig. 8a, the Faster R-CNN model achieved relatively comprehensive detection and can identify most tassels. However, it still exhibited missed detections and duplicate detections, especially around the edges of images, indicating insufficient edge feature extraction capability. The SSD model showed a particularly severe issue with missed detections, with a notably higher number of yellow boxes, especially struggling to detect tassels in edge regions, as seen in Fig. 8b. This reflected its poorer adaptability to complex background environments. In contrast, the YOLO series models (YOLOv5s, YOLOv7-tiny, YOLOv8n, YOLOv10n, YOLOv11n) demonstrated generally better detection performance, accurately recognizing most tassel targets. The red boxes were dense and mostly complete, though a few missed detections remained, particularly in areas with occlusion or strong light reflections. This declin in detection accuracy was likely due to environmental factors such as lighting reflections, occlusions, and resolution limitations in field conditions, which made feature extraction challenging and affected the precision of individual tassel bounding boxes. Among all, the LightTassel-YOLO model performed the best. As shown in Fig. 8h, it detected the highest number of tassel targets, effectively reducing missed detections in complex environments and showcasing stronger feature extraction and detection capabilities.

View original graphic|Download|PPT slide

Fig. 8 Performance of different target detection models in the detection of maize tassel study

2.6　Robustness analysis of the model

In this section, the robustness of the LightTassel-YOLO model was evaluated in the task of maize tassel detection.To further analyze the model's performance, Table 7 presents the test results of LightTassel-YOLO on datasets collected under varying tasseling stages, maize varieties, image acquisition height, and weather conditions.

Table 7 LightTassel-YOLO test results of different maize tassel test sets

Dataset	Dataset dimensions	P/%	R/%	AP@0.5/%	AP@0.5：0.95/%
Period	Early tasseling	88.4	79.1	89.0	50.4
	Partial tasseling	90.8	84.0	91.0	52.7
	Full tasseling	91.9	88.9	93.5	57.2
Height	5 m	91.4	87.7	93.6	53.3
Height	10 m	90.4	86.0	91.6	52.7
Weather	Sunny	89.6	86.7	91.2	53.9
Weather	Cloudy	91.9	90.8	94.5	57.2
Variety	DN279	91.9	88.0	93.6	56.5
	DN285	90.3	84.3	91.4	54.9
	AB368	84.8	82.1	86.6	52.7
	QS370	85.3	83.0	87.9	53.4

The results showed that detection performed best at the full flowering stage (P = 91.9%, R = 88.9%, AP@0.5 = 93.5%, AP@0.5:0.95 = 57.2%), followed by the partial tasseling stage, while the early tasseling stage achieved the lowest accuracy (P = 88.4%, R = 79.1%, AP@0.5 = 89.0%, AP@0.5:0.95 = 50.4%). The phenotypic differences of maize tassels at various growth stages led to these detection disparities, as shown in Fig. 9a. Because during early tasseling, the tassels were relatively small, and after processing the UAV images, the model's feature extraction ability was limited, resulting in poorer detection performance. As the tassels develop and fully flower, their color and phenotypic features became more prominent, which facilitated better detection by the model. The detection performance on the 5 m altitude test set was better than that on the 10 m test set, with P, R, AP@0.5, and AP@0.5:0.95 higher by 1.0, 1.7, 2.0, and 0.6 percentage points, respectively. Fig. 9b suggests that this may be due to image resolution decreasing as the UAV collection height increases, causing tassel features to be less distinct and thereby reducing detection accuracy. Detection under cloudy conditions outperformed sunny conditions. As shown in Fig. 9c, strong natural sunlight on sunny days caused severe reflections on maize leaves, resulting in overexposed images and blurred boundaries between tassels and leaves, making it difficult to distinguish tassels. Under cloudy conditions, the overall image was darker, and the tassels and leaves reflected light differently, making it easier for the model to distinguish tassels from the complex background. Thus, cloudy weather was more conducive to detection. Among different varieties, DN279 achieved the best detection results (P = 91.9%, R = 88.0%, AP@0.5 = 93.6%, AP@0.5:0.95 = 56.5%), followed by DN285, while AB368 performed relatively worse (P = 84.8%, R = 82.1%, AP@0.5 = 86.6%, AP@0.5:0.95 = 52.7%). This likely stemmed from DN279 and DN285 being part of the training dataset of the LightTassel-YOLO model, enabling stronger feature extraction capabilities for these varieties. In contrast, AB368 and QS370 were not included in training and were only used for testing, thus showing more significant phenotypic differences and lower detection performance. Additionally, Fig. 9d illustrated that DN279 tassels were more concentrated, morphologically clear, and had deeper coloration that contrasts distinctly with surrounding leaves, allowing the model to more accurately identify tassel locations and boundaries. This variety demonstrated strong robustness and precision during detection. DN285's tassels had a more branched and slender morphology, with denser leaved causing some occlusions between tassels and between tassels and leaves; however, the model still accurately detected most occluded tassels. AB368 showed relatively weaker detection results due to its slender tassel branches and low color contrast with leaves, which reduced the model's discrimination capability. Some tassels had complex shapes that increased detection difficulty, resulting in generally lower accuracy. QS370's detection results fall between DN285 and AB368. Its tassels were more scattered with weaker features, leading to lower confidence scores for some targets. Furthermore, the thin branches were more easily occluded by the complex background, affecting detection completeness. Overall, these findings indicated that the morphological traits of maize tassels and background complexity both impact the detection performance of LightTassel-YOLO. Nonetheless, the model still achieved favorable results across diverse test conditions, demonstrating strong robustness and adaptability.

View original graphic|Download|PPT slide

Fig. 9 Detection effect of LightTassel-YOLO on different test sets

3 Conclusions

Among the growth stages of maize, the tasseling stage is crucial. Accurate tassel detection is essential for detasseling operations. However, existing methods lack sufficient feature extraction capabilities, resulting in unsatisfactory detection performance. Even when deployed on detasseling UAVs, manual intervention is often required. To address this, the LightTassel-YOLO, a maize tassel detection algorithm based on YOLOv11n was proposed in this research. Using UAV-acquired aerial images from 2023 and 2024, a large, high-quality dataset was built, covering various tasseling stages, maize varieties, flight altitudes, and weather conditions. Key improvements include integrating EfficientViT into the backbone for enhanced multi-scale feature perception while maintaining real-time efficiency, designing the channel-priority convolutional attention module C2PSA-CPCA to capture spatial relations and reduce computation, and employing the C3k2-SCConv module with channel reordering to reduce redundant features and meet low-cost embedded device requirements. Experiments show that LightTassel-YOLO achieves P of 92.6%, R of 89.1%, and AP@0.5 of 94.7%, outperforming classic models such as YOLO series, Faster R-CNN, and SSD. It effectively detects tassels under various conditions, while ensuring high-precision identification of maize tassels, the model effectively reduces parameter count and computational load with strong robustness, making it suitable for deployment on edge devices.

Although this study has achieved promising results and could provide strong support for addressing the contradiction between real-time detection and resource limitations in agricultural production. However, the model still has certain limitations. First, the training data are primarily based on RGB images and do not incorporate multimodal information. This limits the stability of feature extraction for tassels, particularly during early tasseling stages or under complex lighting conditions. In summary, LightTassel-YOLO is not only effective for maize tassel detection but also provides a perceptual foundation for precision detasseling operations. It offers new insights and paradigms for the research and application of agricultural target detection models, with strong potential for broader application and further study. Future work includes expanding the dataset to improve generalization and integrating multispectral data to enhance early maize tassel detection accuracy.

All authors declare no competing interests.

参考文献

原文顺序 | 文献年度倒序 | 文中引用次数倒序

[1]	WANG X W, LI X Y, LOU Y S, et al. Refined evaluation of climate suitability of maize at various growth stages in major maize-producing areas in the north of China[J]. Agronomy, 2024, 14(2): ID 344.

[2]	YANG J X, ZHANG R R, DING C C, et al. YOLO-detassel: Efficient object detection for Omitted Pre-Tassel in detasseling operation for maize seed production[J]. Computers and electronics in agriculture, 2025, 231: ID 109951.

[3]	THOMPSON A L, THORP K R, CONLEY M M, et al. Comparing nadir and multi-angle view sensor technologies for measuring in-field plant height of upland cotton[J]. Remote sensing, 2019, 11(6): ID 700.

[4]	CHIVASA W, MUTANGA O, BIRADAR C. UAV-based multispectral phenotyping for disease resistance to accelerate crop improvement under changing climate conditions[J]. Remote sensing, 2020, 12(15): ID 2445.

[5]	KUMAR C, MUBVUMBA P, HUANG Y B, et al. Multi-stage corn yield prediction using high-resolution UAV multispectral data and machine learning models[J]. Agronomy, 2023, 13(5): ID 1277.

[6]	LU H, CAO Z G, XIAO Y, et al. Region-based colour modelling for joint crop and maize tassel segmentation[J]. Biosystems engineering, 2016, 147: 139-150.

[7]	KURTULMUŞ F, KAVDIR İ. Detecting corn tassels using computer vision and support vector machines[J]. Expert systems with applications, 2014, 41(16): 7390-7397.

[8]	ZHANG W Q, WU S, WEN W L, et al. Three-dimensional branch segmentation and phenotype extraction of maize tassel based on deep learning[J]. Plant methods, 2023, 19(1): ID 76.

[9]	GAO R, JIN Y S, TIAN X, et al. YOLOv5-T: A precise real-time detection method for maize tassels based on UAV low altitude remote sensing images[J]. Computers and electronics in agriculture, 2024, 221: ID 108991.

[10]	YU X, YIN D M, NIE C W, et al. Maize tassel area dynamic monitoring based on near-ground and UAV RGB images by U-Net model[J]. Computers and electronics in agriculture, 2022, 203: ID 107477.

[11]	WAN T Y, RAO Y, JIN X, et al. Improved U-Net for growth stage recognition of in-field maize[J]. Agronomy, 2023, 13(6): ID 1523.

[12]	LIU L B, YU L J, WU D, et al. PocketMaize: An Android-smartphone application for maize plant phenotyping[J]. Frontiers in plant science, 2021, 12: ID 770217.

[13]	LI Y H, ZHAO Z X, LUO Y F, et al. Real-time pattern-recognition of GPR images with YOLOv3 implemented by tensorflow[J]. Sensors, 2020, 20(22): ID 6476.

[14]	PU H L, CHEN X, YANG Y Y, et al. Tassel-YOLO: A new high-precision and real-time method for maize tassel detection and counting based on UAV aerial images[J]. Drones, 2023, 7(8): ID 492.

[15]	NIU S W, NIE Z G, LI G, et al. Multi-altitude corn tassel detection and counting based on UAV RGB imagery and deep learning[J]. Drones, 2024, 8(5): ID 198.

[16]	JIA Y J, FU K, LAN H, et al. Maize tassel detection with CA-YOLO for UAV images in complex field environments[J]. Computers and electronics in agriculture, 2024, 217: ID 108562.

[17]	LIU Y L, CEN C J, CHE Y P, et al. Detection of maize tassels from UAV RGB imagery with faster R-CNN[J]. Remote sensing, 2020, 12(2): ID 338.

[18]	FALAHAT S, KARAMI A. Maize tassel detection and counting using a YOLOv5-based model[J]. Multimedia tools and applications, 2023, 82(13): 19521-19538.

[19]	REDMON J, DIVVALA S, GIRSHICK R, et al. You only look once: Unified, real-time object detection[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway, New Jersey, USA: IEEE, 2016: 779-788.

[20]	KHANAM R, HUSSAIN M. Yolov11: An overview of the key architectural enhancements[EB/OL]. arXiv: 2410.17725, 2024.

[21]	CAI H, LI J, HU M, et al. Efficientvit: Multi-scale linear attention for high-resolution dense prediction[EB/OL]. arXiv: 2205.14756, 2022.

[22]	VOITA E, TALBOT D, MOISEEV F, et al. Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned[EB/OL]. arXiv:1905.09418, 2019.

[23]	NASCIMENTO M G D, PRISACARIU V, FAWCETT R. DSConv: Efficient convolution operator[C]// 2019 IEEE/CVF International Conference on Computer Vision (ICCV). Piscataway, New Jersey, USA: IEEE, 2019: 5147-5156.

[24]	HUANG H, CHEN Z, ZOU Y, et al. Channel prior convolutional attention for medical image segmentation[J]. Computers in biology and medicine, 2024, 178: ID 108784.

[25]	WOO S, PARK J, LEE J Y, et al. CBAM: Convolutional Block attention module[C]// Computer Vision-ECCV 2018. Cham, Germany: Springer, 2018: 3-19.

[26]	LI J F, WEN Y, HE L H. SCConv: Spatial and channel reconstruction convolution for feature redundancy[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway, New Jersey, USA: IEEE, 2023: 6153-6162.

[27]	ZHANG X L, WANG Z S, WANG X S, et al. STARNet: An efficient spatiotemporal feature sharing reconstructing network for automatic modulation classification[J]. IEEE transactions on wireless communications, 2024, 23(10): 13300-13312.

[28]	CHEN H, WANG Y, GUO J, et al. Vanillanet: The power of minimalism in deep learning[J]. Advances in neural information processing systems, 2023, 36: 7050-7064.

[29]	QIN D F, LEICHNER C, DELAKIS M, et al. MobileNetV4: Universal models for the mobile ecosystem[C]// Computer Vision – ECCV 2024. Cham, Germany: Springer Nature Switzerland, 2024: 78-96.

[30]	MA N N, ZHANG X Y, ZHENG H T, et al. ShuffleNet V2: Practical guidelines for efficient CNN architecture design[C]// Computer Vision-ECCV 2018. Cham, Germany: Springer, 2018: 122-138.

[31]	WANG A, CHEN H, LIN Z J, et al. Rep ViT: Revisiting mobile CNN from ViT perspective[C]// 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway, New Jersey, USA: IEEE, 2024: 15909-15920.

Options

文章导航

模态框（Modal）标题

摘要

本文引用格式

Abstract

0 Introduction

1 Materials and methods

Fig. 1 Flowchart of the LightTassel-YOLO

1.1 Dataset construction

1.1.1 Data collection

Fig. 2 Maize tassel dataset

1.1.2 Data pre-processing

Table1 Description of the maize dataset (a. Composition of the training and validation sets)

1.2 Construction of the LightTassel-YOLO model

Fig. 3 Architecture of LightTassel-YOLO model

1.2.1 EfficientViT feature extraction network

Fig. 4 EfficientViT model architecture

1.2.2 Channel prioritized convolutional attention

Fig. 5 Structure of CPCA

1.2.3 SCConv convolutional module

Fig. 6 Structure of SCConv

1.3 Evaluation indexes

1.4 Experimental platform and configuration

Table 2 Key parameters settings of maize tassel detection research

2 Results and discussion

2.1 Comparison with different feature extraction networks

Table 3 Comparison of experimental results for YOLOv11n models improved with different feature extraction networks

2.2 Comparison with different EfficientViT variants

Table 4 Comparison of YOLOv11n models with different EfficientViT varians

2.3 Ablation experiments

Table 5 Ablation experiment results of LightTassel-YOLO model

2.4 Analysis of the LightTassel-YOLO training process

Fig. 7 Model training performance of LightTassel-YOLO

2.5 Comparison with mainstream object detection models

Table 6 Comparison between LightTassel-YOLO and mainstream object detection models

Fig. 8 Performance of different target detection models in the detection of maize tassel study

2.6 Robustness analysis of the model

Table 7 LightTassel-YOLO test results of different maize tassel test sets

Fig. 9 Detection effect of LightTassel-YOLO on different test sets

3 Conclusions

参考文献

1.1　Dataset construction

1.1.1　Data collection

1.1.2　Data pre-processing

1.2　Construction of the LightTassel-YOLO model

1.2.1　EfficientViT feature extraction network

1.2.2　Channel prioritized convolutional attention

1.2.3　SCConv convolutional module

1.3　Evaluation indexes

1.4　Experimental platform and configuration

2.1　Comparison with different feature extraction networks

2.2　Comparison with different EfficientViT variants

2.3　Ablation experiments

2.4　Analysis of the LightTassel-YOLO training process

2.5　Comparison with mainstream object detection models

2.6　Robustness analysis of the model