In recent years, with the rapid development of deep learning, object detection algorithms have experienced an evolution from traditional algorithms to two-stage and single-stage methods based on deep learning, with significant improvements in detection speed and accuracy. Traditional algorithms rely on hand-designed features and classifiers to classify and locate objects through candidate region generation, feature extraction, and utilization of SVM (Support Vector Machine)
[3] belowor linear regression models. Two-stage algorithms based on deep learning divide the object detection task into two stages: candidate region generation and classification and regression, and the representative algorithms include R-CNN(Regionbased Convolutional Neural Network)
[4], Fast R-CNN
[5] (Fast Regionbased Convolutional Neural Network), and Faster R-CNN (Faster Regionbased Convolutional Neural Network)
[6]. Although the two-stage object detection algorithms are more advantageous in terms of accuracy and detection effect, they often involve complex models and run slower. Single-stage methods combine object localization and classification into a single process, representative algorithms include SSD (Single Shot MultiBox Detector)
[7], YOLO series
[8-11], which have high speed but low detection accuracy, so researchers focus on single-stage algorithms to balance speed and detection accuracy, so researchers have focused on improving single-stage algorithms to balance speed and accuracy. Ge et al.
[12] proposed YOLOX on the basis of YOLOv5, which treats the classification task and the regression task separately to improve the detection accuracy and convergence speed. CBAM-YOLOv5
[13] incorporates CBAM (Convolutional Block Attention Module) in the backbone network, which enhances the model's focus on the object region through channel and spatial attention mechanisms. FE-YOLOv5
[14] model combines feature enhancement module (FEM) and spatial awareness module (SAM), which is used to capture finer semantic information and foreground features. Kim et al.
[15] proposed an ECAP-YOLO model that enhances the YOLO backbone network with an efficient pyramid channel attention mechanism. Luo et al.
[16] improved YOLOv5 by combining a feature extraction module containing three asymmetric convolutions to enhance the extraction of obscure features. The LCB-YOLOv5
[17] model integrates a light-weight stabilization module (LSM) and a cross-level partial network with three convolutional structure modules for extracting multi-dimensional features of small objects. In addition, anchor-box-free object detection algorithms, such as FCOS (Fully Convolutional One-Stage)
[18] and FoveaBox
[19], achieve object localization by predicting whether each pixel point in the image is the centroid of the object. Other approaches, such as CornerNet
[20], Grid R-CNN
[21], and RepPoints
[22], are also based on centroids, but model the spatial structure of the object in a different way to improve the localization accuracy. In recent years, detection models such as DETR
[23] (DEtection TRansformer), Deformable DETR
[24], and Sparse R-CNN
[25], on the other hand, propose a new end-to-end object detection paradigm that no longer relies on the traditional candidate frame generation process, thus achieving a breakthrough in methodology.