[Objective] The purpose of this study is to solve the problem of accuracy and efficiency in the detection of tree planting sites (tree pits) in Inner Mongolia of China's 'Three North Project'. The traditional manual field investigation method of the tree planting sites is not only inefficient but also error-prone, and the low-altitude unmanned aerial vehicle (UAV) has become the best choice to solve these problems. To this end, the research team proposed an accurate recognition and detection model of tree planting sites based on YOLOv10-MHSA. [Methods] In this study, a long-endurance multi-purpose vertical take-off and landing fixed-wing UAV was used to collect images of tree planting sites. The UAV was equipped with a 26 million pixel camera with high spatial resolution, which was suitable for high-precision mapping in the field. The aerial photography was carried out from 11:00 to 12:00 on August 1, 2024. The weather was sunny, the wind force was 3, the flight height was set to 150 m (ground resolution was about 2.56 cm), the course overlap rate was 75 %, the side overlap rate was 65 %, and the flight speed was 20 m/s. After the image acquisition was completed, the aerial images were stitched using Metashape software (v2.1.0) to generate a digital orthophoto map (DOM) covering about 2 000 mu (880 m×1 470 m) of tree planting sites, and it was cut through a 640-pixel sliding window into 3 102 high-definition RGB images of 640×640 size for subsequent detection and analysis. In order to prevent overfitting in the process of network training, the research team expanded and divided the original data set. By increasing the amount of model training data, introducing different attention mechanisms and optimizing loss functions, the quality and efficiency of model training are improved. A more effective EIOU loss function was introduced, which was divided into three parts: IOU loss, distance loss and azimuth loss, which directly minimized the width and height difference between the target frame and Anchor, resulting in faster convergence speed and better positioning results. In addition, the Focal-EIOU loss function was introduced to optimize the sample imbalance problem in the bounding box regression task, which further improves the convergence speed and positioning accuracy of the model. [Results and Discussions] After the introduction of the multi-head self-attention mechanism (MHSA), the model was improved by 1.4% and 1.7% on the two evaluation criteria of AP@0.5 and AP@0.5:0.95, respectively, and the accuracy and recall rate were also improved. It showed that MHSA could better help the model to extract the feature information of the target and improve the detection accuracy in complex background. Although the processing speed of the model decreases slightly after adding the attention mechanism, the overall decrease was not large, and it could still meet the requirements of real-time detection. On the optimization of the loss function, the experiment compared the four loss functions of CIOU, SIOU, EIOU and Focal-EIOU. The results showed that the Focal-EIOU loss function was improved, and the precision and recall rates were also significantly improved. This showed that the Focal-EIOU loss function could accelerate the convergence speed of the model and improve the positioning accuracy when dealing with the sample imbalance problem in small target detection. Although the processing speed of the model was slightly reduced, it still meet the requirements of real-time detection. Finally, an improved model, YOLOv10-MHSA, was proposed, which introduces MHSA attention mechanism, small target detection layer and Focal-EIOU loss function. The results of ablation experiments showed that AP@0.5 and AP@0.5:0.95 were increased by 2.1% and 0.9%, respectively, after adding only small target detection layer on the basis of YOLOv10n, and the accuracy and recall rate were also significantly improved. When the MHSA and Focal-EIOU loss functions were further added, the model detection effect was significantly improved. Compared with the baseline model YOLOv10n, the AP@0.5, AP@0.5:0.95, P-value and R-value were improved by 6.6%, 9.8%, 4.4% and 5.1%, respectively. Although the FPS was reduced to 109, the detection performance of the improved model was significantly better than that of the original model in various complex scenes, especially for small target detection in densely distributed and occluded scenes. [Conclusions] In summary, this study effectively improved the YOLOv10n model by introducing MHSA and the optimized loss function (Focal-EIOU), which significantly improved the accuracy and efficiency of tree planting site detection in the 'Three North Project' in Inner Mongolia. The experimental results show that MHSA can enhance the ability of the model to extract local and global information of the target in complex background, and effectively reduce the phenomenon of missed detection and false detection. The Focal-EIOU loss function accelerates the convergence speed of the model and improves the positioning accuracy by optimizing the sample imbalance problem in the bounding box regression task. Although the model processing speed has declined, it still meets the real-time detection requirements and provides strong technical support for the scientific afforestation of the 'Three North Project'.