基于改进AdaTAD的奶山羊爬跨行为时序动作定位方法

doi:10.12133/j.smartag.SA202601012

Smart Agriculture

• •

基于改进AdaTAD的奶山羊爬跨行为时序动作定位方法

王佳源¹, 李其同¹, 罗元滔¹, 杨蜀秦², 王振华¹, 宁纪锋¹(), 王美丽¹

^1. 西北农林科技大学信息工程学院，陕西杨凌 712100，中国
^2. 西北农林科技大学机械与电子工程学院，陕西杨凌 712100，中国

收稿日期:2026-01-09 出版日期:2026-04-22
基金项目:
国家重点研发计划(2022YFD1300200); 陕西秦创原引用高层次创新创业人才项目(QCYRCXM-2022-359)
作者简介:
王佳源，硕士研究生，研究方向为基于计算机视觉的奶山羊爬跨行为时序动作定位方法研究。E-mail： yolo_wjy@nwafu.edu.cn
通信作者:
宁纪锋，博士，教授，研究方向为计算机视觉及机器学习。E-mail： njf@nwsuaf.edu.cn

Temporal Action Localization of Mounting Behavior in Dairy Goats Based on an Improved AdaTAD

WANG Jiayuan¹, LI Qitong¹, LUO Yuantao¹, YANG Shuqin², WANG Zhenhua¹, NING Jifeng¹(), WANG Meili¹

^1. College of Information Engineering, Northwest A & F University, Yangling 712100, China
^2. College of Mechanical and Electronic Engineering, Northwest A & F University, Yangling 712100, China

Received:2026-01-09 Online:2026-04-22
Foundation items:National Key Research and Development Program of China(2022YFD1300200); Shaanxi Qinchuangyuan High-level Innovation and Entrepreneurship Talent Program(QCYRCXM-2022-359)
About author:
WANG Jiayuan, E-mail: yolo_wjy@nwafu.edu.cn
Corresponding author:
NING Jifeng, E-mail: njf@nwsuaf.edu.cn

摘要/Abstract

摘要：

【目的/意义】 奶山羊爬跨行为的时序定位是繁殖管理的重要基础。针对现有方法多停留在行为判别层面、在未修剪视频中对短时突发行为的起止边界刻画不足，且易受遮挡、视角变化与背景干扰影响的问题，提出一种基于面向时序动作定位的适配器调优（Adapter Tuning for Temporal Action Detection, AdaTAD）改进的端到端时序动作定位方法，以实现爬跨行为的准确识别与起止时间精确定位。 【方法】 以AdaTAD框架为基线，引入视觉提示调优，通过少量可学习Prompt Tokens对主干注意力分布进行任务引导，增强关键帧及边界邻域的特征响应；设计多尺度运动感知适配器，采用并联多尺度时序深度可分离卷积分支建模不同时间尺度的运动模式，并结合残差连接与非线性映射稳定注入主干特征，提升短时微动作与相对完整动作过程的联合建模能力。 【结果与讨论】 所提方法的平均精度均值达到81.72%，相较基准模型AdaTAD提升5.00个百分点；在时间交并比为0.7的更严格条件下达到68.85%，较AdaTAD提升4.06个百分点，表明该方法在高边界精度要求下仍具有优势。模型推理速度为每秒65.78帧，可训练参数量为27.941 M，在精度提升的同时保持较低开销。 【结论】 该方法可提升复杂养殖场景下奶山羊爬跨行为的时序定位精度与稳定性，为繁殖行为监测与管理决策提供关键时序信息支撑。

关键词: 爬跨行为, 时序动作定位, 奶山羊, 精准养殖, 多尺度运动建模, 端到端模型

Abstract:

[Objective] Accurate temporal localization of mounting behaviour in dairy goats is important for intelligent reproductive management, as event frequency, onset time, and duration provide useful evidence for heat monitoring and mating decisions. Unlike simple behaviour recognition, temporal localization in untrimmed videos enables fine-grained, time-resolved records for practical farm use. However, real-world mounting behaviour is usually brief and sporadic, with few informative frames in long video streams. Moreover, weak discrimination from similar non-target interactions, together with occlusion, viewpoint variation, and background motion, often degrades boundary-aware representation learning and leads to unstable start–end localization. To address these challenges, an improved AdaTAD-based end-to-end temporal action localization approach is proposed for mounting behaviour in dairy goats, aiming to enhance localization accuracy and stability while maintaining practical efficiency for deployment. [Methods] The proposed approach adopted AdaTAD as the baseline end-to-end temporal action localization framework and introduced two complementary improvements, explicit key-frame guidance and multi-scale motion modelling, while retaining the original detection head and post-processing pipeline for generating temporal action instances. First, visual prompt tuning (VPT) was incorporated to provide task-conditioned guidance to backbone feature extraction in a parameter-efficient manner. Specifically, a small number of learnable prompt tokens were inserted into the Transformer backbone with backbone parameters frozen. Through multi-head attention interactions between prompt tokens and patch tokens, the prompts steer attention towards mounting-relevant temporal regions, strengthened feature responses at critical frames and in boundary neighbourhoods, and improved the separability between brief target segments and abundant background frames. Second, a multi-scale motion adapter (MSMA) was introduced to model motion patterns at different temporal scales and improve robustness to diverse scene dynamics. MSMA emploied parallel multi-scale temporal depthwise separable convolution branches to capture short-, mid-, and longer-range temporal variations, enhancing representations of subtle short-duration micro-actions as well as relatively complete action processes. Residual connections and nonlinear mappings further stabilised feature injection and gradient propagation, enabling multi-scale dynamics to be integrated into backbone features with limited additional optimisation burden. Overall, VPT focused on boundary-relevant attention guidance, whereas MSMA emphasises multi-scale temporal dynamics modelling; Together, they formed a complementary design within the end-to-end localization pipeline. [Results and Discussions] Comparative experiments showed that the proposed method achieves an average mAP (mean Average Precision@[0.3:0.1:0.7]) of 81.72%, improving upon the baseline AdaTAD by 5.00 percentage points, indicating that incorporating VPT and MSMA enhanced overall localization performance. At a temporal Intersection over Union (tIoU) threshold of 0.7, the proposed method attained 68.85%, exceeding AdaTAD by 4.06 percentage points, demonstrating that the performance gain was preserved under stricter temporal boundary-consistency criteria. Further comparisons with representative approached, including TadTR, VSGN, AFSD, ActionFormer, TriDet, DyFADet, and Re2TAL, showed average mAP improvements of 38.82, 33.83, 25.29, 4.09, 2.83, 1.20, and 6.06 percentage points, respectively, demonstrating stronger overall competitiveness. In terms of efficiency, the model ran at 65.78 f/s with 27.941 million trainable parameters, indicating that the accuracy gains were achieved while maintaining a relatively low parameter overhead and practical runtime efficiency. Overall, task-guided prompting and multi-scale temporal modelling improved key temporal feature representations with limited parameter increments, thereby benefiting localization of short, sporadic behaviours. [Conclusions] This study presents an improved AdaTAD-based end-to-end temporal action localization method for mounting behaviour in dairy goats. By combiningVPT for boundary-relevant attention guidance with a MSMA for multi-scale temporal dynamics modelling, the proposed approach improves localization accuracy and maintains stable advantages under stricter boundary-consistency requirements, while preserving practical inference efficiency. The method provides critical temporal information for reproductive behaviour monitoring and decision support, and offers a feasible basis for building individual-level, time-resolved management systems in real farming environments.

Key words: mounting behavior, temporal action localization, dairy goats, precision livestock farming, multi-scale motion modeling, end-to-end model

中图分类号:

S255

王佳源, 李其同, 罗元滔, 杨蜀秦, 王振华, 宁纪锋, 王美丽. 基于改进AdaTAD的奶山羊爬跨行为时序动作定位方法[J]. 智慧农业(中英文), doi: 10.12133/j.smartag.SA202601012.

WANG Jiayuan, LI Qitong, LUO Yuantao, YANG Shuqin, WANG Zhenhua, NING Jifeng, WANG Meili. Temporal Action Localization of Mounting Behavior in Dairy Goats Based on an Improved AdaTAD[J]. Smart Agriculture, doi: 10.12133/j.smartag.SA202601012.

图/表 14

图1

图2

图3

表1

图4

表2

表3

表4

表5

表6

表7

表8

图5

图6

参考文献 27

[1]	王引, 张秋桐, 樊平. 养羊业发展趋势及针对性防疫措施[J]. 中国动物保健, 2024, 26(4): 87-88.
[2]	王自科, 郝志云, 车陇杰, 等. 智慧养羊业发展现状及研究进展[J]. 甘肃畜牧兽医, 2024, 54(3): 1-4, 12.
[3]	ENDO N, RAHAYU L P, ARAKAWA T, et al. Video tracking analysis of behavioral patterns during estrus in goats[J]. The Journal of Reproduction and Development, 2016, 62(1): 115-119.
[4]	KOLAREVIC J, AAS-HANSEN Ø, ESPMARK Å, et al. The use of acoustic acceleration transmitter tags for monitoring of Atlantic salmon swimming activity in recirculating aquaculture systems (RAS)[J]. Aquacultural Engineering, 2016, 72: 30-39.
[5]	NASIRAHMADI A, EDWARDS S A, STURM B. Implementation of machine vision for detecting behaviour of cattle and pigs[J]. Livestock Science, 2017, 202: 25-38.
[6]	WANG J, SI Y F, WANG J P, et al. Discrimination strategy using machine learning technique for oestrus detection in dairy cows by a dual-channel-based acoustic tag[C]// Computers and Electronics in Agriculture. New York, USA: ACM, 2023.
[7]	WANG Z, HUA Z X, WEN Y C, et al. E-YOLO: Recognition of estrus cow based on improved YOLOv8n model[J]. Expert Systems with Applications, 2024, 238: 122212.
[8]	WANG R, GAO R H, LI Q F, et al. A lightweight cow mounting behavior recognition system based on improved YOLOv5s[J]. Scientific Reports, 2023, 13: 17418.
[9]	SHI J R, CHEN X W, ZHANG Y L, et al. Detection of estrous ewes' tail-wagging behavior in group-housed environments using Temporal-Boost 3D convolution[J]. Computers and Electronics in Agriculture, 2025, 234: 110283.
[10]	DUAN Y P, YANG Y Z, CAO Y, et al. A multimodal deep learning network for precise detection of estrus and pseudo-estrus in sows[J]. Smart Agricultural Technology, 2025, 12: 101279.
[11]	CAO Y, YIN Z, DUAN Y P, et al. Research on improved sound recognition model for oestrus detection in sows[J]. Computers and Electronics in Agriculture, 2025, 234: 109975.
[12]	WANG Z, DENG H X, ZHANG S J, et al. Detection and tracking of oestrus dairy cows based on improved YOLOv8n and TransT models[J]. Biosystems Engineering, 2025, 252: 61-76.
[13]	张志勇, 曹姗姗, 孔繁涛, 等. 母牛发情精准感知与智能鉴定技术研究进展、问题与挑战[J]. 智慧农业(中英文), 2025, 7(3): 48-68.
	ZHANG Z Y, CAO S S, KONG F T, et al. Advances, problems and challenges of precise estrus perception and intelligent identification technology for cows[J]. Smart Agriculture, 2025, 7(3): 48-68.
[14]	LIU X L, BAI S, BAI X. An empirical study of end-to-end temporal action detection[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway, New Jersey, USA: IEEE, 2022: 19978-19987.
[15]	CHEN G, ZHENG Y D, WANG L M, et al. DCAN: improving temporal action detection via dual context aggregation [C]// Proceedings of the AAAI Conference on Artificial Intelligence. Palo Alto, California: AAAI Press, 2022: 248-257.
[16]	ZHU Z X, TANG W, WANG L, et al. Enriching local and global contexts for temporal action localization[C]// 2021 IEEE/CVF International Conference on Computer Vision (ICCV). Piscataway, New Jersey, USA: IEEE, 2021: 13496-13505.
[17]	XU M M, ZHAO C, ROJAS D S, et al. G-TAD: Sub-graph localization for temporal action detection[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway, New Jersey, USA: IEEE, 2020: 10153-10162.
[18]	ZENG Y S, ZHONG Y J, FENG C J, et al. UniMD: towards unifying moment retrieval and Temporal action detection[C]// Computer Vision – ECCV 2024. Cham, Germany: Springer, 2025: 286-304.
[19]	LIU S M, ZHANG C L, ZHAO C, et al. End-to-end temporal action detection with 1B parameters across 1000 frames[C]// 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway, New Jersey, USA: IEEE, 2024: 18591-18601.
[20]	JIA M L, TANG L M, CHEN B C, et al. Visual prompt tuning[C]// Computer Vision–ECCV 2022. Cham, Germany: Springer, 2022: 709-727.
[21]	LIU X L, WANG Q M, HU Y, et al. End-to-end temporal action detection with transformer[J]. IEEE Transactions on Image Processing, 2022, 31: 5427-5441.
[22]	ZHAO C, THABET A, GHANEM B. Video self-stitching graph network for temporal action localization[C]// 2021 IEEE/CVF International Conference on Computer Vision (ICCV). Piscataway, New Jersey, USA: IEEE, 2021: 13638-13647.
[23]	LIN C M, XU C M, LUO D H, et al. Learning salient boundary feature for anchor-free temporal action localization[C]// 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway, New Jersey, USA: IEEE, 2021: 3319-3328.
[24]	ZHANG C L, WU J X, LI Y. ActionFormer: localizing moments of Actions with Transformers[C]// Computer Vision – ECCV 2022. Cham, Germany: Springer, 2022: 492-510.
[25]	SHI D F, ZHONG Y J, CAO Q, et al. TriDet: temporal action detection with relative boundary modeling[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway, New Jersey, USA: IEEE, 2023: 18857-18866.
[26]	YANG L, ZHENG Z W, HAN Y Z, et al. DyFADet: dynamic feature aggregation for Temporal action detection[C]// Computer Vision – ECCV 2024. Cham, Germany: Springer, 2025: 305-322.
[27]	ZHAO C, LIU S M, MANGALAM K, et al. Re2TAL: rewiring pretrained video backbones for reversible temporal action localization[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway, New Jersey, USA: IEEE, 2023: 10637-10647.

遮挡情况	数据集	正面	侧面	后方	合计
合计	65/77	16/18	31/38	18/21	142
无遮挡	训练集/测试集	5/6	6/10	4/4	15/20
羊只遮挡	训练集/测试集	6/6	9/10	5/6	20/22
设施遮挡	训练集/测试集	5/6	16/18	9/11	30/35

模型类别	模型	avg mAP	tIOU=0.3	tIOU=0.4	tIOU=0.5	tIOU=0.6	tIOU=0.7
非端到端模型	TadTR	42.90	56.09	52.65	42.42	36.05	27.32
	VSGN	47.89	68.28	58.97	49.18	40.32	22.68
	ActionFormer	77.63	86.42	85.00	79.60	73.39	63.74
	TriDet	78.89	88.25	87.81	79.33	74.32	64.74
	DyFADet	80.52	91.36	90.11	79.89	76.99	64.23
端到端模型	AFSD	56.43	81.76	73.13	60.93	45.61	20.69
	AdaTAD	76.72	87.10	80.90	78.49	72.32	64.79
	Re2TAL	75.66	86.71	82.96	77.28	71.17	60.17
	Our	81.72	90.46	86.70	83.47	79.11	68.85

模型	FPS	可训练参数量	总参数量	avg mAP	tIoU=0.3	tIoU=0.4	tIoU=0.5	tIoU=0.6	tIoU=0.7
AdaTAD	66.19	27.703 M	49.583 M	76.72	87.10	80.90	78.49	72.32	64.79
AdaTAD + MSMA	58.51	27.936 M	49.815 M	81.00	88.64	86.59	82.56	77.73	69.47
AdaTAD + VPT	66.80	27.708 M	49.587 M	79.64	91.56	85.21	79.73	75.18	66.53
Our	65.78	27.941 M	49.820 M	81.72	90.46	86.70	83.47	79.11	68.85

层数	avg mAP	tIOU=0.3	tIOU=0.4	tIOU=0.5	tIOU=0.6	tIOU=0.7
6	81.72	90.46	86.70	83.47	79.11	68.85
4	77.22	88.75	85.39	77.32	72.03	62.60
2	78.00	90.70	85.41	79.13	72.74	62.00
10	79.10	92.10	87.06	79.15	72.40	64.79
8	79.12	90.12	85.12	81.45	74.20	64.74
12	80.69	91.57	85.40	82.19	77.15	67.15

Prompt Tokens个数	avg mAP	tIOU=0.3	tIOU=0.4	tIOU=0.5	tIOU=0.6	tIOU=0.7
1	80.15	89.49	87.44	80.54	75.92	67.34
2	81.72	90.46	86.70	83.47	79.11	68.85
3	80.43	91.30	87.72	80.16	74.79	68.19
4	80.35	89.73	85.46	81.84	78.31	66.42

基于改进AdaTAD的奶山羊爬跨行为时序动作定位方法

Temporal Action Localization of Mounting Behavior in Dairy Goats Based on an Improved AdaTAD

在线阅读

知网下载

本地下载

可视化

摘要/Abstract

引用本文

使用本文

图/表 14

参考文献 27

相关文章 1

编辑推荐

Metrics

本文评价

MSMA分支尺度设置	avg mAP	tIOU=0.3	tIOU=0.4	tIOU=0.5	tIOU=0.6	tIOU=0.7
k=｛1， 3， 5｝（our）	81.72	90.46	86.70	83.47	79.11	68.85
k=｛3｝（baseline）	79.64	91.56	85.21	79.73	75.18	66.53
k=｛1， 3｝	79.36	89.15	85.77	80.54	75.97	65.20
k=｛3， 5｝	78.02	90.29	84.86	79.87	73.29	61.79
k=｛1， 3， 5， 7｝	80.80	91.40	86.54	81.66	76.21	68.17

Fold/统计量	Avg-mAP	mAP@0.3	mAP@0.4	mAP@0.5	mAP@0.6	mAP@0.7
Fold 1	77.43	88.44	84.78	79.60	73.11	61.21
Fold 2	79.00	88.26	85.07	79.81	75.76	66.10
Fold 3	80.81	89.52	86.04	80.87	77.91	69.72
Fold 4	82.19	94.40	87.96	84.77	77.58	66.23
Fold 5	80.74	90.10	85.08	81.44	76.56	70.54
Mean	80.03	90.14	85.79	81.30	76.18	66.76
Std.	1.84	2.50	1.31	2.08	1.92	3.69

统计量	Avg-mAP	mAP@0.3	mAP@0.4	mAP@0.5	mAP@0.6	mAP@0.7
Repeat1	80.83±2.63	90.79±1.80	86.56±1.85	82.23±2.32	76.85±3.34	67.70±3.91
Repeat2	80.92±2.08	90.74±1.47	86.56±1.44	82.25±1.71	76.93±2.73	68.13±3.17
Repeat3	80.91±2.35	90.73±1.64	86.53±1.73	82.27±2.04	76.97±3.00	68.07±3.40
Repeat4	80.76±2.76	90.62±1.96	86.39±1.99	82.17±2.35	76.81±3.37	67.79±4.19
Repeat5	80.92±2.28	90.71±1.53	86.63±1.71	82.23±1.95	76.89±2.89	68.13±3.42
Overall	80.87±2.22	90.72±1.54	86.53±1.60	82.23±1.91	76.89±2.81	67.97±3.33