结合解耦注意力图卷积与时态建模的骨架动作识别

结合解耦注意力图卷积与时态建模的骨架动作识别
DOI:
                        
                    
CSTR:
                        [cstr]
                    
作者:
                        
                        
                    
作者单位:北京建筑大学电气与信息工程学院
作者简介:
通讯作者:
中图分类号:TP391.4
基金项目:北京市自然科学基金面上项目(4232021)，北京建筑大学校设科研基金自然科学项目(ZF17072)

Combining Decoupling Attention Graph Convolution and Temporal Modeling for Skeleton-based Action Recognition

Author:

Affiliation:

Fund Project:

摘要

图/表

访问统计

参考文献

相似文献

引证文献

资源附件

文章评论

摘要:

目前，多数基于图卷积网络的骨架动作识别模型在提取空间特征时，在所有通道上共享相同的拓扑结构，限制了空间聚合的表现能力；在提取时间特征时，仅堆叠多层的一维局部卷积，使得非相邻时间帧之间的关联信息被忽略。因此，提出一种结合解耦注意力与时态建模的图卷积网络模型。通过使用解耦注意力图卷积模块和通道注意力模块，将更多的注意力集中在关键的通道信息上，提高图卷积网络的空间聚合表达能力；通过融入多尺度时态建模模块，对相邻和非相邻时间步长之间的时态关系进行建模，充分提取骨架序列的时间动态特征。在公开的大规模数据集NTU RGB+D、NTU RGB+D 120、Kinetics-Skeleton上进行了实验，分别取得了90.1%(CV)和96.0%(CS)、86.0%(CSub)和87.2%(CSet)、37.3%的top-1识别准确率。实验结果表明识别精度优于当前较主流的方法，提高了人体骨架动作识别的准确性。

Abstract:

At present, most skeleton action recognition models based on graph convolutional networks share the same topology structure on all channels when extracting spatial features, which limits the expressive ability of spatial aggregation. When extracting temporal features, only multi-layer one-dimensional local convolutions are stacked, so that the correlation information between non adjacent time frames is ignored. Therefore, a graph convolutional network model combining decoupled attention and temporal modeling is proposed. By using decoupled attention graph convolution modules and channel attention modules, more attention is focused on key channel information, improving the expression ability of graph convolutional networks in spatial aggregation. By incorporating a multi-scale temporal modeling module, the temporal relationship between adjacent and non-adjacent time steps is modeled, fully extracting the temporal dynamic features of skeleton sequences. Experiments were conducted on publicly available large-scale datasets NTU RGB+D, NTU RGB+D 120 and Kinetics-Skeleton. Top-1 accuracies are 90.1%(CV) and 96.0%(CS) for NTU RGB+D, 86.0%(CSub) and 87.2%(CSet) for NTU RGB+D 120, 37.3% for Kinetics-Skeleton respectively. The results showed that the recognition accuracy was superior to the current mainstream methods, improving the accuracy of human skeleton action recognition.

参考文献

相似文献

引证文献

引用本文

复制

文章指标

点击次数:
下载次数:
HTML阅读次数:
引用次数:

历史

收稿日期:2023-05-19
最后修改日期:2023-07-15
录用日期:2023-07-17
在线发布日期:
出版日期:

网站首页

杂志简介

在线阅读

投稿须知

欢迎订阅

联系我们

引用本文

相关视频

分享

文章指标

历史

文章二维码