Abstract:At present, most skeleton action recognition models based on graph convolutional networks share the same topology structure on all channels when extracting spatial features, which limits the expressive ability of spatial aggregation. When extracting temporal features, only multi-layer one-dimensional local convolutions are stacked, so that the correlation information between non adjacent time frames is ignored. Therefore, a graph convolutional network model combining decoupled attention and temporal modeling is proposed. By using decoupled attention graph convolution modules and channel attention modules, more attention is focused on key channel information, improving the expression ability of graph convolutional networks in spatial aggregation. By incorporating a multi-scale temporal modeling module, the temporal relationship between adjacent and non-adjacent time steps is modeled, fully extracting the temporal dynamic features of skeleton sequences. Experiments were conducted on publicly available large-scale datasets NTU RGB+D, NTU RGB+D 120 and Kinetics-Skeleton. Top-1 accuracies are 90.1%(CV) and 96.0%(CS) for NTU RGB+D, 86.0%(CSub) and 87.2%(CSet) for NTU RGB+D 120, 37.3% for Kinetics-Skeleton respectively. The results showed that the recognition accuracy was superior to the current mainstream methods, improving the accuracy of human skeleton action recognition.