Abstract:Metal defect detection, as a critical component of industrial quality control, directly determines the advancement of intelligent manufacturing. To address existing issues in feature fusion modules including feature information loss, insufficient cross-scale interaction, and low recognition accuracy, a hierarchical multi-scale feature fusion-based classification model is proposed. By integrating complementary advantages of Swin Transformer and ConvNeXt architectures, a hierarchical perception-enabled feature learning network is constructed. Specifically, the Swin Transformer employs shifted window mechanisms and multi-stage self-attention to effectively capture global features, while ConvNeXt utilizes depth separable convolution and efficient convolutional operations for precise local feature extraction. To achieve efficient global-local fusion, an innovative adaptive hierarchical feature fusion layer is designed, incorporating channel attention mechanisms, spatial attention mechanisms, and multi-scale fusion strategies to enable effective multi-level feature integration. Additionally, a multi-layer inverted residual fusion module is incorporated to dynamically adjust feature extraction, ensuring precise and reliable feature fusion. Experimental validation on public NEU-DET and GC10-DET datasets demonstrates superior performance with accuracy rates of 99.6% and 96.9%, respectively. To verify generalization capability, evaluations on a self-constructed dataset achieve an accuracy of 99.8%, outperforming mainstream models including ConvNeXt, Swin Transformer, VGG16, and ResNet34 by 3.4%, 2.3%, 4.3%, and 2.7% respectively. The results confirm that the HMFF model exhibits enhanced classification accuracy and robustness in metal defect detection, providing a novel methodological framework for high-precision industrial defect inspection.