Abstract:Existing multimodal 3D object detection methods often rely on multi-scale spatial feature stacking, which leads to frequency information entanglement and limits detection accuracy. To address this issue, this paper proposes a Transformer-based method for feature frequency decoupling and fusion. Firstly, the input images are processed using discrete wavelet transform for multi-frequency decomposition. Separate high-frequency and low-frequency feature pyramids are constructed to capture detailed local textures and global structural semantics, respectively. Then, an asymmetric frequency update encoder is designed, where high-frequency features are treated as the primary components and updated through adaptive dynamic window encoding to enhance edge and texture representation. Meanwhile, sparse deformable attention is introduced to replace standard attention mechanisms for efficient low-frequency feature updating, enabling coordinated encoding across different frequency bands. A high-frequency guided voxel fusion module is further proposed, where multi-scale high-frequency features are projected into 3D voxel space via frustum-based mapping. Combined with an adaptive radius sampling strategy, this module effectively supplements the local structure of sparse point clouds and extracts critical voxel-level features. Finally, voxel and image features are unified in the bird′s-eye view space. A region-shift Transformer module is introduced to enhance cross-modal feature fusion using attention mechanisms. The proposed method is evaluated on the nuScenes, KITTI, and Waymo datasets. The method achieves 73.2% mAP and 74.3% NDS on the nuScenes test set, demonstrating strong performance in detecting small and distant objects. Moreover, real-vehicle experiments indicate that the method maintains high detection accuracy in complex and dynamic environments.