基于Transformer的特征频率解耦融合三维目标检测方法
DOI:
CSTR:
作者:
作者单位:

1.苏州科技大学电子与信息工程学院苏州215009; 2.清华大学苏州汽车研究院苏州215134

作者简介:

通讯作者:

中图分类号:

TH741TP391.4

基金项目:

国家自然科学基金项目(62472300)资助


Feature frequency decoupling and fusion for 3D object detection with Transformer
Author:
Affiliation:

1.School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou 215009, China; 2.Suzhou Automotive Research Institute, Tsinghua University, Suzhou 215134, China

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    针对现有多模态方法在融合过程中普遍依赖多尺度空间特征堆叠,导致特征频率信息耦合,从而限制三维目标检测精度提升的问题,提出了一种基于Transformer的特征频率解耦融合三维目标检测方法。首先,对输入图像进行小波频域特征解耦,通过离散小波变换对特征图进行多频率分解,通过独立的高低频特征金字塔获得高低频图像特征。其次,设计非对称频率更新编码块,将高频图像特征作为主要特征,采用自适应动态窗口编码进行高频率更新以增强边缘与纹理特征表达。同时,引入稀疏化的可变形注意力替代传统注意力层进行低频特征更新,实现不同频段特征的高效编码与协同优化。然后,构建高频引导体素融合模块,将小波解耦后的多尺度高频特征经视锥投影映射至三维体素空间,结合自适应半径采样方法,对稀疏点云局部结构进行高效补充,提取关键点体素特征。最后,将体素与图像特征统一映射到二维空间中,设计区域偏移Transformer,利用注意力机制实现跨模态特征融合。在开放数据集nuScenes、KITTI和Waymo上评估该方法,在nuScenes测试集上实现了73.2%的mAP和74.3%的NDS,尤其在小目标与远距离检测中表现突出。同时,该方法在实车平台上的测试表明,在复杂多变的实际环境中仍拥有较高的检测精度。

    Abstract:

    Existing multimodal 3D object detection methods often rely on multi-scale spatial feature stacking, which leads to frequency information entanglement and limits detection accuracy. To address this issue, this paper proposes a Transformer-based method for feature frequency decoupling and fusion. Firstly, the input images are processed using discrete wavelet transform for multi-frequency decomposition. Separate high-frequency and low-frequency feature pyramids are constructed to capture detailed local textures and global structural semantics, respectively. Then, an asymmetric frequency update encoder is designed, where high-frequency features are treated as the primary components and updated through adaptive dynamic window encoding to enhance edge and texture representation. Meanwhile, sparse deformable attention is introduced to replace standard attention mechanisms for efficient low-frequency feature updating, enabling coordinated encoding across different frequency bands. A high-frequency guided voxel fusion module is further proposed, where multi-scale high-frequency features are projected into 3D voxel space via frustum-based mapping. Combined with an adaptive radius sampling strategy, this module effectively supplements the local structure of sparse point clouds and extracts critical voxel-level features. Finally, voxel and image features are unified in the bird′s-eye view space. A region-shift Transformer module is introduced to enhance cross-modal feature fusion using attention mechanisms. The proposed method is evaluated on the nuScenes, KITTI, and Waymo datasets. The method achieves 73.2% mAP and 74.3% NDS on the nuScenes test set, demonstrating strong performance in detecting small and distant objects. Moreover, real-vehicle experiments indicate that the method maintains high detection accuracy in complex and dynamic environments.

    参考文献
    相似文献
    引证文献
引用本文

李明光,陶重犇.基于Transformer的特征频率解耦融合三维目标检测方法[J].仪器仪表学报,2025,46(7):345-357

复制
分享
相关视频

文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:
  • 最后修改日期:
  • 录用日期:
  • 在线发布日期: 2025-11-07
  • 出版日期:
文章二维码