Abstract:Aiming at the problem of the difficulty of aligning multimodal features caused by the different spatial dimensions of point clouds and images, we propose a 3D object detection algorithm on YOLOv8 combined with multimodal feature fusion. First, using the YOLOv8-based data enhancement module to map the image to 3D space, we generate a pseudo-cloud aligned with the point cloud and enhance the point cloud and pseudo-cloud using YOLOv8 with frozen weights. Then, a dual-stream encoder is constructed to extract multimodal features in parallel. Finally, an attention mechanism-based RoI fusion module and a RoI gating fusion module are designed to aggregate multimodal RoI features. On the KITTI validation set, the proposed algorithm achieves better performance of a 3D average accuracy of 79.28%, 58.70%, and 76.04% for cars, pedestrians, and cyclists at the difficult level, boosting 0.62%, 3.07%, and 7.54% over the existing algorithm. These results illustrate the clear advantages of our method.