Abstract:Aiming at the problem of the difficulty of aligning multimodal features caused by the different spatial dimensions of point clouds and images,we propose a 3D object detection algorithm on YOLOv8 combined with multimodal feature fusion.First,using the YOLOv8-based data enhancement module to map the image to 3D space, we generate a pseudo-cloud aligned with the point cloud and enhance the point cloud and pseudo-cloud using YOLOv8 with frozen weights.Then,a dual-stream encoder is constructed to extract multimodal features in parallel.Finally,an attention mechanism-based RoI fusion module and a RoI gating fusion module are designed to aggregate multimodal RoI features.On the KITTI validation set,the proposed algorithm achieves better performance of a 3D average accuracy of 79.28%,58.70%,and 76.04%for cars,pedestrians,and cyclists at the difficult level,boosting 0.62%,3.07%,and 7.54%over the existing algorithm.These results illustrate the clear advantages of our method.