Abstract:Monocular visual-inertial SLAM (simultaneous localization and mapping) systems recover poses by tracking the hand-crafted point features, such as Shi-Tomas, FAST, and so on. However, the robustness of hand-crafted features is limited in some challenging scenes, such as severe illumination or perspective changes, which may lead to poor localization accuracy. Inspired by the excellent performance of SuperPoint network in feature extraction, a monocular VINS (i. e. , CNN-VINS) is proposed, which is based on the selfsupervised network and works robustly in challenging scenes. Our main contributions are summarized in three terms. An improved SuperPoint-based feature extraction network is proposed. The dynamical detection threshold adjustment algorithm is used to detect and describe feature points uniformly, which can establish accurate feature correspondence. The improved SuperPoint network is efficiently integrated into a complete monocular visual-inertial SLAM including nonlinear optimization and loop detection modules. In addition, to evaluate the performance of the feature extraction network encoder layer in terms of the localization accuracy of the VINS system, learn and optimize the intermediate shared encoder layer and loss function of the network. Experimental results on the public benchmark EuRoc dataset show that the localization accuracy of our method is increased 15% more than that of VINS-Mono in challenging scenes. In simple illumination change scenes, the mean absolute trajectory error is between 0. 067~ 0. 069 m.