Abstract:In the field of collaborative classification between hyperspectral images and LiDAR data, although CNN and Transformer have shown keen insight into local features and global dependencies in image processing and data analysis, their collaborative mechanisms have not been fully explored, and the potential for cross-modal feature complementarity has not been effectively unleashed. Therefore, this article proposes a multimodal collaborative land-cover classification method for remote sensing data that combines CNN with Transformer for hyperspectral images and LiDAR data. Firstly, the model performs dimensionality reduction on hyperspectral images through principal component analysis to remove redundant spectral information. Then, it uses CNN layers to capture local texture features, and constructs a global spectral-spatial representation using the Transformer self-attention mechanism. Then, through a bidirectional feature interaction mechanism, the global contextual information from the Transformer is injected into the CNN feature channels, while the local details extracted by the CNN are fed back into the Transformer branch. Cross-scale feature alignment is achieved through the feature coupling unit, enhancing the joint extraction ability of the model for the global structure and local details of hyperspectral images. For LiDAR data, a dynamic convolution cascade module is used to effectively capture elevation information and contextual relationships. Finally, a cross-modal feature fusion module is used to achieve deep interaction and fusion of dual source data features, improving the classification accuracy of complex land features in the complementary semantics of dual modalities. Experiments on three publicly available datasets—Houston 2013, Trento, and Augsburg—showed that the overall classification accuracy of our proposed method reached 99.85%, 99.68%, and 97.34%, respectively, with average accuracies of 99.87%, 99.34%, and 90.60%. This improvement in classification accuracy compared to mainstream methods such as GLT and HCT fully demonstrates the advantages and effectiveness of our proposed method for multimodal data collaborative classification.