Abstract:Cross-modal emotion recognition aims to perceive human emotions through data from different modalities. Currently,most research still focuses on single modalities,neglecting the importance of other modalities.This paper proposes a cross-modal emotion recognition method based on knowledge distillation,which significantly improves the accuracy of emotion recognition by integrating information from both speech and text modalities.Specifically,the proposed method utilizes a pre-trained text model,RoBERTa,as the teacher model,and transfers its high-quality textual emotional representations to a lightweight speech student model through feature distillation.Additionally,a bi- directional objective distillation is employed,enabling the teacher and student models to mutually transfer knowledge. Experimental results show that the proposed method achieves superior performance on the IEMOCAP and MELD datasets.