Compared to text-based algorithms, hate speech detection (HSD) in films is still poorly studied. Current multi-modal systems frequently miss important indications like on-screen text and audio and fail to capture inter-modal relationships. We present MM-HSD, a multi-modal model that uses Cross-Modal Attention (CMA) as an early feature extractor to combine voice transcripts, video frames, audio, and on-screen text. For the first time, we have carefully examined query/key setups in CMA for HSD and found that on-screen text works best as a query. Through efficient modality fusion, experiments on the HateMM dataset show that MM-HSD outperforms previous approaches and reaches state-of-the-art performance with an M-F1 score of 0.874. https://dl.acm.org/doi/10.1145/3746027.3754558 Share this: Click to print (Opens in new window) Print Click to share on Facebook (Opens in new window) Facebook Click to share on LinkedIn (Opens in new window) LinkedIn Click to share on Reddit (Opens in new window) Reddit Click to share on WhatsApp (Opens in new window) WhatsApp Click to share on Bluesky (Opens in new window) Bluesky Click to email a link to a friend (Opens in new window) Email Like this:Like Loading... Post navigation Efficient Hate Speech Detection: A Three-Layer LoRA-Tuned BERTweet Framework (arXiv) Performance comparison of deep learning approaches for Indonesian twitter hate speech detection using IndoBERTweet embedding (Procedia Computer Science)