Compared to text-based algorithms, hate speech detection (HSD) in films is still poorly studied. Current multi-modal systems frequently miss important indications like on-screen text and audio and fail to capture inter-modal relationships. We present MM-HSD, a multi-modal model that uses Cross-Modal Attention (CMA) as an early feature extractor to combine voice transcripts, video frames, audio, and on-screen text. For the first time, we have carefully examined query/key setups in CMA for HSD and found that on-screen text works best as a query. Through efficient modality fusion, experiments on the HateMM dataset show that MM-HSD outperforms previous approaches and reaches state-of-the-art performance with an M-F1 score of 0.874.

https://dl.acm.org/doi/10.1145/3746027.3754558

Leave a Reply

Your email address will not be published. Required fields are marked *