Effective detection is necessary because social media makes it possible for nasty information to spread across several modalities, including text, audio, and visual. It is yet unknown how successful contemporary methods are when combined with other modalities. With an emphasis on performance across pictures and videos, the research offers a methodical examination of fusion-based techniques for multimodal hate detection. A thorough analysis identifies important modality-specific limitations: basic embedding fusion struggles with complex image-text linkages in memes (Hateful Memes dataset), while it achieves state-of-the-art performance on video material (HateMM dataset) with a 9.9% points F1-score improvement. By means of thorough ablation investigations and error analysis, we show how sophisticated cross-modal interactions are not captured by existing fusion techniques. Our results underline the necessity of modality-specific design and offer vital insights for creating more reliable hate detection systems.https://arxiv.org/abs/2502.07138Share this:FacebookXLike this:Like Loading... Post navigation Dealing with Annotator Disagreement in Hate Speech Classification (arXiv) Hate Speech Detection Using Social Media Discourse: A Multilingual Approach with Large Language Model (African Journal of Biomedical Research)