The increased use of text, graphics, audio, and video in hate speech makes identification more difficult. In order to assess heterogeneous data, this study presents a Multi-modal Hate Speech Detection Framework (MHSDF) that combines Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs). Long Short Term Memory (LSTM) manages sequential relationships in text and audio, whereas CNNs extract spatial data such as text patterns and visual signals. Semantic comprehension is improved by sophisticated embeddings such as Word2Vec and BERT. By concentrating on cross-modal interactions, attention processes make it possible to identify hate speech in a variety of media, including sardonic videos and harmful memes. The framework enhances interpretability, traceability, and transparency while achieving excellent accuracy (98.53%) and robustness (97.64%).https://www.nature.com/articles/s41598-025-94069-zShare this:FacebookXLike this:Like Loading... Post navigation Hate Speech in Pixels: Detection of Offensive Memes towards Automatic Moderation (Meta) A Survey of Machine Learning Models and Datasets for the Multi-label Classification of Textual Hate Speech in English (arXiv)