Because social media platforms are so widely used, it becomes necessary to automatically recognize and flag instances of hate speech. While there are a number of hate speech detection techniques available, the majority of these “black-box” techniques are not intended to be interpreted or explained. In this research, the authors propose to overcome the lack of interpretability by training a basic hate speech classifier with features extracted in the form of rationales from the input text using state-of-the-art Large Language Models (LLMs). This allows faithful interpretability by design. The methodology successfully integrates the discriminative ability of cutting-edge hate speech classifiers with the textual understanding skills of LLMs to provide these classifiers with faithful interpretability. Our thorough assessment using a range of hate speech datasets from social media platforms shows: (1) how well the LLM-extracted arguments work, and (2) how surprisingly well detector performance is retained even after training to guarantee interpretability. https://arxiv.org/abs/2403.12403 Share this: Click to print (Opens in new window) Print Click to share on Facebook (Opens in new window) Facebook Click to share on LinkedIn (Opens in new window) LinkedIn Click to share on Reddit (Opens in new window) Reddit Click to share on WhatsApp (Opens in new window) WhatsApp Click to share on Bluesky (Opens in new window) Bluesky Click to email a link to a friend (Opens in new window) Email Like this:Like Loading... Post navigation Advancing Ethical and Accurate Hate Speech Detection with Machine Learning Techniques (IJSRET) Transformer-based models for hate speech classification (ICIASC)