Because social media platforms are so widely used, it becomes necessary to automatically recognize and flag instances of hate speech. While there are a number of hate speech detection techniques available, the majority of these “black-box” techniques are not intended to be interpreted or explained. In this research, the authors propose to overcome the lack of interpretability by training a basic hate speech classifier with features extracted in the form of rationales from the input text using state-of-the-art Large Language Models (LLMs). This allows faithful interpretability by design. The methodology successfully integrates the discriminative ability of cutting-edge hate speech classifiers with the textual understanding skills of LLMs to provide these classifiers with faithful interpretability. Our thorough assessment using a range of hate speech datasets from social media platforms shows: (1) how well the LLM-extracted arguments work, and (2) how surprisingly well detector performance is retained even after training to guarantee interpretability.

