Towards Interpretable Hate Speech Detection using Large Language Model-extracted Rationales (arXiv)

Mar 23, 2024 #Algorithms, #Policies

Because social media platforms are so widely used, it becomes necessary to automatically recognize and flag instances of hate speech. While there are a number of hate speech detection techniques available, the majority of these “black-box” techniques are not intended to be interpreted or explained. In this research, the authors propose to overcome the lack of interpretability by training a basic hate speech classifier with features extracted in the form of rationales from the input text using state-of-the-art Large Language Models (LLMs). This allows faithful interpretability by design. The methodology successfully integrates the discriminative ability of cutting-edge hate speech classifiers with the textual understanding skills of LLMs to provide these classifiers with faithful interpretability. Our thorough assessment using a range of hate speech datasets from social media platforms shows: (1) how well the LLM-extracted arguments work, and (2) how surprisingly well detector performance is retained even after training to guarantee interpretability.

https://arxiv.org/abs/2403.12403

Towards Interpretable Hate Speech Detection using Large Language Model-extracted Rationales (arXiv)

Like this:

Leave a Reply Cancel reply

LATEST NEWS

Soft Security Resources: Press Articles, Documents, and Recordings on Countering Extremism, Hate Speech, and False Information – December 2025 (II/II)

“They’re Not So Separate After All” – Digital and Analog Dimensions of Radicalization (Policyinstitute.net)

Soft Security Resources: Press Articles, Documents, and Recordings on Countering Extremism, Hate Speech, and False Information – December 2025 (I/II)

Soft Security Resources: Press Articles, Documents, and Recordings on Countering Extremism, Hate Speech, and False Information – November 2025 (I/I)

New on preventhate.org | Policyinstitute.net, 17 November 2025

TAGS

preventhate.org | Policyinstitute.net

Towards Interpretable Hate Speech Detection using Large Language Model-extracted Rationales (arXiv)

Share this:

Like this:

Leave a Reply Cancel reply

Soft Security Resources: Press Articles, Documents, and Recordings on Countering Extremism, Hate Speech, and False Information – December 2025 (II/II)

“They’re Not So Separate After All” – Digital and Analog Dimensions of Radicalization (Policyinstitute.net)

Soft Security Resources: Press Articles, Documents, and Recordings on Countering Extremism, Hate Speech, and False Information – December 2025 (I/II)

Soft Security Resources: Press Articles, Documents, and Recordings on Countering Extremism, Hate Speech, and False Information – November 2025 (I/I)

New on preventhate.org | Policyinstitute.net, 17 November 2025

TAGS