Towards Interpretable Hate Speech Detection using Large Language Model-extracted Rationales (arXiv)

Mar 23, 2024 #Algorithms, #Policies

Because social media platforms are so widely used, it becomes necessary to automatically recognize and flag instances of hate speech. While there are a number of hate speech detection techniques available, the majority of these “black-box” techniques are not intended to be interpreted or explained. In this research, the authors propose to overcome the lack of interpretability by training a basic hate speech classifier with features extracted in the form of rationales from the input text using state-of-the-art Large Language Models (LLMs). This allows faithful interpretability by design. The methodology successfully integrates the discriminative ability of cutting-edge hate speech classifiers with the textual understanding skills of LLMs to provide these classifiers with faithful interpretability. Our thorough assessment using a range of hate speech datasets from social media platforms shows: (1) how well the LLM-extracted arguments work, and (2) how surprisingly well detector performance is retained even after training to guarantee interpretability.

https://arxiv.org/abs/2403.12403

Towards Interpretable Hate Speech Detection using Large Language Model-extracted Rationales (arXiv)

Like this:

Leave a Reply Cancel reply

LATEST NEWS

New on preventhate.org, 12 July 2026 (Policyinstitute.net)

UNESCO launches issue brief on Media and Information Literacy to counter hate speech in the digital age (UNESCO)

Five lessons from the No Hate Speech Week: what we heard, what we learned, what comes next (Council of Europe)

Hate speech levels across Europe alarming, stronger action needed (Council of Europe)

Soft Security Resources: Press Articles, Documents, and Recordings on Countering Extremism, Hate Speech, and False Information – December 2025 (II/II)

TAGS

preventhate.org | Policyinstitute.net

Towards Interpretable Hate Speech Detection using Large Language Model-extracted Rationales (arXiv)

Share this:

Like this:

Leave a Reply Cancel reply

New on preventhate.org, 12 July 2026 (Policyinstitute.net)

UNESCO launches issue brief on Media and Information Literacy to counter hate speech in the digital age (UNESCO)

Five lessons from the No Hate Speech Week: what we heard, what we learned, what comes next (Council of Europe)

Hate speech levels across Europe alarming, stronger action needed (Council of Europe)

Soft Security Resources: Press Articles, Documents, and Recordings on Countering Extremism, Hate Speech, and False Information – December 2025 (II/II)

TAGS