SMARTER: A Data-efficient Framework to Improve Toxicity Detection with Explanation via Self-augmenting Large Language Models (arXiv)

Sep 23, 2025 #Algorithms, #Policies

The researchers provide SMARTER, a two-stage paradigm for explainable content moderation that uses Large Language Models (LLMs) and is data-efficient. In Stage 1, alignment is made possible by preference optimization with less human oversight by utilizing LLMs’ own outputs to produce synthetic explanations for both correct and incorrect labels. Cross-model training is used in Stage 2 to improve the quality of explanations, enabling weaker models to stylistically and semantically match stronger ones. Using only a portion of the complete training set, experiments on three benchmark tasks—HateXplain, Latent Hate, and Implicit Hate—show that SMARTER allows LLMs to achieve up to a 13.5% macro-F1 improvement over conventional few-shot baselines. The existing system leverages the self-improving categorization and explanation capabilities of LLMs to provide a scalable approach for low-resource environments.

https://arxiv.org/abs/2509.15174

SMARTER: A Data-efficient Framework to Improve Toxicity Detection with Explanation via Self-augmenting Large Language Models (arXiv)

Like this:

Leave a Reply Cancel reply

LATEST NEWS

“They’re Not So Separate After All” – Digital and Analog Dimensions of Radicalization (Policyinstitute.net)

Soft Security Resources: Press Articles, Documents, and Recordings on Countering Extremism, Hate Speech, and False Information – December 2025 (I/II)

Soft Security Resources: Press Articles, Documents, and Recordings on Countering Extremism, Hate Speech, and False Information – November 2025 (I/I)

New on preventhate.org | Policyinstitute.net, 17 November 2025

Meta Oversight Board’s Nascent Standard on Hate Speech: Towards Plural Standard Setting in International Human Rights Law (SSRN)

TAGS

preventhate.org | Policyinstitute.net

SMARTER: A Data-efficient Framework to Improve Toxicity Detection with Explanation via Self-augmenting Large Language Models (arXiv)

Share this:

Like this:

Leave a Reply Cancel reply

“They’re Not So Separate After All” – Digital and Analog Dimensions of Radicalization (Policyinstitute.net)

Soft Security Resources: Press Articles, Documents, and Recordings on Countering Extremism, Hate Speech, and False Information – December 2025 (I/II)

Soft Security Resources: Press Articles, Documents, and Recordings on Countering Extremism, Hate Speech, and False Information – November 2025 (I/I)

New on preventhate.org | Policyinstitute.net, 17 November 2025

Meta Oversight Board’s Nascent Standard on Hate Speech: Towards Plural Standard Setting in International Human Rights Law (SSRN)

TAGS