The researchers provide SMARTER, a two-stage paradigm for explainable content moderation that uses Large Language Models (LLMs) and is data-efficient. In Stage 1, alignment is made possible by preference optimization with less human oversight by utilizing LLMs’ own outputs to produce synthetic explanations for both correct and incorrect labels. Cross-model training is used in Stage 2 to improve the quality of explanations, enabling weaker models to stylistically and semantically match stronger ones. Using only a portion of the complete training set, experiments on three benchmark tasks—HateXplain, Latent Hate, and Implicit Hate—show that SMARTER allows LLMs to achieve up to a 13.5% macro-F1 improvement over conventional few-shot baselines. The existing system leverages the self-improving categorization and explanation capabilities of LLMs to provide a scalable approach for low-resource environments.

https://arxiv.org/abs/2509.15174

By author

Leave a Reply

Your email address will not be published. Required fields are marked *