Content moderation relies heavily on hate speech identification, however existing models frequently fall short of generalizing because of biases in the dataset and sentence-level labels that do not account for the structure of hate speech. Models find it difficult to distinguish label meanings from context, even when finer span-level annotations are included (e.g., labeling “artists” as a “target” and “are parasites” as dehumanizing). Novel expression combinations are therefore still difficult to find. The researchers investigate whether generalization is enhanced by training on data with uniformly distributed utterances across contexts. The authors then present U-PLEAD, a dataset consisting of around 364,000 synthetic posts and a benchmark of approximately 8,000 hand verified posts. U-PLEAD produces state-of-the-art results on PLEAD and improves compositional generalization when used with actual data.

https://arxiv.org/abs/2506.03916

By author

Leave a Reply

Your email address will not be published. Required fields are marked *