It is currently common practice to train models for a range of NLP applications using synthetic data. Regarding its efficacy on highly subjective tasks like hate speech identification, prior research has produced conflicting findings. Using 3,500 carefully annotated samples, this study provides a thorough qualitative review of the potential and particular drawbacks of using synthetic data for hate speech identification in English. It is demonstrated that synthetic data produced by paraphrasing gold texts can enhance out-of-distribution resilience from a computational perspective across several models. However, synthetic data significantly diminishes the representation of both particular identity groups and intersectional hatred, produces radically different class distributions, and fails to accurately replicate the features of real-world data on a number of language variables.https://aclanthology.org/2024.emnlp-main.1099Share this:FacebookXLike this:Like Loading... Post navigation Recent Advances in Online Hate Speech Moderation: Multimodality and the Role of Large Models (ACL Anthology) Multi3Hate: Multimodal, Multilingual, and Multicultural Hate Speech Detection with Vision-Language Models (arXiv)