It is currently common practice to train models for a range of NLP applications using synthetic data. Regarding its efficacy on highly subjective tasks like hate speech identification, prior research has produced conflicting findings. Using 3,500 carefully annotated samples, this study provides a thorough qualitative review of the potential and particular drawbacks of using synthetic data for hate speech identification in English. It is demonstrated that synthetic data produced by paraphrasing gold texts can enhance out-of-distribution resilience from a computational perspective across several models. However, synthetic data significantly diminishes the representation of both particular identity groups and intersectional hatred, produces radically different class distributions, and fails to accurately replicate the features of real-world data on a number of language variables.

https://aclanthology.org/2024.emnlp-main.1099

By author

Leave a Reply

Your email address will not be published. Required fields are marked *