The authors explore the idea of enhancing current data with generative language models, lowering target imbalance, given the unparalleled skills of LLMs in providing high-quality data. The Measuring Hate Speech corpus is an English dataset tagged with target identity information. Approximately 30,000 synthetic samples are added, and 1,000 posts are augmented using a combination of basic data augmentation techniques and several generative model types, comparing autoregressive and sequence-to-sequence approaches. The combination of the two usually yields the greatest outcomes, however the researchers found that classic DA approaches are frequently superior to generative models. In fact, hate speech categorization utilizing enhanced data for training increases by more than 10% F1 over the no augmentation baseline for several hate categories including origin, religion, and handicap.

https://arxiv.org/abs/2410.08053

By author

Leave a Reply

Your email address will not be published. Required fields are marked *