It is expensive to gather labeled hate speech data, particularly for low-resource languages. Previous research indicates that data augmentation and cross-lingual transfer learning are beneficial in low-data environments. Using nearest-neighbor retrieval, we provide a scalable method to improve detection with little labeled data in the target language. Relevant instances are extracted from a huge multilingual pool using a small labeled set. Tested on eight languages, the researchers’ approach frequently surpasses state-of-the-art outcomes and routinely beats models trained just on target data. It is scalable to new languages and jobs and data-efficient, often utilizing only 200 samples. In some situations, performance is further enhanced by reducing duplication by using maximum marginal relevance.

https://arxiv.org/abs/2505.14272

By author

Leave a Reply

Your email address will not be published. Required fields are marked *