Despite the growing threats to safety and unity posed by online hate speech, Southeast Asian languages, such as Malay, are still underrepresented in NLP studies. By offering 26,985 multilingual Malay-English social media texts for binary hate speech categorization, the current dataset fills the gap. It provides high-confidence, quality-controlled entries that have been curated from five public sources and filtered using human annotation and pseudo-labelling. The dataset, which was created for multilingual machine learning applications, supports cross-lingual benchmarking, transformer-based classifiers, and instructional resources for populations that speak Malay and English.

https://www.sciencedirect.com/science/article/pii/S2352340925008741

Leave a Reply

Your email address will not be published. Required fields are marked *