Preventhate.org | Policyinstitute.net Algorithms L3Cube-MahaHate: A Tweet-based Marathi Hate Speech Detection Dataset and BERT models (arXiv)

L3Cube-MahaHate: A Tweet-based Marathi Hate Speech Detection Dataset and BERT models (arXiv)

The dataset is curated from Twitter, annotated manually. Our dataset consists of over 25000 distinct tweets labeled into four major classes i.e hate, offensive, profane, and not. We present the approaches used for collecting and annotating the data and the challenges faced during the process. Finally, we present baseline classification results using deep learning models based on CNN, LSTM, and Transformers. We explore mono-lingual and multilingual variants of BERT like MahaBERT, IndicBERT, mBERT, and xlm-RoBERTa and show that mono-lingual models perform better than their multi-lingual counterparts. The MahaBERT model provides the best results on L3Cube-MahaHate Corpus.

https://arxiv.org/pdf/2203.13778.pdf

Leave a Reply

Your email address will not be published. Required fields are marked *

Related Post

Using Transfer-based Language Models to Detect Hateful and Offensive Language Online (Proceedings of the Fourth Workshop on Online Abuse and Harms)Using Transfer-based Language Models to Detect Hateful and Offensive Language Online (Proceedings of the Fourth Workshop on Online Abuse and Harms)

“The results indicate that the attention-based models profoundly confuse hate speech with offensive and normal language. However, the pre-trained models outperform state-of-the-art results in terms of accurately predicting the hateful

%d bloggers like this: