The datasets used to train hate speech detection models affect the models’ performance. The majority of existing datasets are created using a small number of hate domains or instances to define hate subjects. Large-scale analysis and transfer learning for hate domains are hampered as a result. In this work, we create massive twitter datasets with 100k human-labeled tweets each for the purpose of detecting hate speech in English and Turkish, a language with limited resources. We have scattered an equal amount of tweets over five domains in our datasets. In terms of large-scale hate speech identification, Transformer-based language models perform at least 5% better in English and 10% better in Turkish when compared to conventional bag-of-words and neural models, according to experimental data corroborated by statistical testing. Additionally, the performance is adaptable to various training sizes; with 20% of training instances, 98% of the English performance and 97% of the Turkish performance are recovered. We also investigate the capacity of cross-domain transfer between hate domains to generalize. We demonstrate that, on average, other domains recover 96% of a target domain’s performance for English and 92% for Turkish. Sports struggle the hardest to generalize to other domains, but gender and religion fare better. https://paperswithcode.com/paper/large-scale-hate-speech-detection-with-cross-1/review Share this: Click to print (Opens in new window) Print Click to share on Facebook (Opens in new window) Facebook Click to share on LinkedIn (Opens in new window) LinkedIn Click to share on Reddit (Opens in new window) Reddit Click to share on WhatsApp (Opens in new window) WhatsApp Click to share on Bluesky (Opens in new window) Bluesky Click to email a link to a friend (Opens in new window) Email Like this:Like Loading... Post navigation Hatred Stems from Ignorance! Distillation of the Persuasion Modes in Countering Conversational Hate Speech (arXiv) Code-mixing unveiled: Enhancing the hate speech detection in Arabic dialect tweets using machine learning models (PLOS ONE)