The annotation bias in the underlying hate speech datasets used to train these models is known to cause bias against African American English (AAE) dialect text in the hate speech detection task. As a result, regular AAE writing is more likely than non-AAE content to be incorrectly labeled as abusive or nasty, creating a discrepancy. To combat this kind of discrepancy, basic debiasing methods have been created; in this work, we implement and assess these methods within the context of RoBERTa-based encoders. Results from experiments indicate that these approaches’ effectiveness is mostly dependent on how training datasets are constructed, although they can lessen the difference between dialect subgroups on the hate speech detection task if representation bias is properly taken into account.

https://arxiv.org/abs/2501.15430

By author

Leave a Reply

Your email address will not be published. Required fields are marked *