“We propose a transparent method that calculates these values by generating explicit perturbations of the input text, allowing the importance scores themselves to be explainable. We employ our method to explain the predictions of different hate speech detection models on the same set of curated examples from a test suite, and show that different values of necessity and sufficiency for identity terms correspond to different kinds of false positive errors, exposing sources of classifier bias against marginalized groups.”https://www.svkir.com/papers/Balkir-et-al-SuffNecc-NAACL-2022.pdfShare this:FacebookXLike this:Like Loading... Post navigation Hatred Comments Detection in Twitter using Deep Learning (University of Ruhuna, Sri Lanka) Hidden behind the obvious: misleading keywords and implicitly abusive language on social media (arXiv)