Necessity and Sufficiency for Explaining Text Classifiers: A Case Study in Hate Speech Detection (SVKIR)

Byauthor

May 13, 2022

“We propose a transparent method that calculates these values by generating explicit perturbations of the input text, allowing the importance scores themselves to be explainable. We employ our method to explain the predictions of different hate speech detection models on the same set of curated examples from a test suite, and show that different values of necessity and sufficiency for identity terms correspond to different kinds of false positive errors, exposing sources of classifier bias against marginalized groups.”

https://www.svkir.com/papers/Balkir-et-al-SuffNecc-NAACL-2022.pdf