“We propose a transparent method that calculates these values by generating explicit perturbations of the input text, allowing the importance scores themselves to be explainable. We employ our method to explain the predictions of different hate speech detection models on the same set of curated examples from a test suite, and show that different values of necessity and sufficiency for identity terms correspond to different kinds of false positive errors, exposing sources of classifier bias against marginalized groups.”

https://www.svkir.com/papers/Balkir-et-al-SuffNecc-NAACL-2022.pdf

By author

Leave a Reply

Your email address will not be published. Required fields are marked *