The paper presents CoARL, a unique framework that models the pragmatic consequences of social biases in hostile remarks, hence improving the development of counterspeech. In the first two stages of CoARL, the model is taught to comprehend the intentions, responses, and harms of offensive comments by sequential multi-instruction tuning. After that, it learns task-specific low-rank adapter weights to produce intent-conditioned counterspeech. Reinforcement learning is used in the last stage to optimize outputs for efficacy and nontoxicity. With an average improvement of around 3 points in intent-conformity and about 4 points in argument-quality criteria, CoARL surpasses current standards in intent-conditioned counterspeech production. CoARL’s effectiveness in producing better and more context-appropriate replies than other systems, including well-known LLMs like ChatGPT, is supported by extensive human review.

https://aclanthology.org/2024.naacl-long.374

By author

Leave a Reply

Your email address will not be published. Required fields are marked *