The authors use text embeddings from computational linguistics to develop and evaluate a approach for quantifying the distortions caused by content-moderation in online discourse. Using a typical dataset of 5 million political Tweets from the US, they test their measure and discover that eliminating toxic Tweets skews online discourse. This result holds true for various samples, toxicity measures, and embedding models. Crucially, it is shown that the poisonous language is not the source of the distortions brought about by content moderation. Rather, it is demonstrated that content moderation distorts the subject composition of online material by altering the mean and variance of the embedding space. Lastly, the researchers suggest a different method of content moderation that preserves the recoverable content of hazardous Tweets by rephrasing them using generative large language models, as opposed to completely eliminating them. This rephrasing technique has been shown to minimize distortions in online information while lowering toxicity.https://arxiv.org/abs/2412.16114Share this:FacebookXLike this:Like Loading... Post navigation Towards Efficient and Explainable Hate Speech Detection via Model Distillation (arXiv) Recent Advances in Online Hate Speech Moderation: Multimodality and the Role of Large Models (ACL Anthology)