The current study tackles the challenges of multimodal hate speech identification in internet memes, which frequently combine text and images to express cultural biases. The authors describe ViT-BERT CAMT, a cross-attention multitask model that uses a linear self-attentive fusion mechanism to combine vision transformer (ViT) data with BERT-based textual representations. The model performs exceptionally well in identifying sentiment, sarcasm, offensiveness, and discriminatory content when tested on the SemEval 2020 Memotion and MIMIC datasets. The results demonstrate how well joint image-text modeling captures subtle semantic and spatial links, promoting automatic moderation of online debate that is sensitive to cultural differences. https://www.sciencedirect.com/science/article/abs/pii/S0893608025009694 Share this: Click to print (Opens in new window) Print Click to share on Facebook (Opens in new window) Facebook Click to share on LinkedIn (Opens in new window) LinkedIn Click to share on Reddit (Opens in new window) Reddit Click to share on WhatsApp (Opens in new window) WhatsApp Click to share on Bluesky (Opens in new window) Bluesky Click to email a link to a friend (Opens in new window) Email Like this:Like Loading... Post navigation The Role of Context in Detecting the Target of Hate Speech (ACL Anthology) Two Weeks in Soft Security: Free Resources on Countering Extremism, Hate, and Disinformation, September 2025 (I/II)