The current study tackles the challenges of multimodal hate speech identification in internet memes, which frequently combine text and images to express cultural biases. The authors describe ViT-BERT CAMT, a cross-attention multitask model that uses a linear self-attentive fusion mechanism to combine vision transformer (ViT) data with BERT-based textual representations. The model performs exceptionally well in identifying sentiment, sarcasm, offensiveness, and discriminatory content when tested on the SemEval 2020 Memotion and MIMIC datasets. The results demonstrate how well joint image-text modeling captures subtle semantic and spatial links, promoting automatic moderation of online debate that is sensitive to cultural differences. https://www.sciencedirect.com/science/article/abs/pii/S0893608025009694 Share this: Print (Opens in new window) Print Share on Facebook (Opens in new window) Facebook Share on LinkedIn (Opens in new window) LinkedIn Share on Reddit (Opens in new window) Reddit Share on WhatsApp (Opens in new window) WhatsApp Share on Bluesky (Opens in new window) Bluesky Email a link to a friend (Opens in new window) Email Like this:Like Loading... Post navigation The Role of Context in Detecting the Target of Hate Speech (ACL Anthology) Two Weeks in Soft Security: Free Resources on Countering Extremism, Hate, and Disinformation, September 2025 (I/II)