Implicit Modality Mining: An End-to-End Method for Multimodal Information Extraction
Download PDF


Named entity recognition
Relation extraction
Patch projection



Submitted : 2024-02-29
Accepted : 2024-03-15
Published : 2024-03-30


Multimodal named entity recognition (MNER) and relation extraction (MRE) are key in social media analysis
but face challenges like inefficient visual processing and non-optimal modality interaction. (1) Heavy visual embedding: the process of visual embedding is both time and computationally expensive due to the prerequisite extraction of explicit visual cues from the original image before input into the multimodal model. Consequently, these approaches cannot achieve efficient online reasoning; (2) suboptimal interaction handling: the prevalent method of managing interaction between different modalities typically relies on the alternation of self-attention and cross-attention mechanisms or excessive dependence on the gating mechanism. This explicit modeling method may fail to capture some nuanced relations between image and text, ultimately undermining the model’s capability to extract optimal information. To address these challenges, we introduce Implicit Modality Mining (IMM), a novel end-to-end framework for fine-grained image-text correlation without heavy visual embedders. IMM uses an Implicit Semantic Alignment module with a Transformer for cross-modal clues and an Insert-Activation module to effectively utilize these clues. Our approach achieves state-of-the-art
performance on three datasets.


Moon S, Neves L, Carvalho V, 2018, Multimodal Named Entity Recognition for Short Social Media Posts. arXiv.

Huang Z, Xu W, Yu K, 2015, Bidirectional LSTM-CRF Models for Sequence Tagging. arXiv. https://doi.


Lample G, Ballesteros M, Subramanian S, et al., 2016, Neural Architectures for Named Entity Recognition. arXiv.

Gui T, Ma R, Zhang Q, et al., 2019, CNN-Based Chinese NER with Lexicon Rethinking. Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, 4982–4988.

Liu C, Sun W, Chao W, et al., 2013, Proceedings of the Advanced Data Mining and Applications 9th International Conference, December 14–16, 2013: Convolution Neural Network for Relation Extraction. Hangzhou, 231–242.

Zhang D, Wang D, 2015, Relation Classification Via Recurrent Neural Network. arXiv.

Zhou P, Shi W, Tian J, et al., 2016, Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), August 17–22, 2022: Attention-based Bidirectional Long Short-Term Memory Networks for Relation Classification Using Knowledge Distillation from BERT. Calgary, 207–212.

Nayak T, Majumder N, Goyal P, et al., 2021, Deep Neural Approaches to Relation Triplets Extraction: A Comprehensive Survey. Cognitive Computation, 13: 1215–1232.

Zhang Q, Fu J, Liu X, et al., 2018, Proceedings of The AAAI Conference on Artificial Intelligence, February 2–7, 2018: Adaptive Co-Attention Network for Named Entity Recognition in Tweets. New Orleans, 5674–5681.

Yu J, Jiang J, Yang L, et al., 2020, Improving Multimodal Named Entity Recognition Via Entity Span Detection with Unified Multimodal Transformer. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 3342–3352.

Chen X, Zhang N, Li L, et al., 2022, Good Visual Guidance Makes a Better Extractor: Hierarchical Visual Prefix for Multimodal Entity and Relation Extraction. arXiv.

Ren S, He K, Girshick R, et al., 2015, Faster R-CNN: Towards Real-Time Object Detection with

Region Proposal Networks. Advances in Neural Information Processing Systems 28 (NIPS 2015), 1–9.

He K, Gkioxari G, Dollár P, et al., 2017, Proceedings of the IEEE International Conference on Computer Vision, October 22–29, 2017: Mask R-CNN. Venice, 2961–2969.

Yang Z, Gong B, Wang L, et al., 2019, Proceedings of the IEEE/CVF International Conference on Computer Vision, October 27–November 2, 2019: A Fast and Accurate One-Stage Approach to Visual Grounding. Seoul, 4683–4693.

He K, Zhang X, Ren S, et al., 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition, June 27–30, 2016: Deep Residual Learning for Image Recognition. Las Vegas, 770–778.

Dosovitskiy A, Beyer L, Kolesnikov A, et al., 2020, An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv.

Chen X, Zhang N, Li L, et al., 2022, Hybrid Transformer with Multi-Level Fusion for Multimodal Knowledge Graph Completion. Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, 904–915.

Vaswani A, Shazeer N, Parmar N, et al., 2017, 31st Conference on Neural Information Processing Systems (NIPS 2017), December 4–9, 2017: Attention is All You Need. Long Beach, 1–11.

Radford A, Kim JW, Hallacy C, et al., 2021, Proceedings of the International Conference on Machine Learning, July 18–24, 2021: Learning Transferable Visual Models from Natural Language Supervision. Virtual, 8748–8763.

Yao Y, Huang S, Dong L, et al., 2022, Natural Language Processing and Chinese Computing, September 24–25, 2022: Kformer: Knowledge Injection in Transformer Feed-Forward Layers. Guilin, China, 131–143.

Zhou G, Su J, 2005, Machine Learning-Based Named Entity Recognition Via Effective Integration of Various Evidences. Natural Language Engineering, 11: 189–206.

Zhang M, Zhou G, Yang L, et al., 2006, Chinese Word Segmentation and Named Entity Recognition Based on a Context-Dependent Mutual Information Independence Model. Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing, 154–157.

Luo G, Huang X, Lin CY, et al., 2015, Joint Entity Recognition and Disambiguation. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 879–888.

Ma X, Hovy E, 2016, End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF. arXiv.

Chiu JP, Nichols E, 2016, Named Entity Recognition with Bidirectional LSTM-CNNs. Transactions of the

Association for Computational Linguistics, 4: 357–370.

Zhang Z, Wu Y, Zhao H, et al., 2020, Semantics-Aware BERT for Language Understanding. Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20), 9628–9635.

Lu D, Neves L, Carvalho V, et al., 2018, Visual Attention Model for Name Tagging in Multimodal Social Media. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 1990–1999.

Radford A, Narasimhan K, Salimans T, et al., 2018, Improving Language Understanding by Generative Pre-Training. arXiv.

Zheng C, Wu Z, Feng J, et al., 2021, Proceedings of the 2021 IEEE International Conference on Multimedia and Expo (ICME), July 5–9, 2021: MNRE: A Challenge Multimodal Dataset for Neural Relation Extraction with Visual Evidence in Social Media Posts. Shenzhen, 1–6.

Zheng C, Feng J, Fu Z, et al., 2021, Proceedings of the 29th ACM International Conference on Multimedia, October 20–24, 2021: Multimodal Relation Extraction with Efficient Graph Alignment. Virtual, 5298–5306.

Chen D, Li Z, Gu B, et al., Proceedings of the Database Systems for Advanced Applications: 26th International Conference (DASFAA 2021), April 11–14, 2021: Multimodal Named Entity Recognition with Image Attributes and Image Knowledge. Taipei, 186–201.

Xu B, Huang S, Sha C, et al., 2022, MAF: A General Matching and Alignment Framework for Multimodal Named Entity Recognition. Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining, 1215–1223.

Devlin J, Chang MW, Lee K, et al., 2018, BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv.

Li LH, Yatskar M, Yin D, et al., 2019, VisualBERT: A Simple and Performant Baseline for Vision and Language. arXIV.

Li G, Duan N, Fang Y, et al., 2020, Unicoder-VL: A Universal Encoder for Vision and Language by Cross-Modal Pre-Training. Proceedings of the AAAI Conference on Artificial Intelligence, 11336–11344.

Su W, Zhu X, Cao Y, et al., 2019, VL-BERT: Pre-training of Generic Visual-Linguistic Representations. arXiv.

Chen YC, Li L, Yu L, et al., Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, August 23–28, 2020: Uniter: Universal Image-Text Representation Learning. Glasgow, 104–120.

Tan H, Bansal M, 2019, Lxmert: Learning Cross-Modality Encoder Representations from Transformers. arXiv.

Jin D, Pan E, Oufattole N, et al., 2020, What Disease Does This Patient Have? A Large-Scale Open Domain Question Answering Dataset from Medical Exams. arXiv.

Lin TY, Maire M, Belongie S, et al., 2014, Proceedings of European Conference on Computer Vision (ECCV 2014), September 6-12: Microsoft coco: Common Objects in Context. Zurich, 740–755.

Krishna R, Zhu Y, Groth O, et al., 2017, Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. International Journal of Computer Vision, 123: 32–73.

Ordonez V, Kulkarni G, Berg T, 2011, Proceedings of the 24th International Conference on Neural Information Processing Systems, December 12–15, 2011: Im2Text: Describing Images Using 1 Million Captioned Photographs. Granada, 1143–1151.

Sharma P, Ding N, Goodman S, et al., 2018, Conceptual Captions: A Cleaned, Hypernymed, Image Alt-Text Dataset for Automatic Image Captioning. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2556–2565.

Zhang Y, Lu H, 2018, Deep Cross-Modal Projection Learning for Image-Text Matching. Proceedings of the European Conference on Computer Vision (ECCV), 686–701.

Geva M, Schuster R, Berant J, et al., 2020, Transformer Feed-Forward Layers Are Key-Value Memories. arXiv.

Maas AL, Hannun AY, Ng AY, et al., 2013, Rectifier Nonlinearities Improve Neural Network Acoustic Models. Proceedings of the 30th International Conference on Machine Learning, 3.

Dai D, Dong L, Hao Y, et al., 2021, Knowledge Neurons in Pretrained Transformers. arXiv.

Wu Z, Zheng C, Cai Y, 2020, The 28th ACM International Conference on Multimedia, October 12–16, 2020: Multimodal Representation with Embedded Visual Guiding Objects for Named Entity Recognition in Social Media Posts. Seattle, 1038–1046.

Wang X, Gui M, Jiang Y, et al., 2022, ITA: Image-Text Alignments for Multi-Modal Named Entity Recognition. arXiv.

Zhang D, Wei S, Li S, et al., 2021, Multi-Modal Graph Fusion for Named Entity Recognition with Targeted Visual Guidance. Proceedings of the AAAI Conference 0n Artificial Intelligence, 14347–14355.

Zeng D, Liu K, Chen Y, et al., 2015, Distant Supervision for Relation Extraction via Piecewise Convolutional Neural Networks. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 1753–1762.

Soares LB, FitzGerald N, Ling J, et al., 2019, Matching the Blanks: Distributional Similarity for Relation Learning. arXiv.