The Latest Research Progress of Attention Mechanism in Deep Learning

Xu Jiang; Xiaoling Bai; Lifeng Yin

doi:10.26689/jera.v9i3.10597

Download PDF

Keywords

Natural language processing
Computer vision
Attention mechanism
Large models

DOI

10.26689/jera.v9i3.10597

Submitted : 2025-04-29

Accepted : 2025-05-14

Published : 2025-05-29

Abstract

With the development of artificial intelligence and deep learning, the attention mechanism has become a key technology for enhancing the performance of complex tasks. This paper reviews the evolution of attention mechanisms, including soft attention, hard attention, and recent innovations such as multi-head latent attention and cross-attention. It focuses on the latest research outcomes, such as lightning attention, the PADRe polynomial attention replacement algorithm, the context anchor attention module, and improvements in attention mechanisms for large models. These advancements improve the efficiency and accuracy of models, expanding the application potential of attention mechanisms in fields such as computer vision, natural language processing, and remote sensing object detection, aiming to provide readers with a comprehensive understanding and stimulate innovative thinking.

References

Vaswani A, Shazeer N, Parmar N, et al., 2017, Attention is All You Need. Advances in Neural Information Processing Systems, 30: 5998–6008.

Hu J, Shen L, Sun G, 2018, Squeeze and Excitation Networks, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 7132–7141.

Devlin J, Chang MW, Lee K, et al., 2019, Bert: Pre-training of Deep Bidirectional Transformers for Language Understanding, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (long and short papers), Minneapolis, Minnesota, 4171–4186.

Wang X, Girshick R, Gupta A, et al., 2018, Non-local Neural Networks, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 7794–7803.

Anderson P, He X, Buehler C, et al., 2018, Bottom-up and top-down Attention for Image Captioning and Visual Question Answering, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 6077–6086.

Bahdanau D, Cho K, Bengio Y, 2014, Neural Machine Translation by Jointly Learning to Align and Translate. https://doi.org/10.48550/arXiv.1409.0473

Luong MT, Pham H, Manning CD, 2015, Effective Approaches to Attention-based Neural Machine Translation. https://doi.org/10.48550/arXiv.1508.04025

Sukhbaatar S, Weston J, Fergus R, 2015, End-to-end Memory Networks. Advances in Neural Information Processing Systems, 28.

Yao L, Torabi A, Cho K, et al., 2015, Describing Videos by Exploiting Temporal Structure, Proceedings of the IEEE International Conference on Computer Vision, 4507–4515.

Martins A, Astudillo R, 2016, From Softmax to Sparsemax: A Sparse Model of Attention and Multi-label Classification, International Conference on Machine Learning. PMLR, 1614–1623.

Yang Z, Yang D, Dyer C, et al., 2016, Hierarchical Attention Networks for Document Classification. Association for Computational Linguistics, 2016: 1480–1489.

Lu J, Xiong C, Parikh D, et al., 2017, Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 375–383.

Gheini M, Ren X, May J, 2021, Cross-attention is All You Need: Adapting Pretrained Transformers for Machine Translation. https://arxiv.org/abs/2104.08771

Qin Z, Sun W, Li D, et al., 2024, Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Models. https://arxiv.org/abs/2401.04658

Liu A, Feng B, Wang B, et al., 2024, Deepseek-v2: A Strong, Economical, and Efficient Mixture-of-experts Language Model. https://arxiv.org/abs/2405.04434

Yuan J, Gao H, Dai D, et al., 2025, Native Sparse Attention: Hardware-aligned and Natively Trainable Sparse Attention. https://arxiv.org/abs/2502.11089

Cai X, Lai Q, Wang Y, et al., 2024, Poly Kernel Inception Network for Remote Sensing Detection, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 27706–27716.

Letourneau PD, Singh MK, Cheng HP, et al., 2024, Padre: A Unifying Polynomial Attention Drop-in Replacement for Efficient Vision Transformer. https://arxiv.org/abs/2407.11306