A Tibetan Speaker Verification Method Based on the Improved MFA-NConformer Model

Yitong Gong; Yuting Chen

doi:10.26689/jera.v10i3.14660

Download PDF

Keywords

Conformer block
Tibetan
GE2E loss
Speaker verification

DOI

10.26689/jera.v10i3.14660

Submitted : 2026-03-25

Accepted : 2026-04-09

Published : 2026-04-24

Abstract

MFA-conformer methods are widely used in English and Chinese speaker recognition. Theoretically language-independent but practically language-related, Tibetan speaker recognition currently relies on traditional models with poor performance. To address this, we adopt MFA-conformer as the basic framework and propose improvements: integrating 1D depth-wise separable convolution and channel attention into the conformer feed-forward network, fusing multi-block features, and adding an intra-class correlation regularizer to GE2E loss. Experiments show the improved model reduces the equal error rate (EER) compared with the conformer baseline.

References

Boles A, Rad P, 2017, Voice Biometrics: Deep Learning-based Voiceprint Authentication System, In: 2017 12th System of Systems Engineering Conference (SoSE), 1–6.

Chowdhury F, Wang Q, Moreno I, 2018, Attention-based Models for Text-Dependent Speaker Verification, In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 5359–5363.

Peddinti V, Povey D, Khudanpur S, 2015, A Time Delay Neural Network Architecture for Efficient Modeling of Long Temporal Contexts., in Interspeech, 3214–3218.

Sang M, Zhao Y, Liu G, et al., 2023, Improving Transformer-based Networks with Locality for Automatic Speaker Verification, In: ICASSP 2023- 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 1–5.

Cai D, Li M, 2023, Leveraging ASR Pretrained Conformers for Speaker Verification through Transfer Learning and Knowledge Distillation, arXiv, https://doi.org/10.1109/TASLP.2024.3419426

Gulati A, Qin J, Chiu C, et al., 2020, Conformer: Convolution-Augmented Transformer for Speech Recognition, arXiv, https://doi.org/10.48550/arXiv.2005.08100

Zhang Y, Lv Z, Wu H, et al., 2022, Mfa-Conformer: Multi-Scale Feature Aggregation Conformer for Automatic Speaker Verification, arXiv, https://doi.org/10.48550/arXiv.2203.15249

Li L, Wang D, Rozi A, et al., 2017, Cross-Lingual Speaker Verification with Deep Feature Learning, In: 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), 1040–1044.

Wu Y, Liao W, 2021, Toward Text-Independent Cross-Lingual Speaker Recognition using English-Mandarin-Taiwanese Dataset, In: 2020 25th International Conference on Pattern Recognition (ICPR), 8515–8522.

Thienpondt J, Desplanques B, Demuynck K, 2022, Tackling the Score Shift in Cross-Lingual Speaker Verification by Exploiting Language Information, In: ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 7187–7191.

Schuessler A, 2024, Sino-Tibetan in Tibetan and Old Chinese. Language and Linguistics, 80–122.

Mokgonyane T, Sefara T, Manamela M, et al., 2019, The Effects of Data Size on Text-Independent Automatic Speaker Identification System, In: 2019 International Conference on Advances in Big Data, Computing and Data Communication Systems (icABCD), 1–6.

Wan L, Wang Q, Papir A, et al., 2018, Generalized End-to-End Loss for Speaker Verification, In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 4879–4883.

Lyu H, Sha N, Qin S, et al., 2019, Advances in Neural Information Processing Systems. Advances in Neural Information Processing Systems, 32(2019).