MFA-conformer methods are widely used in English and Chinese speaker recognition. Theoretically language-independent but practically language-related, Tibetan speaker recognition currently relies on traditional models with poor performance. To address this, we adopt MFA-conformer as the basic framework and propose improvements: integrating 1D depth-wise separable convolution and channel attention into the conformer feed-forward network, fusing multi-block features, and adding an intra-class correlation regularizer to GE2E loss. Experiments show the improved model reduces the equal error rate (EER) compared with the conformer baseline.
Boles A, Rad P, 2017, Voice Biometrics: Deep Learning-based Voiceprint Authentication System, In: 2017 12th System of Systems Engineering Conference (SoSE), 1–6.
Chowdhury F, Wang Q, Moreno I, 2018, Attention-based Models for Text-Dependent Speaker Verification, In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 5359–5363.
Peddinti V, Povey D, Khudanpur S, 2015, A Time Delay Neural Network Architecture for Efficient Modeling of Long Temporal Contexts., in Interspeech, 3214–3218.
Sang M, Zhao Y, Liu G, et al., 2023, Improving Transformer-based Networks with Locality for Automatic Speaker Verification, In: ICASSP 2023- 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 1–5.
Cai D, Li M, 2023, Leveraging ASR Pretrained Conformers for Speaker Verification through Transfer Learning and Knowledge Distillation, arXiv, https://doi.org/10.1109/TASLP.2024.3419426
Gulati A, Qin J, Chiu C, et al., 2020, Conformer: Convolution-Augmented Transformer for Speech Recognition, arXiv, https://doi.org/10.48550/arXiv.2005.08100
Zhang Y, Lv Z, Wu H, et al., 2022, Mfa-Conformer: Multi-Scale Feature Aggregation Conformer for Automatic Speaker Verification, arXiv, https://doi.org/10.48550/arXiv.2203.15249
Li L, Wang D, Rozi A, et al., 2017, Cross-Lingual Speaker Verification with Deep Feature Learning, In: 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), 1040–1044.
Wu Y, Liao W, 2021, Toward Text-Independent Cross-Lingual Speaker Recognition using English-Mandarin-Taiwanese Dataset, In: 2020 25th International Conference on Pattern Recognition (ICPR), 8515–8522.
Thienpondt J, Desplanques B, Demuynck K, 2022, Tackling the Score Shift in Cross-Lingual Speaker Verification by Exploiting Language Information, In: ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 7187–7191.
Schuessler A, 2024, Sino-Tibetan in Tibetan and Old Chinese. Language and Linguistics, 80–122.
Mokgonyane T, Sefara T, Manamela M, et al., 2019, The Effects of Data Size on Text-Independent Automatic Speaker Identification System, In: 2019 International Conference on Advances in Big Data, Computing and Data Communication Systems (icABCD), 1–6.
Wan L, Wang Q, Papir A, et al., 2018, Generalized End-to-End Loss for Speaker Verification, In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 4879–4883.
Lyu H, Sha N, Qin S, et al., 2019, Advances in Neural Information Processing Systems. Advances in Neural Information Processing Systems, 32(2019).