Real-Time Sound Source Localization Method Based on Selective SRP-PHAT and Vision Fusion
Download PDF
$currentUrl="http://$_SERVER[HTTP_HOST]$_SERVER[REQUEST_URI]"

Keywords

Sound source localization
SRP-PHAT
Audio-visual fusion
Real-time processing
Microphone array

DOI

10.26689/jera.v9i4.11459

Submitted : 2025-07-08
Accepted : 2025-07-23
Published : 2025-08-07

Abstract

Aiming at the problem that the traditional SRP-PHAT sound source localization method performs intensive search in a 360-degree space, resulting in high computational complexity and difficulty in meeting real-time requirements, an innovative high-precision sound source localization method is proposed. This method combines the selective SRP-PHAT algorithm with real-time visual analysis. Its core innovations include using face detection to dynamically determine the scanning angle range to achieve visually guided selective scanning, distinguishing face sound sources from background noise through a sound source classification mechanism, and implementing intelligent background orientation selection to ensure comprehensive monitoring of environmental noise. Experimental results show that the method achieves a positioning accuracy of ±5 degrees and a processing speed of more than 10FPS in complex real environments, and its performance is significantly better than the traditional full-angle scanning method.

References

Schmidt RO, 1986, Multiple Emitter Location and Signal Parameter Estimation. IEEE Transactions on Antennas and Propagation, 34(3).

Omologo M, Svaizer P, 1997, Use of the Crosspower-Spectrum Phase in Acoustic Event Location. IEEE Trans Speech Audio Process, 5(3): 288–292.

Brumann K, Doclo S, 2024, Steered Response Power-Based Direction-of-Arrival Estimation Exploiting an Auxiliary Microphone. European Signal Processing Conference, 917–921.

Li C, Hendriks RC, 2023, Alternating Least-Squares-Based Microphone Array Parameter Estimation for a Single-Source Reverberant and Noisy Acoustic Scenario, in IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31, 3922–3934.

Diaz-Guerra D, Miguel A, JR Beltran JR, 2021, Robust Sound Source Tracking Using SRP-PHAT and 3D Convolutional Neural Networks, in IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29, 300–311.

Alghareb FS, Hasan BT, 2025, Multitask Learning-Based Pipeline-Parallel Computation Offloading Architecture for Deep Face Analysis. Computers, 14: 29.