Enhancing Indoor Object Detection with xLSTM Attention-Driven YOLOv9 for Improved 2D-Driven 3D Object Detection

Yu He; Chengpeng Jin; Xuesong Zhang

doi:10.26689/jera.v9i2.9698

Download PDF

Keywords

Deep learning;
Object detection
Attention mechanism

DOI

10.26689/jera.v9i2.9698

Submitted : 2025-01-26

Accepted : 2025-02-10

Published : 2025-02-25

Abstract

Three-dimensional (3D) object detection is crucial for applications such as robotic control and autonomous driving. While high-precision sensors like LiDAR are expensive, RGB-D sensors (e.g., Kinect) offer a cost-effective alternative, especially for indoor environments. However, RGB-D sensors still face limitations in accuracy and depth perception. This paper proposes an enhanced method that integrates attention-driven YOLOv9 with xLSTM into the F-ConvNet framework. By improving the precision of 2D bounding boxes generated for 3D object detection, this method addresses issues in indoor environments with complex structures and occlusions. The proposed approach enhances detection accuracy and robustness by combining RGB images and depth data, offering improved indoor 3D object detection performance.

References

Hu Y, Yang J, Chen L, et al., 2023, “Planning-Oriented Autonomous Driving”. Proceedings of the IEEE/CVF Conference on CVPR, 2023: 17853–17862.

Yang H, Zhang S, Huang D, et al., 2024, Unipad: A Universal Pre-Training Paradigm for Autonomous Driving. Proceedings of the IEEE/CVF Conference on CVPR, 2024: 15238–15250.

Min C, Zhao D, Xiao L, et al., 2024, Driveworld: 4D Pre-Trained Scene Understanding via World Models for Autonomous Driving. Proceedings of the IEEE/CVF Conference on CVPR, 2024: 15522–15533.

Wang Z, Jia K, 2019, Frustum ConvNet: Sliding Frustums to Aggregate Local Point-Wise Features for Amodal 3D Object Detection. 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2019: 1742–1749.

Wang CY, Yeh IH, Liao HYM, 2024, YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information. Computer–ECCV2024, 15089: 1–21.

Beck M, Poppel K, Spanring M, et al., 2024, xLSTM: Extended Long Short-Term Memory. arXiv:2405.04517v1, 2024: 1–56.

Song S, Lichtenberg SP, Xiao J, 2015, SUN RGB-D: A RGB-D Scene Understanding Benchmark Suite. Proceedings of the IEEE Conference on CVPR. Boston, Massachusetts, 567–576.

Song S, Xiao J, 2016, Deep Sliding Shapes for Amodal 3D Object Detection in RGB-D Images. Proceedings of the IEEE Conference on CVPR, Las Vegas, Nevada, 808–816.

Ren Z, Sudderth EB, 2016, Three-Dimensional Object Detection and Layout Prediction Using Clouds of Oriented Gradients. Proceedings of the IEEE Conference on CVPR, Las Vegas, Nevada, 1525–1533.

Lahoud J, Ghanem B, 2017, 2D-Driven 3D Object Detection in RGB-D Images. Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 4622–4630.

Xu D, Anguelov D, Jain A, 2018, Pointfusion: Deep Sensor Fusion for 3D Bounding Box Estimation. Proceedings of the IEEE Conference on CVPR, Salt Lake City, Utah, 244–253.

Ren Z, Sudderth E B, 2018, 3D Object Detection with Latent Support Surfaces. Proceedings of the IEEE Conference on CVPR, Salt Lake City, Utah, 937–946.

Qi CR, Liu W, Wu C, et al., 2018, Frustum Pointnets for 3D Object Detection from RGB-D Data. Proceedings of the IEEE Conference on CVPR, Salt Lake City, Utah, 918–927.