Due to multi-scale variations and occlusion problems, accurate traffic road pedestrian detection faces great challenges. This paper proposes an improved pedestrian detection method called Multi Scales Attention-YOLOv5x (MSA-YOLOv5x) based on the YOLOv5x framework. Firstly, by replacing the first convolutional operation of the backbone network with the Focus module, this method expands the number of image input channels to enhance feature expressiveness. Secondly, we construct C3_CBAM module instead of the original C3 module for better feature fusion. In this way, the learning process could achieve more multi-scale features and occluded pedestrian target features through channel attention and spatial attention. Additionally, a new feature pyramid detection layer and a new detection channel are embedded in the feature fusion part for enhancing multi-scale pedestrian detection accuracy. Compared with the baseline methods, experimental results on a public dataset demonstrate that the proposed method achieves optimal detection accuracy for traffic road pedestrian detection.
Dalal N, Triggs B, 2005, Histograms of Oriented Gradients for Human Detection. 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), Vol. 1. IEEE, 2005: 886–893.
Ahonen T, Hadid A, Pietik¨ainen M, 2004, Face Recognition with Local Binary Patterns. Computer Vision-ECCV 2004: 8th European Conference on Computer Vision, Proceedings, Part I 8, Springer, 2004: 469–481.
Lowe DG, 2004, Distinctive Image Features from Scale-Invariant Keypoints. International Journal of Computer Vision, 60: 91–110.
Girshick R, Donahue J, Darrell T, et al., 2014, Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014: 580–587.
Girshick R, 2015, Fast R-CNN. Proceedings of the IEEE International Conference on Computer Vision, 2015: 1440–1448.
Ren S, He K, Girshick R, et al., 2015, Faster R-CNN: Towards Realtime Object Detection with Region Proposal Networks. Advances in Neural Information Processing Systems, 28: 2015.
He K, Gkioxari G, Doll´ar P, et al., 2017, Mask R-CNN. Proceedings of the IEEE International Conference on Computer Vision, 2017: 2961–2969.
Redmon J, Divvala S, Girshick R, et al., 2016, You Only Look Once: Unified, Real-Time Object Detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016: 779–788.
Liu W, Anguelov D, Erhan D, et al., 2016, SSD: Single Shot Multibox Detector. Computer Vision–ECCV 2016: 14th European Conference Proceedings, Part I 14. Springer, 2016: 21–37.
Tian Y, Luo P, Wang X, et al., 2015, Deep Learning Strong Parts for Pedestrian Detection. Proceedings of the IEEE International Conference on Computer Vision, 2015: 1904–1912.
Li Q, Su Y, Gao Y, et al., 2022, Oaf-Net: An Occlusion-Aware Anchor-Free Network for Pedestrian Detection in a Crowd. IEEE Transactions on Intelligent Transportation Systems, 23(11): 21291–21300.
Fei C, Liu B, Chen Z, et al., 2019, Learning Pixel-Level and Instance-Level Context-Aware Features for Pedestrian Detection in Crowds. IEEE Access, 7: 94944–94953.
Xie J, Pang Y, Khan MH, et al., 2020, Mask-Guided Attention Network and Occlusion-Sensitive Hard Example Mining for Occluded Pedestrian Detection. IEEE Transactions on Image Processing, 30: 3872–3884.
Xie H, Chen Y, Shin H, 2019, Context-Aware Pedestrian Detection Especially for Small-Sized Instances with Deconvolution Integrated Faster RCNN (dif R-CNN). Applied Intelligence, 49: 1200–1211.
Lin C, Lu J, Wang G, et al., 2018, Graininess-Aware Deep Feature Learning for Pedestrian Detection. Proceedings of the European conference on Computer Vision (ECCV), 2018: 732–747.
Yan C, Zhang H, Li X, et al., 2022, R-SSD: Refined Single Shot Multibox Detector for Pedestrian Detection. Applied Intelligence, 52(9): 10430–10447.
Wang CY, Liao HYM, Wu YH, et al., 2020, CSPNet: A New Backbone that can Enhance Learning Capability of CNN. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2020: 390–391.
Felzenszwalb PF, Girshick RB, McAllester D, et al., 2009, Object Detection with Discriminatively Trained Part-Based Models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9): 1627–1645.
Woo S, Park J, Lee JY, et al., 2018, CBAM: Convolutional Block Attention Module. Proceedings of the European Conference on Computer Vision (ECCV), 2018: 3–19.
Redmon J, Farhadi A, 2018, Yolov3: An Incremental Improvement. arXiv Preprint, arXiv:1804.02767.
He K, Zhang X, Ren S, et al., 2016, Deep Residual Learning for Image Recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016: 770–778.
He K, Zhang X, Ren S, et al., 2015, Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(9): 1904–1916.
Zhang S, Benenson R, Schiele B, 2017, Citypersons: A Diverse Dataset for Pedestrian Detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017: 3213–3221.