Research on Self-Supervised Comparative Learning for Computer Vision
Download PDF

Keywords

Self-supervised learning
Comparative learning
Conceptual analysis framework
Computer vision field
Performance analysis

DOI

10.26689/jera.v5i3.2320

Submitted : 2021-07-18
Accepted : 2021-08-02
Published : 2021-08-17

Abstract

In recent years, self-supervised learning which does not require a large number of manual labels generate supervised signals through the data itself to attain the characterization learning of samples. Self-supervised learning solves the problem of learning semantic features from unlabeled data, and realizes pre-training of models in large data sets. Its significant advantages have been extensively studied by scholars in recent years. There are usually three types of self-supervised learning: “Generative, Contrastive, and Generative-Contrastive.” The model of the comparative learning method is relatively simple, and the performance of the current downstream task is comparable to that of the supervised learning method. Therefore, we propose a conceptual analysis framework: data augmentation pipeline, architectures, pretext tasks, comparison methods, semi-supervised fine-tuning. Based on this conceptual framework, we qualitatively analyze the existing comparative self-supervised learning methods for computer vision, and then further analyze its performance at different stages, and finally summarize the research status of self-supervised comparative learning methods in other fields.

References

Deng J, Dong W, Socher R, et al., 2009, Imagenet: A Large-Scale Hierarchical Image Database. In CVPR:248–255.IEEE.

He K, Zhang X, et al., 2016, Deep Residual Learning for Image Recognition. In CVPR: 770–778.

Huang G, et al., 2017, Weinberger. Densely Connected Convolutional Networks. 2017 IEEE CVPR: 2261–2269.

Girshick R, et al., 2014, Rich Feature Hier-Archies for Accurate Object Detection and Semantic Segmentation. In CVPR: 580–587.

Long J, et al., 2015, Fully Convolutional Networks for Semantic Segmentation. In CVPR: 3431–3440.

Devlin J, et al., 2015, Pre-Training of Deep Bidirectional Transformers for Language Under-Standing. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, (1):4171–4186.

Lan Z, et al., 2019, Albert: A Lite Bert for Self-Supervised Learning of Language Representations. arXiv preprint arXiv:1909.11942.

Asai A, et al., 2019, Learning to Retrieve Reasoning Paths Over Wikipedia Graph for Question Answering. arXiv preprint arXiv:1911.10470.

Hu Z, et al., 2020, Heterogeneous Graph Transformer. arXiv preprint arXiv:2003.01332.

Zhang M, et al., 2018, An End-To-End Deep Learning Architecture for Graph Classification. In AAAI.

Hoffmann J, et al., 2019, Infograph: Unsupervised and Semi-Supervised Graph-Level Representation Learning Via Mutual Information Maximization. arXiv preprint arXiv:1908.01000.

Deng J, 2020, How Useful is Self-Supervised Pretraining for Visual Tasks? In CVPR: 7345–7354.

Zoph B, et al., 2020, Rethinking Pre-Training and Self-Training. arXiv:2006.06882.

Chen T, et al., 2020, Big Self-Supervised Models are Strong Semi-Supervised Learners. arXiv preprint arXiv:2006.10029.

Goyal P, et al., 2017, Accurate, Large Minibatch Sgd: Training Imagenet in 1 Hour.

He K, et al., 2019, Momentum Contrast for Unsupervised Visual Representation Learning.

T. Chen, et al., 2020, A Simple Framework for Contrastive Learning of Visual Representations. arXiv preprint arXiv:2002.05709.

Gutmann M, Hyvärinen A, 2010, Noise-Contrastive Estimation: A New Estimation Principle for Unnormalized Statistical Models. In AISTATS.

Doersch C, Gupta A, et al., 2015, Unsupervised Visual Representation Learning by Context Prediction. In Proceedings of the IEEE ICCV: 1422–1430.

Kim D, et al., 2018, Learning Image Representations by Completing Damaged Jigsaw Puzzles. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV): 793–802.

Misra I, et al., 2019, Self-Supervised Learning of Pretext-Invariant Representations. arXiv preprint arXiv:1912.01991.

Gidaris S, Singh P, et al., 2018, Unsupervised Repre-Sentation Learning by Predicting Image Rotations. arXiv preprint arXiv:1803.07728.

Hjelm R, et al., 2018, Learning Deep Representations by Mutual Information Estimation and Maximization. arXiv preprint arXiv:1808.06670.

Bachman P, Hjelm R, 2019, Learning Represen-Ations by Maximizing Mutual Information Across Views. In NIPS: 15509–15519.

Li Y, Vinyals O, 2018, Representation Learning with Contrastive Predictive Coding. arXiv:1807.03748.

Caron M, et al., 2018, Deep Clustering for Unsupervised Learning of Visual Features. In Proceedings of the ECCV (ECCV): 132–149.

Zhuang C, et al., 2019, Local Aggregation for Unsupervised Learning of Visual Embeddings. In Proceedings of the IEEE ICCV: 6002–6012.

Yan X, Misra I, et al., 2019, Clusterfit: Improving Generalization of Visual Representations. arXiv:1912.03330.

Caron M, Misra I, et al., 2020, Unsupervised Learning of Visual Features by Contrasting Cluster Assignments. arXiv preprint arXiv:2006.09882, 2020.

Goyal JP, Caron M, et al., 2021, Self-Supervised Pretraining of Visual Features in the Wild. arXiv preprint arXiv:2103.01988.

Wu Z, Xiong Y, et al., 2018, Unsupervised Feature Learning Via Non-Parametric Instance Discrimination. In CVPR: 3733–3742, 2018.

Tian Y, Krishnan D, et al., 2019, Contrastive Multiview Coding. arXiv preprint arXiv:1906.05849.

X. Chen, H. Fan, et al., 2020, Improved Baselines with Momentum Contrastive Learning. arXiv preprint arXiv:2003.04297.

Tian Y, Sun C, et al., 2005, What Makes for Good Views for Contrastive Learning. arXiv preprint arXiv:2005.10243.

Grill JB, Strub F, et al., 2006, Bootstrap your Own Latent: A New Approach to Self-Supervised Learning. arXiv preprint arXiv:2006.07733.

J. Mitrovic, B. McWilliams, et al., 2010, Representation Learning Via Invariant Causal Mechanisms. arXiv preprint arXiv:2010.07922.

Chen X, et al., 2011, Exploring Simple Siamese Representation Learning. arXiv:2011.10566.

Mikolov T, Sutskever I, et al., 2013, Distributed Representations of Words and Phrases and Their Compositionality.

Arora S, Khandeparkar H, et al., 2019, A Theoretical Analysis of Contrastive Unsupervised Representation Learning.

Iter D, Guu K, et al., 2020, Pretraining with Contrastive Sentence Objectives Improves Discourse Performance of Language Models.

Chi Z, Dong L, Wei F, et al., 2020, Infoxlm: An Information-Theoretic Framework for Cross-Lingual Language Model Pre-Training.

Fang H, Wang S, Zhou M, et al., 2020, Cert: Contrastive Self-Supervised Learning for Language Understanding.

Giorgi J, Nitski O, Bader G, et al., 2020, Declutr: Deep Contrastive Learning for Unsupervised Textual Representations.

Ma, Shuang, et al., 2021, “Active Contrastive Learning of Audio-Visual Video Representations.” ICLR 2021: The Ninth International Conference on Learning Representations.

Radford, Alec, et al., 2021, “Learning Transferable Visual Models from Natural Language Supervision.” ArXiv Preprint ArXiv:2103.00020.

Jia, Chao, et al., 2021, “Scaling Up Visual and Vision-Language Representation Learning with Noisy Text Supervision.” ArXiv Preprint ArXiv:2102.05918.

Huo Y, et al., 2021, “WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training.” ArXiv Preprint ArXiv:2103.06561.

Wei C, Xie L, Ren X, et al., 2019, Iterative Reorganization with Weak Spatial Constraints: Solving Arbitrary Jigsaw Puzzles for Unsupervised Representation Learning. In CVPR: 1910–1919.

Li, W. Hung J, Huang S, et al., 2016, Unsupervised Visual Representation Learning by Graph-Based Consistent Constraints. In ECCV: 678–694. Springer.

Noroozi M, Vinjimoor A, Favaro P, et al., 2018, Boosting Self-Supervised Learning Via Knowledge Transfer. In CVPR: 9359–9367.

Yang J, Parikh D, Batra D, 2016, Joint Unsupervised Learning of Deep Representations and Image Clusters. In CVPR: 5147–5156.

Olivier J, Hénaff, Razavi A, Doersch C, et al., 2019, Data-Efficient Image Recognition with Contrastive Predictive Coding. arXiv preprint arXiv:1905.09272.