Unsupervised Learning of Visual Representations using Videos

Wang, Xiaolong; Gupta, Abhinav

Computer Science > Computer Vision and Pattern Recognition

arXiv:1505.00687 (cs)

[Submitted on 4 May 2015 (v1), last revised 6 Oct 2015 (this version, v2)]

Title:Unsupervised Learning of Visual Representations using Videos

Authors:Xiaolong Wang, Abhinav Gupta

View PDF

Abstract:Is strong supervision necessary for learning a good visual representation? Do we really need millions of semantically-labeled images to train a Convolutional Neural Network (CNN)? In this paper, we present a simple yet surprisingly powerful approach for unsupervised learning of CNN. Specifically, we use hundreds of thousands of unlabeled videos from the web to learn visual representations. Our key idea is that visual tracking provides the supervision. That is, two patches connected by a track should have similar visual representation in deep feature space since they probably belong to the same object or object part. We design a Siamese-triplet network with a ranking loss function to train this CNN representation. Without using a single image from ImageNet, just using 100K unlabeled videos and the VOC 2012 dataset, we train an ensemble of unsupervised networks that achieves 52% mAP (no bounding box regression). This performance comes tantalizingly close to its ImageNet-supervised counterpart, an ensemble which achieves a mAP of 54.4%. We also show that our unsupervised network can perform competitively in other tasks such as surface-normal estimation.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:1505.00687 [cs.CV]
	(or arXiv:1505.00687v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.1505.00687

Submission history

From: Xiaolong Wang [view email]
[v1] Mon, 4 May 2015 15:50:53 UTC (3,121 KB)
[v2] Tue, 6 Oct 2015 17:05:49 UTC (3,132 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Unsupervised Learning of Visual Representations using Videos

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Unsupervised Learning of Visual Representations using Videos

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators