Unified Image and Video Saliency Modeling

Droste, Richard; Jiao, Jianbo; Noble, J. Alison

doi:10.1007/978-3-030-58558-7_25

Computer Science > Computer Vision and Pattern Recognition

arXiv:2003.05477 (cs)

[Submitted on 11 Mar 2020 (v1), last revised 7 Nov 2020 (this version, v3)]

Title:Unified Image and Video Saliency Modeling

Authors:Richard Droste, Jianbo Jiao, J. Alison Noble

View PDF

Abstract:Visual saliency modeling for images and videos is treated as two independent tasks in recent computer vision literature. While image saliency modeling is a well-studied problem and progress on benchmarks like SALICON and MIT300 is slowing, video saliency models have shown rapid gains on the recent DHF1K benchmark. Here, we take a step back and ask: Can image and video saliency modeling be approached via a unified model, with mutual benefit? We identify different sources of domain shift between image and video saliency data and between different video saliency datasets as a key challenge for effective joint modelling. To address this we propose four novel domain adaptation techniques - Domain-Adaptive Priors, Domain-Adaptive Fusion, Domain-Adaptive Smoothing and Bypass-RNN - in addition to an improved formulation of learned Gaussian priors. We integrate these techniques into a simple and lightweight encoder-RNN-decoder-style network, UNISAL, and train it jointly with image and video saliency data. We evaluate our method on the video saliency datasets DHF1K, Hollywood-2 and UCF-Sports, and the image saliency datasets SALICON and MIT300. With one set of parameters, UNISAL achieves state-of-the-art performance on all video saliency datasets and is on par with the state-of-the-art for image saliency datasets, despite faster runtime and a 5 to 20-fold smaller model size compared to all competing deep methods. We provide retrospective analyses and ablation studies which confirm the importance of the domain shift modeling. The code is available at this https URL

Comments:	Presented at the European Conference on Computer Vision (ECCV) 2020. R. Droste and J. Jiao contributed equally to this work. v3: Updated Fig. 5a) and added new MTI300 benchmark results to supp. material
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Cite as:	arXiv:2003.05477 [cs.CV]
	(or arXiv:2003.05477v3 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2003.05477
Journal reference:	In: ECCV 2020, Springer LNCS 12350, pp. 419-435
Related DOI:	https://doi.org/10.1007/978-3-030-58558-7_25

Submission history

From: Richard Droste [view email]
[v1] Wed, 11 Mar 2020 18:28:29 UTC (4,995 KB)
[v2] Sat, 18 Jul 2020 00:48:35 UTC (8,944 KB)
[v3] Sat, 7 Nov 2020 13:43:34 UTC (8,940 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Unified Image and Video Saliency Modeling

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Unified Image and Video Saliency Modeling

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators