GazeCLIP: Towards Enhancing Gaze Estimation via Text Guidance

Wang, Jun; Ruan, Hao; Wang, Mingjie; Zhang, Chuanghui; Li, Huachun; Zhou, Jun

Computer Science > Computer Vision and Pattern Recognition

arXiv:2401.00260 (cs)

[Submitted on 30 Dec 2023 (v1), last revised 26 Apr 2024 (this version, v3)]

Title:GazeCLIP: Towards Enhancing Gaze Estimation via Text Guidance

Authors:Jun Wang, Hao Ruan, Mingjie Wang, Chuanghui Zhang, Huachun Li, Jun Zhou

View PDF HTML (experimental)

Abstract:Over the past decade, visual gaze estimation has garnered increasing attention within the research community, owing to its wide-ranging application scenarios. While existing estimation approaches have achieved remarkable success in enhancing prediction accuracy, they primarily infer gaze from single-image signals, neglecting the potential benefits of the currently dominant text guidance. Notably, visual-language collaboration has been extensively explored across various visual tasks, such as image synthesis and manipulation, leveraging the remarkable transferability of large-scale Contrastive Language-Image Pre-training (CLIP) model. Nevertheless, existing gaze estimation approaches overlook the rich semantic cues conveyed by linguistic signals and the priors embedded in CLIP feature space, thereby yielding performance setbacks. To address this gap, we delve deeply into the text-eye collaboration protocol and introduce a novel gaze estimation framework, named GazeCLIP. Specifically, we intricately design a linguistic description generator to produce text signals with coarse directional cues. Additionally, a CLIP-based backbone that excels in characterizing text-eye pairs for gaze estimation is presented. This is followed by the implementation of a fine-grained multi-modal fusion module aimed at modeling the interrelationships between heterogeneous inputs. Extensive experiments on three challenging datasets demonstrate the superiority of the proposed GazeCLIP which achieves the state-of-the-art accuracy.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2401.00260 [cs.CV]
	(or arXiv:2401.00260v3 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2401.00260

Submission history

From: Hao Ruan [view email]
[v1] Sat, 30 Dec 2023 15:24:50 UTC (552 KB)
[v2] Sun, 7 Jan 2024 04:17:20 UTC (518 KB)
[v3] Fri, 26 Apr 2024 03:59:41 UTC (513 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:GazeCLIP: Towards Enhancing Gaze Estimation via Text Guidance

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:GazeCLIP: Towards Enhancing Gaze Estimation via Text Guidance

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators