Multi-Modal Few-Shot Temporal Action Detection

Nag, Sauradip; Xu, Mengmeng; Zhu, Xiatian; Perez-Rua, Juan-Manuel; Ghanem, Bernard; Song, Yi-Zhe; Xiang, Tao

Computer Science > Computer Vision and Pattern Recognition

arXiv:2211.14905 (cs)

[Submitted on 27 Nov 2022 (v1), last revised 27 Mar 2023 (this version, v2)]

Title:Multi-Modal Few-Shot Temporal Action Detection

Authors:Sauradip Nag, Mengmeng Xu, Xiatian Zhu, Juan-Manuel Perez-Rua, Bernard Ghanem, Yi-Zhe Song, Tao Xiang

View PDF

Abstract:Few-shot (FS) and zero-shot (ZS) learning are two different approaches for scaling temporal action detection (TAD) to new classes. The former adapts a pretrained vision model to a new task represented by as few as a single video per class, whilst the latter requires no training examples by exploiting a semantic description of the new class. In this work, we introduce a new multi-modality few-shot (MMFS) TAD problem, which can be considered as a marriage of FS-TAD and ZS-TAD by leveraging few-shot support videos and new class names jointly. To tackle this problem, we further introduce a novel MUlti-modality PromPt mETa-learning (MUPPET) method. This is enabled by efficiently bridging pretrained vision and language models whilst maximally reusing already learned capacity. Concretely, we construct multi-modal prompts by mapping support videos into the textual token space of a vision-language model using a meta-learned adapter-equipped visual semantics tokenizer. To tackle large intra-class variation, we further design a query feature regulation scheme. Extensive experiments on ActivityNetv1.3 and THUMOS14 demonstrate that our MUPPET outperforms state-of-the-art alternative methods, often by a large margin. We also show that our MUPPET can be easily extended to tackle the few-shot object detection problem and again achieves the state-of-the-art performance on MS-COCO dataset. The code will be available in this https URL

Comments:	Technical Report
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multimedia (cs.MM)
Cite as:	arXiv:2211.14905 [cs.CV]
	(or arXiv:2211.14905v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2211.14905

Submission history

From: Sauradip Nag [view email]
[v1] Sun, 27 Nov 2022 18:13:05 UTC (29,912 KB)
[v2] Mon, 27 Mar 2023 08:39:13 UTC (36,272 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Multi-Modal Few-Shot Temporal Action Detection

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Multi-Modal Few-Shot Temporal Action Detection

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators