What's Left? Concept Grounding with Logic-Enhanced Foundation Models

Hsu, Joy; Mao, Jiayuan; Tenenbaum, Joshua B.; Wu, Jiajun

Computer Science > Computer Vision and Pattern Recognition

arXiv:2310.16035 (cs)

[Submitted on 24 Oct 2023]

Title:What's Left? Concept Grounding with Logic-Enhanced Foundation Models

Authors:Joy Hsu, Jiayuan Mao, Joshua B. Tenenbaum, Jiajun Wu

View PDF

Abstract:Recent works such as VisProg and ViperGPT have smartly composed foundation models for visual reasoning-using large language models (LLMs) to produce programs that can be executed by pre-trained vision-language models. However, they operate in limited domains, such as 2D images, not fully exploiting the generalization of language: abstract concepts like "left" can also be grounded in 3D, temporal, and action data, as in moving to your left. This limited generalization stems from these inference-only methods' inability to learn or adapt pre-trained models to a new domain. We propose the Logic-Enhanced Foundation Model (LEFT), a unified framework that learns to ground and reason with concepts across domains with a differentiable, domain-independent, first-order logic-based program executor. LEFT has an LLM interpreter that outputs a program represented in a general, logic-based reasoning language, which is shared across all domains and tasks. LEFT's executor then executes the program with trainable domain-specific grounding modules. We show that LEFT flexibly learns concepts in four domains: 2D images, 3D scenes, human motions, and robotic manipulation. It exhibits strong reasoning ability in a wide variety of tasks, including those that are complex and not seen during training, and can be easily applied to new domains.

Comments:	NeurIPS 2023. First two authors contributed equally. Project page: this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)
Cite as:	arXiv:2310.16035 [cs.CV]
	(or arXiv:2310.16035v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2310.16035

Submission history

From: Jiayuan Mao [view email]
[v1] Tue, 24 Oct 2023 17:50:20 UTC (2,029 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:What's Left? Concept Grounding with Logic-Enhanced Foundation Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:What's Left? Concept Grounding with Logic-Enhanced Foundation Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators