Retrieving Texts based on Abstract Descriptions

Ravfogel, Shauli; Pyatkin, Valentina; Cohen, Amir DN; Manevich, Avshalom; Goldberg, Yoav

Computer Science > Computation and Language

arXiv:2305.12517v2 (cs)

[Submitted on 21 May 2023 (v1), revised 22 Oct 2023 (this version, v2), latest version 26 Apr 2024 (v4)]

Title:Retrieving Texts based on Abstract Descriptions

Authors:Shauli Ravfogel, Valentina Pyatkin, Amir DN Cohen, Avshalom Manevich, Yoav Goldberg

View PDF

Abstract:While instruction-tuned Large Language Models (LLMs) excel at extracting information from text, they are not suitable for locating texts conforming to a given description in a large document collection (semantic retrieval). Similarity search over embedding vectors does allow to perform retrieval by query, but the similarity reflected in the embedding is ill-defined and non-consistent, and is sub-optimal for many use cases. What, then, is a good query representation for effective retrieval?
We identify the well defined and consistent task of retrieving sentences based on abstract descriptions of their content. We demonstrate the inadequacy of current text embeddings and propose an alternative model that significantly improves when used in standard nearest neighbor search. The model is trained using positive and negative pairs sourced through prompting a LLM. While it is easy to source the training material from an LLM, the retrieval task cannot be performed by the LLM directly. This demonstrates that data from LLMs can be used not only for distilling more efficient specialized models than the original LLM, but also for creating new capabilities not immediately possible using the original model.

Comments:	A preprint
Subjects:	Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
Cite as:	arXiv:2305.12517 [cs.CL]
	(or arXiv:2305.12517v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2305.12517

Submission history

From: Shauli Ravfogel [view email]
[v1] Sun, 21 May 2023 17:14:31 UTC (7,351 KB)
[v2] Sun, 22 Oct 2023 17:38:42 UTC (2,396 KB)
[v3] Thu, 25 Apr 2024 08:30:17 UTC (1,603 KB)
[v4] Fri, 26 Apr 2024 08:04:59 UTC (1,603 KB)

Computer Science > Computation and Language

Title:Retrieving Texts based on Abstract Descriptions

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Retrieving Texts based on Abstract Descriptions

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators