Human-interpretable clustering of short-text using large language models

Miller, Justin K.; Alexander, Tristram J.

Computer Science > Computation and Language

arXiv:2405.07278 (cs)

[Submitted on 12 May 2024]

Title:Human-interpretable clustering of short-text using large language models

Authors:Justin K. Miller, Tristram J. Alexander

View PDF HTML (experimental)

Abstract:Large language models have seen extraordinary growth in popularity due to their human-like content generation capabilities. We show that these models can also be used to successfully cluster human-generated content, with success defined through the measures of distinctiveness and interpretability. This success is validated by both human reviewers and ChatGPT, providing an automated means to close the 'validation gap' that has challenged short-text clustering. Comparing the machine and human approaches we identify the biases inherent in each, and question the reliance on human-coding as the 'gold standard'. We apply our methodology to Twitter bios and find characteristic ways humans describe themselves, agreeing well with prior specialist work, but with interesting differences characteristic of the medium used to express identity.

Comments:	Main text: 18 pages, 8 figures. Supplementary: 21 pages, 15 figures, 3 tables
Subjects:	Computation and Language (cs.CL); Machine Learning (cs.LG)
ACM classes:	I.2.7
Cite as:	arXiv:2405.07278 [cs.CL]
	(or arXiv:2405.07278v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2405.07278

Submission history

From: Justin Miller [view email]
[v1] Sun, 12 May 2024 12:55:40 UTC (1,108 KB)

Computer Science > Computation and Language

Title:Human-interpretable clustering of short-text using large language models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Human-interpretable clustering of short-text using large language models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators