Correction of Errors in Preference Ratings from Automated Metrics for Text Generation

Deriu, Jan; von Däniken, Pius; Tuggener, Don; Cieliebak, Mark

Computer Science > Computation and Language

arXiv:2306.03866 (cs)

[Submitted on 6 Jun 2023]

Title:Correction of Errors in Preference Ratings from Automated Metrics for Text Generation

Authors:Jan Deriu, Pius von Däniken, Don Tuggener, Mark Cieliebak

View PDF

Abstract:A major challenge in the field of Text Generation is evaluation: Human evaluations are cost-intensive, and automated metrics often display considerable disagreement with human judgments. In this paper, we propose a statistical model of Text Generation evaluation that accounts for the error-proneness of automated metrics when used to generate preference rankings between system outputs. We show that existing automated metrics are generally over-confident in assigning significant differences between systems in this setting. However, our model enables an efficient combination of human and automated ratings to remedy the error-proneness of the automated metrics. We show that using this combination, we only require about 50% of the human annotations typically used in evaluations to arrive at robust and statistically significant results while yielding the same evaluation outcome as the pure human evaluation in 95% of cases. We showcase the benefits of approach for three text generation tasks: dialogue systems, machine translation, and text summarization.

Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2306.03866 [cs.CL]
	(or arXiv:2306.03866v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2306.03866

Submission history

From: Jan Deriu [view email]
[v1] Tue, 6 Jun 2023 17:09:29 UTC (387 KB)

Computer Science > Computation and Language

Title:Correction of Errors in Preference Ratings from Automated Metrics for Text Generation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Correction of Errors in Preference Ratings from Automated Metrics for Text Generation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators