CleanML: A Study for Evaluating the Impact of Data Cleaning on ML Classification Tasks

Li, Peng; Rao, Xi; Blase, Jennifer; Zhang, Yue; Chu, Xu; Zhang, Ce

Computer Science > Databases

arXiv:1904.09483 (cs)

[Submitted on 20 Apr 2019 (v1), last revised 5 Apr 2021 (this version, v3)]

Title:CleanML: A Study for Evaluating the Impact of Data Cleaning on ML Classification Tasks

Authors:Peng Li, Xi Rao, Jennifer Blase, Yue Zhang, Xu Chu, Ce Zhang

View PDF

Abstract:Data quality affects machine learning (ML) model performances, and data scientists spend considerable amount of time on data cleaning before model training. However, to date, there does not exist a rigorous study on how exactly cleaning affects ML -- ML community usually focuses on developing ML algorithms that are robust to some particular noise types of certain distributions, while database (DB) community has been mostly studying the problem of data cleaning alone without considering how data is consumed by downstream ML analytics. We propose a CleanML study that systematically investigates the impact of data cleaning on ML classification tasks. The open-source and extensible CleanML study currently includes 14 real-world datasets with real errors, five common error types, seven different ML models, and multiple cleaning algorithms for each error type (including both commonly used algorithms in practice as well as state-of-the-art solutions in academic literature). We control the randomness in ML experiments using statistical hypothesis testing, and we also control false discovery rate in our experiments using the Benjamini-Yekutieli (BY) procedure. We analyze the results in a systematic way to derive many interesting and nontrivial observations. We also put forward multiple research directions for researchers.

Comments:	published in ICDE 2021
Subjects:	Databases (cs.DB); Machine Learning (cs.LG)
Cite as:	arXiv:1904.09483 [cs.DB]
	(or arXiv:1904.09483v3 [cs.DB] for this version)
	https://doi.org/10.48550/arXiv.1904.09483

Submission history

From: Peng Li [view email]
[v1] Sat, 20 Apr 2019 19:12:03 UTC (310 KB)
[v2] Fri, 26 Apr 2019 00:17:24 UTC (310 KB)
[v3] Mon, 5 Apr 2021 23:35:41 UTC (790 KB)

Computer Science > Databases

Title:CleanML: A Study for Evaluating the Impact of Data Cleaning on ML Classification Tasks

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Databases

Title:CleanML: A Study for Evaluating the Impact of Data Cleaning on ML Classification Tasks

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators