Building Wavelet Histograms on Large Data in MapReduce

Jestes, Jeffrey; Yi, Ke; Li, Feifei

Computer Science > Databases

arXiv:1110.6649 (cs)

[Submitted on 30 Oct 2011]

Title:Building Wavelet Histograms on Large Data in MapReduce

Authors:Jeffrey Jestes, Ke Yi, Feifei Li

View PDF

Abstract:MapReduce is becoming the de facto framework for storing and processing massive data, due to its excellent scalability, reliability, and elasticity. In many MapReduce applications, obtaining a compact accurate summary of data is essential. Among various data summarization tools, histograms have proven to be particularly important and useful for summarizing data, and the wavelet histogram is one of the most widely used histograms. In this paper, we investigate the problem of building wavelet histograms efficiently on large datasets in MapReduce. We measure the efficiency of the algorithms by both end-to-end running time and communication cost. We demonstrate straightforward adaptations of existing exact and approximate methods for building wavelet histograms to MapReduce clusters are highly inefficient. To that end, we design new algorithms for computing exact and approximate wavelet histograms and discuss their implementation in MapReduce. We illustrate our techniques in Hadoop, and compare to baseline solutions with extensive experiments performed in a heterogeneous Hadoop cluster of 16 nodes, using large real and synthetic datasets, up to hundreds of gigabytes. The results suggest significant (often orders of magnitude) performance improvement achieved by our new algorithms.

Comments:	VLDB2012
Subjects:	Databases (cs.DB)
Cite as:	arXiv:1110.6649 [cs.DB]
	(or arXiv:1110.6649v1 [cs.DB] for this version)
	https://doi.org/10.48550/arXiv.1110.6649
Journal reference:	Proceedings of the VLDB Endowment (PVLDB), Vol. 5, No. 2, pp. 109-120 (2011)

Submission history

From: Feifei Li [view email] [via Ahmet Sacan as proxy]
[v1] Sun, 30 Oct 2011 20:21:30 UTC (352 KB)

Computer Science > Databases

Title:Building Wavelet Histograms on Large Data in MapReduce

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Databases

Title:Building Wavelet Histograms on Large Data in MapReduce

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators