Random Forests for Big Data

Genuer, Robin; Poggi, Jean-Michel; Tuleau-Malot, Christine; Villa-Vialaneix, Nathalie

Statistics > Machine Learning

arXiv:1511.08327v1 (stat)

[Submitted on 26 Nov 2015 (this version), latest version 22 Mar 2017 (v2)]

Title:Random Forests for Big Data

Authors:Robin Genuer (ISPED,SISTM), Jean-Michel Poggi (UPD5,LM-Orsay), Christine Tuleau-Malot (JAD), Nathalie Villa-Vialaneix (MIAT INRA)

View PDF

Abstract:Big Data is one of the major challenges of statistical science and has numerous consequences from algorithmic and theoretical viewpoints. Big Data always involve massive data but they also often include data streams and data heterogeneity. Recently some statistical methods have been adapted to process Big Data, like linear regression models, clustering methods and bootstrapping schemes. Based on decision trees combined with aggregation and bootstrap ideas, random forests were introduced by Breiman in 2001. They are a powerful nonparametric statistical method allowing to consider in a single and versatile framework regression problems, as well as two-class and multi-class classification problems. Focusing on classification problems, this paper reviews available proposals about random forests in parallel environments as well as about online random forests. Then, we formulate various remarks for random forests in the Big Data context. Finally, we experiment three variants involving subsampling, Big Data-bootstrap and MapReduce respectively, on two massive datasets (15 and 120 millions of observations), a simulated one as well as real world data.

Subjects:	Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
Cite as:	arXiv:1511.08327 [stat.ML]
	(or arXiv:1511.08327v1 [stat.ML] for this version)
	https://doi.org/10.48550/arXiv.1511.08327

Submission history

From: Nathalie Villa-Vialaneix [view email] [via CCSD proxy]
[v1] Thu, 26 Nov 2015 09:04:47 UTC (26 KB)
[v2] Wed, 22 Mar 2017 14:51:57 UTC (1,708 KB)

Statistics > Machine Learning

Title:Random Forests for Big Data

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Statistics > Machine Learning

Title:Random Forests for Big Data

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators