Statistical methods and computing for big data. | Innovative Methods Program for Advancing Clinical Trials (IMPACT)

Title	Statistical methods and computing for big data.
Publication Type	Journal Article
Year of Publication	2016
Authors	Wang, Chun, Ming-Hui Chen, Elizabeth Schifano, Jing Wu, and Jun Yan
Journal	Stat Interface
Volume	9
Issue	4
Pagination	399-414
Date Published	2016
ISSN	1938-7989
Abstract	Big data are data on a massive scale in terms of volume, intensity, and complexity that exceed the capacity of standard analytic tools. They present opportunities as well as challenges to statisticians. The role of computational statisticians in scientific discovery from big data analyses has been under-recognized even by peer statisticians. This article summarizes recent methodological and software developments in statistics that address the big data challenges. Methodologies are grouped into three classes: subsampling-based, divide and conquer, and online updating for stream data. As a new contribution, the online updating approach is extended to variable selection with commonly used criteria, and their performances are assessed in a simulation study with stream data. Software packages are summarized with focuses on the open source R and R packages, covering recent tools that help break the barriers of computer memory and computing power. Some of the tools are illustrated in a case study with a logistic regression for the chance of airline delay.
DOI	10.4310/SII.2016.v9.n4.a1
Alternate Journal	Stat Interface
Original Publication	Statistical methods and computing for big data.
PubMed ID	27695593
PubMed Central ID	PMC5041595
Grant List	P01 CA142538 / CA / NCI NIH HHS / United States R01 GM070335 / GM / NIGMS NIH HHS / United States / / United States NCI / / / United States NIGMS /

Project:

Project 1.3

Project 2.2