Lots of Data != Big Data

1758

收藏 2013-03-30

转自Revolution Analytics

Lots of data != "Big Data"

by Joseph Rickert

by Joseph Rickert
When talking with data scientists and analysts — who are working with large scale data analytics platforms such as Hadoop — about the best way to do some sophisticated modeling task it is not uncommon for someone to say, "We have all of the data. Why not just use it all?" This sort of comment often initially sounds pragmatic and reasonable to almost everyone. After all, wouldn’t a model based on all of the data be better than a model based on a subsample? Well, maybe not — it depends, of course, on the problem at hand as well as time and computational constraints. To illustrate the kinds of challenges that large data sets present, let’s just look at something very simple using the airlines data set from the 2009 ASA challenge.
Here are some of the results for a regression of ArrDelay on CRSDepTime with a random sample of 12,283 records drawn from that data set:
# Coefficients:
#          Estimate Std. Error t value Pr(>|t|)
# (Intercept) -0.85885 0.80224  -1.071 0.284
# CRSDepTime 0.56199 0.05564  10.100 2.22e-16
# Multiple R-squared: 0.008238
# Adjusted R-squared: 0.008157
And here are some results from the same model using 120,947,440 records:
#Coefficients:
#                Estimate Std. Error t value Pr(>|t|)
# (Intercept) -2.4021635  0.0083532  -287.6 2.22e-16 ***
# CRSDepTime 0.6990404  0.0005826  1199.9 2.22e-16 ***
# Multiple R-squared: 0.01176
# Adjusted R-squared: 0.01176
More data data didn’t yield an obvously better model! I don’t think anyone would really find this to be much of a surprise. We are dealing with a not very good model to begin with. Nevertheless, the example does provide the opportunity to investigate how estimates of the coefficients change with sample size. This next graph shows the coeffients of the slope plotted against sample size with sample sizes ranging from 12,283 to 12,094,709 records. Each regression was done on a random sample that includes about 12,000 points more than the previous one. The graph also shows the standard estimate for the confidence interval for the coefficient at each point in red. Notice that after some initial instability, the coefficient estimates settle down to something close to the value of beta obtained using all of the data.

The rapid approach to the full-data-set value of the coefficient is even more apparent in the following graph that shows the difference between the estimated value of the beta coefficient at each sample and the value obtained using all of the data. The maximum difference from the fourth sample on is 0.07. This is pretty close indeed. In cases like this, if you believed that your samples were representative of the entire data set, working with all of the data to evaluate possible models would be a waste of time an possibly counterproductive.

I am certainly not arguing that one never wants to use all of the data. For one thing, when scoring a model or making predictions the goal is to do something with all of the records. Moreover, in more realistic modeling situations where there are thousands of predictor variables 120M observations might not be enough data to conclude anything. A large model can digest degrees of freedom very quickly and severely limit the ability to make any kind of statistical inference. I do want to argue, however, that with large data sets the ability to work with random samples of the data confers the freedom to examine several models quickly with considerable confidence that results would be decent estimates of what would be obtained in using the full data set.
I did the random sampling and regressions in my little example using functions from Revolution AnalyticsRevoScaleR package. Initially, all of the data was read from the csv files that comprise the FAA data set into the binary .xdf file format that is used by the RevoScaleR package. Then the random samples were selected by using the rxDataStep function of RevoScaleR. This function was designed to quickly manipulate large data sets. The code below reads a record, draws a random number with a value between 1 and 9999 and assigns it to the variable urns.
rxDataStep(inData = working.file, outFile = working.file, transforms=list(urns = as.integer(runif(.rxNumRows,1,10000))), overwrite=TRUE)

Random samples for each regression were drawn by looping throught the appropriate values of the variables. Notice how the call to R’s runif() function happens within the transforms parameter of rxDataStep. It took about 33 seconds to do the full regression on my laptop which made it feasible to undertake the extravagent number of calculations necessary to do the 1,000 regressions in a few hours after dinner.
I think there are three main take aways from this exercise:

Lots of data does not necessarily equate to “Big Data”
For exploratory modeling you want to work in an environment that allows for the rapid prototyping and provides the statistical tools for model evaluation and visualizations. There is no better environment that R for this kind of work, and the Revolution’s distribution of R offers the ability to work with very large samples.
The ability draw random samples from large data sets is the way to balance accuracy against computational constraints.

To my way of thinking, the single most important capability to implement in any large scale data platform that is

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

全部回复

suzhzh

2013-4-1 09:48:06

agree

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

Hamlet邵e

2013-4-1 16:14:38

但是还是有一个“度”的问题不是么？譬如图1，如果没有算全数据，我怎么知道在2500000附近开始，即使继续增大sample size也不会太影响计算结果了呢？换言之：数据大到什么程度就基本接近全数据了？对于没有全数据的前提下用什么样的标准去判定数据已经足够了？再者：什么样的方式确认所取的数据本身是unbiased？希望得到更多更多数据的初衷就是为了更大的coverage以便寻出规律。虽然trim有时在所难免。

我虽然也同意很多数据不代表大数据，但是我看了Take-Home Note3感觉难以苟同，百亿亿次计算的时代即将到来的今天，算法和硬件可以使得计算不再成为瓶颈，应当关注的是当数据规模庞大化后应当通过怎样的校准使得计算的准确性保持住……

总之，支持楼主的辩证思考过程~~层主我受教了。

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

ltx5151

2013-4-1 21:28:19

Hamlet邵e 发表于 2013-4-1 16:14
但是还是有一个“度”的问题不是么？譬如图1，如果没有算全数据，我怎么知道在2500000附近开始，即使继续增 ...

哈哈，这篇博客不是小弟写的，说这是小弟的思考过程可不敢当。这个是从Revolution Analytics 的官方博客转来的。

对于您所说的“百亿亿次计算的时代即将到来的今天，算法和硬件可以使得计算不再成为瓶颈”，我并不认同。的确，现在的技术发展使得我们可以更高效的使用计算资源，而且计算成本也在持续降低，但是离所谓“不再成为计算瓶颈”还差得太远了。因为从事的相关工作原因，不谦虚的说，小弟也算是长期使用真正意义大数据的人（全世界只有Google，Facebook， Microsoft和 Amazon的数据可以真正达到这种级别）。在这样的数据上，即便是目前最先进的MapReduce架构（甚至是高度定制化和优化过后的），依然无法满足很多基本的建模和估计，更不要说作为一个公用的体系供一般数据分析者使用了。最简单的一问题，估计一个变量中位数的方差，这个在理论上已经基本得不能再基本的变量，在实际应用中都是很难做好。很多最高会议和期刊上都在做这个基本问题的讨论（例如，michael jordan 的 bag of little bootstrap），而且实际上到现在都没有一个可以广泛使用的高效的方法（比较可行的方式都使用了类似subsampling的方法逼近bootstrap的结果）。近两年愈加热门的Deep Learning就更不要提了。在我看来，处理大数据时懂得做subsampling是一个十分重要和基本的思想。

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

Hamlet邵e

2013-4-2 09:46:22

ltx5151 发表于 2013-4-1 21:28
哈哈，这篇博客不是小弟写的，说这是小弟的思考过程可不敢当。这个是从Revolution Analytics 的官方博客转 ...

呃~~看来我又鲁班面前卖大斧，曲协面前反三俗了。subsampling势必难免，计算资源的限制只是其中之一，当数据庞大化后会有更多practical的问题，你说的例子也是，图像识别的例子也是。这句【百亿亿次计算的时代即将到来的今天，算法和硬件可以使得计算不再成为瓶颈】确实有点儿夸口过分的赶脚了。但是相比以前的那种窘境，我仍旧光明地觉得：已经改善了很多了。而且另一点你说的也是，共用体系和一般数据分析者没有access来做，甚至一般规模的计算。作为科研人员，我都没有用GPU加速计算的条件……确实乐观过头了，moi。

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

Hamlet邵e

2013-4-2 09:52:40

另外再补一句哈，我只是处理lots of data的人，不是处理Big data的人，虽然偶尔闲下来也看看那些报道~~所以果断还是一线战斗的人更factual一些哈~~

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

扫码加我 拉你入群

扫码加我 拉你入群

扫码加我 拉你入群

扫码加我 拉你入群

扫码加我 拉你入群

扫码加我 拉你入群

分享

扫码加好友，拉您进群

扫码加我拉你入群

扫码加我拉你入群

扫码加我拉你入群

扫码加我拉你入群

扫码加我拉你入群

扫码加我拉你入群