算法子抽样的计量分析

514

收藏 2022-03-08

摘要翻译：
太字节大小的数据集越来越常见，但计算机瓶颈经常阻碍对数据的完整分析。虽然多数据总比少数据好，但收益递减表明，我们可能不需要太字节的数据来估计一个参数或检验一个假设。但是我们应该分析哪些数据行，任意的行子集是否可以保留原始数据的特征？本文回顾了以理论计算机科学和数值线性代数为基础的一系列工作，发现一个算法上理想的草图，它是一个随机选择的数据子集，必须保留数据的特征结构，这一性质称为子空间嵌入。在此工作的基础上，我们研究如何预测和推论可以影响数据草图在线性回归设置。我们表明，与研究人员可以控制的样本量效应相比，草图误差是很小的。由于算法上最优的草图大小可能不适合预测和推理，我们使用统计参数来为草图大小提供“推理意识”指南。当适当地实现时，汇集不同草图的估计器可以几乎与使用完整样本的不可行估计器一样有效。
---
英文标题：
《An Econometric Perspective on Algorithmic Subsampling》
---
作者：
Sokbae Lee, Serena Ng
---
最新提交年份：
2020
---
分类信息：

一级分类：Economics 经济学
二级分类：Econometrics 计量经济学
分类描述：Econometric Theory, Micro-Econometrics, Macro-Econometrics, Empirical Content of Economic Relations discovered via New Methods, Methodological Aspects of the Application of Statistical Inference to Economic Data.
计量经济学理论，微观计量经济学，宏观计量经济学，通过新方法发现的经济关系的实证内容，统计推论应用于经济数据的方法论方面。
--
一级分类：Statistics 统计学
二级分类：Computation 计算
分类描述：Algorithms, Simulation, Visualization
算法、模拟、可视化
--

---
英文摘要：
Datasets that are terabytes in size are increasingly common, but computer bottlenecks often frustrate a complete analysis of the data. While more data are better than less, diminishing returns suggest that we may not need terabytes of data to estimate a parameter or test a hypothesis. But which rows of data should we analyze, and might an arbitrary subset of rows preserve the features of the original data? This paper reviews a line of work that is grounded in theoretical computer science and numerical linear algebra, and which finds that an algorithmically desirable sketch, which is a randomly chosen subset of the data, must preserve the eigenstructure of the data, a property known as a subspace embedding. Building on this work, we study how prediction and inference can be affected by data sketching within a linear regression setup. We show that the sketching error is small compared to the sample size effect which a researcher can control. As a sketch size that is algorithmically optimal may not be suitable for prediction and inference, we use statistical arguments to provide 'inference conscious' guides to the sketch size. When appropriately implemented, an estimator that pools over different sketches can be nearly as efficient as the infeasible one using the full sample.
---
PDF链接：
https://arxiv.org/pdf/1907.01954

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

扫码加我 拉你入群

分享

扫码加好友，拉您进群

扫码加我拉你入群