摘要翻译:
在回归和分类分析中引入未标记数据是应用统计学和机器学习文献日益关注的焦点,最近的一些例子表明了未标记数据有助于提高预测精度的潜力。这种半变量分析的统计基础似乎没有得到很好的描述;因此,潜在的理论和基本原理可能被低估,尤其是非统计学家。统计学家也有更充分地参与这一统计学和计算机科学交叉的重要领域的有力研究的空间。例如,文献中的许多理论工作都集中在特定算法中未标记数据的几何和结构性质上,而不是概率和统计问题。本文概述了预测建模的基本统计基础和与未标记数据相关的一般问题,强调了抽样设计和先验规范的古老概念的相关性。这一理论用一系列的中心说明性例子和两个实质性的真实
数据分析来说明,准确地说明了无标记数据何时、为何以及如何重要。
---
英文标题:
《The Use of Unlabeled Data in Predictive Modeling》
---
作者:
Feng Liang, Sayan Mukherjee, Mike West
---
最新提交年份:
2007
---
分类信息:
一级分类:Statistics 统计学
二级分类:Methodology 方法论
分类描述:Design, Surveys, Model Selection, Multiple Testing, Multivariate Methods, Signal and Image Processing, Time Series, Smoothing, Spatial Statistics, Survival Analysis, Nonparametric and Semiparametric Methods
设计,调查,模型选择,多重检验,多元方法,信号和图像处理,时间序列,平滑,空间统计,生存分析,非参数和半参数方法
--
---
英文摘要:
The incorporation of unlabeled data in regression and classification analysis is an increasing focus of the applied statistics and machine learning literatures, with a number of recent examples demonstrating the potential for unlabeled data to contribute to improved predictive accuracy. The statistical basis for this semisupervised analysis does not appear to have been well delineated; as a result, the underlying theory and rationale may be underappreciated, especially by nonstatisticians. There is also room for statisticians to become more fully engaged in the vigorous research in this important area of intersection of the statistical and computer sciences. Much of the theoretical work in the literature has focused, for example, on geometric and structural properties of the unlabeled data in the context of particular algorithms, rather than probabilistic and statistical questions. This paper overviews the fundamental statistical foundations for predictive modeling and the general questions associated with unlabeled data, highlighting the relevance of venerable concepts of sampling design and prior specification. This theory, illustrated with a series of central illustrative examples and two substantial real data analyses, shows precisely when, why and how unlabeled data matter.
---
PDF链接:
https://arxiv.org/pdf/710.4618