两样本问题的核方法

253

收藏 2022-03-04

摘要翻译：
我们提出了一个分析和比较分布的框架，允许我们设计统计测试来确定两个样本是否来自不同的分布。我们的测试统计量是在再生核希尔伯特空间(RKHS)的单位球中对函数的期望的最大差异。我们给出了两个基于检验统计量的大偏差界的检验，而第三个基于该统计量的渐近分布的检验。测试统计量可以在二次时间内计算，尽管有效的线性时间近似是可用的。当用于计算期望差值的函数空间（如Banach空间）更一般时，恢复了几个关于分布的经典度量。我们将我们的两个样本测试应用于各种问题，包括使用匈牙利婚姻方法的数据库属性匹配，在那里它们表现强劲。在比较图上的分布时也获得了优异的性能，对于这些测试是第一次这样的测试。
---
英文标题：
《A Kernel Method for the Two-Sample Problem》
---
作者：
Arthur Gretton, Karsten Borgwardt, Malte J. Rasch, Bernhard Scholkopf,
Alexander J. Smola
---
最新提交年份：
2008
---
分类信息：

一级分类：Computer Science 计算机科学
二级分类：Machine Learning 机器学习
分类描述：Papers on all aspects of machine learning research (supervised, unsupervised, reinforcement learning, bandit problems, and so on) including also robustness, explanation, fairness, and methodology. cs.LG is also an appropriate primary category for applications of machine learning methods.
关于机器学习研究的所有方面的论文（有监督的，无监督的，强化学习，强盗问题，等等），包括健壮性，解释性，公平性和方法论。对于机器学习方法的应用，CS.LG也是一个合适的主要类别。
--
一级分类：Computer Science 计算机科学
二级分类：Artificial Intelligence 人工智能
分类描述：Covers all areas of AI except Vision, Robotics, Machine Learning, Multiagent Systems, and Computation and Language (Natural Language Processing), which have separate subject areas. In particular, includes Expert Systems, Theorem Proving (although this may overlap with Logic in Computer Science), Knowledge Representation, Planning, and Uncertainty in AI. Roughly includes material in ACM Subject Classes I.2.0, I.2.1, I.2.3, I.2.4, I.2.8, and I.2.11.
涵盖了人工智能的所有领域，除了视觉、机器人、机器学习、多智能体系统以及计算和语言（自然语言处理），这些领域有独立的学科领域。特别地，包括专家系统，定理证明（尽管这可能与计算机科学中的逻辑重叠），知识表示，规划，和人工智能中的不确定性。大致包括ACM学科类I.2.0、I.2.1、I.2.3、I.2.4、I.2.8和I.2.11中的材料。
--

---
英文摘要：
We propose a framework for analyzing and comparing distributions, allowing us to design statistical tests to determine if two samples are drawn from different distributions. Our test statistic is the largest difference in expectations over functions in the unit ball of a reproducing kernel Hilbert space (RKHS). We present two tests based on large deviation bounds for the test statistic, while a third is based on the asymptotic distribution of this statistic. The test statistic can be computed in quadratic time, although efficient linear time approximations are available. Several classical metrics on distributions are recovered when the function space used to compute the difference in expectations is allowed to be more general (eg. a Banach space). We apply our two-sample tests to a variety of problems, including attribute matching for databases using the Hungarian marriage method, where they perform strongly. Excellent performance is also obtained when comparing distributions over graphs, for which these are the first such tests.
---
PDF链接：
https://arxiv.org/pdf/0805.2368

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

扫码加我 拉你入群

分享

扫码加好友，拉您进群

扫码加我拉你入群