数据库和网络的显式概率模型

865

收藏 2022-03-07

摘要翻译：
最近在数据挖掘及相关领域的工作已经突出了对数据挖掘结果进行统计评估的重要性。这一努力的关键是为数据选择一个非平凡的空模型，可以将发现的模式与之进行对比。迄今为止最有影响的零模型都是根据零分布的不变量定义的。这种零模型可以被计算密集的随机化方法用于估计数据挖掘结果的统计显著性。在这里，我们介绍了一种基于最大熵原理构造非平凡概率模型的方法。我们展示了MaxEnt模型如何允许先验信息的自然合并。此外，它们满足了以前引入的随机化方法的许多理想性质。最后，它们还有一个好处，那就是它们可以被明确地表达出来。我们认为我们的方法可以用于各种数据类型。但是，为了具体起见，我们选择特别针对数据库和网络来演示它。
---
英文标题：
《Explicit probabilistic models for databases and networks》
---
作者：
Tijl De Bie
---
最新提交年份：
2009
---
分类信息：

一级分类：Computer Science 计算机科学
二级分类：Artificial Intelligence 人工智能
分类描述：Covers all areas of AI except Vision, Robotics, Machine Learning, Multiagent Systems, and Computation and Language (Natural Language Processing), which have separate subject areas. In particular, includes Expert Systems, Theorem Proving (although this may overlap with Logic in Computer Science), Knowledge Representation, Planning, and Uncertainty in AI. Roughly includes material in ACM Subject Classes I.2.0, I.2.1, I.2.3, I.2.4, I.2.8, and I.2.11.
涵盖了人工智能的所有领域，除了视觉、机器人、机器学习、多智能体系统以及计算和语言（自然语言处理），这些领域有独立的学科领域。特别地，包括专家系统，定理证明（尽管这可能与计算机科学中的逻辑重叠），知识表示，规划，和人工智能中的不确定性。大致包括ACM学科类I.2.0、I.2.1、I.2.3、I.2.4、I.2.8和I.2.11中的材料。
--
一级分类：Computer Science 计算机科学
二级分类：Databases 数据库
分类描述：Covers database management, datamining, and data processing. Roughly includes material in ACM Subject Classes E.2, E.5, H.0, H.2, and J.1.
涵盖数据库管理、数据挖掘和数据处理。大致包括ACM学科类E.2、E.5、H.0、H.2和J.1中的材料。
--
一级分类：Computer Science 计算机科学
二级分类：Information Theory 信息论
分类描述：Covers theoretical and experimental aspects of information theory and coding. Includes material in ACM Subject Class E.4 and intersects with H.1.1.
涵盖信息论和编码的理论和实验方面。包括ACM学科类E.4中的材料，并与H.1.1有交集。
--
一级分类：Mathematics 数学
二级分类：Information Theory 信息论
分类描述：math.IT is an alias for cs.IT. Covers theoretical and experimental aspects of information theory and coding.
它是cs.it的别名。涵盖信息论和编码的理论和实验方面。
--

---
英文摘要：
Recent work in data mining and related areas has highlighted the importance of the statistical assessment of data mining results. Crucial to this endeavour is the choice of a non-trivial null model for the data, to which the found patterns can be contrasted. The most influential null models proposed so far are defined in terms of invariants of the null distribution. Such null models can be used by computation intensive randomization approaches in estimating the statistical significance of data mining results. Here, we introduce a methodology to construct non-trivial probabilistic models based on the maximum entropy (MaxEnt) principle. We show how MaxEnt models allow for the natural incorporation of prior information. Furthermore, they satisfy a number of desirable properties of previously introduced randomization approaches. Lastly, they also have the benefit that they can be represented explicitly. We argue that our approach can be used for a variety of data types. However, for concreteness, we have chosen to demonstrate it in particular for databases and networks.
---
PDF链接：
https://arxiv.org/pdf/0906.5148

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

扫码加我 拉你入群

分享

扫码加好友，拉您进群

扫码加我拉你入群