基于随机化的数据库查询重要性

389

收藏 2022-03-11

摘要翻译：
许多结构化数据通常以相互关联的表的多关系格式存储。在此关系模型下，可以使用关系查询进行探索性数据分析。例如，在因特网电影数据库（IMDb）中，可以使用查询来检查动作电影的平均等级是否高于戏剧电影的平均等级。我们考虑评估这样一个查询返回的结果是否具有统计意义，或者仅仅是数据中结构的随机工件的问题。我们的方法基于随机化查询中出现的表，并在随机化的表上重复原始查询。事实证明，在多关系数据中没有唯一的随机化方法。我们提出了几种随机化技术，研究了它们的性质，并展示了如何找出关于我们数据的哪些查询或假设会导致统计显著信息。我们给出了真实数据和生成数据的结果，并说明了一些查询的重要性在不同的随机化之间是如何变化的。
---
英文标题：
《Query Significance in Databases via Randomizations》
---
作者：
Markus Ojala, Gemma C. Garriga, Aristides Gionis, Heikki Mannila
---
最新提交年份：
2009
---
分类信息：

一级分类：Computer Science 计算机科学
二级分类：Databases 数据库
分类描述：Covers database management, datamining, and data processing. Roughly includes material in ACM Subject Classes E.2, E.5, H.0, H.2, and J.1.
涵盖数据库管理、数据挖掘和数据处理。大致包括ACM学科类E.2、E.5、H.0、H.2和J.1中的材料。
--
一级分类：Computer Science 计算机科学
二级分类：Artificial Intelligence 人工智能
分类描述：Covers all areas of AI except Vision, Robotics, Machine Learning, Multiagent Systems, and Computation and Language (Natural Language Processing), which have separate subject areas. In particular, includes Expert Systems, Theorem Proving (although this may overlap with Logic in Computer Science), Knowledge Representation, Planning, and Uncertainty in AI. Roughly includes material in ACM Subject Classes I.2.0, I.2.1, I.2.3, I.2.4, I.2.8, and I.2.11.
涵盖了人工智能的所有领域，除了视觉、机器人、机器学习、多智能体系统以及计算和语言（自然语言处理），这些领域有独立的学科领域。特别地，包括专家系统，定理证明（尽管这可能与计算机科学中的逻辑重叠），知识表示，规划，和人工智能中的不确定性。大致包括ACM学科类I.2.0、I.2.1、I.2.3、I.2.4、I.2.8和I.2.11中的材料。
--

---
英文摘要：
Many sorts of structured data are commonly stored in a multi-relational format of interrelated tables. Under this relational model, exploratory data analysis can be done by using relational queries. As an example, in the Internet Movie Database (IMDb) a query can be used to check whether the average rank of action movies is higher than the average rank of drama movies. We consider the problem of assessing whether the results returned by such a query are statistically significant or just a random artifact of the structure in the data. Our approach is based on randomizing the tables occurring in the queries and repeating the original query on the randomized tables. It turns out that there is no unique way of randomizing in multi-relational data. We propose several randomization techniques, study their properties, and show how to find out which queries or hypotheses about our data result in statistically significant information. We give results on real and generated data and show how the significance of some queries vary between different randomizations.
---
PDF链接：
https://arxiv.org/pdf/0906.5485

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

扫码加我 拉你入群

分享

扫码加好友，拉您进群

扫码加我拉你入群