无极小值挖掘感兴趣频繁项集的快速算法支持

380

收藏 2022-03-04

摘要翻译：
现实世界的数据集是稀疏的、肮脏的，并且包含数百个项。在这种情况下，使用传统的频繁项集挖掘方法通过指定用户定义的输入支持阈值来发现有趣的规则（结果）是不合适的。由于在没有任何领域知识的情况下，将支持度阈值设置得小或大都不能输出任何结果或大量冗余的无意义结果。最近提出了一种只挖掘N-most/top-k个感兴趣频繁项集的新方法，该方法可以在不指定用户支持度阈值的情况下发现前N个感兴趣的结果。然而，在没有最小支持度阈值的情况下挖掘感兴趣的频繁项集，在项集搜索空间探索和处理代价方面代价较大。因此，它们的挖掘效率在很大程度上取决于三个主要因素（1）用于项集频率计数的数据库表示方法，(2)相关事务向搜索空间下层节点的投影和（3）算法实现技术。因此，为了提高挖掘过程的效率，本文提出了两种新的挖掘算法（N-MostMiner和Top-K-Miner）。此外，本文还介绍了N-MostMiner和Top-K-Miner的几种有效的实现技术，这些技术是我们在实现过程中所体验到的。我们在基准数据集上的实验结果表明，NMostMiner和Top-K-Miner在处理时间方面比当前最好的算法BOMO和TFP效率更高。
---
英文标题：
《Fast Algorithms for Mining Interesting Frequent Itemsets without Minimum
Support》
---
作者：
Shariq Bashir, Zahoor Jan, Abdul Rauf Baig
---
最新提交年份：
2009
---
分类信息：

一级分类：Computer Science 计算机科学
二级分类：Databases 数据库
分类描述：Covers database management, datamining, and data processing. Roughly includes material in ACM Subject Classes E.2, E.5, H.0, H.2, and J.1.
涵盖数据库管理、数据挖掘和数据处理。大致包括ACM学科类E.2、E.5、H.0、H.2和J.1中的材料。
--
一级分类：Computer Science 计算机科学
二级分类：Artificial Intelligence 人工智能
分类描述：Covers all areas of AI except Vision, Robotics, Machine Learning, Multiagent Systems, and Computation and Language (Natural Language Processing), which have separate subject areas. In particular, includes Expert Systems, Theorem Proving (although this may overlap with Logic in Computer Science), Knowledge Representation, Planning, and Uncertainty in AI. Roughly includes material in ACM Subject Classes I.2.0, I.2.1, I.2.3, I.2.4, I.2.8, and I.2.11.
涵盖了人工智能的所有领域，除了视觉、机器人、机器学习、多智能体系统以及计算和语言（自然语言处理），这些领域有独立的学科领域。特别地，包括专家系统，定理证明（尽管这可能与计算机科学中的逻辑重叠），知识表示，规划，和人工智能中的不确定性。大致包括ACM学科类I.2.0、I.2.1、I.2.3、I.2.4、I.2.8和I.2.11中的材料。
--
一级分类：Computer Science 计算机科学
二级分类：Data Structures and Algorithms 数据结构与算法
分类描述：Covers data structures and analysis of algorithms. Roughly includes material in ACM Subject Classes E.1, E.2, F.2.1, and F.2.2.
涵盖数据结构和算法分析。大致包括ACM学科类E.1、E.2、F.2.1和F.2.2中的材料。
--

---
英文摘要：
Real world datasets are sparse, dirty and contain hundreds of items. In such situations, discovering interesting rules (results) using traditional frequent itemset mining approach by specifying a user defined input support threshold is not appropriate. Since without any domain knowledge, setting support threshold small or large can output nothing or a large number of redundant uninteresting results. Recently a novel approach of mining only N-most/Top-K interesting frequent itemsets has been proposed, which discovers the top N interesting results without specifying any user defined support threshold. However, mining interesting frequent itemsets without minimum support threshold are more costly in terms of itemset search space exploration and processing cost. Thereby, the efficiency of their mining highly depends upon three main factors (1) Database representation approach used for itemset frequency counting, (2) Projection of relevant transactions to lower level nodes of search space and (3) Algorithm implementation technique. Therefore, to improve the efficiency of mining process, in this paper we present two novel algorithms called (N-MostMiner and Top-K-Miner) using the bit-vector representation approach which is very efficient in terms of itemset frequency counting and transactions projection. In addition to this, several efficient implementation techniques of N-MostMiner and Top-K-Miner are also present which we experienced in our implementation. Our experimental results on benchmark datasets suggest that the NMostMiner and Top-K-Miner are very efficient in terms of processing time as compared to current best algorithms BOMO and TFP.
---
PDF链接：
https://arxiv.org/pdf/0904.3319

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

扫码加我 拉你入群

分享

扫码加好友，拉您进群

扫码加我拉你入群