摘要翻译:
我们提出了一种从文本语料库中自动获取分类或概念层次的新方法。该方法基于形式概念分析(FCA),这是一种主要用于
数据分析的方法,即研究和处理明确给定的信息。我们遵循Harris分布假设,将某个词的上下文建模为表示句法依赖关系的向量,并通过语言分析器从文本语料库中自动获取该向量。基于这些上下文信息,FCA生成一个格,我们将其转换成一种特殊的偏序,构成一个概念层次。通过比较得到的概念层次结构和手工编制的两个领域的分类法来评估该方法:旅游和金融。我们还直接将我们的方法与层次凝聚聚类以及作为一个分裂聚类算法实例的双截面K均值进行了比较。此外,我们研究了使用不同的度量权重每个属性的贡献以及应用特定的平滑技术来处理数据稀疏性的影响。
---
英文标题:
《Learning Concept Hierarchies from Text Corpora using Formal Concept
Analysis》
---
作者:
P. Cimiano, A. Hotho, S. Staab
---
最新提交年份:
2011
---
分类信息:
一级分类:Computer Science 计算机科学
二级分类:Artificial Intelligence
人工智能
分类描述:Covers all areas of AI except Vision, Robotics, Machine Learning, Multiagent Systems, and Computation and Language (Natural Language Processing), which have separate subject areas. In particular, includes Expert Systems, Theorem Proving (although this may overlap with Logic in Computer Science), Knowledge Representation, Planning, and Uncertainty in AI. Roughly includes material in ACM Subject Classes I.2.0, I.2.1, I.2.3, I.2.4, I.2.8, and I.2.11.
涵盖了人工智能的所有领域,除了视觉、机器人、机器学习、多智能体系统以及计算和语言(自然语言处理),这些领域有独立的学科领域。特别地,包括专家系统,定理证明(尽管这可能与计算机科学中的逻辑重叠),知识表示,规划,和人工智能中的不确定性。大致包括ACM学科类I.2.0、I.2.1、I.2.3、I.2.4、I.2.8和I.2.11中的材料。
--
---
英文摘要:
We present a novel approach to the automatic acquisition of taxonomies or concept hierarchies from a text corpus. The approach is based on Formal Concept Analysis (FCA), a method mainly used for the analysis of data, i.e. for investigating and processing explicitly given information. We follow Harris distributional hypothesis and model the context of a certain term as a vector representing syntactic dependencies which are automatically acquired from the text corpus with a linguistic parser. On the basis of this context information, FCA produces a lattice that we convert into a special kind of partial order constituting a concept hierarchy. The approach is evaluated by comparing the resulting concept hierarchies with hand-crafted taxonomies for two domains: tourism and finance. We also directly compare our approach with hierarchical agglomerative clustering as well as with Bi-Section-KMeans as an instance of a divisive clustering algorithm. Furthermore, we investigate the impact of using different measures weighting the contribution of each attribute as well as of applying a particular smoothing technique to cope with data sparseness.
---
PDF链接:
https://arxiv.org/pdf/1109.2140