摘要翻译:
网络上可用的信息量以令人难以置信的高速度增长。从网络资源中提取这些数据的系统和程序已经存在,在过去几年中,已经研究了不同的方法和技术。一方面,可靠的解决方案应该提供鲁棒的Web
数据挖掘算法,能够自动地面对可能的故障或失败。另一方面,在文献中缺乏关于这些系统维护的解决方案。提取Web数据的过程可能与数据源本身的结构严格互连;因此,故障或损坏数据的获取可能是由数据源的所有者对数据源进行结构修改造成的。目前,数据完整性的验证和维护大多由人工管理,以确保这些系统正确可靠地工作。在本文中,我们提出了一种新的方法来创建能够从Web源中提取数据的过程--所谓的Web包装器--它能够面对由于修改数据源结构而可能导致的故障,并能够自动修复。
---
英文标题:
《Intelligent Self-Repairable Web Wrappers》
---
作者:
Emilio Ferrara and Robert Baumgartner
---
最新提交年份:
2011
---
分类信息:
一级分类:Computer Science 计算机科学
二级分类:Artificial Intelligence
人工智能
分类描述:Covers all areas of AI except Vision, Robotics, Machine Learning, Multiagent Systems, and Computation and Language (Natural Language Processing), which have separate subject areas. In particular, includes Expert Systems, Theorem Proving (although this may overlap with Logic in Computer Science), Knowledge Representation, Planning, and Uncertainty in AI. Roughly includes material in ACM Subject Classes I.2.0, I.2.1, I.2.3, I.2.4, I.2.8, and I.2.11.
涵盖了人工智能的所有领域,除了视觉、机器人、机器学习、多智能体系统以及计算和语言(自然语言处理),这些领域有独立的学科领域。特别地,包括专家系统,定理证明(尽管这可能与计算机科学中的逻辑重叠),知识表示,规划,和人工智能中的不确定性。大致包括ACM学科类I.2.0、I.2.1、I.2.3、I.2.4、I.2.8和I.2.11中的材料。
--
一级分类:Computer Science 计算机科学
二级分类:Information Retrieval 信息检索
分类描述:Covers indexing, dictionaries, retrieval, content and analysis. Roughly includes material in ACM Subject Classes H.3.0, H.3.1, H.3.2, H.3.3, and H.3.4.
涵盖索引,字典,检索,内容和分析。大致包括ACM主题课程H.3.0、H.3.1、H.3.2、H.3.3和H.3.4中的材料。
--
---
英文摘要:
The amount of information available on the Web grows at an incredible high rate. Systems and procedures devised to extract these data from Web sources already exist, and different approaches and techniques have been investigated during the last years. On the one hand, reliable solutions should provide robust algorithms of Web data mining which could automatically face possible malfunctioning or failures. On the other, in literature there is a lack of solutions about the maintenance of these systems. Procedures that extract Web data may be strictly interconnected with the structure of the data source itself; thus, malfunctioning or acquisition of corrupted data could be caused, for example, by structural modifications of data sources brought by their owners. Nowadays, verification of data integrity and maintenance are mostly manually managed, in order to ensure that these systems work correctly and reliably. In this paper we propose a novel approach to create procedures able to extract data from Web sources -- the so called Web wrappers -- which can face possible malfunctioning caused by modifications of the structure of the data source, and can automatically repair themselves.
---
PDF链接:
https://arxiv.org/pdf/1106.3967