摘要翻译:
目前,通过Web发布的海量信息促使人们采用各种研究技术来高效、可靠地提取相关数据。学术界和企业界都提出了一些Web数据抽取的方法,例如使用人工智能或
机器学习技术。通常采用的一些程序,即包装器,确保从Web页面中提取信息的高度精确度,同时必须证明鲁棒性,以不损害数据本身的质量和可靠性。在本文中,我们重点研究了一些实验方面的问题,这些问题涉及到数据抽取过程的鲁棒性以及自动调整包装器的可能性。讨论了在两个不同版本的Web页面之间寻找相似性的算法的实现,以处理修改,避免数据抽取任务的失败,保证抽取信息的可靠性。我们的目的是评估我们新的自动包装器自适应系统的性能、优点和不足。
---
英文标题:
《Design of Automatically Adaptable Web Wrappers》
---
作者:
Emilio Ferrara and Robert Baumgartner
---
最新提交年份:
2011
---
分类信息:
一级分类:Computer Science 计算机科学
二级分类:Artificial Intelligence
人工智能
分类描述:Covers all areas of AI except Vision, Robotics, Machine Learning, Multiagent Systems, and Computation and Language (Natural Language Processing), which have separate subject areas. In particular, includes Expert Systems, Theorem Proving (although this may overlap with Logic in Computer Science), Knowledge Representation, Planning, and Uncertainty in AI. Roughly includes material in ACM Subject Classes I.2.0, I.2.1, I.2.3, I.2.4, I.2.8, and I.2.11.
涵盖了人工智能的所有领域,除了视觉、机器人、机器学习、多智能体系统以及计算和语言(自然语言处理),这些领域有独立的学科领域。特别地,包括专家系统,定理证明(尽管这可能与计算机科学中的逻辑重叠),知识表示,规划,和人工智能中的不确定性。大致包括ACM学科类I.2.0、I.2.1、I.2.3、I.2.4、I.2.8和I.2.11中的材料。
--
一级分类:Computer Science 计算机科学
二级分类:Information Retrieval 信息检索
分类描述:Covers indexing, dictionaries, retrieval, content and analysis. Roughly includes material in ACM Subject Classes H.3.0, H.3.1, H.3.2, H.3.3, and H.3.4.
涵盖索引,字典,检索,内容和分析。大致包括ACM主题课程H.3.0、H.3.1、H.3.2、H.3.3和H.3.4中的材料。
--
---
英文摘要:
Nowadays, the huge amount of information distributed through the Web motivates studying techniques to be adopted in order to extract relevant data in an efficient and reliable way. Both academia and enterprises developed several approaches of Web data extraction, for example using techniques of artificial intelligence or machine learning. Some commonly adopted procedures, namely wrappers, ensure a high degree of precision of information extracted from Web pages, and, at the same time, have to prove robustness in order not to compromise quality and reliability of data themselves. In this paper we focus on some experimental aspects related to the robustness of the data extraction process and the possibility of automatically adapting wrappers. We discuss the implementation of algorithms for finding similarities between two different version of a Web page, in order to handle modifications, avoiding the failure of data extraction tasks and ensuring reliability of information extracted. Our purpose is to evaluate performances, advantages and draw-backs of our novel system of automatic wrapper adaptation.
---
PDF链接:
https://arxiv.org/pdf/1103.1254