摘要翻译:
本文利用数据压缩技术的特点,提出了一种通用的信息抽取方法。我们首先定义并将注意力集中在一个序列的所谓“字典”上。词典本质上是有趣的,研究它们的特征对于研究它们所提取的序列(例如DNA串)的性质是非常有用的。然后,我们描述了字典创建的序列(或“人工文本”)之间的字符串比较过程,该过程在几个上下文中给出了非常好的结果。最后给出了自洽分类问题的一些结果。
---
英文标题:
《Dictionary based methods for information extraction》
---
作者:
A. Baronchelli, E. Caglioti, V. Loreto, E. Pizzi
---
最新提交年份:
2004
---
分类信息:
一级分类:Physics        物理学
二级分类:Statistical Mechanics        统计力学
分类描述:Phase transitions, thermodynamics, field theory, non-equilibrium phenomena, renormalization group and scaling, integrable models, turbulence
相变,热力学,场论,非平衡现象,重整化群和标度,可积模型,湍流
--
一级分类:Physics        物理学
二级分类:Other Condensed Matter        其他凝聚态物质
分类描述:Work in condensed matter that does not fit into the other cond-mat classifications
在不适合其他cond-mat分类的凝聚态物质中工作
--
一级分类:Computer Science        计算机科学
二级分类:Information Retrieval        信息检索
分类描述:Covers indexing, dictionaries, retrieval, content and analysis. Roughly includes material in ACM Subject Classes H.3.0, H.3.1, H.3.2, H.3.3, and H.3.4.
涵盖索引,字典,检索,内容和分析。大致包括ACM主题课程H.3.0、H.3.1、H.3.2、H.3.3和H.3.4中的材料。
--
一级分类:Quantitative Biology        数量生物学
二级分类:Genomics        基因组学
分类描述:DNA sequencing and assembly; gene and motif finding; RNA editing and alternative splicing; genomic structure and processes (replication, transcription, methylation, etc); mutational processes.
DNA测序与组装;基因和基序的发现;RNA编辑和选择性剪接;基因组结构和过程(复制、转录、甲基化等);突变过程。
--
一级分类:Quantitative Biology        数量生物学
二级分类:Other Quantitative Biology        其他定量生物学
分类描述:Work in quantitative biology that does not fit into the other q-bio classifications
不适合其他q-bio分类的定量生物学工作
--
---
英文摘要:
  In this paper we present a general method for information extraction that exploits the features of data compression techniques. We first define and focus our attention on the so-called "dictionary" of a sequence. Dictionaries are intrinsically interesting and a study of their features can be of great usefulness to investigate the properties of the sequences they have been extracted from (e.g. DNA strings). We then describe a procedure of string comparison between dictionary-created sequences (or "artificial texts") that gives very good results in several contexts. We finally present some results on self-consistent classification problems. 
---
PDF链接:
https://arxiv.org/pdf/cond-mat/0402581