摘要翻译:
生物数据主要包括脱氧核糖核酸(DNA)和蛋白质序列。这些是存在于人类所有细胞中的生物分子。由于DNA的自我复制特性,它是存在于所有呼吸生物中的遗传物质的关键组成部分。这个生物分子(DNA)理解了所有人格化生命的运作和扩张所必需的遗传物质。保存单人DNA数据需要10个CD-ROM,而且这个容量还在不断增加,公共数据库中增加的序列也越来越多。序列数据的大量增加对从序列数据中准确提取信息提出了挑战。由于许多
数据分析和可视化工具都不支持处理如此庞大的数据量。为了减少DNA和蛋白质序列的大小,许多科学家提出了各种类型的序列压缩算法,如压缩或gzip、上下文树加权(CTW)、LZW、算术编码、游程编码和替换法等,这些技术都为最小化生物数据集做出了充分的贡献。另一方面,传统的压缩技术也不太适合这些类型的序列数据的压缩。在本文中,我们探索了各种类型的压缩技术,以大量的DNA序列数据。本文通过对各种技术的分析表明,有效的技术不仅减少了序列的大小,而且避免了任何信息丢失。对现有研究的回顾还表明,DNA序列的压缩对于理解DNA数据的关键特征以及提高存储效率和数据传输具有重要意义。此外,蛋白质序列的压缩对研究界来说是一个挑战。评价这些压缩算法的主要参数包括压缩比、运行时间复杂度等。
---
英文标题:
《Analysis of Compression Techniques for DNA Sequence Data》
---
作者:
Shakeela Bibi, Javed Iqbal, Adnan Iftekhar, Mir Hassan
---
最新提交年份:
2020
---
分类信息:
一级分类:Quantitative Biology 数量生物学
二级分类:Other Quantitative Biology 其他定量生物学
分类描述:Work in quantitative biology that does not fit into the other q-bio classifications
不适合其他q-bio分类的定量生物学工作
--
---
英文摘要:
Biological data mainly comprises of Deoxyribonucleic acid (DNA) and protein sequences. These are the biomolecules which are present in all cells of human beings. Due to the self-replicating property of DNA, it is a key constitute of genetic material that exist in all breathingcreatures. This biomolecule (DNA) comprehends the genetic material obligatory for the operational and expansion of all personified lives. To save DNA data of single person we require 10CD-ROMs.Moreover, this size is increasing constantly, and more and more sequences are adding in the public databases. This abundant increase in the sequence data arise challenges in the precise information extraction from this data. Since many data analyzing and visualization tools do not support processing of this huge amount of data. To reduce the size of DNA and protein sequence, many scientists introduced various types of sequence compression algorithms such as compress or gzip, Context Tree Weighting (CTW), Lampel Ziv Welch (LZW), arithmetic coding, run-length encoding and substitution method etc. These techniques have sufficiently contributed to minimizing the volume of the biological datasets. On the other hand, traditional compression techniques are also not much suitable for the compression of these types of sequential data. In this paper, we have explored diverse types of techniques for compression of large amounts of DNA Sequence Data. In this paper, the analysis of techniques reveals that efficient techniques not only reduce the size of the sequence but also avoid any information loss. The review of existing studies also shows that compression of a DNA sequence is significant for understanding the critical characteristics of DNA data in addition to improving storage efficiency and data transmission. In addition, the compression of the protein sequence is a challenge for the research community. The major parameters for evaluation of these compression algorithms include compression ratio, running time complexity etc.
---
PDF链接:
https://arxiv.org/pdf/2006.02232