摘要翻译:
在以前的论文中,我们在强有力的实验支持下描述了CoHSI(Hartley-Shannon信息守恒)在确定所有已知蛋白质的重要全局性质中所起的组织作用,从定义长度分布到超长蛋白质的自然出现及其与进化时间的关系。在这里,我们考虑CoHSI可能会给另一个问题带来的洞察力,即相同蛋白质在物种间的分布。水平和垂直基因转移(HGT/VGT)都通过多种机制导致蛋白质序列的跨物种复制,其中一些机制尚不清楚。与此相反,CoHSI从基础理论上预测,这类系统将表现出与任何机制无关的幂律行为,并利用Uniprot数据库表明,蛋白质重复使用的全局模式在对数-对数图上呈线性(adj.$r^{2}=0.99,p<2.2\×10^{-16}$超过40年);即与预测的幂律极为接近。具体而言,我们发现TrEMBL 18-02中有690万个蛋白质被重复使用,即它们的序列在2-9,812个物种中出现相同,重复使用的蛋白质长度从7到14,596个氨基酸不等。用(DL+V)表示生命加病毒的三个结构域,两个(DL+V)共有21,676个蛋白质;在三个(DL+V)和五个(DL+V)之间的22在所有四个(DL+V)中共享。尽管大多数蛋白质重复使用发生在细菌物种之间,但最频繁重复使用的蛋白质不成比例地发生在病毒中,病毒在这种分布中发挥了基本作用。这些结果表明,不同的基因转移机制(包括传统遗传)在决定蛋白质重复使用的全球分布方面是不相关的。
---
英文标题:
《CoHSI IV: Unifying Horizontal and Vertical Gene Transfer - is Mechanism
Irrelevant ?》
---
作者:
Les Hatton, Gregory Warr
---
最新提交年份:
2018
---
分类信息:
一级分类:Quantitative Biology 数量生物学
二级分类:Other Quantitative Biology 其他定量生物学
分类描述:Work in quantitative biology that does not fit into the other q-bio classifications
不适合其他q-bio分类的定量生物学工作
--
---
英文摘要:
In previous papers we have described with strong experimental support, the organising role that CoHSI (Conservation of Hartley-Shannon Information) plays in determining important global properties of all known proteins, from defining the length distribution, to the natural emergence of very long proteins and their relationship to evolutionary time. Here we consider the insight that CoHSI might bring to a different problem, the distribution of identical proteins across species. Horizontal and Vertical Gene Transfer (HGT/VGT) both lead to the replication of protein sequences across species through a diversity of mechanisms some of which remain unknown. In contrast, CoHSI predicts from fundamental theory that such systems will demonstrate power law behavior independently of any mechanisms, and using the Uniprot database we show that the global pattern of protein re-use is emphatically linear on a log-log plot (adj. $R^{2} = 0.99, p < 2.2 \times 10^{-16}$ over 4 decades); i.e. it is extremely close to the predicted power law. Specifically we show that over 6.9 million proteins in TrEMBL 18-02 are re-used, i.e. their sequence appears identically in between 2 and 9,812 species, with re-used proteins varying in length from 7 to as long as 14,596 amino acids. Using (DL+V) to denote the three domains of life plus viruses, 21,676 proteins are shared between two (DL+V); 22 between three (DL+V) and 5 are shared in all four (DL+V). Although the majority of protein re-use occurs between bacterial species those proteins most frequently re-used occur disproportionately in viruses, which play a fundamental role in this distribution. These results suggest that diverse mechanisms of gene transfer (including traditional inheritance) are irrelevant in determining the global distribution of protein re-use.
---
PDF链接:
https://arxiv.org/pdf/1811.02526