摘要翻译:
大规模、高通量串联质谱数据集的挖掘是基于质谱的蛋白质鉴定中的一个非常重要的问题。大规模谱挖掘的一个基本问题是设计合适的度量和算法,以避免对谱进行全对比较。在本文中,我们提出了一个基于向量空间的通用框架,以避免成对比较。我们首先以一种新颖的方式在高维空间中稳健地嵌入光谱,然后将快速近似近邻算法应用于数据库搜索、索引和相似性搜索的滤波器构造等任务。我们形式化地证明了与余弦相似度相比,我们的嵌入具有较低的失真,并结合局部敏感哈希(LSH)设计了用于数据库搜索的过滤器,该过滤器可以过滤掉989%以上的肽(减少118倍),而最多丢失0.29%的正确序列。然后我们展示了如何将我们的框架用于相似性搜索,然后可以使用相似性搜索来检测紧密聚类或复制。平均而言,对于16个光谱的星系团规模,LSH只漏掉1个光谱,只允许1个假光谱。此外,我们的框架结合降维技术允许我们在2D空间中可视化大型数据集。我们的框架还具有嵌入和比较数据集与翻译后修改(PTM)的潜力。
---
英文标题:
《Mining Mass Spectra: Metric Embeddings and Fast Near Neighbor Search》
---
作者:
Debojyoti Dutta, Ting Chen
---
最新提交年份:
2006
---
分类信息:
一级分类:Quantitative Biology 数量生物学
二级分类:Quantitative Methods 定量方法
分类描述:All experimental, numerical, statistical and mathematical contributions of value to biology
对生物学价值的所有实验、数值、统计和数学贡献
--
一级分类:Quantitative Biology 数量生物学
二级分类:Other Quantitative Biology 其他定量生物学
分类描述:Work in quantitative biology that does not fit into the other q-bio classifications
不适合其他q-bio分类的定量生物学工作
--
---
英文摘要:
Mining large-scale high-throughput tandem mass spectrometry data sets is a very important problem in mass spectrometry based protein identification. One of the fundamental problems in large scale mining of spectra is to design appropriate metrics and algorithms to avoid all-pair-wise comparisons of spectra. In this paper, we present a general framework based on vector spaces to avoid pair-wise comparisons. We first robustly embed spectra in a high dimensional space in a novel fashion and then apply fast approximate near neighbor algorithms for tasks such as constructing filters for database search, indexing and similarity searching. We formally prove that our embedding has low distortion compared to the cosine similarity, and, along with locality sensitive hashing (LSH), we design filters for database search that can filter out more than 989% of peptides (118 times less) while missing at most 0.29% of the correct sequences. We then show how our framework can be used in similarity searching, which can then be used to detect tight clusters or replicates. On an average, for a cluster size of 16 spectra, LSH only misses 1 spectrum and admits only 1 false spectrum. In addition, our framework in conjunction with dimension reduction techniques allow us to visualize large datasets in 2D space. Our framework also has the potential to embed and compare datasets with post translation modifications (PTM).
---
PDF链接:
https://arxiv.org/pdf/q-bio/0603002