摘要翻译:
在许多现代应用中,包括基因表达和文本文档的分析,数据是有噪声的、高维的和无序的--对于给定的变量顺序没有特定的意义。然而,由于稀疏性,成功的学习往往是可能的:数据通常是冗余的,其底层结构只能由少数特征表示。本文提出了Treelets--一种新的多尺度基结构,它将小波扩展到非光滑信号。该方法是完全自适应的,因为它返回一个层次树和一个正交基,两者都反映了数据的内部结构。Treelets特别适合于在样本规模较小、数据稀疏且相关或共线变量分组未知的情况下,在回归和分类之前作为降维和特征选择工具。该方法实现简单,理论分析简单。在这里,我们描述了树型比主成分分析更好的各种情况,以及一些常见的变量选择和聚类平均方案。我们在一个阻塞协方差模型和几个数据集(高光谱图像数据、DNA微阵列数据和互联网广告)上说明treelets,这些数据集在变量之间具有高度复杂的依赖关系。
---
英文标题:
《Treelets--An adaptive multi-scale basis for sparse unordered data》
---
作者:
Ann B. Lee, Boaz Nadler, Larry Wasserman
---
最新提交年份:
2008
---
分类信息:
一级分类:Statistics 统计学
二级分类:Methodology 方法论
分类描述:Design, Surveys, Model Selection, Multiple Testing, Multivariate Methods, Signal and Image Processing, Time Series, Smoothing, Spatial Statistics, Survival Analysis, Nonparametric and Semiparametric Methods
设计,调查,模型选择,多重检验,多元方法,信号和图像处理,时间序列,平滑,空间统计,生存分析,非参数和半参数方法
--
---
英文摘要:
In many modern applications, including analysis of gene expression and text documents, the data are noisy, high-dimensional, and unordered--with no particular meaning to the given order of the variables. Yet, successful learning is often possible due to sparsity: the fact that the data are typically redundant with underlying structures that can be represented by only a few features. In this paper we present treelets--a novel construction of multi-scale bases that extends wavelets to nonsmooth signals. The method is fully adaptive, as it returns a hierarchical tree and an orthonormal basis which both reflect the internal structure of the data. Treelets are especially well-suited as a dimensionality reduction and feature selection tool prior to regression and classification, in situations where sample sizes are small and the data are sparse with unknown groupings of correlated or collinear variables. The method is also simple to implement and analyze theoretically. Here we describe a variety of situations where treelets perform better than principal component analysis, as well as some common variable selection and cluster averaging schemes. We illustrate treelets on a blocked covariance model and on several data sets (hyperspectral image data, DNA microarray data, and internet advertisements) with highly complex dependencies between variables.
---
PDF链接:
https://arxiv.org/pdf/707.0481