6. Become familiar with the Hadoop sub-projects: HBase, Zookeeper [32], Hive [33], Mahout, etc. These projects can help you store/access your data, and they scale.
机器学习终究和大数据息息相关,所以Hadoop的子项目要关注,比如HBase Zookeeper Hive等等
7. Learn about advanced signal processing techniques: feature extraction is one of the most important parts of machine-learning. If your features suck, no matter which algorithm you choose, your going to see horrible performance. Depending on the type of problem you are trying to solve, you may be able to utilize really cool advance signal processing algorithms like: wavelets [42], shearlets [43], curvelets [44], contourlets [45], bandlets [46]. Learn about time-frequency analysis [47], and try to apply it to your problems. If you have not read about Fourier Analysis[48] and Convolution[49], you will need to learn about this stuff too. The ladder is signal processing 101 stuff though.
这里主要是在讲特征的提取问题。无论是分类(classification)还是回归(regression)问题,都要解决特征选择和抽取(extraction)的问题。他给出了一些基础的特征抽取的工具如小波等,同时说需要掌握傅里叶分析和卷积等等。这部分我不大了解,大概就是说信号处理你要懂,比如傅里叶这些。。。
Finally, practice and read as much as you can. In your free time, read papers like Google Map-Reduce [34], Google File System [35], Google Big Table [36], The Unreasonable Effectiveness of Data [37],etc There are great free machine learning books online and you should read those also. [38][39][40]. Here is an awesome course I found and re-posted on github [41]. Instead of using open source packages, code up your own, and compare the results. If you can code an SVM from scratch, you will understand the concept of support vectors, gamma, cost, hyperplanes, etc. It's easy to just load some data up and start training, the hard part is making sense of it all.
总之机器学习如果想要入门分为两方面:
一方面是去看算法,需要极强的数理基础(真的是极强的),从SVM入手,一点点理解。
另一方面是学工具,比如分布式的一些工具以及Unix~
Good luck.
祝好
[1] http://radar.oreilly.com/2011/04...
[2] NumPy — Numpy
[3] The R Project for Statistical Computing
[4] Welcome to Apache™ Hadoop®!
[5] http://hadoop.apache.org/common/...
[6] http://en.wikipedia.org/wiki/Nai...
[7] http://en.wikipedia.org/wiki/Mix...
[8] http://en.wikipedia.org/wiki/Hid...
[9] http://en.wikipedia.org/wiki/Mea...
[10] http://en.wikipedia.org/wiki/Sup...
[11] http://en.wikipedia.org/wiki/Con...
[12] http://en.wikipedia.org/wiki/Gra...
[13] http://en.wikipedia.org/wiki/Qua...
[14] http://en.wikipedia.org/wiki/Lag...
[15] http://en.wikipedia.org/wiki/Par...
[16] http://en.wikipedia.org/wiki/Sum...
[17] http://radar.oreilly.com/2010/06...
[18] AWS | Amazon Elastic Compute Cloud (EC2)
[19] http://en.wikipedia.org/wiki/Goo...
[20] Apache Mahout: Scalable machine learning and data mining
[21] http://incubator.apache.org/whirr/
[22] http://en.wikipedia.org/wiki/Map...
[23] HBase - Apache HBase Home
[24] http://en.wikipedia.org/wiki/Cat...
[25] grep
[26] http://en.wikipedia.org/wiki/Find
[27] AWK
[28] sed
[29] http://en.wikipedia.org/wiki/Sor...
[30] http://en.wikipedia.org/wiki/Cut...
[31] http://en.wikipedia.org/wiki/Tr_...
[32] Apache ZooKeeper
[33] Apache Hive TM
[34] http://static.googleusercontent....
[35]http://static.googleusercontent....
[36]http://static.googleusercontent....
[37]http://static.googleusercontent....
[38] http://www.ics.uci.edu/~welling/...
[39] http://www.stanford.edu/~hastie/...
[40] http://infolab.stanford.edu/~ull...
[41] https://github.com/josephmisiti/...
[42] http://en.wikipedia.org/wiki/Wav...
[43] http://www.shearlet.uni-osnabrue...
[44] http://math.mit.edu/icg/papers/F...
[45] http://www.ifp.illinois.edu/~min...
[46] http://www.cmap.polytechnique.fr...
[47 ]http://en.wikipedia.org/wiki/Tim...
[48] http://en.wikipedia.org/wiki/Fou...
[49 ]http://en.wikipedia.org/wiki/Con...