全部版块 我的主页
论坛 计量经济学与统计论坛 五区 计量经济学与统计软件 LATEX论坛
2327 5
2017-02-19

本帖隐藏的内容

Automated Machine Learning (AutoML) has become a topic of considerable interest over the past year. A recent KDnuggets blog competition focused on this topic, resulting in a handful of interesting ideas and projects. Several AutoML tools have been generating notable interest and gaining respect and notoriety in this time frame as well.

This post will provide a brief explanation of AutoML, argue for its justification and adoption, present a pair of contemporary tools for its pursuit, and discuss AutoML's anticipated future and direction.

What is Automated Machine Learning?


We can talk about what automated machine learning is, and we can talk about what automated machine learning is not.

AutoML is not automated data science. While there is undoubtedly overlap, machine learning is but one of many tools in the data science toolkit, and its use does not actually factor in to all data science tasks. For example, if prediction will be part of a given data science task, machine learning will be a useful component; however, machine learning may not play in to a descriptive analytics task at all.

Even for predictive tasks, data science encompasses much more than the actual predictive modeling. Data scientist Sandro Saitta, when discussing the potential confusion between AutoML and automated data science, had this to say:

The misconception comes from the confusion between the whole Data Science process (see for example CRISP-DM) and the sub-tasks of data preparation (feature extraction, etc.) and modeling (algorithm selection, hyper-parameters tuning, etc.) which I call Machine Learning.

[...]

When you read news about tools that automate Data Science and Data Science competitions, people with no industry experience may be confused and think that Data Science is only modeling and can be fully automated.

He is absolutely correct, and it's not just a matter of semantics. If you want (need?) more clarification on the relationship between machine learning and data science (and several other related concepts), read this.

Further, data scientist and leading automated machine learning proponent Randy Olson states that effective machine learning design requires us to:

  • Always tune the hyperparameters for our models
  • Always try out many different models
  • Always explore numerous feature representations for our data

Taking all of the above into account, if we consider AutoML to be the tasks of algorithm selection, hyperparameter tuning, iterative modeling, and model assessment, we can start to define what AutoML actually is. There will not be total agreement on this definition (for comparison, ask 10 people to define "data science," and then compare the 11 answers you get), but it arguably starts us off on the right foot.

Why Do We Need It?


While we are done with defining concepts, as an exercise in considering why AutoML may be beneficial, let's have a look at why machine learning is hard.



Credit: S. Zayd Enam

AI Researcher and Stanford University PhD candidate S. Zayd Enam, in a fantastic blog post titled "Why is machine learning 'hard'?," recently wrote the following (emphasis added):

[M]achine learning remains a relatively ‘hard’ problem. There is no doubt the science of advancing machine learning algorithms through research is difficult. It requires creativity, experimentation and tenacity. Machine learning remains a hard problem when implementing existing algorithms and models to work well for your new application.

Note that, while Enam is primarily referring to machine learning research, he also touches on the implementation of existing algorithms in use cases (see emphasis).

Enam goes on to elaborate on the difficulties of machine learning, and focuses on the nature of algorithms (again, emphasis added):

An aspect of this difficulty involves building an intuition for what tool should be leveraged to solve a problem. This requires being aware of available algorithms and models and the trade-offs and constraints of each one.

[...]

The difficulty is that machine learning is a fundamentally hard debugging problem. Debugging for machine learning happens in two cases: 1) your algorithm doesn't work or 2) your algorithm doesn't work well enough.[...] Very rarely does an algorithm work the first time and so this ends up being where the majority of time is spent in building algorithms.

Enam then eloquantly elaborates this framed problem from the algorithm research point of view. Again, however, what he says applies to... well, applying algorithms. If an algorithm does not work, or does not do so well enough, and the process of choosing and refinining becomes iterative, this exposes an opportunity for automation, hence automated machine learning.

I have previously attempted to capture AutoML's essence as follows:

If, as Sebastian Raschka has described it, computer programming is about automation, and machine learning is "all about automating automation," then automated machine learning is "the automation of automating automation." Follow me, here: programming relieves us by managing rote tasks; machine learning allows computers to learn how to best perform these rote tasks; automated machine learning allows for computers to learn how to optimize the outcome of learning how to perform these rote actions.

This is a very powerful idea; while we previously have had to worry about tuning parameters and hyperparameters, automated machine learning systems can learn the best way to tune these for optimal outcomes by a number of different possible methods.

The rationale for AutoML stems from this idea: if numerous machine learning models must be built, using a variety of algorithms and a number of differing hyperparameter configurations, then this model building can be automated, as can the comparison of model performance and accuracy.

Simple, right?

A Comparison of Select Automated Machine Learning Tools


Now that we know what AutoML is, and why we would use it... how do we do it? The following is an overview and comparison of a pair of contemporary Python AutoML tools which take different approaches in an attempt to achieve more or less the same goal, that of automating the machine learning process.

[size=+1]Auto-sklearn

Auto-sklearn is "an automated machine learning toolkit and a drop-in replacement for a scikit-learn estimator." It also happens to be the winner of KDnuggets' recent automated data science and machine learning blog contest.

auto-sklearn frees a machine learning user from algorithm selection and hyperparameter tuning. It leverages recent advantages in Bayesian optimization, meta-learning and ensemble construction. Learn more about the technology behind auto-sklearn by reading this paper published at the NIPS 2015.

As the above excerpt from the project's documentation notes, Auto-sklearn performs hyperparameter optimization by way of Bayesian optimization, which proceeds by iterating the following steps:

  • Build a probabilistic model to capture the relationship between hyperparameter settings and their performance
  • Use the model to select useful hyperparameter settings to try next by trading off exploration (searching in parts of the space where the model is uncertain) and exploitation (focussing on parts of the space predicted to perform well)
  • Run the machine learning algorithm with those hyperparameter settings

Further explanation of how this process plays out follows:

This process can be generalized to jointly select algorithms, preprocessing methods, and their hyperparameters as follows: the choices of classifier / regressor and preprocessing methods are top-level, categorical hyperparameters, and based on their settings the hyperparameters of the selected methods become active. The combined space can then be searched with Bayesian optimization methods that handle such high-dimensional, conditional spaces; we use the random-forest-based SMAC, which has been shown to work best for such cases.



二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

全部回复
2017-2-19 13:20:33
kankan
二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

2017-2-19 17:51:23
谢谢分享
二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

2017-2-23 19:54:13
MATLAB Machine Learning
二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

2017-3-21 02:59:58
xie xie ni
二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

2019-6-4 09:09:14
kankan
二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

相关推荐
栏目导航
热门文章
推荐文章

说点什么

分享

扫码加好友,拉您进群
各岗位、行业、专业交流群