3. Supervised Training Methods
Historically, deep neural networks were known to be difficult to train using standard random initialization and gradient decent. However, new algorithms for initializing and training deep neural networks proposed in the last decade have produced remarkable successes. Research continues in this area to better understand existing training methods and to improve them.
Dropout distillation
Bulò, Samuel Rota, Porzi, L., & Kontschieder, P. (2016)
Dropout is a regularization technique that was proposed to prevent neural networks from overfitting. It drops units from the network randomly during training by setting their outputs to zero, thus reducing co-adaptation of the units. This procedure implicitly trains an ensemble of exponentially many smaller networks sharing the same parametrization. The predictions of these networks must then be averaged at test time, which is unfortunately intractable to compute precisely. But the averaging can be approximated by scaling the weights of a single network.
However, this approximation may not produce sufficient accuracy in all cases. The authors introduce a better approximation method called dropout distillation that finds a predictor with minimal divergence from the ideal predictor by applying stochastic gradient descent. The distillation procedure can even be applied to networks already trained using dropout by utilizing unlabeled data. Their results on benchmark problems show consistent improvements over standard dropout.
Normalization Propagation: A Parametric Technique for Removing Internal Covariate Shift in Deep Networks
Arpit, D., Zhou, Y., Kota, B., & Govindaraju, V. (2016)
One of the difficulties of training deep neural networks is that the distribution of input activations to each hidden layer may shift during training. One way to address this problem, known as internal covariate shift, is to normalize the input activations to each hidden layer using the Batch Normalization (BN) technique. However, BN has a couple of drawbacks: (1) its estimates of mean and standard deviation of input activations are inaccurate, especially during initial iterations, because they are based on mini-batches of training data and (2) it cannot be used with batch-size of one. To address these drawbacks, the authors introduce normalization propagation, which is based on a data-independent closed-form estimate of mean and standard deviation for every layer. It is based on the observation that the pre-activation values of ReLUs in deep networks follow a Gaussian distribution. The normalization property can then be forward-propagated to all hidden layers during training. The authors show that their method achieves better convergence stability than BN during training. It is also faster because it doesn't have to compute a running estimate of the mean and standard deviation of
the hidden layer activations.
Scalable Gradient-Based Tuning of Continuous Regularization Hyperparameters
Luketina, J., Raiko, T., Berglund, M., & Greff, K. (2016)
Tuning hyperparameters is often necessary to get good results with deep neural networks. Typically, the turning is performed either by manual trial-and-error, by using search, or by evaluating validation set performance. The authors propose a gradient based method that is less tedious and less computationally expensive to find good regularization hyperparameters.
Unlike previous methods, their method is simpler and computationally lightweight, and it updates both hyperparameters and regular parameters using stochastic gradient descent in the same training run. The gradient of the hyperparameters is obtained from the cost of the unregularized model on the validation set. Although the authors show that their method is effective in finding good regularization hyperparameters, they haven't extended it to common training techniques such as dropout regularization and learning rate adaptation.
4. Deep Reinforcement Learning
The researchers at DeepMind extended the breakthrough successes of deep learning in supervised tasks to the challenging reinforcement learning domain of playing Atari 2600 games. Their basic idea was to leverage the demonstrated ability of deep learning to extract high-level features from raw high-dimensional data by training a deep convolutional network. However, reinforcement learning tasks such as playing games do not come with training data that are labeled with the correct move for each turn.
Instead, they are characterized by sparse, noisy, and delayed reward signals. Furthermore, training data are typically correlated and non-stationary. They overcame these challenges using stochastic gradient descent and experience replay to stabilize learning, essentially jump-starting the field of deep reinforcement learning.
Asynchronous Methods for Deep Reinforcement Learning
Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., Kavukcuoglu (2016)
The experience replay technique stabilizes learning by making it possible to batch or sample the training data randomly. However, it requires more memory and computation and applies only to off-policy learning algorithms such as Q-learning. In this paper, the authors introduce a new method based on asynchronously executing multiple agents on different instances of the environment. The resulting parallel algorithm effectively de-correlates the training data and makes it more stationary. Moreover, it makes it possible to extend deep learning to off-policy reinforcement learning algorithms such as SARSA and actor-critic methods. Their method combined with the actor-critic algorithm improved upon previous results on the Atari domain using
much less computation resources.
Dueling Network Architectures for Deep Reinforcement Learning
Wang, Z., Schaul, T., Hessel, M., van Hasselt, Hado, Lanctot, M., & de Freitas, Nando (2016)
This work, which won the Best Paper award, introduces a new neural network architecture that complements the algorithmic advances in deep Q-learning networks (DQN) and experience replay. The authors point out that the value of an action choice from a given state need to be estimated only if that action has a consequence on what happens. The dueling network architecture leverages this observation by inserting two parallel streams of fully connected layers after the final convolutional layer of a regular DQN. One of the two streams estimates the state-value function while the other stream estimates the state-dependent advantage of taking an action. The output module of the network combines the activations of these two streams to produce the Q-values for each action. This architecture learns state-value functions more efficiently and produces better policy evaluations when actions have similar values or the number of actions is large.
Opponent Modeling in Deep Reinforcement Learning
He, H., Boyd-Graber, J., Kwok, K., & III, Hal Daumé (2016)
The authors introduce an extension of the deep Q-network (DQN) called Deep Reinforcement Opponent Network (DRON) for multi-agent settings, where the action outcome of the agent being controlled depends on the actions of the other agents (opponents). If the opponents use fixed policies, then standard Q-learning is sufficient.
However, opponents with non-stationary policies occur when they learn and adapt their strategies over time. In this scenario, treating the opponents as part of the world in a standard Q-learning setup masks changes in opponent behavior. Therefore, the joint policy of opponents must be taken into consideration when defining the Q-function. The DRON architecture implements this idea by employing an opponent network to learn opponent policies and a Q-network to evaluate actions for a state. The outputs of the two networks are combined using a Mixture-of-Experts network [13] to obtain the expected Q-value. DRON out-performed DQN in simulated soccer and a trivia game by discovering different strategy patterns of opponents.
Conclusions
Deep learning is experiencing a phase of rapid growth due to its strong performance in a number of domains, producing state of the art results and winning machine learning competitions. However, these successes have also contributed to a fair amount of hype. The papers presented at ICML 2016 provided an unvarnished view of a vibrant field in which researchers are working actively to overcome challenges in making deep learning techniques more powerful, and in extending their successes to other domains and larger problems.