Three Impactful Machine Learning Topics at ICML 2016

1573

收藏 2016-07-02

This post discusses 3 particular tutorial sessions of impact from the recent ICML 2016 conference held in New York. Check out some innovative ideas on Deep Residual Networks, Memory Networks for Language Understanding, and Non-Convex Optimization.

By Robert Dionne, init.ai.

The International Conference on Machine Learning (ICML) is the leading international academic conference in machine learning, attracting 2000+ participants. This year it was held in NYC and I attended on behalf of Init.ai. Three of the tutorial sessions I attended were quite impactful. Anyone working on conversational apps, chatbots, and deep learning would be interested in these topics.

Deep Residual Networks: Deep Learning Gets Way Deeper by Kaiming He (slides)
Memory Networks for Language Understanding by Jason Weston(slides)
Recent Advances in Non-Convex Optimization and its Implications to Learning by Anima Anandkumar (slides)

Deep Residual Networks

I’ve written before about Residual Neural Network research, but listening to Kaiming was informative. In the talk, he described motivations for increasing the depth of neural networks. He demonstrated obstacles to increasing depth and initial solutions. Additionally, He showed how residual networks increase accuracy with increased depth beyond these initial solutions. Moreover, Kaiming justified using identity mappings in both the shortcut connection and the post-addition operation. Finally, He gave empirical results that ResNets yield representations generalizing to many problems.

Kaiming showed how deeper neural networks had won recent ImageNet competitions. Yet, extending them beyond a depth of about twenty layers decreases performance.
A few techniques are enough to get this far. Careful weight initialization and batch normalization enable networks to train beyond ten layers.

Weight Initialization

Weight initialization reduces vanishing and exploding behavior in the forward and backward signals. For healthy propagation, one should force the product of all layers’ scaled variances to be constant. Thus, one should rescale the scaled variance of each layer to be one. For a linear activation, one can use:

From slide 19.

For a rectified-linear (ReLU) activation, one can use:

From slide 20.

For a rectified-linear network with 22 layers, initializing with the second equation converges faster. The same network with 30 layers requires the second form to progress at all. The second form makes sense because ReLU drops half of the input space.

(Click to enlarge)
From slide 21.
Batch Normalization

Batch normalization rescales each layer for each minibatch. It reduces the training’s sensitivity to initial weights. For each layer and minibatch, one calculates the mean and standard deviation of inputs x. Then the layer rescales its input and applies a (component-wise) linear transformation with parameters γ and β.

From slide 23.

Despite these techniques, increasing depth another order-of-magnitude decreases performance. Yet by construction, one can trivially add identity layers to get a deeper net with the same accuracy.

Residual learning bypasses this barrier and improves accuracy with more layers.

(Click to enlarge)
From slide 37.

To deepen another 10x to 1000, He replaces the after-addition mapping with the identity function. Traditional ResNets used ReLU after the addition. Deeper ResNets use the identity. He shows several reasonable post-add activation functions result in multiplicative behavior, and reduce performance.
The identity activation smoothly propagates signal from all previous layers lto the L-th layer:

From slide 52.

Similarly, it smoothly propagates error from all later layers L to the l-th layer:

From slide 54.

To conclude, Kaiming showed results of transferring features from ResNets on image classification. Using ResNet features on localization, detection, and segmentation tasks is more accurate by 8.5%. Also, human pose estimation and depth estimation transfer well. ResNets show promise in image generation, natural language processing, speech recognition and advertising tasks.

Here are two implementations Kaiming highlighted:

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

全部回复

oliyiyi

2016-7-2 09:44:25

Memory Networks for Language Understanding

Jason Weston motivated building an end-to-end dialog agent. He detailed a simple model that makes headway toward this goal: Memory Networks. He provided means to test this model against a set of toy benchmarks. He described the benchmarks as an escalating sequence of tasks. Jason showed a revised memory network model that learns end-to-end without explicitly supervised attention. He gave real-world datasets where memory networks do well and where they do poorly. He portrayed a way to scale efficiently to large datasets. He presented two revisions: one using key-value pairs and another learning from textual feedback. Finally, he asked questions motivating future research.

First, Jason introduced a set of beliefs describing an ideal dialog agent. It should use all its knowledge to perform complex tasks. It should converse at length and understand the motives underlying the dialog. It should be able to grow its capabilities while conversing. It should learn end-to-end.

Next, Memory Networks (MemNNs) were introduced. Memory Networks combine inputs with attention on memories to provide reasoned outputs. He limits the first iteration’s scope to be as simple as possible. It consists of a recurrent controller module that accepts an initial query. To start, its memory is loaded with a set of facts. The query and facts are bag-of-words vectors. The controller predicts an attention vector (with a supervision signal) to choose a fact. It reads the chosen memory to update its hidden state. After several repetitions, or hops, it formulates an output. The output ranks possible responses from a dictionary of words. Error signals back-propagate through the network via the output and the supervised attention episodes.

From slide 9.

He described a set of toy benchmarks of increasing complexity. Each benchmark consists of a set of short stories. Each story is a sequence of statements about an evolving situation. The model should read a single story and answer one or more questions about it. Within a benchmark, the stories test the same skill. Across the different benchmarks, the skills get more difficult.

John was in the bedroom.
Bob was in the office.
John went to the kitchen.
Bob travelled back home.
Where is John? A: kitchen

(Example from slide 11.)

The benchmarks are:

Factoid question/answer with single supporting fact
Factoid QA with two supporting facts
Factoid QA with three supporting facts
Two argument relations: subject versus object
Three argument relations
Yes/no questions
Counting
Lists/sets
Simple negation
Indefinite knowledge
Basic coreference
Conjunction
Compound coreference
Time manipulation
Basic deduction
Basic induction
Positional reasoning
Reasoning about size
Path finding
Reasoning about agent’s motivation

A revised model, the End-to-end Memory Network (MemN2N) learns without attention supervision. It uses soft-attention (a probability vector) to read the memory. Thus, it is fully-differentiable and can learn from output supervision alone. The newer model still fails on some toy benchmark tasks. Yet, it succeeds on several real-world benchmarks, such as children’s books and news question sets.

Another revision, the Key-Value Memory Network splits each memory cell into two parts. The first part is a lookup key used to match the incoming state vector. The second is a value combined with attention to produce the read value. Key-Value MemNNs closely match state-of-the-art on some real-world question-answering datasets.
Finally, a third revision learns only through textual feedback. It learns to predict the response of a “teacher” agent that provides feedback in words. Mismatches between predicted and actual feedback provide a training signal to the model.

From slide 86.

Explore the papers, code and datasets in slide 87. Find questions for future research in slide 10, slide 83 and slide 88.

Non-Convex Optimization

Anima Anandkumar covered methods that achieve guaranteed global optimization for non-convex problems. Machine learning problems are optimization problems, often non-convex. But, non-convex problems have an exponential number of critical points. These saddle points impede the progress of gradient descent and Newton’s method. She detailed conditions that define different types of critical points. She gave algorithms to escape well-behaved functions to find local optima. Such well-behaved functions are twice-differentiable and have non-degenerate saddle points. Stochastic gradient descent and Hessian methods can escape saddle points efficiently. She showed how higher-order critical points impede the progress of these algorithms. She detailed specific problems for which global optima can be reached: matrix eigen-analysis and orthogonal tensor decomposition.

She showed tensor decomposition can replace popular machine learning methods that use maximum likelihood: document topic modeling,convolutional dictionary models, fast text embeddings and neural network training. She gave steps for future research on slide 87. If these methods interest you further, read this detailed post at offconvex from her research group.

Bio: Robert Dionne is a software developer working on backend services and deep learning infrastructure. If you’re interested in learning more about conversational interfaces, follow him and Init.aion Medium and Twitter. And if you’re looking to create a conversational interface for your app, service, or company, check outInit.ai.

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

hjtoh

2016-7-2 10:03:42

oliyiyi 发表于 2016-7-2 09:43
This post discusses 3 particular tutorial sessions of impact from the recent ICML 2016 conference he ...

非常好，谢谢分享

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

扫码加我 拉你入群

扫码加我 拉你入群

扫码加我 拉你入群

分享

扫码加好友，拉您进群

扫码加我拉你入群

扫码加我拉你入群

扫码加我拉你入群