Accelerating Deep Learning by Focusing on the Biggest Losers

preview_player
Показать описание
What if you could reduce the time your network trains by only training on the hard examples? This paper proposes to select samples with high loss and only train on those in order to speed up training.

Abstract:
This paper introduces Selective-Backprop, a technique that accelerates the training of deep neural networks (DNNs) by prioritizing examples with high loss at each iteration. Selective-Backprop uses the output of a training example's forward pass to decide whether to use that example to compute gradients and update parameters, or to skip immediately to the next example. By reducing the number of computationally-expensive backpropagation steps performed, Selective-Backprop accelerates training. Evaluation on CIFAR10, CIFAR100, and SVHN, across a variety of modern image models, shows that Selective-Backprop converges to target error rates up to 3.5x faster than with standard SGD and between 1.02--1.8x faster than a state-of-the-art importance sampling approach. Further acceleration of 26% can be achieved by using stale forward pass results for selection, thus also skipping forward passes of low priority examples.

Authors: Angela H. Jiang, Daniel L.-K. Wong, Giulio Zhou, David G. Andersen, Jeffrey Dean, Gregory R. Ganger, Gauri Joshi, Michael Kaminksy, Michael Kozuch, Zachary C. Lipton, Padmanabhan Pillai

Рекомендации по теме
Комментарии
Автор

There are so many ML papers these days, the authors have to resort to click baity titles.

What a time to be alive.

herp_derpingson
Автор

Aside from the hard example selection, is this identical to the RevNet technique for saving memory needed for backprop?

connor-shorten
Автор

This actually seems a lot like intrinsically motivated AI. The only difference is that those AIs move to get more high-loss&decrease-in-loss input instead of selecting neurons or examples in a batch when training.

simleek
Автор

I think it will be difficult for multi gpu training, because they will forward once and sync the results for a total gpu node batch for forward and backward, and it will be a tradeoff for extra forward time with saving sample backward time.

guanfuchen
Автор

won't it just overfit to the selected hard examples and underfit to the easy ones?

sehbanomer
Автор

What resources do you recommend for starting with DL? Anything in R?

superkhoy
Автор

This approach seems like a derivative of boosting

DrAhdol
Автор

Great paper though why hasnt anyone thought about this before?

herp_derpingson