Faster Neural Network Training with Data Echoing (Paper Explained)

Показать описание

CPUs are often bottlenecks in Machine Learning pipelines. Data fetching, loading, preprocessing and augmentation can be slow to a point where the GPUs are mostly idle. Data Echoing is a technique to re-use data that is already in the pipeline to reclaim this idle time and keep the GPUs busy at all times.

Abstract:
In the twilight of Moore's law, GPUs and other specialized hardware accelerators have dramatically sped up neural network training. However, earlier stages of the training pipeline, such as disk I/O and data preprocessing, do not run on accelerators. As accelerators continue to improve, these earlier stages will increasingly become the bottleneck. In this paper, we introduce "data echoing," which reduces the total computation used by earlier pipeline stages and speeds up training whenever computation upstream from accelerators dominates the training time. Data echoing reuses (or "echoes") intermediate outputs from earlier pipeline stages in order to reclaim idle capacity. We investigate the behavior of different data echoing algorithms on various workloads, for various amounts of echoing, and for various batch sizes. We find that in all settings, at least one data echoing algorithm can match the baseline's predictive performance using less upstream computation. We measured a factor of 3.25 decrease in wall-clock time for ResNet-50 on ImageNet when reading training data over a network.

Authors: Dami Choi, Alexandre Passos, Christopher J. Shallue, George E. Dahl

Links:

Рекомендации по теме

Комментарии

0:00 Overview
0:40 What's the problem?
8:30 The Data Echoing Method
13:00 Contributions
14:20 Experiments
26:15 My criticisms of the experiments
35:10 Strange validation accuracies

YannicKilcher

Reminds me of experience replay but instead of the environment here data in the batch preparation pipeline is cached. Also if you run separate "offline" preprocessing steps (e.g. resizing, augmenting) and store the results it is also a form of "memoization".

bluelng

I guess the figure 2 (A) is correct. The downstream is processing the data got from previous upstream

nareshr

Nice video as always, thank you!

I agree that it would have been nice to see "when it breaks", as you say. But you also have to keep in mind, how many models they have trained and how long it took for each model (see walltime). I believe this is something that should be covered in future work by the same or other people.

Edit: Also, while it is true that you CAN reach higher performance on CIFAR10 and Imagenet, just fixing the architecture does not guarantee that you do. So far, I have not found a principled investigation into how much the initialization seed influences the outcome. Erhan et al. looked at this in 2010 for MLPs ("Why Does Unsupervised Pre-training Help Deep Learning ?") and the variance is enormous. Not sure how this translates to modern architectures, but I would be surprised if the seed did not play a major role in the final outcome, judging from my own experience.

XRay

I used to run SGD on the same loaded batch on GPU multiple times for every batch. In my experience it helps in getting to 80% (MNIST) rather quickly because there is less data being copied over so there is lower traffic in the PCI bus.

34:00 This is a replay buffer from reinforcement learning?

herp_derpingson

Good one! Consider making a video on "Meta Pseudo Labels"

arkasaha

Correct me if I'm wrong, I wanna see if I get the idea of this.
So normally when you train you have epochs where you go through all of the data. Say you train your model for 20 epcohs, then it sees every data point 20 times. Here they just have a like 1.x times modifier for each epoch where they will see any given data point 1.x times on average per epoch. So rather than repeating after each epoch you have some repetition during them.
Some of the learning convergence would make sense then right? Because after your first epoch your data isn't quite as great for convergence as before but it still has a substantial impact until you begin over training.

Nick

I upload 4096 images to GPU memory.
I sample 16 and augment them two times, then I MixUp those and get 16*16=256 images.
I train the network with Contrastive Learning: all images that are Mixed Up from the same original version of an image should be close, all else far away.
I sample 4096 times from these 4096 images, and then I update those with other ones.

jonatani

Do you have a time machine? Your video came faster than the original paper itself! J. K. Thank you for all your indefatigable efforts :)

raghebalghezi

Maybe I'm missing something here but I don't quite understand why you would expect lower accuracy if you echo data. If you run each batch twice with half the learning rate, with respect to the number of fresh samples you should just have sliightly more accurate gradients no?

DanielHesslow

Did you see this one on reddit this morning too? :P

jrkirby

Nice video as always!
If I may, I would suggest this article for you to look at. Seems quite close to your content. A lot of references to Schmidhuber as a plus! :)

mikhaildoroshenko

It seems to have nothing to do with machine learning. Sorry.😪

yilu

Why is this a Paper, this is common sens when having a expensive input pipeline.

XMasterDE

Faster Neural Network Training with Data Echoing (Paper Explained)

Faster Neural Network Training with Data Echoing (Paper Explained)

How to Train Neural Networks Fast and Efficiently | Tutorial

Faster Neural Network Training, Algorithmically | Jonathan Frankle

How to Lower Neural Network Training Times

Pruning a neural Network for faster training times

Jeremy Howard: Very Fast Training of Neural Networks | AI Podcast Clips

How to Make Neural Networks Train Faster on Keras

Fooling Neural Networks 1000x Faster

MNIST Handwritten Digit Classification - Deep Learning with Tensorflow | Ep. 10

Deep Learning Foundations: Jonathan Frankle talk on Faster Neural Network Training, Algorithmically

MAMBA from Scratch: Neural Nets Better and Faster than Transformers

Making Neural Networks Fast with Vectorization (DL 10)

How do GPUs speed up Neural Network training?

THIS is HARDEST MACHINE LEARNING model I've EVER coded

Andrew Ng's Secret to Mastering Machine Learning - Part 1 #shorts

Neural Networks explained in 60 seconds!

Fast Neural Networks… a no brainer! - Riccardo Terrell (Lambda Days 2017)

Buying a GPU for Deep Learning? Don't make this MISTAKE! #shorts

How to Create a Neural Network (and Train it to Identify Doodles)

The Wrong Batch Size Will Ruin Your Model

PyTorch in 100 Seconds

Neural Networks Explained in 5 minutes

Training Neural Networks: Crash Course AI #4

TensorFlow in 100 Seconds