Faster Neural Network Training with Data Echoing (Paper Explained)

preview_player
Показать описание
CPUs are often bottlenecks in Machine Learning pipelines. Data fetching, loading, preprocessing and augmentation can be slow to a point where the GPUs are mostly idle. Data Echoing is a technique to re-use data that is already in the pipeline to reclaim this idle time and keep the GPUs busy at all times.

Abstract:
In the twilight of Moore's law, GPUs and other specialized hardware accelerators have dramatically sped up neural network training. However, earlier stages of the training pipeline, such as disk I/O and data preprocessing, do not run on accelerators. As accelerators continue to improve, these earlier stages will increasingly become the bottleneck. In this paper, we introduce "data echoing," which reduces the total computation used by earlier pipeline stages and speeds up training whenever computation upstream from accelerators dominates the training time. Data echoing reuses (or "echoes") intermediate outputs from earlier pipeline stages in order to reclaim idle capacity. We investigate the behavior of different data echoing algorithms on various workloads, for various amounts of echoing, and for various batch sizes. We find that in all settings, at least one data echoing algorithm can match the baseline's predictive performance using less upstream computation. We measured a factor of 3.25 decrease in wall-clock time for ResNet-50 on ImageNet when reading training data over a network.

Authors: Dami Choi, Alexandre Passos, Christopher J. Shallue, George E. Dahl

Links:
Рекомендации по теме
Комментарии
Автор

0:00 Overview
0:40 What's the problem?
8:30 The Data Echoing Method
13:00 Contributions
14:20 Experiments
26:15 My criticisms of the experiments
35:10 Strange validation accuracies

YannicKilcher
Автор

Reminds me of experience replay but instead of the environment here data in the batch preparation pipeline is cached. Also if you run separate "offline" preprocessing steps (e.g. resizing, augmenting) and store the results it is also a form of "memoization".

bluelng
Автор

I guess the figure 2 (A) is correct. The downstream is processing the data got from previous upstream

nareshr
Автор

Nice video as always, thank you!

I agree that it would have been nice to see "when it breaks", as you say. But you also have to keep in mind, how many models they have trained and how long it took for each model (see walltime). I believe this is something that should be covered in future work by the same or other people.


Edit: Also, while it is true that you CAN reach higher performance on CIFAR10 and Imagenet, just fixing the architecture does not guarantee that you do. So far, I have not found a principled investigation into how much the initialization seed influences the outcome. Erhan et al. looked at this in 2010 for MLPs ("Why Does Unsupervised Pre-training Help Deep Learning ?") and the variance is enormous. Not sure how this translates to modern architectures, but I would be surprised if the seed did not play a major role in the final outcome, judging from my own experience.

XRay
Автор

I used to run SGD on the same loaded batch on GPU multiple times for every batch. In my experience it helps in getting to 80% (MNIST) rather quickly because there is less data being copied over so there is lower traffic in the PCI bus.

34:00 This is a replay buffer from reinforcement learning?

herp_derpingson
Автор

Good one! Consider making a video on "Meta Pseudo Labels"

arkasaha
Автор

Correct me if I'm wrong, I wanna see if I get the idea of this.
So normally when you train you have epochs where you go through all of the data. Say you train your model for 20 epcohs, then it sees every data point 20 times. Here they just have a like 1.x times modifier for each epoch where they will see any given data point 1.x times on average per epoch. So rather than repeating after each epoch you have some repetition during them.
Some of the learning convergence would make sense then right? Because after your first epoch your data isn't quite as great for convergence as before but it still has a substantial impact until you begin over training.

Nick
Автор

I upload 4096 images to GPU memory.
I sample 16 and augment them two times, then I MixUp those and get 16*16=256 images.
I train the network with Contrastive Learning: all images that are Mixed Up from the same original version of an image should be close, all else far away.
I sample 4096 times from these 4096 images, and then I update those with other ones.

jonatani
Автор

Do you have a time machine? Your video came faster than the original paper itself! J. K. Thank you for all your indefatigable efforts :)

raghebalghezi
Автор

Maybe I'm missing something here but I don't quite understand why you would expect lower accuracy if you echo data. If you run each batch twice with half the learning rate, with respect to the number of fresh samples you should just have sliightly more accurate gradients no?

DanielHesslow
Автор

Did you see this one on reddit this morning too? :P

jrkirby
Автор

Nice video as always!
If I may, I would suggest this article for you to look at. Seems quite close to your content. A lot of references to Schmidhuber as a plus! :)

mikhaildoroshenko
Автор

It seems to have nothing to do with machine learning. Sorry.😪

yilu
Автор

Why is this a Paper, this is common sens when having a expensive input pipeline.

XMasterDE