Neural Architecture Search without Training (Paper Explained)

preview_player
Показать описание
#ai #research #machinelearning

Neural Architecture Search is typically very slow and resource-intensive. A meta-controller has to train many hundreds or thousands of different models to find a suitable building plan. This paper proposes to use statistics of the Jacobian around data points to estimate the performance of proposed architectures at initialization. This method does not require training and speeds up NAS by orders of magnitude.

OUTLINE:
0:00 - Intro & Overview
0:50 - Neural Architecture Search
4:15 - Controller-based NAS
7:35 - Architecture Search Without Training
9:30 - Linearization Around Datapoints
14:10 - Linearization Statistics
19:00 - NAS-201 Benchmark
20:15 - Experiments
34:15 - Conclusion & Comments

Abstract:
The time and effort involved in hand-designing deep neural networks is immense. This has prompted the development of Neural Architecture Search (NAS) techniques to automate this design. However, NAS algorithms tend to be extremely slow and expensive; they need to train vast numbers of candidate networks to inform the search process. This could be remedied if we could infer a network's trained accuracy from its initial state. In this work, we examine how the linear maps induced by data points correlate for untrained network architectures in the NAS-Bench-201 search space, and motivate how this can be used to give a measure of modelling flexibility which is highly indicative of a network's trained performance. We incorporate this measure into a simple algorithm that allows us to search for powerful networks without any training in a matter of seconds on a single GPU. Code to reproduce our experiments is available at this https URL.

Authors: Joseph Mellor, Jack Turner, Amos Storkey, Elliot J. Crowley

Links:

If you want to support me, the best thing to do is to share out the content :)

If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n
Рекомендации по теме
Комментарии
Автор

At 21:55 I think you should draw horizontal lines (not vertical) to discard architectures since score is on the y-axis. No?

YeshwanthReddy
Автор

These long videos are really growing on me. Not just introducing me to papers that I am not familiar with, but also the additional insight of your perspectives. Thank you.

RickeyBowers
Автор

If it's true, then pre filtering (or rejection sampling) based up on these avg score is a cheap speedup tool for any neural architecture search algorithm too.

Miestroh
Автор

Thank you so much, you save 10s of thousands of people hours of work. The impact of your work is immense even if you don't get hundreds of thousands of views. Please never stop, you're amazing!

JeroenMW
Автор

It is basically an "anti-lottery ticket hypothesis".
.
33:00 For the RL based search models, I think we would still need some negative samples. Otherwise the RL model would keep suggesting bad models for the sake of exploration.
.
Nice paper, easy to implement. Will definitely use this trick.

herp_derpingson
Автор

I think that a lot of the lack of performance compared to other techniques can be explained by the way the NAS-Bench-201 benchmark is constructed:
We only have 15, 625 different architectures: enough for the sample efficient "train until you're done" NAS systems, but searching without training may just need significantly different architectures. This would also explain why the more complex tasks the metric "spreads out": There just isn't enough ways the NAS-201 networks differ to make a meaningful difference that can be observed just by looking at the initial state. Maybe one could combine this approach with something like NEAT to generate a population of architectures and score them pretty much instantly using this. This would allow the system to get away from the "resnet-likes" that make up the NAS-BENCH.

TimeofMinecraftMods
Автор

Thank you for sharing this. It's very interesting. I learned a lot.

Mahyaalizadeh
Автор

Thank you for the efforts.. highly appreciated!!

marohs
Автор

Yet another area I am most interested to hear about. Thank you

Notshife
Автор

The role of nonlinear function in a neural network can be treated as the if...else statement in a traditional programming language.
the LSTM, GRU, Attention also can be treated as the same way, they provide switch control capability.

albertwang
Автор

I wonder if this scoring could be improved simply by exchanging regular correlation with distance correlation, since that will also capture non-linearities. It might make a difference in particular in those networks where currently the score no longer tells you much.

Kram
Автор

This is very exciting and as others mention in the comments a super speedy tool to discard faulty architectures, thanks for the video!

lucha
Автор

amazing, always like your inspiring interpretation

eddtsoi
Автор

Great video! Thanks for you personal interpretation too---helps think things through. I would argue though that the interpretation of the pytorchcv at (25:40) is wrong (admittedly, I don't know if its your interpretation or the authors since they seemed to have removed this part from their most recent version). But it looks like they're showing that their metric scores methods that we know do well high. That is, architectures that have been found by humans to preform well achieve a high NASWOT score (or whatever they call it).

lugae
Автор

my 6-word slogan for this paper:
Neural architecture physiognomy! And it works!

CosmiaNebula
Автор

Thank you for the very clear explanation!

dipsyteletubbie
Автор

The idea is interesting, thanks for making it so accessible. The big question is: Is it useful? I read fig 3 differently to you and so I come to a different conclusion. You think that this method weeds out most/90% of the bad architectures, I think it weeds out very few. If we had an uninformative method then the correlation line would be vertical in fig 3, that is the score is not correlated with accuracy. By eye I integrate vertically to get the distribution of scores that would happen with a useless method. I then do the same again for (say) the top 10% of scores. Scatter plots are terrible at showing density, but it looks to me that all the probability mass is at the top of the plot, so the distributions would be very similar. The authors could have done this basic stats, .

tonyrobinson
Автор

I somehow like these one-step methods. What I do not directly understand is how this method can predict the generalization capabilities of a network-architecture (e.g. validation set accuracy) from the linear map histogram of one mini-batch.

bluelng
Автор

I don’t know if you’re aware, but the paper seems to have been edited/updated since you made this video with different graphs, showing correlation matrices instead of histograms, and a different equation for computing the score. Is this common for papers to be changed after publishing? Do you know if the new equation is mathematically equivalent and preferred because it’s easier to calculate? Or is it just a different score that measures approximately the same thing?

drozen
Автор

Is J of shape NxD or DxN (where D is the dimensionality of x)? The shape of JJ^T would be NxN and DxD respectively in these two cases. Clearly the first makes more sense in context but the J_i, n in the second line below (1) seems to indicate otherwise.

dermitdembrot