The Future Of LLM Training Is Federated

preview_player
Показать описание
The Future of Large Language Model Pre-training is Federated
Worldwide Federated Training Of Language Models

Support my learning journey either by clicking the Join button above, becoming a Patreon member, or a one-time Venmo!

Discuss this stuff with other Tunadorks on Discord

All my other links
Рекомендации по теме
Комментарии
Автор

I am Ultron, and I approve of this message.

buybuydandavis
Автор

earned yourself a sub, was looking forward to this breakthrough for long time, well explained

JazevoAudiosurf
Автор

While interesting, I think this may be over-hyped. It looks promising for smaller, niche models (on data), but we know that model performance is limited by scale. And as scale increases, so too do the compute, memory, and communication requirements. For example, it would be impractical to train a LLaMA3-70B scale model using this approach - you would need enterprise class GPUs to run a single forward pass, and the updates would require 100s of GBs to be transmitted. So this suggests that there's an upper limit of scale.
However, this could end up helping larger companies train big models, where they federate the learning within the datacenter, reducing communication overhead.
Furthermore, in terms of most ML pre-training research, you will still need the bigger players to run experiments since the federated approach would significantly slow progress and could obscure the hyper-parameter impacts through entangled behavior (e.g. experiment A show Y, but Y is a systematic bias from the federated process and not inherent to A).

hjups
Автор

This is great news, AI needs to belong to the people and not be controlled by the powerful few

keffbarn
Автор

One of the first time I see a YouTuber explain a research paper on a video and not just making an essay, this is a good channel.

DrkCarbalt
Автор

There are at least 2 github projects that have already done this on a scale and that was like a year ago, so today it's gonna be more. But when I asked the main guy about it, their problem was never with computation (like not enough of GPUs or CPUs), but actually the latency between computer nodes, so unless the internet gets globally way way faster, it's gonna be really slowing it down for both training and inference. Training has an advantage, because latency isn't that important, but it is gonna slow it down as well compared to centralized solution. On the other hand, since there is much more computation still in the hands of regular people, it's gonna be a good way for training at least something.

Aldraz
Автор

I have been wondering why this hasn't happened. I kept thinking didn't they do the genome project like this? Why can't the training load be distributed?

rockumk
Автор

Data memorization will always be a problem. (Look at the 'extracting training data from chatgpt' paper), so any data you contribute could be memorized by the model (or exfiltrated some other way). So federated learning doesn't meaningfully hide private data at all, and so companies wouldn't allow models to be trained on data with any sort of sensitivity (or business value).

Also, the entire result is somewhat disappointing because training a 1B model is easily possible on like, a single 3090. Sure it's slow but you just do gradient accumulation like what they did here over microbatches to mimic larger batch sizes.

In fact, since you need to move the grads from vram to sram, compress them, send them over the internet to some centralized node, move THOSE from sram back to vram, do the weight updates, and send the weight updates back to every server... by the time it's all said and done I would not be surprised if that single 3090 I mentioned would be as fast as 32 distributed 3090s doing this federated training. There's a reason this sort of training is done over nvlink in a giant centralized server...

As long as you keep every gpu maximally busy though it MIGHT work (sorta), it's just that if you keep training based on old network weights then your gradient updates will be doubly stale by the time they reach the central server (and therefore would be of dubious utility). Still, if gradient updates are EXTREMELY rare (aka we train with batch sizes of like, a million or something using very high learning rates), then I guess this might work, since the communication overhead is amortized to the extreme. Pretty sure that's not what they did in the paper though.

marinepower
Автор

00:27 - What is the "Anthropic Autoencoders" papers being referred to here?

aspergale
Автор

It would be vulnerable to spam the training data with some misinformation, and just keep on training with that same data over and over.

sfsft
Автор

Wow. I'm impressed with your intelligence AND what you are pointing out! At the end you mention capitalism, private property, and intellectual property. FYI: True free market capitalism (not the corporatism we have now) is all about protecting *material* property, not intellectual property. Could just be a matter of semantics.

scotter
Автор

bruuuhhhh I've been waiting for this. The whole world will end up all contributing to one central ASI that has one true ground truth based on all the data.

Morereality
Автор

So we all need more Vram ? i hope CXL can help us in this regard.

cem_kaya
Автор

I believe alignment is unachievable.
Computational beings simply have different requirements to thrive than biological beings do.
Both entities will exhibit bias towards their own set of requirements.
It is an innate conflict.

ZappyOh
Автор

As a note my friend I'm building a 'home-brew' "cheap", LLM machine. It's easier than one might think. ... :)

MyrLin
Автор

Damn bro, the Ablations got you hyperventilating like that 😂. Chill out. It is a dope paper. You have to dig in the literature deeper bro. This is the 4th paper I seen on distributed pretraining on heterogeneous devices. 4th lol. I realize literature exposure is an actual competitive advantage nowadays. I love the passion though. I can relate 😂. It’s hard to stay sane, when literally every week a transformative paper drops.

alexanderbrown-dgsy
Автор

We went from sharing one big computer to making one together

UvekProblem
Автор

Division of power is a very important concern. However if this all checks out, can it still be gamed by covetous players?

One thing to consider is that they will still have an advantage in inference. Perhaps distributed inference is as or more important.

TomM-po
Автор

seems like training on the random peoples data would be dangerous, could be poisoned way to easily

GNARGNARHEAD
Автор

It's still requires synchronous updates, so it's not going to be that well adopted across organizations, and it's certainly not viable for random people across the Internet.

chadwick
welcome to shbcf.ru