Direct Preference Optimization: Your Language Model is Secretly a Reward Model | DPO paper explained

preview_player
Показать описание
Direct Preference Optimization (DPO) to finetune LLMs without reinforcement learning. DPO was one of the two Outstanding Main Track Runner-Up papers.

Thanks to our Patrons who support us in Tier 2, 3, 4: 🙏
Dres. Trost GbR, Siltax, Vignesh Valliappan, @Mutual_Information , Kshitij

Outline:
00:00 DPO motivation
00:53 Finetuning with human feedback
01:39 RLHF explained
03:05 DPO explained
04:24 Why Reinforcement Learning in the first place?
05:58 Shortcomings
06:50 Results

▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
🔥 Optionally, pay us a coffee to help with our Coffee Bean production! ☕
▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀

🔗 Links:

#AICoffeeBreak #MsCoffeeBean #MachineLearning #AI #research​

Video editing: Nils Trost
Music 🎵 : Ice & Fire - King Canyon
Рекомендации по теме
Комментарии
Автор

Wow, two videos in one week? You're spoiling us!!

DerPylz
Автор

On why RLHF came first, it was invented by OpenAI, which had focused almost exclusively on RL stuff prior to GPT. "When all you have is a hammer..." as the saying goes.

TheRyulord
Автор

I have not read the paper yet, but this sounds like supervised contrastive learning. If it is, then it's really astonishing that nobody came up with it before. I implemented some supervised contrastive learning myself... missed opportunity 😢

Neomadra
Автор

I really enjoy your videos. Please keep up the good work! My theory for no one thought of this is your first reason; they thought there is no closed form loss function. That is where RL comes in.

ShihgianLee
Автор

Really excellent breakdown of DPO!

Given what we’ve seen with gains from the use of DPO in the open source community, it makes complete sense that it was at least one of the runners up at NeurIPS this year.

I’m really enjoying your explainer videos! Thank you for taking the time to make them!

MaJetiGizzle
Автор

These videos are great! Really well explained, thank you so much for the effort you put into them :)

poketopa
Автор

Great explanation, easy to follow through, pretty simplified to understand

kayvanshah
Автор

Thanks for the video! Very well explained, I just began looking into DPO and your video gives a great context.

SrikanthIyer
Автор

Thanks for. pretty useful and timely explanation

KPreddiePWSP
Автор

I’m no expert, but when RLHF was new, the most common justification I heard in explainer articles and videos was that the reward model was smaller than the LLM, so less likely to overfit on the human labels, and could be used to produce more data for the LLM to train on compared to just the expensive human-annotated data. So pretty much your second hypothesis.

alexkubiesa
Автор

Good video :) really enjoyed watching it

qzfrnfz
Автор

Thank you so much for such clear high level explanations.

IbrahimSobh
Автор

Great video and wonderful explanation. Thanks for covering the differences and thoughts about the limitations of just using DPO.

I am wondering why instruction finetuning was not mentioned? Wouldn’t SFT make the whole DPO process more efficient? Especially when sampling directly from a pre-trained model, it should be hard to even get good samples when the model hasn’t yet learned what questions and answers look like? No?

mkamp
Автор

Hi, just came here cause I saw you´re a member of Sabine´s channel. Wow, did not expect a successful channel here too. Your videos are made very good, though computer science is not my field. I´ll recommend your channel, all the best.

Thomas-gk
Автор

Thank you so much for your clear explanation ~~ it is really helpful :)) hope you review other NeurIPS papers haha THanks ~~~

xkrgnhy
Автор

What about explanation video about MoE arch (mistral)?

paprikar
Автор

So the main logic lies in the custom loss function, which is calculating higher loss for next token if it is far from the positive example?

eck
Автор

Great video! Got a follow up question: what kind of finetuning is the finetuning provided by the openai API, where it finetunes a model based on a training set of Q&A pairs provided by the user?

learnsomethingnew
Автор

Does this only apply to transformers or would it also work with Mamba?

sfsft
Автор

Another question I got when reading the paper subtitle "Your Language Model is Secretly a Reward Mode" was, in what way do they mean the language model is a reward model? To me it seems like they're not using a reward model at all, because they figured out that after starting to use a contrastive loss, they don't need one.

kristoferkrus