Direct Preference Optimization: Your Language Model is Secretly a Reward Model | DPO paper explained

Показать описание

Direct Preference Optimization (DPO) to finetune LLMs without reinforcement learning. DPO was one of the two Outstanding Main Track Runner-Up papers.

Thanks to our Patrons who support us in Tier 2, 3, 4: 🙏
Dres. Trost GbR, Siltax, Vignesh Valliappan, @Mutual_Information , Kshitij

Outline:
00:00 DPO motivation
00:53 Finetuning with human feedback
01:39 RLHF explained
03:05 DPO explained
04:24 Why Reinforcement Learning in the first place?
05:58 Shortcomings
06:50 Results

▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
🔥 Optionally, pay us a coffee to help with our Coffee Bean production! ☕
▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀

🔗 Links:

#AICoffeeBreak #MsCoffeeBean #MachineLearning #AI #research

Video editing: Nils Trost
Music 🎵 : Ice & Fire - King Canyon

Рекомендации по теме

Комментарии

Wow, two videos in one week? You're spoiling us!!

DerPylz

On why RLHF came first, it was invented by OpenAI, which had focused almost exclusively on RL stuff prior to GPT. "When all you have is a hammer..." as the saying goes.

TheRyulord

I have not read the paper yet, but this sounds like supervised contrastive learning. If it is, then it's really astonishing that nobody came up with it before. I implemented some supervised contrastive learning myself... missed opportunity 😢

Neomadra

I really enjoy your videos. Please keep up the good work! My theory for no one thought of this is your first reason; they thought there is no closed form loss function. That is where RL comes in.

ShihgianLee

Really excellent breakdown of DPO!

Given what we’ve seen with gains from the use of DPO in the open source community, it makes complete sense that it was at least one of the runners up at NeurIPS this year.

I’m really enjoying your explainer videos! Thank you for taking the time to make them!

MaJetiGizzle

These videos are great! Really well explained, thank you so much for the effort you put into them :)

poketopa

Great explanation, easy to follow through, pretty simplified to understand

kayvanshah

Thanks for the video! Very well explained, I just began looking into DPO and your video gives a great context.

SrikanthIyer

Thanks for. pretty useful and timely explanation

KPreddiePWSP

I’m no expert, but when RLHF was new, the most common justification I heard in explainer articles and videos was that the reward model was smaller than the LLM, so less likely to overfit on the human labels, and could be used to produce more data for the LLM to train on compared to just the expensive human-annotated data. So pretty much your second hypothesis.

alexkubiesa

Good video :) really enjoyed watching it

qzfrnfz

Thank you so much for such clear high level explanations.

IbrahimSobh

Great video and wonderful explanation. Thanks for covering the differences and thoughts about the limitations of just using DPO.

I am wondering why instruction finetuning was not mentioned? Wouldn’t SFT make the whole DPO process more efficient? Especially when sampling directly from a pre-trained model, it should be hard to even get good samples when the model hasn’t yet learned what questions and answers look like? No?

mkamp

Hi, just came here cause I saw you´re a member of Sabine´s channel. Wow, did not expect a successful channel here too. Your videos are made very good, though computer science is not my field. I´ll recommend your channel, all the best.

Thomas-gk

Thank you so much for your clear explanation ~~ it is really helpful :)) hope you review other NeurIPS papers haha THanks ~~~

xkrgnhy

What about explanation video about MoE arch (mistral)?

paprikar

So the main logic lies in the custom loss function, which is calculating higher loss for next token if it is far from the positive example?

eck

Great video! Got a follow up question: what kind of finetuning is the finetuning provided by the openai API, where it finetunes a model based on a training set of Q&A pairs provided by the user?

learnsomethingnew

Does this only apply to transformers or would it also work with Mamba?

sfsft

Another question I got when reading the paper subtitle "Your Language Model is Secretly a Reward Mode" was, in what way do they mean the language model is a reward model? To me it seems like they're not using a reward model at all, because they figured out that after starting to use a contrastive loss, they don't need one.

kristoferkrus

Direct Preference Optimization: Your Language Model is Secretly a Reward Model | DPO paper explained

Direct Preference Optimization: Your Language Model is Secretly a Reward Model | DPO paper explained

Direct Preference Optimization (DPO): Your Language Model is Secretly a Reward Model Explained

Direct Preference Optimization (DPO) - How to fine-tune LLMs directly without reinforcement learning

Direct Preference Optimization: Forget RLHF (PPO)

Aligning LLMs with Direct Preference Optimization

Direct Preference Optimization

Direct Preference Optimization (DPO) explained: Bradley-Terry model, log probabilities, math

Direct Preference Optimization in One Minute

DPO : Direct Preference Optimization

What is Direct Preference Optimization?

Direct Preference Optimization: Your Language Model is Secretly a Reward Model (Ko/En Subtitles)

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Direct Preference Optimization Your Language Model is Secretly a Reward Model

PR-453: Direct Preference Optimization

Towards Reliable Use of Large Language Models: Better Detection, Consistency, and Instruction-Tuning

Direct Preference Optimization (DPO)

Direct Preference Optimization:Your Language Model is Secretly a Reward Model, Paper & Code

Direct Preference Optimization Your Language Model is Secretly a Reward Model Stanford 2023

DPO - Part1 - Direct Preference Optimization Paper Explanation | DPO an alternative to RLHF??

[short] Direct Preference Optimization: Your Language Model is Secretly a Reward Model

[Open DMQA Seminar] Direct Preference Optimization with Diffusion Models

Direct Preference Optimization (DPO)

Direct Preference Optimization (DPO) of LLMs to Reduce Toxicity

RLHF vs DPO (and KTO) - Top 3 Optimization Methods in a Nutshell