Direct Preference Optimization (DPO): Your Language Model is Secretly a Reward Model Explained

Показать описание

Рекомендации по теме

Комментарии

Dude, you're a hero for making these videos! Definitely earned a subscription from me.

rw-kbqv

I'm really glad a channel likes yours exist.

FaultyTwo

Thank you so much! I truly enjoyed your video and the way you explain things. There are moments, however, where I find myself a bit lost when you don't delve deeper, and I wish you could expand on those points. I would appreciate it if you could recommend a video, article to help me become more familiar with the basic concepts related to such papers in this field. Understanding these basics would make it much easier for me to grasp the material. I feel some gaps in my knowledge that If I work on them then understanding these papers, their mathematical notation will be easier. I focusing on papers in this area of transformers and training models.. Any suggestions would be greatly appreciated. Btw you definitely earned my subscription as well.

unclecode

Nice one, thanks for sharing. The replacement of the reward model by the MLE term looks appealing when we have groundtruth (generated reply and reference reply). Still, the advantage of reward models is mainly in their potential to be used on new samples without groundtruth presented (i.e., no reference replies in a self-play training on new datasets), so how would the MLE loss work in such scenarios?

prof_shixo

I was reading Zephyr. It led to this DPO paper, which landed me on your channel. I am soooo happy. Keep it up!

bibiworm

Hey @Gabriel. Can you please clear my doubt, why @12:58 we cant just directly backpropagate the loss like we do in simple fine tuning? I am not understanding it? Please share any relevant resource?

YashVerma-iilx

Direct Preference Optimization (DPO): Your Language Model is Secretly a Reward Model Explained

Direct Preference Optimization: Your Language Model is Secretly a Reward Model | DPO paper explained

Direct Preference Optimization (DPO) explained: Bradley-Terry model, log probabilities, math

Direct Preference Optimization (DPO) - How to fine-tune LLMs directly without reinforcement learning

Direct Preference Optimization (DPO): Your Language Model is Secretly a Reward Model Explained

Direct Preference Optimization: Forget RLHF (PPO)

Reinforcement Learning from Human Feedback (RLHF) & Direct Preference Optimization (DPO) Explain...

Aligning LLMs with Direct Preference Optimization

Direct Preference Optimization (DPO)

LLM Alignment: Techniques for Building Human-Aligned AI

Direct Preference Optimization (DPO) in AI

Direct Preference Optimization in One Minute

DPO Debate: Is RL needed for RLHF?

Direct Preference Optimization

Direct Preference Optimization (DPO)

Towards Reliable Use of Large Language Models: Better Detection, Consistency, and Instruction-Tuning

What is Direct Preference Optimization?

Direct Preference Optimization (DPO): How It Works and How It Topped an LLM Eval Leaderboard

DPO : Direct Preference Optimization

Direct Preference Optimization (DPO) of LLMs to Reduce Toxicity

DPO - Part1 - Direct Preference Optimization Paper Explanation | DPO an alternative to RLHF??

DPO - Part2 - Direct Preference Optimization Implementation using TRL | DPO an alternative to RLHF??

Direct Preference Optimization Your Language Model is Secretly a Reward Model

LLM training process with Direct Preference Optimization (DPO) and bypass Reward Model (Part3)

Direct Preference Optimization: Your Language Model is Secretly a Reward Model (Ko/En Subtitles)