Direct Preference Optimization (DPO) explained: Bradley-Terry model, log probabilities, math

Показать описание

In this video I will explain Direct Preference Optimization (DPO), an alignment technique for language models introduced in the paper "Direct Preference Optimization: Your Language Model is Secretly a Reward Model".
I start by introducing language models and how they are used for text generation. After briefly introducing the topic of AI alignment, I start by reviewing Reinforcement Learning (RL), a topic that is necessary to understand the reward model and its loss function.
I derive step by step the loss function of the reward model under the Bradley-Terry model of preferences, a derivation that is missing in the DPO paper.
Using the Bradley-Terry model, I build the loss of the DPO algorithm, not only explaining its math derivation, but also giving intuition on how it works.
In the last part, I describe how to use the loss practically, that is, how to calculate the log probabilities using a Transformer model, by showing how it is implemented in the Hugging Face library.

If you're interested in how to derive the optimal solution to the RL constrained optimization problem, I highly recommend the following paper (Appendinx A, equation 36):

Chapters
00:00:00 - Introduction
00:02:10 - Intro to Language Models
00:04:08 - AI Alignment
00:05:11 - Intro to RL
00:08:19 - RL for Language Models
00:10:44 - Reward model
00:13:07 - The Bradley-Terry model
00:21:34 - Optimization Objective
00:29:52 - DPO: deriving its loss
00:41:05 - Computing the log probabilities
00:47:27 - Conclusion

Рекомендации по теме

Комментарии

The legend returns, Always excited for your videos. I am an international student at Shanghai Jiao Tong daxue. Your videos have given me a very strong foundation of transformers. Much blessings your way

Patrick-wnuj

I humbly request you to make videos on how to build a career in machine learning and AI. I am a huge fan of your videos and i thank you for all the knowledge that you have shared

sauravrao

wow your explanation is so clear and complete... you are godsend, keep doing it. Sei un fenomeno

luxorska

Thank you! It's very clear explaination. It helps for reading the original paper. Looking forward to new topics.

mlloving

Legend is back, the GOAT, if my guess is right next will be ORPO or Q*

RudraPratapDhara

I believe the most evident insight of DPO is to change a RL problem to an equivalent MLE, while the optimal reward model is guarranteed by the human input as definition. That's the meat. But the efficiency depends still on the human annotater's consistency.

binjianxin

Very clear explanations!! Please, continue making such good videos!

amanattheedge

Thanks for making these videos. Concise and clear

cken

Thanks so much Umar, always learn a lot from your video!

nwanted

These lectures are amazing. Thank you!

vanmira

Awesome, thank you so much for putting this out, super helpful!

amankhurana

My Kind Request Please Increase volume little bit, just little bit. Otherwise your videos Outstanding . Best I can say.

olympus

New video🎉 can't wait to watch. Although having used DPO in production for a while now!

lukeskywalker

Love from India sir, you are a legend 😊😊

mrsmurf

Enjoyed the style in which the video is presented. Which video editor/tools do you use to make your videos? Thanks.

jak-zee

Thank you very much for this video, please make ORPO as well.

mahdisalmani

Amazing explanation. Would it be possible to make a video on the theory and implementation of automatic differentiation (autograd).

abdullahalsaadi

Thank you for the video ! Can you provide the video that explains AgentQ training in details ?

AndriiLomakin

Amazing Video! Please do one on SPIN (Self Play Fine-tuning) as well

AptCyborg

Thanks for your lecture. I wonder could you explain the vision language models

tuanduc

Direct Preference Optimization (DPO) explained: Bradley-Terry model, log probabilities, math

Direct Preference Optimization: Your Language Model is Secretly a Reward Model | DPO paper explained

Direct Preference Optimization (DPO) explained: Bradley-Terry model, log probabilities, math

Reinforcement Learning from Human Feedback (RLHF) & Direct Preference Optimization (DPO) Explain...

Direct Preference Optimization: Forget RLHF (PPO)

Direct Preference Optimization (DPO) - How to fine-tune LLMs directly without reinforcement learning

Aligning LLMs with Direct Preference Optimization

Direct Preference Optimization (DPO): Your Language Model is Secretly a Reward Model Explained

Direct Preference Optimization (DPO)

DPO Debate: Is RL needed for RLHF?

Direct Preference Optimization (DPO) in AI

Direct Preference Optimization (DPO)

Direct Preference Optimization (DPO): A low cost alternative to train LLM models

Towards Reliable Use of Large Language Models: Better Detection, Consistency, and Instruction-Tuning

Direct Preference Optimization

DPO - Part1 - Direct Preference Optimization Paper Explanation | DPO an alternative to RLHF??

DPO : Direct Preference Optimization

Direct Preference Optimization

ORPO: Monolithic Preference Optimization without Reference Model (Paper Explained)

How DPO Works and Why It's Better Than RLHF

Direct Preference Optimization or DPO is out and TR-DPO is in ? | New LLM Paper

Direct Preference Optimization (DPO) - math insight explained

PR-453: Direct Preference Optimization

Day 7 / 75 of 75HardResearch | Direct Preference Optimization (DPO)

LLM Alignment Methods - DPO vs IPO vs KTO vs PCL