Direct Preference Optimization (DPO) explained: Bradley-Terry model, log probabilities, math

preview_player
Показать описание
In this video I will explain Direct Preference Optimization (DPO), an alignment technique for language models introduced in the paper "Direct Preference Optimization: Your Language Model is Secretly a Reward Model".
I start by introducing language models and how they are used for text generation. After briefly introducing the topic of AI alignment, I start by reviewing Reinforcement Learning (RL), a topic that is necessary to understand the reward model and its loss function.
I derive step by step the loss function of the reward model under the Bradley-Terry model of preferences, a derivation that is missing in the DPO paper.
Using the Bradley-Terry model, I build the loss of the DPO algorithm, not only explaining its math derivation, but also giving intuition on how it works.
In the last part, I describe how to use the loss practically, that is, how to calculate the log probabilities using a Transformer model, by showing how it is implemented in the Hugging Face library.

If you're interested in how to derive the optimal solution to the RL constrained optimization problem, I highly recommend the following paper (Appendinx A, equation 36):

Chapters
00:00:00 - Introduction
00:02:10 - Intro to Language Models
00:04:08 - AI Alignment
00:05:11 - Intro to RL
00:08:19 - RL for Language Models
00:10:44 - Reward model
00:13:07 - The Bradley-Terry model
00:21:34 - Optimization Objective
00:29:52 - DPO: deriving its loss
00:41:05 - Computing the log probabilities
00:47:27 - Conclusion
Рекомендации по теме
Комментарии
Автор

The legend returns, Always excited for your videos. I am an international student at Shanghai Jiao Tong daxue. Your videos have given me a very strong foundation of transformers. Much blessings your way

Patrick-wnuj
Автор

I humbly request you to make videos on how to build a career in machine learning and AI. I am a huge fan of your videos and i thank you for all the knowledge that you have shared

sauravrao
Автор

wow your explanation is so clear and complete... you are godsend, keep doing it. Sei un fenomeno

luxorska
Автор

Thank you! It's very clear explaination. It helps for reading the original paper. Looking forward to new topics.

mlloving
Автор

Legend is back, the GOAT, if my guess is right next will be ORPO or Q*

RudraPratapDhara
Автор

I believe the most evident insight of DPO is to change a RL problem to an equivalent MLE, while the optimal reward model is guarranteed by the human input as definition. That's the meat. But the efficiency depends still on the human annotater's consistency.

binjianxin
Автор

Very clear explanations!! Please, continue making such good videos!

amanattheedge
Автор

Thanks for making these videos. Concise and clear

cken
Автор

Thanks so much Umar, always learn a lot from your video!

nwanted
Автор

These lectures are amazing. Thank you!

vanmira
Автор

Awesome, thank you so much for putting this out, super helpful!

amankhurana
Автор

My Kind Request Please Increase volume little bit, just little bit. Otherwise your videos Outstanding . Best I can say.

olympus
Автор

New video🎉 can't wait to watch. Although having used DPO in production for a while now!

lukeskywalker
Автор

Love from India sir, you are a legend 😊😊

mrsmurf
Автор

Enjoyed the style in which the video is presented. Which video editor/tools do you use to make your videos? Thanks.

jak-zee
Автор

Thank you very much for this video, please make ORPO as well.

mahdisalmani
Автор

Amazing explanation. Would it be possible to make a video on the theory and implementation of automatic differentiation (autograd).

abdullahalsaadi
Автор

Thank you for the video ! Can you provide the video that explains AgentQ training in details ?

AndriiLomakin
Автор

Amazing Video! Please do one on SPIN (Self Play Fine-tuning) as well

AptCyborg
Автор

Thanks for your lecture. I wonder could you explain the vision language models

tuanduc