filmov
tv
Direct Preference Optimization (DPO) explained: Bradley-Terry model, log probabilities, math
Показать описание
In this video I will explain Direct Preference Optimization (DPO), an alignment technique for language models introduced in the paper "Direct Preference Optimization: Your Language Model is Secretly a Reward Model".
I start by introducing language models and how they are used for text generation. After briefly introducing the topic of AI alignment, I start by reviewing Reinforcement Learning (RL), a topic that is necessary to understand the reward model and its loss function.
I derive step by step the loss function of the reward model under the Bradley-Terry model of preferences, a derivation that is missing in the DPO paper.
Using the Bradley-Terry model, I build the loss of the DPO algorithm, not only explaining its math derivation, but also giving intuition on how it works.
In the last part, I describe how to use the loss practically, that is, how to calculate the log probabilities using a Transformer model, by showing how it is implemented in the Hugging Face library.
If you're interested in how to derive the optimal solution to the RL constrained optimization problem, I highly recommend the following paper (Appendinx A, equation 36):
Chapters
00:00:00 - Introduction
00:02:10 - Intro to Language Models
00:04:08 - AI Alignment
00:05:11 - Intro to RL
00:08:19 - RL for Language Models
00:10:44 - Reward model
00:13:07 - The Bradley-Terry model
00:21:34 - Optimization Objective
00:29:52 - DPO: deriving its loss
00:41:05 - Computing the log probabilities
00:47:27 - Conclusion
I start by introducing language models and how they are used for text generation. After briefly introducing the topic of AI alignment, I start by reviewing Reinforcement Learning (RL), a topic that is necessary to understand the reward model and its loss function.
I derive step by step the loss function of the reward model under the Bradley-Terry model of preferences, a derivation that is missing in the DPO paper.
Using the Bradley-Terry model, I build the loss of the DPO algorithm, not only explaining its math derivation, but also giving intuition on how it works.
In the last part, I describe how to use the loss practically, that is, how to calculate the log probabilities using a Transformer model, by showing how it is implemented in the Hugging Face library.
If you're interested in how to derive the optimal solution to the RL constrained optimization problem, I highly recommend the following paper (Appendinx A, equation 36):
Chapters
00:00:00 - Introduction
00:02:10 - Intro to Language Models
00:04:08 - AI Alignment
00:05:11 - Intro to RL
00:08:19 - RL for Language Models
00:10:44 - Reward model
00:13:07 - The Bradley-Terry model
00:21:34 - Optimization Objective
00:29:52 - DPO: deriving its loss
00:41:05 - Computing the log probabilities
00:47:27 - Conclusion
Комментарии