Direct Preference Optimization (DPO) - How to fine-tune LLMs directly without reinforcement learning

Показать описание

Direct Preference Optimization (DPO) is a method used for training Large Language Models (LLMs). DPO is a direct way to train the LLM without the need for reinforcement learning, which makes it more effective and more efficient.
Learn about it in this simple video!

This is the third one in a series of 4 videos dedicated to the reinforcement learning methods used for training LLMs.

Video 3 (This one!): Deterministic Policy Optimization

00:00 Introduction
01:08 RLHF vs DPO
07:19 The Bradley-Terry Model
11:25 KL Divergence
16:32 The Loss Function
14:36 Conclusion

Get the Grokking Machine Learning book!
Discount code (40%): serranoyt
(Use the discount code on checkout)

Serrano.Academy

Рекомендации по теме

Комментарии

Thank you very much for the video!
Do I understand correctly that RLHF still has some advantages, namely that by using it we can gather a small amount of human preferences data, and then, after training a reward model using that data, it will itself evaluate many more new examples?
So by having trained the reward model, we have basically free human annotator, that can rate endless new examples.
In the case of DPO, however, we only have the initial human preferences data and that’s it.

miklefeldman

Hi Mr. Serrano! I am doing your coursera course at the moment on linear algebra for machine learning and I am having so much fun! You are a brilliant teacher, and I just wanted to say thank you! Wish more teachers would bring theoretical mathematics down to a more practical level. Obviously loving the very expensive fruit examples :)

Cathiina

DPO main equation should be PPO main equation.

guzh

I'm a little confused about one thing: the reward function, even in the Bradley-Terry model, is based on the human-given scores for individual context-prediction pairs, right? And πθ is the probability from the current iteration of the network, and πRef is the probability from the original, untuned network?

So then after that "mathematical manipulation", how does the human-given set of scores become represented by the network's predictions all of a sudden?

IceMetalPunk

Really love the way you broke down the DPO loss, this direct way is more welcome by my brain :). Just one question on the video, I am wondering how important it is to choose the initial transformer carefully. I suspect that if it is very bad at the task, then we will have to change the initial response a lot, but because the loss function prevents from changing too much in one iteration, we will need to perform a lot tiny changes toward the good answer, making the training extremely long. Am I right ?

frankl

Thanks for sharing. Is there any hands on resource to try DPO ?

subhamkundu

Thanks for the simplified explanation. Awesome as always.
The book link in the description is not working.

AravindUkrd

Great video as always. I have a question, in practice which one works best using DPO or RLHF?

mekuzeeyo

Did anyone expect something different than Sofmax regarding the Bradley-Terry model as myself? 😅

frankl

It's kinda hard to remember all of these formulas and it's demotivating me from further learning.

VerdonTrigance

Direct Preference Optimization (DPO) - How to fine-tune LLMs directly without reinforcement learning

Direct Preference Optimization: Your Language Model is Secretly a Reward Model | DPO paper explained

Direct Preference Optimization (DPO) - How to fine-tune LLMs directly without reinforcement learning

Direct Preference Optimization (DPO) explained: Bradley-Terry model, log probabilities, math

Aligning LLMs with Direct Preference Optimization

Reinforcement Learning from Human Feedback (RLHF) & Direct Preference Optimization (DPO) Explain...

Direct Preference Optimization (DPO): Your Language Model is Secretly a Reward Model Explained

Direct Preference Optimization: Forget RLHF (PPO)

Direct Preference Optimization (DPO)

ORPO: The Latest LLM Fine-tuning Method | A Quick Tutorial using Hugging Face

Towards Reliable Use of Large Language Models: Better Detection, Consistency, and Instruction-Tuning

Direct Preference Optimization (DPO)

Direct Preference Optimization (DPO) in AI

DPO Debate: Is RL needed for RLHF?

What is Direct Preference Optimization?

Direct Preference Optimization

Direct Preference Optimization (DPO): A low cost alternative to train LLM models

Direct Preference Optimization (DPO) of LLMs to Reduce Toxicity

Direct Preference Optimization in One Minute

DPO : Direct Preference Optimization

ORPO: Monolithic Preference Optimization without Reference Model (Paper Explained)

What is direct preference optimization (DPO)

PR-453: Direct Preference Optimization

Direct Preference Optimization

Direct Preference Optimization (DPO): How It Works and How It Topped an LLM Eval Leaderboard