How DPO Works and Why It's Better Than RLHF

Показать описание

This week we cover the "Direct Preference Optimization: Your Language Model is Secretly a Reward Model" paper from Stanford. This paper shows how you can remove the need of training a tricky, separate reward model by using a DPO-optimized LLM instead.

--

--

Training Language Models to Follow Instructions 📖

Oxen

Рекомендации по теме

Комментарии

watching it now. Thanks for the great sharing! I really like the diagram @ 8:00 -9:24. I would appreciate more elaboration! It is very helpful. I feel there are still very limited resources on the overall pipeline and there could be so many questions and best practices in different parts of the pipeline. Hope to see more materials there.

ax

DPO is so much more elegant than RLHF, thanks for covering this paper!

ThethDoctor

This is great content, thank you for sharing!

kitanomegumi

hello oxen, thank you very much for the detailled explaination. i have a question. RLHF deals with huge datasets with no need of labelling them as the reward model would be dealing with the responses accuracy. But with DPO, we will have to tag/label the complete dataset with human efforts which is very time and resource consuming. im unable to understand the real benifit of DPO over RLHF here. Could you please help me in understanding this? I would really appreciate if i can somehow have a direct conversation with you over your preferred platform. Thanks in advance.

vamshi-rvk

Hi Greg. pi_ref from winner and loser can be canceled out right? (log(w/c) - log(l/c)) = log(w) - log c - (log l - log c) = log w - log l, Sorry If I am missing some simple math here. Thanks for covering this paper

sandeepsaluru

How DPO Works and Why It's Better Than RLHF

Direct Preference Optimization: Your Language Model is Secretly a Reward Model | DPO paper explained

How DPO Works and Why It's Better Than RLHF

Reinforcement Learning from Human Feedback (RLHF) & Direct Preference Optimization (DPO) Explain...

Data Protection Officer's (#DPO) Roles & Responsibilities in An Organizations

Pregnancy Test Line Progression | Positive at 8 DPO- 14 DPO | Cheapest Early Detection Tests

36.5 Understanding how to calculate Days DSO, DPO, and DIO

TWO WEEK WAIT SYMPTOMS | How I Knew I Was Pregnant

Are early pregnancy symptoms possible before 10dpo?

Direct Preference Optimization (DPO) in AI

Positive Pregnancy TEST vs Negative in 30 SECONDS Time Lapse #shorts

Working at DPO International

What Data Protection Officer (DPO) Training and Certification are available?

Direct Preference Optimization (DPO): How It Works and How It Topped an LLM Eval Leaderboard

3 TIPS FOR GETTING PREGNANT ‣‣ how i got pregnant

Direct Preference Optimization (DPO): Your Language Model is Secretly a Reward Model Explained

What is DPO?

Direct Public Offering (DPO): Definition, How It Works, Examples

Implantation and Early Pregnancy Symptoms: How Early Can You Take a Pregnancy Test?

TWO WEEK WAIT SYMPTOMS // How I Knew I Was Pregnant - 1-14 DPO

Experience working with a Data Protection Office (DPO)

How-to Sign Up Instantly with DPO Pay

DPO Debate: Is RL needed for RLHF?

Difference Between RLHF and DPO in Simple Words

Direct Preference Optimization (DPO) explained: Bradley-Terry model, log probabilities, math