How DPO Works and Why It's Better Than RLHF

preview_player
Показать описание
This week we cover the "Direct Preference Optimization: Your Language Model is Secretly a Reward Model" paper from Stanford. This paper shows how you can remove the need of training a tricky, separate reward model by using a DPO-optimized LLM instead.

--

--

Training Language Models to Follow Instructions 📖
Рекомендации по теме
Комментарии
Автор

watching it now. Thanks for the great sharing! I really like the diagram @ 8:00 -9:24. I would appreciate more elaboration! It is very helpful. I feel there are still very limited resources on the overall pipeline and there could be so many questions and best practices in different parts of the pipeline. Hope to see more materials there.

ax
Автор

DPO is so much more elegant than RLHF, thanks for covering this paper!

ThethDoctor
Автор

This is great content, thank you for sharing!

kitanomegumi
Автор

hello oxen, thank you very much for the detailled explaination. i have a question. RLHF deals with huge datasets with no need of labelling them as the reward model would be dealing with the responses accuracy. But with DPO, we will have to tag/label the complete dataset with human efforts which is very time and resource consuming. im unable to understand the real benifit of DPO over RLHF here. Could you please help me in understanding this? I would really appreciate if i can somehow have a direct conversation with you over your preferred platform. Thanks in advance.

vamshi-rvk
Автор

Hi Greg. pi_ref from winner and loser can be canceled out right? (log(w/c) - log(l/c)) = log(w) - log c - (log l - log c) = log w - log l, Sorry If I am missing some simple math here. Thanks for covering this paper

sandeepsaluru