Direct Preference Optimization (DPO): Your Language Model is Secretly a Reward Model Explained

preview_player
Показать описание
Рекомендации по теме
Комментарии
Автор

Dude, you're a hero for making these videos! Definitely earned a subscription from me.

rw-kbqv
Автор

I'm really glad a channel likes yours exist.

FaultyTwo
Автор

Thank you so much! I truly enjoyed your video and the way you explain things. There are moments, however, where I find myself a bit lost when you don't delve deeper, and I wish you could expand on those points. I would appreciate it if you could recommend a video, article to help me become more familiar with the basic concepts related to such papers in this field. Understanding these basics would make it much easier for me to grasp the material. I feel some gaps in my knowledge that If I work on them then understanding these papers, their mathematical notation will be easier. I focusing on papers in this area of transformers and training models.. Any suggestions would be greatly appreciated. Btw you definitely earned my subscription as well.

unclecode
Автор

Nice one, thanks for sharing. The replacement of the reward model by the MLE term looks appealing when we have groundtruth (generated reply and reference reply). Still, the advantage of reward models is mainly in their potential to be used on new samples without groundtruth presented (i.e., no reference replies in a self-play training on new datasets), so how would the MLE loss work in such scenarios?

prof_shixo
Автор

I was reading Zephyr. It led to this DPO paper, which landed me on your channel. I am soooo happy. Keep it up!

bibiworm
Автор

Hey @Gabriel. Can you please clear my doubt, why @12:58 we cant just directly backpropagate the loss like we do in simple fine tuning? I am not understanding it? Please share any relevant resource?

YashVerma-iilx