Direct Preference Optimization: Forget RLHF (PPO)

preview_player
Показать описание
DPO replaces RLHF: In this technical and informative video, we explore a groundbreaking methodology called direct preference optimization (DPO) by Stanford Univ that has the potential to replace reinforcement learning in the training of GPT systems.

Join us as we dive into the intricacies of direct preference optimization, dissecting its technical details and highlighting its advantages over the conventional reinforcement learning approach.

Discover how this innovative technique opens new possibilities in AI training, offering more precise control and improved performance.

Direct Preference Optimization - DPO can fine-tune Language Models (LLMs) to align with human preferences as well as or better than existing methods. Notably, fine-tuning with DPO exceeds RLHF’s ability (Reinforcement Learning from Human Feedback) to control sentiment of generations and improves response quality in summarization and single-turn dialogue while being substantially simpler to implement and train.

All rights with authors of:
"Direct Preference Optimization: Your Language Model is Secretly a Reward Model"
Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, Chelsea Finn (Stanford Univ)
Рекомендации по теме
Комментарии
Автор

Can you consider doing a separate video on the math

wryltxw
Автор

Loved the video! Please include more maths!

ktajfar
Автор

Direkt das Paper aufs Kindle laden 👍
Danke dir <3

cutmasta-kun
Автор

great video! did you examine this deeply and is this as good as they promised?

gileneusz
Автор

Looking forward to seeing this scaled up.

kevon
Автор

Can't wait to start testing this on some models

MadhavanSureshRobos
Автор

I think this is unsurprising in that the language model acts as a policy network and there are RL methods that only optimize a policy network.

This is still reinforcement learning.

What would be surprising is if this would be the best approach for reinforcement learning, contrary to what has been seen in other areas.

It makes sense that the RLHF approach is not the most practical and this may lead to preferred simplified methods in the short term, but it would be odd if given sufficient resources, one cannot do better on metrics.

RL approaches has often relied on having multiple networks and huge samples and it could be that the most methods then are not suitable for the LLM setting and new hybrid-like variants will be developed.

osuf