DPO - Part1 - Direct Preference Optimization Paper Explanation | DPO an alternative to RLHF??

preview_player
Показать описание
In this video, I have explained in detail the DPO paper which proposes a method that can serve as an alternative to RLHF. DPO is a computationally efficient method that calculates the log probabilities of preferred and dispreferred completions under a model and optimizes its params in a way to increase the likelihood of preferred responses and decrease those of dispreferred to align the model with human preferences without a reward model unlike in RLHF algorithms based on PPO.

For any discussions, you can connect with me via the following social links:

Feel free to join the telegram group for discussions using the following link

The code will be available in the following repository:

Links of playlists of the channel:

-~-~~-~~~-~~-~-
Watch: "LoRA - Low-rank Adaption of Large Language Models Paper In-depth Explanation | NLP Research Papers"
-~-~~-~~~-~~-~-
Рекомендации по теме
Комментарии
Автор

Great motivator to finally dive into those optimization algorithms! It's all much more fun now that tooling, hardware support and quantization allow more affordable DIY stuff 😃

videopublisher
Автор

Can we train LLaMa 2 for multilingual translation task I think we can I am just confirming that

shivamkumar-qpjm