filmov
tv
Direct Preference Optimization: Your Language Model is Secretly a Reward Model | DPO paper explained

Показать описание
Direct Preference Optimization (DPO) to finetune LLMs without reinforcement learning. DPO was one of the two Outstanding Main Track Runner-Up papers.
Thanks to our Patrons who support us in Tier 2, 3, 4: 🙏
Dres. Trost GbR, Siltax, Vignesh Valliappan, @Mutual_Information , Kshitij
Outline:
00:00 DPO motivation
00:53 Finetuning with human feedback
01:39 RLHF explained
03:05 DPO explained
04:24 Why Reinforcement Learning in the first place?
05:58 Shortcomings
06:50 Results
▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
🔥 Optionally, pay us a coffee to help with our Coffee Bean production! ☕
▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
🔗 Links:
#AICoffeeBreak #MsCoffeeBean #MachineLearning #AI #research
Video editing: Nils Trost
Music 🎵 : Ice & Fire - King Canyon
Thanks to our Patrons who support us in Tier 2, 3, 4: 🙏
Dres. Trost GbR, Siltax, Vignesh Valliappan, @Mutual_Information , Kshitij
Outline:
00:00 DPO motivation
00:53 Finetuning with human feedback
01:39 RLHF explained
03:05 DPO explained
04:24 Why Reinforcement Learning in the first place?
05:58 Shortcomings
06:50 Results
▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
🔥 Optionally, pay us a coffee to help with our Coffee Bean production! ☕
▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
🔗 Links:
#AICoffeeBreak #MsCoffeeBean #MachineLearning #AI #research
Video editing: Nils Trost
Music 🎵 : Ice & Fire - King Canyon
Direct Preference Optimization: Your Language Model is Secretly a Reward Model | DPO paper explained
Direct Preference Optimization (DPO): Your Language Model is Secretly a Reward Model Explained
Direct Preference Optimization (DPO) - How to fine-tune LLMs directly without reinforcement learning
Direct Preference Optimization: Forget RLHF (PPO)
Aligning LLMs with Direct Preference Optimization
Direct Preference Optimization
Direct Preference Optimization (DPO) explained: Bradley-Terry model, log probabilities, math
Direct Preference Optimization in One Minute
DPO : Direct Preference Optimization
What is Direct Preference Optimization?
Direct Preference Optimization: Your Language Model is Secretly a Reward Model (Ko/En Subtitles)
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Direct Preference Optimization Your Language Model is Secretly a Reward Model
PR-453: Direct Preference Optimization
Towards Reliable Use of Large Language Models: Better Detection, Consistency, and Instruction-Tuning
Direct Preference Optimization (DPO)
Direct Preference Optimization:Your Language Model is Secretly a Reward Model, Paper & Code
Direct Preference Optimization Your Language Model is Secretly a Reward Model Stanford 2023
DPO - Part1 - Direct Preference Optimization Paper Explanation | DPO an alternative to RLHF??
[short] Direct Preference Optimization: Your Language Model is Secretly a Reward Model
[Open DMQA Seminar] Direct Preference Optimization with Diffusion Models
Direct Preference Optimization (DPO)
Direct Preference Optimization (DPO) of LLMs to Reduce Toxicity
RLHF vs DPO (and KTO) - Top 3 Optimization Methods in a Nutshell
Комментарии