filmov
tv
Direct Preference Optimization (DPO) - How to fine-tune LLMs directly without reinforcement learning
Показать описание
Direct Preference Optimization (DPO) is a method used for training Large Language Models (LLMs). DPO is a direct way to train the LLM without the need for reinforcement learning, which makes it more effective and more efficient.
Learn about it in this simple video!
This is the third one in a series of 4 videos dedicated to the reinforcement learning methods used for training LLMs.
Video 3 (This one!): Deterministic Policy Optimization
00:00 Introduction
01:08 RLHF vs DPO
07:19 The Bradley-Terry Model
11:25 KL Divergence
16:32 The Loss Function
14:36 Conclusion
Get the Grokking Machine Learning book!
Discount code (40%): serranoyt
(Use the discount code on checkout)
Learn about it in this simple video!
This is the third one in a series of 4 videos dedicated to the reinforcement learning methods used for training LLMs.
Video 3 (This one!): Deterministic Policy Optimization
00:00 Introduction
01:08 RLHF vs DPO
07:19 The Bradley-Terry Model
11:25 KL Divergence
16:32 The Loss Function
14:36 Conclusion
Get the Grokking Machine Learning book!
Discount code (40%): serranoyt
(Use the discount code on checkout)
Direct Preference Optimization: Your Language Model is Secretly a Reward Model | DPO paper explained
Direct Preference Optimization (DPO) - How to fine-tune LLMs directly without reinforcement learning
Direct Preference Optimization (DPO) explained: Bradley-Terry model, log probabilities, math
Aligning LLMs with Direct Preference Optimization
Reinforcement Learning from Human Feedback (RLHF) & Direct Preference Optimization (DPO) Explain...
Direct Preference Optimization (DPO): Your Language Model is Secretly a Reward Model Explained
Direct Preference Optimization: Forget RLHF (PPO)
Direct Preference Optimization (DPO)
ORPO: The Latest LLM Fine-tuning Method | A Quick Tutorial using Hugging Face
Towards Reliable Use of Large Language Models: Better Detection, Consistency, and Instruction-Tuning
Direct Preference Optimization (DPO)
Direct Preference Optimization (DPO) in AI
DPO Debate: Is RL needed for RLHF?
What is Direct Preference Optimization?
Direct Preference Optimization
Direct Preference Optimization (DPO): A low cost alternative to train LLM models
Direct Preference Optimization (DPO) of LLMs to Reduce Toxicity
Direct Preference Optimization in One Minute
DPO : Direct Preference Optimization
ORPO: Monolithic Preference Optimization without Reference Model (Paper Explained)
What is direct preference optimization (DPO)
PR-453: Direct Preference Optimization
Direct Preference Optimization
Direct Preference Optimization (DPO): How It Works and How It Topped an LLM Eval Leaderboard
Комментарии