filmov
tv
Reinforcement Learning from Human Feedback (RLHF) & Direct Preference Optimization (DPO) Explained

Показать описание
Learn how Reinforcement Learning from Human Feedback (RLHF) actually works and why Direct Preference Optimization (DPO) and Kahneman-Tversky Optimization (KTO) are changing the game.
This video doesn't go deep on math. Instead, I provide a high-level overview of each technique to help you make practical decisions about where to focus your time and energy.
0:52 The Idea of Reinforcement Learning
1:55 Reinforcement Learning from Human Feedback (RLHF)
4:21 RLHF in a Nutshell
5:06 RLHF Variations
6:11 Challenges with RLHF
7:02 Direct Preference Optimization (DPO)
7:47 Preferences Dataset Example
8:29 DPO in a Nutshell
9:25 DPO Advantages over RLHF
10:32 Challenges with DPO
10:50 Kahneman-Tversky Optimization (KTO)
11:39 Prospect Theory
13:35 Sigmoid vs Value Function
13:49 KTO Dataset
15:28 KTO in a Nutshell
15:54 Advantages of KTO
18:03 KTO Hyperparameters
These are the three papers referenced in the video:
The Huggingface TRL library offers implementations for PPO, DPO, and KTO:
Want to prototype with prompts and supervised fine-tuning? Try Entry Point AI:
How about connecting? I'm on LinkedIn:
This video doesn't go deep on math. Instead, I provide a high-level overview of each technique to help you make practical decisions about where to focus your time and energy.
0:52 The Idea of Reinforcement Learning
1:55 Reinforcement Learning from Human Feedback (RLHF)
4:21 RLHF in a Nutshell
5:06 RLHF Variations
6:11 Challenges with RLHF
7:02 Direct Preference Optimization (DPO)
7:47 Preferences Dataset Example
8:29 DPO in a Nutshell
9:25 DPO Advantages over RLHF
10:32 Challenges with DPO
10:50 Kahneman-Tversky Optimization (KTO)
11:39 Prospect Theory
13:35 Sigmoid vs Value Function
13:49 KTO Dataset
15:28 KTO in a Nutshell
15:54 Advantages of KTO
18:03 KTO Hyperparameters
These are the three papers referenced in the video:
The Huggingface TRL library offers implementations for PPO, DPO, and KTO:
Want to prototype with prompts and supervised fine-tuning? Try Entry Point AI:
How about connecting? I'm on LinkedIn:
Reinforcement Learning from Human Feedback (RLHF) Explained
Reinforcement Learning through Human Feedback - EXPLAINED! | RLHF
Reinforcement Learning from Human Feedback: From Zero to chatGPT
New course with Google Cloud: Reinforcement Learning from Human Feedback (RLHF)
RLHF+CHATGPT: What you must know
Reinforcement Learning from Human Feedback explained with math derivations and the PyTorch code.
Reinforcement Learning with Human Feedback - How to train and fine-tune Transformer Models
RLHF: How to Learn from Human Feedback with Reinforcement Learning
TESOL in A Global Perspective: A Lesson Learned
Reinforcement Learning from Human Feedback (Natural Language Processing at UT Austin)
Reinforcement Learning from Human Feedback Explained (and RLAIF)
Reinforcement Learning: ChatGPT and RLHF
Lessons from reinforcement learning from human feedback | Stephen Casper | EAG Boston 23
Reinforcement Learning from Human Feedback (RLHF) Explained
CMU Advanced NLP Fall 2024 (8): Reinforcement Learning and Human Feedback
The Magic of Reinforcement Learning with Human Feedback RLHF
John Schulman - Reinforcement Learning from Human Feedback: Progress and Challenges
Reinforcement Learning from Human Feedback (RLHF)
Reinforcement Learning from Human Feedback (RLHF) & Direct Preference Optimization (DPO) Explain...
15min History of Reinforcement Learning and Human Feedback
RLAIF vs. RLHF: the technology behind Anthropic’s Claude (Constitutional AI Explained)
RLHF - Reinforcement Learning with Human Feedback
How RLHF Makes Apps More Intuitive (Reinforcement Learning from Human Feedback)
RLHF - Reinforcement Learning from Human Feedback
Комментарии