Reinforcement Learning from Human Feedback (RLHF) & Direct Preference Optimization (DPO) Explained

preview_player
Показать описание
Learn how Reinforcement Learning from Human Feedback (RLHF) actually works and why Direct Preference Optimization (DPO) and Kahneman-Tversky Optimization (KTO) are changing the game.

This video doesn't go deep on math. Instead, I provide a high-level overview of each technique to help you make practical decisions about where to focus your time and energy.

0:52 The Idea of Reinforcement Learning
1:55 Reinforcement Learning from Human Feedback (RLHF)
4:21 RLHF in a Nutshell
5:06 RLHF Variations
6:11 Challenges with RLHF
7:02 Direct Preference Optimization (DPO)
7:47 Preferences Dataset Example
8:29 DPO in a Nutshell
9:25 DPO Advantages over RLHF
10:32 Challenges with DPO
10:50 Kahneman-Tversky Optimization (KTO)
11:39 Prospect Theory
13:35 Sigmoid vs Value Function
13:49 KTO Dataset
15:28 KTO in a Nutshell
15:54 Advantages of KTO
18:03 KTO Hyperparameters

These are the three papers referenced in the video:

The Huggingface TRL library offers implementations for PPO, DPO, and KTO:

Want to prototype with prompts and supervised fine-tuning? Try Entry Point AI:

How about connecting? I'm on LinkedIn:
Рекомендации по теме
Комментарии
Автор

Great video, Is it better to use KTO as optimizer for a binary classification?

liberate
Автор

Hey! Thanks for video! I never used these techniques, but what I really wants to do is to train a base or chat LLM model like llama or phi-3 on some big text (Lord of the Ring for example). But all techniques I've seen so far requires a proper dataset to be prepared, but who and how can do that? Ask all of possible questions and answer them as well? It's impossible! Don't you know how can I prepare a dataset to later train a model on?

VerdonTrigance
Автор

Martin Shirley Jackson Kenneth Allen Mary

priscillaleapman