Direct Preference Optimization (DPO)

Показать описание

Resources:

Chapters:
0:00 Direct Preference Optimisation
0:37 Video Overview
1:37 How does “normal” fine-tuning work?
3:41 How does DPO work?
8:31 DPO Datasets: UltraChat
10:59 DPO Datasets: Helpful and Harmless
14:00 DPO vs RLHF
15:25 Required datasets and SFT models
18:26 DPO Notebook Run through
28:22 DPO Evaluation Results
31:15 Weights and Biases Results Interpretation
35:16 Runpod Setup for 1 epoch Training Run
41:58 Resources

Рекомендации по теме

Комментарии

for example, one example can contain more than one dialog, when you ask a question to the model, will it answer and then generate the human question and answer to the assistant, this issue worries me and confuses me. If I separate each human assistant pair, it will not understand that topic. what should I do?

example : "Human: What kind of noises did dinosaurs make? Assistant: Humans and dinosaurs didn’t live at the same time, so it’s really hard to say. The best place to find out what noises dinosaurs made would be Human: yes they did Assistant: to guess, and that would probably require lots of reading and a certain amount of imagination, so we’re not really prepared to do that. Human: you cant read Assistant: You can read?"

cagataydemirbas

英语
Although the DPO algorithm borrows some elements of reinforcement learning, it does not fully conform to the framework of traditional reinforcement learning algorithms, right?

StevenPack-nhns

Consider putting a high pass filter on your audio :) There are some low frequency computer noises you can easily filter out.

GrahamAnderson-zx

Hi, again

In preparing a dataset for DPO (Direct Preference Optimization) training, should the “prompt” be repeated in the “chosen” and “rejected” columns?

I’ve come across some conflicting information regarding the proper formatting of the dataset for DPO training. Some sources suggest that the prompt should be included in both the “chosen” and “rejected” responses to provide full context, while others state that the prompt should be kept separate and not repeated in these columns.

Additionally, when working with multi-turn dialogue data, I’m unsure how to properly format the dataset. Should the “chosen” and “rejected” columns include the entire conversation history up to that point, or just the assistant’s most recent response following the latest user input?

Could someone clarify the correct approach for formatting the dataset? Should the “chosen” and “rejected” columns contain only the assistant’s responses following the prompt, or should they include the prompt as well? And how should I handle multi-turn dialogues in this context?

cagataydemirbas

Why are you doing a SFT first. Can't we from the LLAMA2 directly apply the DPOTrainer ?

firsfnamelastname

I'm a novice LLM user and am confused by how a limited set of specific DPO pairs can align a model to a near infinite number of diverse user prompts.

I know there's generalization to some degree because when I use aligned LLMs I keep running across misfiring alignment. For example, I'm warned against engaging in celebrity gossip when asking about fictional characters in a TV show simply because I used actor names to help keep the LLM on track (avoid hallucinations). So it sees real celebrity names and questions about ex-wives (even though it's about fictional characters in a show), and then triggers an alignment response despite there being no specific example in the DPO, RLHF... training data.

brandon

I’m guessing here. Perhaps, the DPO experiment (for Tiny Llama) didn’t produce the final results you wanted? Would you consider another DPO tutorial where you get good results that are with the effort (and longer compute time) for using DPO? Thanks.

GrahamAnderson-zx

Kindly also explain maths behind every topic in future so it will be more helpful... 💜 Do you have any book?

imranullah

I have just my own script, and the script just run for a few steps and then the session is crashed with unknown reason i don't know why.

The second thing should we need to qunatiz the lora adapter? But it also load the base model. I just merge adapter with base model. Is it good approach?

imranullah

So if the quality of your fine-tuning data set is already very high, do you even need to do these types of reinforcement learning?

timetravellingtoad

Hey! Did you make a video/have a video link about what the results of tinyLlama means? I read the Readme and understood nothing. Thank you!

tomiwaibrahim

Loved the video, thanks for giving out such gr8 content for free 🫶

vivekpadman

Direct Preference Optimization (DPO)

Direct Preference Optimization: Your Language Model is Secretly a Reward Model | DPO paper explained

Direct Preference Optimization (DPO) explained: Bradley-Terry model, log probabilities, math

Direct Preference Optimization (DPO) - How to fine-tune LLMs directly without reinforcement learning

Reinforcement Learning from Human Feedback (RLHF) & Direct Preference Optimization (DPO) Explain...

Aligning LLMs with Direct Preference Optimization

Direct Preference Optimization (DPO): Your Language Model is Secretly a Reward Model Explained

Direct Preference Optimization: Forget RLHF (PPO)

Direct Preference Optimization (DPO)

Towards Reliable Use of Large Language Models: Better Detection, Consistency, and Instruction-Tuning

Direct Preference Optimization (DPO) in AI

Direct Preference Optimization in One Minute

DPO Debate: Is RL needed for RLHF?

Direct Preference Optimization (DPO)

Direct Preference Optimization (DPO): A low cost alternative to train LLM models

What is Direct Preference Optimization?

Direct Preference Optimization

DPO : Direct Preference Optimization

How DPO Works and Why It's Better Than RLHF

Direct Preference Optimization (DPO) of LLMs to Reduce Toxicity

PR-453: Direct Preference Optimization

ORPO: Monolithic Preference Optimization without Reference Model (Paper Explained)

What is direct preference optimization (DPO)

Direct Preference Optimization

Direct Preference Optimization (DPO): How It Works and How It Topped an LLM Eval Leaderboard