Direct Preference Optimization (DPO)

preview_player
Показать описание

Resources:

Chapters:
0:00 Direct Preference Optimisation
0:37 Video Overview
1:37 How does “normal” fine-tuning work?
3:41 How does DPO work?
8:31 DPO Datasets: UltraChat
10:59 DPO Datasets: Helpful and Harmless
14:00 DPO vs RLHF
15:25 Required datasets and SFT models
18:26 DPO Notebook Run through
28:22 DPO Evaluation Results
31:15 Weights and Biases Results Interpretation
35:16 Runpod Setup for 1 epoch Training Run
41:58 Resources
Рекомендации по теме
Комментарии
Автор

for example, one example can contain more than one dialog, when you ask a question to the model, will it answer and then generate the human question and answer to the assistant, this issue worries me and confuses me. If I separate each human assistant pair, it will not understand that topic. what should I do?

example : "Human: What kind of noises did dinosaurs make? Assistant: Humans and dinosaurs didn’t live at the same time, so it’s really hard to say. The best place to find out what noises dinosaurs made would be Human: yes they did Assistant: to guess, and that would probably require lots of reading and a certain amount of imagination, so we’re not really prepared to do that. Human: you cant read Assistant: You can read?"

cagataydemirbas
Автор

英语
Although the DPO algorithm borrows some elements of reinforcement learning, it does not fully conform to the framework of traditional reinforcement learning algorithms, right?

StevenPack-nhns
Автор

Consider putting a high pass filter on your audio :) There are some low frequency computer noises you can easily filter out.

GrahamAnderson-zx
Автор

Hi, again

In preparing a dataset for DPO (Direct Preference Optimization) training, should the “prompt” be repeated in the “chosen” and “rejected” columns?

I’ve come across some conflicting information regarding the proper formatting of the dataset for DPO training. Some sources suggest that the prompt should be included in both the “chosen” and “rejected” responses to provide full context, while others state that the prompt should be kept separate and not repeated in these columns.

Additionally, when working with multi-turn dialogue data, I’m unsure how to properly format the dataset. Should the “chosen” and “rejected” columns include the entire conversation history up to that point, or just the assistant’s most recent response following the latest user input?

Could someone clarify the correct approach for formatting the dataset? Should the “chosen” and “rejected” columns contain only the assistant’s responses following the prompt, or should they include the prompt as well? And how should I handle multi-turn dialogues in this context?

cagataydemirbas
Автор

Why are you doing a SFT first. Can't we from the LLAMA2 directly apply the DPOTrainer ?

firsfnamelastname
Автор

I'm a novice LLM user and am confused by how a limited set of specific DPO pairs can align a model to a near infinite number of diverse user prompts.

I know there's generalization to some degree because when I use aligned LLMs I keep running across misfiring alignment. For example, I'm warned against engaging in celebrity gossip when asking about fictional characters in a TV show simply because I used actor names to help keep the LLM on track (avoid hallucinations). So it sees real celebrity names and questions about ex-wives (even though it's about fictional characters in a show), and then triggers an alignment response despite there being no specific example in the DPO, RLHF... training data.

brandon
Автор

I’m guessing here. Perhaps, the DPO experiment (for Tiny Llama) didn’t produce the final results you wanted? Would you consider another DPO tutorial where you get good results that are with the effort (and longer compute time) for using DPO? Thanks.

GrahamAnderson-zx
Автор

Kindly also explain maths behind every topic in future so it will be more helpful... 💜 Do you have any book?

imranullah
Автор

I have just my own script, and the script just run for a few steps and then the session is crashed with unknown reason i don't know why.

The second thing should we need to qunatiz the lora adapter? But it also load the base model. I just merge adapter with base model. Is it good approach?

imranullah
Автор

So if the quality of your fine-tuning data set is already very high, do you even need to do these types of reinforcement learning?

timetravellingtoad
Автор

Hey! Did you make a video/have a video link about what the results of tinyLlama means? I read the Readme and understood nothing. Thank you!

tomiwaibrahim
Автор

Loved the video, thanks for giving out such gr8 content for free 🫶

vivekpadman