ORPO: Monolithic Preference Optimization without Reference Model (Paper Explained)

Показать описание

Abstract:
While recent preference alignment algorithms for language models have demonstrated promising results, supervised fine-tuning (SFT) remains imperative for achieving successful convergence. In this paper, we study the crucial role of SFT within the context of preference alignment, emphasizing that a minor penalty for the disfavored generation style is sufficient for preference-aligned SFT. Building on this foundation, we introduce a straightforward and innovative reference model-free monolithic odds ratio preference optimization algorithm, ORPO, eliminating the necessity for an additional preference alignment phase. We demonstrate, both empirically and theoretically, that the odds ratio is a sensible choice for contrasting favored and disfavored styles during SFT across the diverse sizes from 125M to 7B. Specifically, fine-tuning Phi-2 (2.7B), Llama-2 (7B), and Mistral (7B) with ORPO on the UltraFeedback alone surpasses the performance of state-of-the-art language models with more than 7B and 13B parameters: achieving up to 12.20% on AlpacaEval2.0 (Figure 1), 66.19% on IFEval (instruction-level loose, Table 6), and 7.32 in MT-Bench (Figure 12). We release code and model checkpoints for Mistral-ORPO-α (7B) and Mistral-ORPO-β (7B).

Authors: Jiwoo Hong, Noah Lee, James Thorne

Links:

If you want to support me, the best thing to do is to share out the content :)

If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n

Рекомендации по теме

Комментарии

Glad you're back to technical content this time. Any AI YouTuber can give us latest AI news, but you're just about the only one that can give technical insight into the stories.

rt

6 videos in 7 days, I'm having a holiday and this is such a perfect-timing treat.

lone

Thank you for being awesome Yannic, I send people from the classes that I "TA" for to you because you're reliably strong with your analysis.

EternalKernel

I very like more technical content from you. I usually read tech news in telegram and your NL New are greats, but very ordinal and simple. So such paper explanations are kind of impact to the DS community, such videos grands new ideas and increase understanding of the field for those, who tried to dive in the deep. Of course it less popular due to complexity of material for audience, but much more interesting. So thank you for such format.

borisbondarenko

Great to see research from my homeland of South Korea represented!

tensorturtle

26:30 that 'really?' and the following struggle with basic math is WAAAAY to relatable

peach

The main loss function (7) looks like it can be meaningfully simplified with school-level math.
Lor = -log(sigm( log ( odds(y_w|x) / odds(y_l|x)))), where sigm(a) = 1/(1 + exp(-a)) = exp(a) / (1 + exp(a))
Let's assume that both odds(y_w|x) and odds(y_l|x) are positive (because softmax)

By plugging in the sigmoid, we get
Lor = - log (exp(log(odds(y_w|x) / odds(y_l|x) )) / (1 + exp(log(odds(y_w|x) / odds(y_l|x)))) )
Note that exp(log(odds(y_w|x) / odds(y_l|x)) = odds(y_w|x) / odds(y_l|x). We use this to simplify:
Lor = - log( [odds(y_w|x) / odds(y_l|x)] / (1 + odds(y_w|x) / odds(y_l|x)) )
Finally, multiply both numerator and denominator by odds(y_l|x) to get

Lor = - log(odds(y_w|x) / (odds(y_w|x) + odds(y_l|x)) )

Intuitively, this is the negative log-probability of (the odds of good response) / (odds of good response + odds of bad response ).
If you minimize the average loss over multiple texts, it's the same as maximizing the odds that the model chooses winning response in every pair (of winning+losing responses).

justheuristic

Nice I was waiting for this after you mentioned ORPO in ML News :))

blender

Really appreciate your explaining, very helpful. Now I see the alignment process as widening the upper part of the Y shape: x with y_w to y_l. Thanks!

ZhousiChen-hp

Thanks for explaining basic terms along with the more complex stuff, for dilettantes like myself. Cheers.

I---I

I liked the self deprecation at 32:00 haha

yannickpezeu

16:12 Not sure I follow the intuition behind supervised fine tuning not being able to penalize the “wrong” token that is opposite to what we want the model to mimic. I’m confused because in my view the wrong, but highly probably token, contributes more to the loss so it will be penalized heavier than the more meaningless, random output, tokens. Can someone clarify this for me?

simaogoncalves

Thank you Mr Klicher for delving into the paper, ORPO; Monolithic Preference Optimization without Reference Model

Mordenor

What's going on, is it a yannic bonanza time of the year! Loving these addicting videos

wwkk

Where does Yw and Yl come from. Is it from the training dataset or the LLM that is being trained generates these and are labelled by humans or reward models as W and L?

syeshwanth

"Specifically, 1 - p(y|x) in the denominators amplifies the gradients when the corresponding side of the likelihood p(y|x) is low". I think that (1 - p(y|x)) have two different meanings here: it can be the result of differentiation by coincidence and also the "corresponding side" of the likelihood, i.e., 1 - p(y|x). So, when it says the "corresponding side" of p(y|x) is low, it means that 1 - p(y|x) is low.

MyCiaoatutti

18:47 I wish they showed some loss curves of the training in the paper unless I missed it. Whenever you divide things like that in the loss function, the loss curve goes crazy. It still trains but it can go crazy as for some samples the denominator might be close to zero.
.
19:33 There is no ablation in the paper with no SFT since the loss is L_sft + lambda L_orpo. I think we are soon to see a followup paper "ORPO is all you need" which just drops the SFT. I think it will work great.
.
31:30 One of my colleagues tried the probability ratio thing before. I dont remember what came out of it. Havent checked on him for a while.

herp_derpingson

Would be interesting to see how it compares to KTO. Would guess that KTO outperforms and is easier to implament as you dont need pairs of inputs.

mantasorantas

That log of probability is also a power transform often used to narrow or widen a distribution.

maxxba

You should make a video just focusing on log and explaining it's role in neuronal networks.

jondo

ORPO: Monolithic Preference Optimization without Reference Model (Paper Explained)

ORPO: Monolithic Preference Optimization without Reference Model (Paper Explained)

ORPO: NEW DPO Alignment and SFT Method for LLM

PR-482: ORPO: Monolithic Preference Optimization without Reference Model

[Paper Review] ORPO: Monolithic Preference Optimization without Reference Model

Direct Preference Optimization: Forget RLHF (PPO)

Combined Preference and Supervised Fine Tuning with ORPO

ORPO: The Latest LLM Fine-tuning Method | A Quick Tutorial using Hugging Face

From RLHF with PPO/DPO to ORPO + How to build ORPO on Trainium/Neuron SDK

Paper - 'FINE-TUNING LARGE LANGUAGE MODELS FOR DOMAIN ADAPTATION' - Audio Podcast

AutoTrain: Train ANY Large Language Model with 1 Command

Data Exchange Podcast (Episode 236): Jiwoo Hong and Noah Lee of KAIST AI

Fine-Tune Llama 3 Using ORPO on Your Own Dataset

Trying out Mixtral 8x22B MoE fine tuned Zephyr 141B-A35B Powerful Open source LLM

Install Zephyr 141B-A35B Locally - First ORPO Trained LLM

Single-Step Language Model Alignment & Smaller-Scale Large Multimodal Models | Multimodal Weekly...

Fine-Tune LLMs Locally With No Code Using AutoTrain Configs

Lewis Tunstall - Building Machine Learning Applications using Hugging Face - Uphill Conf 2024

Comparison: 64mm SSP Multipurpose vs. Niche Zero vs. 64mm SSP High Uniformity

REI Ruckpack 40 Review | One Bag Travel Backpack (Men’s & Women’s Version)

1015 Matt Poepsel Full Vice President Predictive Index

Using Embodied AI to help answer”why” questions in systems neuroscience

Declarative Learning based Programming for Deep Learning and Reasoning over Natural Language

TRAITOR LEGIONS - Slaves to Darkness | Warhammer 40k Lore

Bias in AI panel presentations and Q&A at MIT: 2021 'Unfolding Intelligence' Symposium