ORPO: NEW DPO Alignment and SFT Method for LLM

preview_player
Показать описание
Instead of the classical SFT and DPO alignment for training our LLMs, there is a new method available. A innovative "reference model-free" monolithic odds ratio reference optimization algorithm, ORPO, eliminating the necessity for an additional preference alignment phase.

A New Preference-aligned SFT method.

We explore this idea from a theoretical physics perspective and notice a similarity to the regularizations terms methodologies. We further explore the conceptional similarities from a Lagrange Multiplier to new correction terms in addition to the classical SFT loss functional.

The performance figures of ORPO are given in comparison to a LLama 2 and a Mistral 7B model.

ORPO: Monolithic Preference Optimization without Reference Model
Рекомендации по теме
Комментарии
Автор

Nice! This is exactly what I didn’t understand about the SFT-then-DPO method.

mshonle
Автор

Thanks ... super cool. Will be interesting to see an implementation of it.

SheIsSinging
Автор

Thank you! As neither a MechEng or MLE, but interest in the later [yet far from being a green grasshopper :( ], your analogy was very illustrative. [supplemented with some sidebar GPT 4 explanations).

IdPreferNot
Автор

Would the new model, with its language skills, be capable of generating a relevant ultrafeedback dataset from specialized long form, scientific text? Such text is long form but has repeatable macroscopic structure, like scientific publications do.

wdonno
join shbcf.ru