ORPO: NEW DPO Alignment and SFT Method for LLM

Показать описание

Instead of the classical SFT and DPO alignment for training our LLMs, there is a new method available. A innovative "reference model-free" monolithic odds ratio reference optimization algorithm, ORPO, eliminating the necessity for an additional preference alignment phase.

A New Preference-aligned SFT method.

We explore this idea from a theoretical physics perspective and notice a similarity to the regularizations terms methodologies. We further explore the conceptional similarities from a Lagrange Multiplier to new correction terms in addition to the classical SFT loss functional.

The performance figures of ORPO are given in comparison to a LLama 2 and a Mistral 7B model.

ORPO: Monolithic Preference Optimization without Reference Model

Рекомендации по теме

Комментарии

Nice! This is exactly what I didn’t understand about the SFT-then-DPO method.

mshonle

Thanks ... super cool. Will be interesting to see an implementation of it.

SheIsSinging

Thank you! As neither a MechEng or MLE, but interest in the later [yet far from being a green grasshopper :( ], your analogy was very illustrative. [supplemented with some sidebar GPT 4 explanations).

IdPreferNot

Would the new model, with its language skills, be capable of generating a relevant ultrafeedback dataset from specialized long form, scientific text? Such text is long form but has repeatable macroscopic structure, like scientific publications do.

wdonno

ORPO: NEW DPO Alignment and SFT Method for LLM

ORPO: NEW DPO Alignment and SFT Method for LLM

4 Ways to Align LLMs: RLHF, DPO, KTO, and ORPO

Direct Preference Optimization: Your Language Model is Secretly a Reward Model | DPO paper explained

ORPO: Monolithic Preference Optimization without Reference Model (Paper Explained)

Direct Preference Optimization (DPO) explained: Bradley-Terry model, log probabilities, math

An update on DPO vs PPO for LLM alignment

Aligning LLMs with Direct Preference Optimization

ORPO Explained: Superior LLM Alignment Technique vs. DPO/RLHF

Make AI Think Like YOU: A Guide to LLM Alignment

ORPO: The Latest LLM Fine-tuning Method | A Quick Tutorial using Hugging Face

From RLHF with PPO/DPO to ORPO + How to build ORPO on Trainium/Neuron SDK

LLM Training & Reinforcement Learning from Google Engineer | SFT + RLHF | PPO vs GRPO vs DPO

Penjelasan ORPO (Odds Ratio Preference Optimization): Alignment LLM singkat sekali jalan 🏃‍♂️...

Enhancing the Reasoning Ability of Multimodal LLM via Mixed Preference Optimization

Generating and cleaning a preference dataset for DPO / ORPO with LLMs and distilabel

FASTER Code for SFT + DPO Training: UNSLOTH

How to align LLMs to Enterprise Objectives and Policies

Model Alignment at Scale using RL from AI Feedback on Databricks

Single-Step Language Model Alignment & Smaller-Scale Large Multimodal Models | Multimodal Weekly...

How to Fine-Tune LLMs to Perform Specialized Tasks Accurately

dotAI 2024 - Merve Noyan - Gain full control of your apps with Open-Source AI

Knowledge Graphs w/ AI Agents form CRYSTAL (MIT)

Scaling Test Time Compute: How o3-Style Reasoning Works (+ Open Source Implementation)

Julian Stastny – Plan B: Training LLMs to fail less severely