ORPO: Monolithic Preference Optimization without Reference Model (Paper Explained)

High Definition Standard Definition

Stats For Nerds

Theater

Video id : 52kMBrAI_IM
ImmersiveAmbientModecolor: #ebecea (color 1)
Video Format : 22 (720p) openh264 ( https://github.com/cisco/openh264) mp4a.40.2 | 44100Hz
Audio Format: Opus - Normalized audio
PokeTubeEncryptID: 408d2389c0fa4d2db678c0c3f438ee7c0a818cf632d5e774b2017f446efa74c16c6aec245bcd7e1825f32124be4888b6
Proxy : eu-proxy.poketube.fun - refresh the page to change the proxy location
Date : 1715999941742 - unknown on Apple WebKit
Mystery text : NTJrTUJyQUlfSU0gaSAgbG92ICB1IGV1LXByb3h5LnBva2V0dWJlLmZ1bg==
143 : true

Jump to Connections

Yannic Kilcher

253K Subs

526

Download

Thanks!

17,614 Views • May 1, 2024 • Click to toggle off description

Paper: arxiv.org/abs/2403.07691

Abstract:
While recent preference alignment algorithms for language models have demonstrated promising results, supervised fine-tuning (SFT) remains imperative for achieving successful convergence. In this paper, we study the crucial role of SFT within the context of preference alignment, emphasizing that a minor penalty for the disfavored generation style is sufficient for preference-aligned SFT. Building on this foundation, we introduce a straightforward and innovative reference model-free monolithic odds ratio preference optimization algorithm, ORPO, eliminating the necessity for an additional preference alignment phase. We demonstrate, both empirically and theoretically, that the odds ratio is a sensible choice for contrasting favored and disfavored styles during SFT across the diverse sizes from 125M to 7B. Specifically, fine-tuning Phi-2 (2.7B), Llama-2 (7B), and Mistral (7B) with ORPO on the UltraFeedback alone surpasses the performance of state-of-the-art language models with more than 7B and 13B parameters: achieving up to 12.20% on AlpacaEval2.0 (Figure 1), 66.19% on IFEval (instruction-level loose, Table 6), and 7.32 in MT-Bench (Figure 12). We release code and model checkpoints for Mistral-ORPO-α (7B) and Mistral-ORPO-β (7B).

Authors: Jiwoo Hong, Noah Lee, James Thorne

Links:
Homepage: ykilcher.com/
Merch: ykilcher.com/merch
YouTube: youtube.com/c/yannickilcher
Twitter: twitter.com/ykilcher
Discord: ykilcher.com/discord
LinkedIn: www.linkedin.com/in/ykilcher

If you want to support me, the best thing to do is to share out the content :)

If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
SubscribeStar: www.subscribestar.com/yannickilcher
Patreon: www.patreon.com/yannickilcher
Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n

Metadata And Engagement

Views : 17,614
Genre: Science & Technology
Date of upload: May 1, 2024 ^^

Rating : 4.925 (10/526 LTDR)
RYD date created : 2024-05-18T02:30:51.752578Z
See in json

YouTube Comments - 44 Comments

@r9999t
2 weeks ago

Glad you're back to technical content this time. Any AI YouTuber can give us latest AI news, but you're just about the only one that can give technical insight into the stories.

23 |

@lone0017
2 weeks ago

6 videos in 7 days, I'm having a holiday and this is such a perfect-timing treat.

20 |

@EternalKernel
2 weeks ago

Thank you for being awesome Yannic, I send people from the classes that I "TA" for to you because you're reliably strong with your analysis.

4 |

@peach412
2 weeks ago

26:30 that 'really?' and the following struggle with basic math is WAAAAY to relatable

15 |

@justheuristic
2 weeks ago

The main loss function (7) looks like it can be meaningfully simplified with school-level math. Lor = -log(sigm( log ( odds(y_w|x) / odds(y_l|x)))), where sigm(a) = 1/(1 + exp(-a)) = exp(a) / (1 + exp(a)) Let's assume that both odds(y_w|x) and odds(y_l|x) are positive (because softmax) By plugging in the sigmoid, we get Lor = - log (exp(log(odds(y_w|x) / odds(y_l|x) )) / (1 + exp(log(odds(y_w|x) / odds(y_l|x)))) ) Note that exp(log(odds(y_w|x) / odds(y_l|x)) = odds(y_w|x) / odds(y_l|x). We use this to simplify: Lor = - log( [odds(y_w|x) / odds(y_l|x)] / (1 + odds(y_w|x) / odds(y_l|x)) ) Finally, multiply both numerator and denominator by odds(y_l|x) to get Lor = - log(odds(y_w|x) / (odds(y_w|x) + odds(y_l|x)) ) Intuitively, this is the negative log-probability of (the odds of good response) / (odds of good response + odds of bad response ). If you minimize the average loss over multiple texts, it's the same as maximizing the odds that the model chooses winning response in every pair (of winning+losing responses).

12 |

@tensorturtle1566
2 weeks ago

Great to see research from my homeland of South Korea represented!

12 |

@borisbondarenko314
2 weeks ago

I very like more technical content from you. I usually read tech news in telegram and your NL New are greats, but very ordinal and simple. So such paper explanations are kind of impact to the DS community, such videos grands new ideas and increase understanding of the field for those, who tried to dive in the deep. Of course it less popular due to complexity of material for audience, but much more interesting. So thank you for such format.

1 |

@I-0-0-I
2 weeks ago

Thanks for explaining basic terms along with the more complex stuff, for dilettantes like myself. Cheers.

|

@blender6426
2 weeks ago

Nice I was waiting for this after you mentioned ORPO in ML News :))

1 |

@max0x7ba
1 day ago

That log of probability is also a power transform often used to narrow or widen a distribution.

|

@fearnworks
2 weeks ago

You are on fire!

4 |

@kaikapioka9711
2 weeks ago

Thx again yan! 🎉

2 |

@Zed_Oud
1 week ago

27:57 “the corresponding side” Maybe they mistakenly switched the w l givens in the denominators?

1 |

@gauranshsoni4011
2 weeks ago

Keep them comin

1 |

@meselfobviouslyme6292
2 weeks ago

Thank you Mr Klicher for delving into the paper, ORPO; Monolithic Preference Optimization without Reference Model

1 |

@jellyfishnexus3132
2 weeks ago

Nice!

1 |

@syeshwanth6790
2 weeks ago

Where does Yw and Yl come from. Is it from the training dataset or the LLM that is being trained generates these and are labelled by humans or reward models as W and L?

1 |

@wwkk4964
2 weeks ago

What's going on, is it a yannic bonanza time of the year! Loving these addicting videos

|

@MyCiaoatutti
2 weeks ago

"Specifically, 1 - p(y|x) in the denominators amplifies the gradients when the corresponding side of the likelihood p(y|x) is low". I think that (1 - p(y|x)) have two different meanings here: it can be the result of differentiation by coincidence and also the "corresponding side" of the likelihood, i.e., 1 - p(y|x). So, when it says the "corresponding side" of p(y|x) is low, it means that 1 - p(y|x) is low.

1 |

@chrise8153
2 weeks ago

Wow good timing to go on youtube

|

Go To Top

AutoPlay: