[QA] Mission Impossible: A Statistical Perspective on Jailbreaking LLMs

preview_player

Показать описание

This paper analyzes preference alignment and jailbreaking in large language models, proposing E-RLHF as a cost-effective method to enhance safety without compromising performance.

Arxiv Papers

Рекомендации по теме