[Monday evening short video] Summary of two new amazing LLM benchmarking papers: GAIA and GPQA

preview_player
Показать описание
Sharing a summary of two amazing LLM benchmarking papers published just last week:
- GAIA: the General AI Assistant benchmark (disclaimer: I'm a co-author with amazing co-authors)
- GPQA: the Graduate Level Google Proof QA benchmark (disclaimer: the authors are also awesome)

When two teams (covering a diverse range of actors like Anthropic, Cohere, New-York University, Hugging Face, Meta AI) independently come up with benchmarks that share so many aspects (while being really different in goals and approaches), you know the future of LLM benchmarking is basically changing under your eyes.

Both are super difficult with ~30% GPT4 success rate, both are small (450 questions) and carefully hand-crafted question by question, with a single gold answer and a strong interest for the reasoning it-self rather than memorization capabilities. Both very challenging test bed for coming models capabilities

And above all: I'm super excited that these are open-source benchmark giving us common ground for comparison of the coming frontier models. On to the future (of open evaluation)!

Papers and more information:
Рекомендации по теме
Комментарии
Автор

Vegas in the building. Could use better videos on explaining how to use huggingface to autotrain, or using docker. Can not find a fully video that explains in great detail for learning purpose. Everyone send you to a website to read pages of work. Help plz 🙏

bumlifeBomblifeManagement
Автор

After reading the GPQA, I'm pretty sure I can't do a single question (BSc Biology)

zaf