Meetup: Evaluating LLMs: Needle in a Haystack

Показать описание

LLM evaluation is a discipline where confusion reigns and foundation model builders are effectively grading their own homework.

Building on the viral threads on X/Twitter, Greg Kamradt, Robert Nishihara, and Jason Lopatecki discuss highlights from Arize AI's ongoing research on how major foundation models – from OpenAI’s GPT-4 to Mistral and Anthropic’s Claude – are stacking up against each other at important tasks and emerging LLM use cases, covering and explaining the importance of results of Needle in a Haystack tests and other evals results on hallucination detection on private data, question-and-answer, code functionality, and more.

Curious which foundation models your company should be using for a specific use case – and which to avoid? You won’t want to miss this meetup!

Anyscale

Рекомендации по теме

Комментарии

Good content, thanks for your research

antonidabrowski

Meetup: Evaluating LLMs: Needle in a Haystack

Meetup: Evaluating LLMs: Needle in a Haystack

Evaluating Retrieval in RAGs - Maria Knorps | WiMLDS Poznań 23rd Meetup, Fandom Office

Holistic Evaluation of Generative AI Systems // Jineet Doshi // MLOps Podcast #280

Anthropic 2024 Updates including Claude 3 + GenAI Observability and LLM Evaluation​ with Truera

Make TechTalks: From Data to Intelligence: Embedding Company Knowledge in LLMs

Learning at test time in LLMs

Gen AI Journey to Production - Expert Panel

LLM Evaluation with Arize AI's Aparna Dhinakaran // MLOps Podcast #210

Navigating the AI Frontier // Boris Selitser // MLOps Podcast #241

Gen AI London - LLM Agents For the Enterprise

Prototyping with Generative AI with Ben Lerner - nyhackr February Meetup

CLIQ-ai.quebec NLP Meetups - December 2024

[80] Solving NLP (Natural Language Processing) Tasks Using Chat GPTs & LLMs (Large Language Mode...

Yuandong Tian | Efficient Inference of LLMs with Long Context Support

SQL Generation Evals: LLMs-as-a-Judge

AI Meetup (Dallas) 7/23/2024

AI Agents for Data Analysis with Shreya Shankar - 703

All the Hard Stuff with LLMs in Product Development // Phillip Carter // MLOps Podcast #170

How Far Can We Scale AI? Gen 3, Claude 3.5 Sonnet and AI Hype

The Needle in the Haystack: Harnessing the Power of AI to Identify the Sickest Cancer Patients

Computer Vision Meetup: Fast and Flexible Data Discovery & Mining at Petabyte Scale

Austin Deep Learning Meetup - Llama 3 Candidate Paper | Self-Rewarding Language Models

SF Unstructured Data Meetup May 21 2024

Boosting AI with Python: Using Click, Jinja2, and GPT Libraries

Anthropic 2024 Updates including Claude 3 + GenAI Observability and LLM Evaluation with Truera