LLM Evals - Part 1: Evaluating Performance

Показать описание

OTHER TRELIS LINKS:

TIMESTAMPS:
00:00 Introduction to LLM Evaluation
03:21 Understanding Evaluation Pipelines
09:56 Building a Demo Application
15:21 Creating Evaluation Datasets
23:52 Practical Evaluation Task / Question Development
27:40 Running and Analyzing Evaluations
30:24 Comparing LLM Model Performance using Evals
34:09 Conclusion and Next Steps

Рекомендации по теме

Комментарии

From video launch onwards, the multi-repo bundle (see Trelis.com for more details) will include ADVANCED-fine-tuning, ADVANCED-inference, ADVANCED-transcription (incl. speech to text and text to speech), ADVANCED-vision (includes multi-modal and diffusion models), and now ADVANCED-evals.

Those who have already purchased the Trelis Multi-Repo bundle will gain free access to the ADVANCED-evals repo. Check your Github activity page once this video goes live!

Said simply - if you purchase or purchased the multi-repo bundle, you will either way get access to ADVANCED-evals when the video goes live.

TrelisResearch

I can't believe it, I guess there is some connection, i was searching some tools and lectures on how to evaluate response from LLM for my own little fine-tuned model and here comes Trelis with the exact same thing
Thanks a lot man, you save me dozens of hours

fatshaddy-rzwn

Genuinely appreciate your systematic and calm structure of explaining... Will watch many more of your videos in the coming days. Thank you man.

colosys

Super useful video, really appreciate your contributions. They're worth so much!

MrMoonsilver

cant wait for the follow up video on evals, this was very useful.

dennismaorwe

Thank you so much for this. Cant wait to watch the whole video. I am grateful for the information ❤

KopikoArepo

Thank you for such a practical video. If you do end up making a part 2, advice on how to use evals to improve pipelines and prompts could be helpful. Everybody knows how to vibe check responses and trial-and-error their way to improved prompts, but i'm wondering if there's a more rigorous, structured approach. Like DSpy, but less complicated.

deoxykev

Nice approach.

Just a few suggestions/requests:
1. if you could also include any UI to move away from the terminal, it would be very helpful.
2. Also if you could include 1 example only for usage with open source models - like ollama for instance.
3. In the future videos - if you could also show a way to evaluate Dataset(Data quality check approaches)

THE-AI_INSIDER

Just two questions:

1. I am still the connection from running a baseline eval on a pretrained LLM of choice, with the intention of finetuning the LLM, to preparing the training data, to training, and running the evaluation on the finetuned LLM. I am getting bits and pieces but not entirely how that connects.

2. How can I use the repo to achieve the above approach? Unless am thinking about it rather naively.

dennismaorwe

Hey, this is gonna be a sort of long question so bear with me, but what is your opinion on creating MCQ datasets for automatic and objective evaluation, without a human/llm judge in the loop?

Essentially it would mean generating question-answer pairs using an LLM, and then using those pairs to generate 3-4 "dummy" answers which are slightly reworded to be wrong. Then you run an eval with your setup on the MCQ set and get an objective measure of whether an answer is correct or not.

If the LLM to generate the questions cannot do it reliably, then we could maybe use the positive anchors generated during RAG fine tuning.

You take a question and anchor, then ask an LLM to generate an answer to this question using the anchor as context.

After this you generate the dummy answers and so on.

My thoughts with this were that it might be too easy for LLMs to answer correctly, or that its too hard to reliably generate good dummy answers, however from a RAG evaluation standpoint I think it could still work. Imagine an 8B model with RAG gets 90% correct and without RAG gets 85% correct. We wouldnt really see what effect our RAG has on the evaluation as the model can answer well enough on its own anyway. However if we swap that out for say a 3B model, essentially handicap the model intentionally and then view the difference made from base vs RAG, maybe that could work.

Anyway, would love to know your opinions on this.

LorenzEhrlich

We need benchmarks for minimum model size. In other words, we should be running our agents functions on the smallest possible models. If a tiny model can get 100% accuracy on running a given function, that's the model we should use for that kind of function. I'm unaware of anyone doing this kind of work.

dr.mikeybee

LLM Evals - Part 1: Evaluating Performance

LLM Evals - Part 1: Evaluating Performance

Why Evals Matter | LangSmith Evaluations - Part 1

LLM Evals and LLM as a Judge: Fundamentals

LLM Eval Office Hours #1: Multi-Turn Chat Evals

Evals Are Important #programming #coding #developerlife #llm #ai #evaluations

How to run LLM evals with no code | PRACTICE

Building LLM Evals From Scratch

Part 1: Introduction and evaluations of LLMs for data extraction

How to measure LLM writing quality when there is no right answer?

What are Evals?

Evaluating LLM-based Applications

Welcome to the LLM evaluation course

Create a dataset and run custom LLM evaluations in 1 minute

LLM-Evals und LLM als Richter: Grundlagen

The Mother of LLM Jailbreaks is Here!

LLM Evaluation Basics: Datasets & Metrics

LLM Evaluation Essentials: Statistical Analysis of Hallucination LLM Evaluations

DONT DO LLM! RL AS LAST RESORT | Yann LeCun #fyp #chatgpt #llm #ai #deeplearning #machinelearning

How to Construct Domain Specific LLM Evaluation Systems: Hamel Husain and Emil Sedgh

A Gentle Introduction to LLM Evaluations - Elena Samuylova

LLM System Design and AI Evals - Product Manager Mock Interview

LLM evaluation benchmarks

Deepchecks LLM Evaluation | Product Overview

How to set up real-time LLM evaluations with LangWatch