LLM Evals - Part 1: Evaluating Performance

preview_player
Показать описание

OTHER TRELIS LINKS:

TIMESTAMPS:
00:00 Introduction to LLM Evaluation
03:21 Understanding Evaluation Pipelines
09:56 Building a Demo Application
15:21 Creating Evaluation Datasets
23:52 Practical Evaluation Task / Question Development
27:40 Running and Analyzing Evaluations
30:24 Comparing LLM Model Performance using Evals
34:09 Conclusion and Next Steps
Рекомендации по теме
Комментарии
Автор

From video launch onwards, the multi-repo bundle (see Trelis.com for more details) will include ADVANCED-fine-tuning, ADVANCED-inference, ADVANCED-transcription (incl. speech to text and text to speech), ADVANCED-vision (includes multi-modal and diffusion models), and now ADVANCED-evals.

Those who have already purchased the Trelis Multi-Repo bundle will gain free access to the ADVANCED-evals repo. Check your Github activity page once this video goes live!

Said simply - if you purchase or purchased the multi-repo bundle, you will either way get access to ADVANCED-evals when the video goes live.

TrelisResearch
Автор

I can't believe it, I guess there is some connection, i was searching some tools and lectures on how to evaluate response from LLM for my own little fine-tuned model and here comes Trelis with the exact same thing
Thanks a lot man, you save me dozens of hours

fatshaddy-rzwn
Автор

Genuinely appreciate your systematic and calm structure of explaining... Will watch many more of your videos in the coming days. Thank you man.

colosys
Автор

Super useful video, really appreciate your contributions. They're worth so much!

MrMoonsilver
Автор

cant wait for the follow up video on evals, this was very useful.

dennismaorwe
Автор

Thank you so much for this. Cant wait to watch the whole video. I am grateful for the information ❤

KopikoArepo
Автор

Thank you for such a practical video. If you do end up making a part 2, advice on how to use evals to improve pipelines and prompts could be helpful. Everybody knows how to vibe check responses and trial-and-error their way to improved prompts, but i'm wondering if there's a more rigorous, structured approach. Like DSpy, but less complicated.

deoxykev
Автор

Nice approach.

Just a few suggestions/requests:
1. if you could also include any UI to move away from the terminal, it would be very helpful.
2. Also if you could include 1 example only for usage with open source models - like ollama for instance.
3. In the future videos - if you could also show a way to evaluate Dataset(Data quality check approaches)

THE-AI_INSIDER
Автор

Just two questions:

1. I am still the connection from running a baseline eval on a pretrained LLM of choice, with the intention of finetuning the LLM, to preparing the training data, to training, and running the evaluation on the finetuned LLM. I am getting bits and pieces but not entirely how that connects.

2. How can I use the repo to achieve the above approach? Unless am thinking about it rather naively.

dennismaorwe
Автор

Hey, this is gonna be a sort of long question so bear with me, but what is your opinion on creating MCQ datasets for automatic and objective evaluation, without a human/llm judge in the loop?

Essentially it would mean generating question-answer pairs using an LLM, and then using those pairs to generate 3-4 "dummy" answers which are slightly reworded to be wrong. Then you run an eval with your setup on the MCQ set and get an objective measure of whether an answer is correct or not.

If the LLM to generate the questions cannot do it reliably, then we could maybe use the positive anchors generated during RAG fine tuning.

You take a question and anchor, then ask an LLM to generate an answer to this question using the anchor as context.

After this you generate the dummy answers and so on.

My thoughts with this were that it might be too easy for LLMs to answer correctly, or that its too hard to reliably generate good dummy answers, however from a RAG evaluation standpoint I think it could still work. Imagine an 8B model with RAG gets 90% correct and without RAG gets 85% correct. We wouldnt really see what effect our RAG has on the evaluation as the model can answer well enough on its own anyway. However if we swap that out for say a 3B model, essentially handicap the model intentionally and then view the difference made from base vs RAG, maybe that could work.

Anyway, would love to know your opinions on this.

LorenzEhrlich
Автор

We need benchmarks for minimum model size. In other words, we should be running our agents functions on the smallest possible models. If a tiny model can get 100% accuracy on running a given function, that's the model we should use for that kind of function. I'm unaware of anyone doing this kind of work.

dr.mikeybee