Autonomous Open Source LLM Evaluator (Ollama) - Full Guide

preview_player
Показать описание
Autonomous Open Source LLM Evaluator (Ollama) - Full Guide

👊 Become a member and get access to GitHub and Code:

🤖 Great AI Engineer Course:

🔥 Open GitHub Repos:

📧 Join the newsletter:

🌐 My website:

Today I take a look at my Autonomous Open Source LLM Evaluator using Ollama and GPT-4. This is a neet tool to test open source LLMs on different tasks like problems and code

00:00 Ollama LLM Eval Intro
00:21 Ollama LLM Eval Flowchart
01:28 LLM Evaluator Code 1
06:24 Test 1
08:30 LLM Evaluator Code 2
09:13 Test 2
10:53 Conclusion
Рекомендации по теме
Комментарии
Автор

I've built a similar system, but I noticed that judge model sometimes hallucinates and gives high marks to obviously wrong solutions. I tried to make a jury of multiple judges (different big models) this improved judging quality, but made it 8X slower. Also, with multiple judges you will need to fuse their judgements to some consensus, it's just pretty slow and all models do hallucinate.

ArseniyPotapov
Автор

aya:35b blows everything out of the window. Not ten times better then chatGPT but one hundred times better. It's slow as it's 35B run locally but, I love it. Besides that I use llama3 for most everyday tasks..

ProfessorCrumbs
Автор

In The Bubble sort evaluation, all the models that were eval as wrong (MIstral, Codestral..etc) had a syntax error in line 1 because it included the output text as a line of code as for the code itself it was sound on all..so it is not a proper eval as you need to check your code as to why it worked for a couple but not the others as a simple syntax error that wasnt part of the LLM's code but yours does not make for a proper eval. Other than its a cool idea

thenarrowgate
Автор

May I ask what is your roadmap for this channel?

tonywhite
Автор

What is the sense to estimate many models by some more powerful model if this is required for each problem so it would be much faster to just ask GPT-4 for an answer of the problem

JohnDoe-zxbu