NEW Open-Source LLM Tops The Rankings...But Is It Actually Good?

preview_player
Показать описание
Cohere released Command R+ with open weights! It i currently the top open model according to lmsys, but let's test it ourselves. This model is optimized for retrieval and tool usage, with a focus on enterprise use cases.

Join My Newsletter for Regular AI Updates 👇🏼

Need AI Consulting? ✅

My Links 🔗

Rent a GPU (MassedCompute) 🚀
USE CODE "MatthewBerman" for 50% discount

Media/Sponsorship Inquiries 📈

Links:
Рекомендации по теме
Комментарии
Автор

When Matthew puts "Is it any good?" in the title, you know it's garbage.

avi
Автор

Thank you Matthew Berman for respecting our time by keeping this video under 10min while serving our preference to hear it from you.

aloveofsurf
Автор

"Ok well YOU needed to do that"

🤣

phobes
Автор

Testing out command-r v1 and command-r-plus out of domain, hmm... I mean as you stated, the model is fine tuned for grounding and citation in RAG. Wouldn't it make sense then to extend your eval data set with RAG tests? RAGAs would be very easy to implement. Grounding and RAG is the most common business use case for LLMs.

JanBadertscher
Автор

Why are people ignoring the fact that it’s meant to be used for RAG

Dygit
Автор

LLM is for Rag.
"Do Snake in Python"

Dreamslol
Автор

As we start seeing more specialized models that are less generalized, it may help to reconsider the testing methods used. I like the practicality in your reviews and would like to see that extend to tailoring your challenges (or weighting what you have) to see how well these models (especially open source ones) do what they claim to be best at.
Thank you for another great video!

jim-i-am
Автор

I managed to put this104 billion parameters model on my phone, and it works, YES.
Granted, it's a 24Gb RAM phone (OnePlus 12 coming straight from China as the global market is still limited to 16Gb) and very heavily quantized (Q1) but nevertheless, from that Snapdragon 8 gen 3 SOC, it produces very good answers where command-R+ shines, i.e. code, and it does so with just about 12 seconds initial latency and then answers about at half normal reading speed 🙂

SasskiaLudin
Автор

Ability to reason and writing code are not the only benchmarks to measure model real life applications.

Creative writing is one too. I use AI in a setting where I'm interested in the entertainment value of the replies, not entirely on wether they are logically correct.

The best open source model I've found for this is Llama 2 chat 13B. It writes the most fun answers by far. It even uses emoji in its replies in a natural way, without being prompted to do so.

I even compared it to Gemini Pro, a much bigger and faster model, and despite it having an amazing inference time due to cloud computing, the answers it wrote were just boring.

brunodangelo
Автор

When you see so many fails, you start wondering if the test scores were a bit cooked, like a VW emmission test!

brianmi
Автор

Yah I dont think this model is really designed for the type of work you were trying to do here. The Langchain channel put out a video already talking about this model and they seemed impressed. I think you will end up finding more models coming out where regular tests will be horrible but for the edge use case the model is built for it is great. That is how I see agent workflows working anyways. I dont think you will be calling one model. You will be having the agents using specific models doing specific jobs. Its really no different than real life where you have specialists that are very good at the jobs they are trained to do. AGI wont be a single model. It will be many models working together. I've seen a few videos talking about tiny models that are designed for specific tasks that outperform much larger models.

pin
Автор

Don't forget it also open-source and can deploy locally which is crucial for some organization for privacy. So it's may be the best solution for some cases even though it's not the smartest model in the field.

AvizStudio
Автор

Sorry Matthew for my negative comment, I usually love your videos, but, this was kinda useless video/tests, I was expecting tests on its strengths, function calling and rag.
"Hey we are gonna test the components in this ice cream, but we don't have any, so we are using butter for the tests."
C'mon buddy, don't be lazy 😂

splitpierre
Автор

These green and red screens for Pass and Fail. Here is a suggestion: make them flickering, longer and perhaps add some siren sound. I almost got a seizure, but not quite, I feel you need to push it a bit harder.

moamber
Автор

R+: i am for RAG
M: Okay. Then can you make a pea soup game in python?
R+: Bro i am for RAG
M: FAIL

spookymv
Автор

Additionally, there is Command-R Plus, which is 104b and offers significant improvements over Command-R. Notably, Ollama runs it flawlessly.

Canna_Science_and_Technology
Автор

I've only been trying out the API for a few days, and I'm impressed with the capabilities of this command-r-plus model, besides the connector function that is already integrated with web-search by default, the multi-turn conversation capability that is very, very easy, without having to design my own schema to make it possible, the main thing is that so far, I haven't found any answers that are "unsatisfactory, tend to be hallucinatory, and don't provide enough insight" for me. It's going to be a tough competition!

muhammadlufti
Автор

In my experience, Cohere did a great job to build their models for RAG and search cases. Their reranker and embedding models are a good starting point for rapid prototyping.

KoenigNord
Автор

AI benchmarks - ❌
Matthew Berman testing - ✅

tejeshwar.p
Автор

Well that's unfortunate on a bit of fronts... I'll still work towards testing it locally and comparing results with some other options. I'm still looking forward to testing out its 128k context window to see how well it responds with large scripts to edit.

Dundell