NEW Open-Source LLM Tops The Rankings...But Is It Actually Good?

Показать описание

Cohere released Command R+ with open weights! It i currently the top open model according to lmsys, but let's test it ourselves. This model is optimized for retrieval and tool usage, with a focus on enterprise use cases.

Join My Newsletter for Regular AI Updates 👇🏼

Need AI Consulting? ✅

My Links 🔗

Rent a GPU (MassedCompute) 🚀
USE CODE "MatthewBerman" for 50% discount

Media/Sponsorship Inquiries 📈

Links:

Рекомендации по теме

Комментарии

When Matthew puts "Is it any good?" in the title, you know it's garbage.

avi

Thank you Matthew Berman for respecting our time by keeping this video under 10min while serving our preference to hear it from you.

aloveofsurf

"Ok well YOU needed to do that"

🤣

phobes

Testing out command-r v1 and command-r-plus out of domain, hmm... I mean as you stated, the model is fine tuned for grounding and citation in RAG. Wouldn't it make sense then to extend your eval data set with RAG tests? RAGAs would be very easy to implement. Grounding and RAG is the most common business use case for LLMs.

JanBadertscher

Why are people ignoring the fact that it’s meant to be used for RAG

Dygit

LLM is for Rag.
"Do Snake in Python"

Dreamslol

As we start seeing more specialized models that are less generalized, it may help to reconsider the testing methods used. I like the practicality in your reviews and would like to see that extend to tailoring your challenges (or weighting what you have) to see how well these models (especially open source ones) do what they claim to be best at.
Thank you for another great video!

jim-i-am

I managed to put this104 billion parameters model on my phone, and it works, YES.
Granted, it's a 24Gb RAM phone (OnePlus 12 coming straight from China as the global market is still limited to 16Gb) and very heavily quantized (Q1) but nevertheless, from that Snapdragon 8 gen 3 SOC, it produces very good answers where command-R+ shines, i.e. code, and it does so with just about 12 seconds initial latency and then answers about at half normal reading speed 🙂

SasskiaLudin

Ability to reason and writing code are not the only benchmarks to measure model real life applications.

Creative writing is one too. I use AI in a setting where I'm interested in the entertainment value of the replies, not entirely on wether they are logically correct.

The best open source model I've found for this is Llama 2 chat 13B. It writes the most fun answers by far. It even uses emoji in its replies in a natural way, without being prompted to do so.

I even compared it to Gemini Pro, a much bigger and faster model, and despite it having an amazing inference time due to cloud computing, the answers it wrote were just boring.

brunodangelo

When you see so many fails, you start wondering if the test scores were a bit cooked, like a VW emmission test!

brianmi

Yah I dont think this model is really designed for the type of work you were trying to do here. The Langchain channel put out a video already talking about this model and they seemed impressed. I think you will end up finding more models coming out where regular tests will be horrible but for the edge use case the model is built for it is great. That is how I see agent workflows working anyways. I dont think you will be calling one model. You will be having the agents using specific models doing specific jobs. Its really no different than real life where you have specialists that are very good at the jobs they are trained to do. AGI wont be a single model. It will be many models working together. I've seen a few videos talking about tiny models that are designed for specific tasks that outperform much larger models.

pin

Don't forget it also open-source and can deploy locally which is crucial for some organization for privacy. So it's may be the best solution for some cases even though it's not the smartest model in the field.

AvizStudio

Sorry Matthew for my negative comment, I usually love your videos, but, this was kinda useless video/tests, I was expecting tests on its strengths, function calling and rag.
"Hey we are gonna test the components in this ice cream, but we don't have any, so we are using butter for the tests."
C'mon buddy, don't be lazy 😂

splitpierre

These green and red screens for Pass and Fail. Here is a suggestion: make them flickering, longer and perhaps add some siren sound. I almost got a seizure, but not quite, I feel you need to push it a bit harder.

moamber

R+: i am for RAG
M: Okay. Then can you make a pea soup game in python?
R+: Bro i am for RAG
M: FAIL

spookymv

Additionally, there is Command-R Plus, which is 104b and offers significant improvements over Command-R. Notably, Ollama runs it flawlessly.

Canna_Science_and_Technology

I've only been trying out the API for a few days, and I'm impressed with the capabilities of this command-r-plus model, besides the connector function that is already integrated with web-search by default, the multi-turn conversation capability that is very, very easy, without having to design my own schema to make it possible, the main thing is that so far, I haven't found any answers that are "unsatisfactory, tend to be hallucinatory, and don't provide enough insight" for me. It's going to be a tough competition!

muhammadlufti

In my experience, Cohere did a great job to build their models for RAG and search cases. Their reranker and embedding models are a good starting point for rapid prototyping.

KoenigNord

AI benchmarks - ❌
Matthew Berman testing - ✅

tejeshwar.p

Well that's unfortunate on a bit of fronts... I'll still work towards testing it locally and comparing results with some other options. I'm still looking forward to testing out its 128k context window to see how well it responds with large scripts to edit.

Dundell

NEW Open-Source LLM Tops The Rankings...But Is It Actually Good?

NEW Open-Source LLM Tops The Rankings...But Is It Actually Good?

New LLM DESTROYS Every Other Model with 'Self Healing' (Open Source)

Should You Use Open Source Large Language Models?

Wake up babe, a dangerous new open-source AI model is here

NEW: Deepseek v2.5 Best Open-Source AI Coder vs OpenAI o1, o1-mini, & Claude Sonnet 3.5 [TESTED]

This new AI is powerful and uncensored… Let’s run it

Top Trending Open-Source GitHub Projects This Week: AI Companion, LLM Inference & LLMs Guide

Local AI Just Got Easy (and Cheap)

Best ChatLLM AI Agent Automation Tool In 2024 (ALL IN ONE TECH HUB)

Top 10 Trending Open-Source GitHub Projects: AI Tools, LLM Development & Image Editors

15 futuristic databases you’ve never heard of

UNCENSORED Mistral v0.2 Dolphin LLM - Won't Refuse Anything!

API For Open-Source Models 🔥 Easily Build With ANY Open-Source LLM

LMSys Leaderboard: Which LLM is Currently The Best?

New OPEN SOURCE AI Just STUNNED The Entire Industry (Beats Everything!)

DeepSeek-v2.5: BEST Opensource LLM! (Beats Claude, GPT-4o, & Gemini) - Full Test

Top 7 Open-Source LLM Projects: From AI Language Models to Smart AI Agents

Pixtral is REALLY Good - Open-Source Vision Model

NEW Qwen-2 LLM: Best Opensource LLM EVER? Impressive Coding Abilities!

host ALL your AI locally

Top Trending Open-Source GitHub Projects: AI Code Editor, Real-Time Speech-to-Text & AI Companio...

New $10m Open-Source Foundational LLM Is AMAZING! (DBRX by Databricks)

This Llama 3 is powerful and uncensored, let’s run it

NEW Open Source AI Video (Multi-Consistent Characters + 30 Second Videos + More)