MLX Mixtral 8x7b on M3 max 128GB | Better than chatgpt?

preview_player
Показать описание
Here is the code to run the Mixtral 8x7B on mac
Рекомендации по теме
Комментарии
Автор

I've bought myself a M2 Max 32 GB notebook recently and am very happy for small 7B models. Thanks to your advice, I'm just using the small LLMs for minor stuff and whenever i need to have quality responses, ChatGPT's API really is highly useful to me

Dominik-K
Автор

For web ui, if you run the model in ollama, ollama web ui is pretty slick and very simple to set up

robertotomas
Автор

This video is outstandingly helpful. Thank you for the clarity of the tests and your insightful closing thoughts. 💯

clementcardonnel
Автор

thanks much for doing this test. it's basically what i wanted to know to know if I should get the m3 max at 128gb. Although everything you said is true, and the inference speed is slow, i think the idea for me, is i can check the quality of the bigger models locally, even 70b's, to see if the quality difference is enough to justify in a deployment. i can check full precision and the quants pretty easily and maybe even some tuning. All with the assumption in a production deployment i can back it with 2 rtx 6000's or an a100.

The main issue with chatgpt is of course privacy. although api is supposedly private, most enterprises i don't think are comfortable with that, and gpt4 is still very expensive. anyways thanks!

stephenthumb
Автор

I agree, I was going to purchase an old Precision 7810 Xeon Silver w/ 256GB of RAM to run open source models with the objective to show my organization how to run open source models privately hosted with our own data and hardware. However, using OpenAI API or ChatGPT is right now far more faster and reasonable.

datpspguy
Автор

Your recent video content is a perfect fit for my needs, and I'm thinking about whether to buy a top-of-the-line M3 Max to do some of my hobbies: cloning myself with LLM, but to be honest your videos make me hesitate, because there doesn't seem to be a way to balance portability, fine-tuning and output speed, and although the Mac is a good choice at this point, it hasn't reached the state I want

laobaGao-yf
Автор

You are comparing too many things simultaneosuly. Local model vs a cloud model. Free vs paid. This is not a good strategy.

dsblue
Автор

Thanks for the video. Did you try fine-tuning, e.g. Lora, with your M3 Mac and if yes how fast is it?

daReturn
Автор

Great Video! Just subscribed! thank you.

rchuhk
Автор

A single A100 80GB could run this model fine in 8-bit quantization which doesn't take much of an accuracy hit, a single A100 40GB in 6-bit also without much accuracy difference, and a single 24GB GPU in 4-bit. 8-bit takes a bit over 42GB, 6-bit a bit over 32GB, and 4-bit a bit over 21GB. You can see the accuracy difference between them in the model repository: turboderp/Mixtral-8x7B-exl2

Anthonyg
Автор

Just from the perspective of LLM operation (not considering portability), will the m2 ultra 192G have a better performance?

laobaGao-yf
Автор

With how cheap GDDR6 is now they really should make cards with a lot more ram to catch up to this advantage. We can't have puny macs outperforming the most expensive PC stuff.
The model spelled tokyo with an i though. Could be a fluke

DanFrederiksen
Автор

The problem with chatGPT and Claude 3.5 and other strong models is that is all strong censored modells. That's why I believe that hardware developers will rise to the task and in the future there will be affordable hardware that supports running private language models in a home environment.

gaborfeher
Автор

I have just 48gb m3 max, and can run a dolphin q5_k_m version of it (via ollama), and was not thrilled with it for programming. I feel like deepseek code 33b q_4 is already much better. Is the fp16 much better? I almost felt like something was wrong, the version I can run performed so poorly.

robertotomas
Автор

This is great info, what I was looking for on MLX. But realistically if your budget is $5k you can buy a bunch of tesla cards and run it on an old xeon workstation for way less than that, probably less than half the price. You can use M10 GPUs that don't even need an motherboard that has above 4g decoding and stick 3 of them in one ancient system with 32*3=96 gb. Such as system would be much less than $1k and is one type of system I may end up building. However I'm not really sure that large models are nearly as useful as having a RAG system with a mixture of experts and in that case you can use almost any GPU and get nice results.

MattJonesYT
Автор

Thanks for the video but I think you're completely missing the point here... First - you don't run models locally to save the money but to avoid exposing your data to OpenAI, Microsoft or any other 3rd party. Most of commercial companies prefer to pay more but to keep privacy. Second - from my experience GPT works much worse when it comes to some prompts, e.g. writing a code, Mistral-7B wins on this field IMO - so in most cases I'd prefer to wait longer but get better response

pawel
Автор

Hey, really great content, thanks 🙏 I am still looking forward to the Grandma modelling video.

vimuser
Автор

can you make a video of how to make the model run embedded in an app? I mean that part of the xcode project will contain the 7b model in the bundle, so the app runs with the llm locally not remotely. is that even possible or do we need to wait for apple to release it as core model?

nat.serrano
Автор

Thanks for your sharing. I want to buy a m3 max mbp, and the only problem is how much memory I need to choose. What is your recommendation ? I want to make some development with LLM like to make my own knowledge system.

zhihmeng
Автор

Can it run on mac 14 inch or it has to be 16 inch? Thank you

nothingnobest