Phi 4 on Ollama - is it REALLY better than Claude 3.5?

preview_player
Показать описание
PHI 4 is a new 14 billion parameter model from Microsoft that is claimed to beat GPT-4o, Llama 3.3 and Claude 3.5 in math, and is comparable in other benchmarks. What makes this model so good however (for it's size), is its chain of thoughts

In this video we get started with PHI 4 by getting up and running locally using Ollama and testing it withon OpenWebUI. We look at its math capabilities, it's coding capabilities, it's role playing capabilities and the chain of thoughts it generates for puzzles such as sudoku or tic-tac-toe, comparing with Deep Seek R1, Claude 3.5 Sonnect, GPT-4o-mini and Llama 3.3 in various tasks. We also look at how Microsoft is priming itself for agents but isn't quite there due to lack of tool calls.

00:00 - Introduction
00:55 - Installing PHI-4 with Ollama
02:12 - Using OpenWebUI
03:06 - Math Problems
06:48 - Sudoku
11:38 - Tic-Tac-Toe
15:05 - Function Calling and Tools
17:20 - Role Playing
18:55 - Creating Stories
21:09 - Coding
24:37 - Final Thoughts
Рекомендации по теме
Комментарии
Автор

please note at 13:46, mini got the answer right, as did llama3.3 at 14:11. i did point out earlier in the video, that position 2 is a correct answer. in the flow of the video i was focused on position 3, even though i had pointed out position 2 was also correct. you've all figured this for yourselves. i'm not editting this video as it doesn't change the narrative. apologies for missing this in the flow of the video

chrishayuk
Автор

Saved me time and heartache figuring this out the hard way. I can ask for nothing more. Thanks!

Junon
Автор

No tool calling? Qwen will they be able to?

patruff
Автор

Nemotron got on first try when responding the '254*752-3+(2-7)+5':

To evaluate this expression, we need to follow the order of operations, often remembered by the acronym PEMDAS (Parentheses, Exponents, Multiplication and Division from left to right, Addition and Subtraction from left to right). Here's how to break it down:

Given expression: `254 * 752 - 3 + (2 - 7) + 5`

1. **Parentheses first**: Solve the expression inside the parentheses.
- `(2 - 7) = -5`

2. **Update the expression**: Replace the solved parentheses with its result.
- `254 * 752 - 3 - 5 + 5`

3. **Multiplication next** (since there are no exponents):
- `254 * 752 = 191008`

4. **Update the expression again**:
- `191008 - 3 - 5 + 5`

5. **Finally, handle Addition and Subtraction from left to right**:
- `191008 - 3 = 191005`
- `191005 - 5 = 191000`
- `191000 + 5 = 191005`

So, the final result of the expression `254 * 752 - 3 + (2 - 7) + 5` is **191005**.

gustavo
Автор

FYI, on Mac at least, the BoltAI GUI app will connect to locally-running Ollama and LM Studio models served by those apps

pmarreck
Автор

what's 254*752-3+(2-7)+5? first convert it to simpler parts then start solve it
This model is very good at CoT, so if you make it think step by step, it gives you the right answer. For a better test of this model, you should make it break it down and think about it as much as possible.
so the System Prompt should be : always break down any questions into simpler parts, then solve it step by step.

seniormcyt
Автор

Google Fi, Phi, but when are they going to release Fo, and Fum? As an Englishman by blood please respond.

patruff
Автор

not yet available on huggingface library searching but if someone has a pull command from ollama please say

QuizmasterLaw
Автор

I think Sam Altman said “no one wants last years model” or the second best model. Unless it’s really faster or better most will stick to one of the others. Good to see though.

thegrumpydeveloper
Автор

in openweb ui how did you setup a dark mode? it is so useful

artoke
Автор

What version of OpenWebUI are you using?

stoicescucatalin
Автор

does OpenWebUI cost money? can it run *anything* ?

dot
Автор

If you're really good at using the model, could get a lot of work done with it, or does it have too many limitations and weaknesses?

Corteum
Автор

The 14b parameter model may give worse answers in test time compute as smaller models generally does not reason better vs larger models

FalconStudioWin
Автор

I used a Gen10 A2 GPU, and it actually did a really good job on the latest model. It used about 10GB NVRAM at most compared with the o1, and it was not far of a really good model

DrWaldonHendricks
Автор

The Ollama modal is quantized to Q4_K_M, so you will lose much quality, especially context. It is not a fair comparison. Also, you may need some guidance to force thinking which o1 and Claude 3.5 are doing but Ollama doesn't support yet.

Автор

At this point, Supernova Medius is better?🤔

husanaaulia
Автор

13:46 GPT4o mini got a RIGHT answer. position 2 is a winning move as much as 3 (you said it yourself earlier in the video and you mistakenly discarded the GPT4o mini answer as invalid.
Again on 14:11 you wrongly stated Llama 3.3 gets it wrong when it gets it right!
You had 1 job! Please fix the video or carry your reputation as a shill.

alelondon