Ask Ollama Many Questions at the SAME TIME!

preview_player
Показать описание
Up until now, its been impossible to get Ollama to answer more than one question at a time. Version 0.1.33 changes EVERYTHING!!

(they have a pretty url because they are paying at least $100 per month for Discord. You help get more viewers to this channel and I can afford that too.)

00:00 - Start
00:41 - This new version changes everything
01:33 - Whats The New Change??
01:50 - What is ollama_num_parallel
02:16 - What is max loaded models?
02:44 - Let's See It In Action
04:55 - But its not all Roses and Ponies
05:44 - What else is in this release?
06:02 - The New Models
06:46 - Time for a config file??
07:10 - The Technovangelist Newsletter
Рекомендации по теме
Комментарии
Автор

Finally, I've been waiting for this from ollama!! Thanks for always keeping us up to date

giannisanrochman
Автор

This will be game changing for agent frameworks.

nathank
Автор

Hi Matt, comes in mind another idea for a possible video in the future: deepening the context windows SIZE when concerning Ollama-available models

solyarisoftware
Автор

THe more I work with AI, the more I respect and appreciate the work that Open Source groups do, like Ollama. I hope they continue to improve and innovate for the rest of us to enjoy. Thanks for breaking down this exciting update Matt. Cheers!

OscarTheStrategist
Автор

This is huge. Even if it's experimental, it highly changes everything. Thanks for the news on the update. Love you

arskas
Автор

I really enjoy your videos. They are informative, they are entertaining, they are wise.

jayd
Автор

Thanks for bringing the really great news. We'll set up a family server for openwebui with this!

MyAmazingUsername
Автор

Matt you’re currently making some of the best and most accessible videos about Ollama- could you make a video going over RAG/using documents specifically with the webui in some more depth? Like how to format documents, prompts, and what different embedding models offer?

Rushil
Автор

Wow this week gonna be a lot of fun for me, thank you ollama team!

NLPprompter
Автор

This is great when chatting with documents in open-webui when using rag.
In open-webui im using my ollama engine to run the embedding models too and before every time i asked a question it would load the embed model to create an embedding of my question and then would load the chat model to respond to my question. Now it can have both models in memory at the same time and i dont have to wait each time i ask a question for the models to load.

outofahat
Автор

Thanks for awesome video. This is one of the best updates that has ever come out. Totally changes the game.

Slimpickens
Автор

Great video as always. The biggest take away for me was Zellij . I don't know how I miss tools like this especially when I used to be a very heavy tmux user
😂

brinkoo
Автор

Yay that's fantastic! I'm hoping to use an embedding model + llama 3 in something I build, as well as running some prompts in parallel in an agentic flow.

Swapping between models back and forth would have been a huge slow down especially for llama3-70b so that's a fantastic feature!

But like you described, environment variables are not gonna cut it. I would need to manage the concurrency and multiple models loading ability from my app via the API, being aware of the VRAM available too to optimize the flow.

supercurioTube
Автор

Because the other comment was so ... ugh .... thank you for your efforts! i think the user experience is HORRIBLE, but the software itself is great. ^_^

solsticeprojekt
Автор

Hi Matt, if I well understand, this new feature allow the ollama server to possibly run a certain number of requests, "in parallel", with a cost of extra RAM to manage each request "context". It's that correct? Anyway I suggest to include the server concurrency architectures as a topic of new ollama course. Thanks for all your disseminations!

solyarisoftware
Автор

Is there a way of knowing how many concurrent users can we have with a LLM with a Card?
I mean, if I have an RTX 3080 with LLama3-8B or whatever I could load on a Ollama and I push it to prod, how many user could I serve with it? Thank you again for your videos

tecnopadre
Автор

A few videos ago I suggested you write your typescript/python code to make requests in parallel instead of sequentially to take advantage of parallel processing if possible. You had said that this wouldn’t make it run any faster, but it sounds like this isn’t the case?

ischmitty
Автор

By the way I was not referring to Ollama, but to tools like Zellij that you mentioned in this video. Thanks.

martinlightheart
Автор

My preferred AI tool of all times! Ollama team is awesome! Parallel processing is a really exciting thing!

AlexandreBarbosaIT
Автор

Finally support for concurrency! Thanks for the update👍

henkhbit