Ask Ollama Many Questions at the SAME TIME!

Показать описание

Up until now, its been impossible to get Ollama to answer more than one question at a time. Version 0.1.33 changes EVERYTHING!!

(they have a pretty url because they are paying at least $100 per month for Discord. You help get more viewers to this channel and I can afford that too.)

00:00 - Start
00:41 - This new version changes everything
01:33 - Whats The New Change??
01:50 - What is ollama_num_parallel
02:16 - What is max loaded models?
02:44 - Let's See It In Action
04:55 - But its not all Roses and Ponies
05:44 - What else is in this release?
06:02 - The New Models
06:46 - Time for a config file??
07:10 - The Technovangelist Newsletter

Рекомендации по теме

Комментарии

Finally, I've been waiting for this from ollama!! Thanks for always keeping us up to date

giannisanrochman

This will be game changing for agent frameworks.

nathank

Hi Matt, comes in mind another idea for a possible video in the future: deepening the context windows SIZE when concerning Ollama-available models

solyarisoftware

THe more I work with AI, the more I respect and appreciate the work that Open Source groups do, like Ollama. I hope they continue to improve and innovate for the rest of us to enjoy. Thanks for breaking down this exciting update Matt. Cheers!

OscarTheStrategist

This is huge. Even if it's experimental, it highly changes everything. Thanks for the news on the update. Love you

arskas

I really enjoy your videos. They are informative, they are entertaining, they are wise.

jayd

Thanks for bringing the really great news. We'll set up a family server for openwebui with this!

MyAmazingUsername

Matt you’re currently making some of the best and most accessible videos about Ollama- could you make a video going over RAG/using documents specifically with the webui in some more depth? Like how to format documents, prompts, and what different embedding models offer?

Rushil

Wow this week gonna be a lot of fun for me, thank you ollama team!

NLPprompter

This is great when chatting with documents in open-webui when using rag.
In open-webui im using my ollama engine to run the embedding models too and before every time i asked a question it would load the embed model to create an embedding of my question and then would load the chat model to respond to my question. Now it can have both models in memory at the same time and i dont have to wait each time i ask a question for the models to load.

outofahat

Thanks for awesome video. This is one of the best updates that has ever come out. Totally changes the game.

Slimpickens

Great video as always. The biggest take away for me was Zellij . I don't know how I miss tools like this especially when I used to be a very heavy tmux user
😂

brinkoo

Yay that's fantastic! I'm hoping to use an embedding model + llama 3 in something I build, as well as running some prompts in parallel in an agentic flow.

Swapping between models back and forth would have been a huge slow down especially for llama3-70b so that's a fantastic feature!

But like you described, environment variables are not gonna cut it. I would need to manage the concurrency and multiple models loading ability from my app via the API, being aware of the VRAM available too to optimize the flow.

supercurioTube

Because the other comment was so ... ugh .... thank you for your efforts! i think the user experience is HORRIBLE, but the software itself is great. ^_^

solsticeprojekt

Hi Matt, if I well understand, this new feature allow the ollama server to possibly run a certain number of requests, "in parallel", with a cost of extra RAM to manage each request "context". It's that correct? Anyway I suggest to include the server concurrency architectures as a topic of new ollama course. Thanks for all your disseminations!

solyarisoftware

Is there a way of knowing how many concurrent users can we have with a LLM with a Card?
I mean, if I have an RTX 3080 with LLama3-8B or whatever I could load on a Ollama and I push it to prod, how many user could I serve with it? Thank you again for your videos

tecnopadre

A few videos ago I suggested you write your typescript/python code to make requests in parallel instead of sequentially to take advantage of parallel processing if possible. You had said that this wouldn’t make it run any faster, but it sounds like this isn’t the case?

ischmitty

By the way I was not referring to Ollama, but to tools like Zellij that you mentioned in this video. Thanks.

martinlightheart

My preferred AI tool of all times! Ollama team is awesome! Parallel processing is a really exciting thing!

AlexandreBarbosaIT

Finally support for concurrency! Thanks for the update👍

henkhbit

Ask Ollama Many Questions at the SAME TIME!

Ask Ollama Many Questions at the SAME TIME!

New Ollama update adds Llama 3, ability to ask multiple questions at once and more | Today AI

Ollama In 120 Seconds

'okay, but I want Llama 3 for my specific use case' - Here's how

Writing Better Code with Ollama

Ollama's Newest Release and Model Breakdown

Optimize Your AI Models

How to chat with your PDFs using local Large Language Models [Ollama RAG]

FINALLY! Open-Source 'LLaMA Code' Coding Assistant (Tutorial)

Upgrade Your AI Using Web Search - The Ollama Course

Ollama Embedding: How to Feed Data to AI for Better Response?

'I want Llama3 to perform 10x with my private knowledge' - Local Agentic RAG w/ llama3

EASIEST Way to Fine-Tune LLAMA-3.2 and Run it in Ollama

Ollama meets LangChain

The Secret Behind Ollama's Magic: Revealed!

Extending Llama-3 to 1M+ Tokens - Does it Impact the Performance?

Talk to your CSV & Excel with LangChain

Fine Tune LLaMA 2 In FIVE MINUTES! - 'Perform 10x Better For My Use Case'

Is Open Webui The Ultimate Ollama Frontend Choice?

Let's build a RAG system - The Ollama Course

HaverScript with Ollama - Library for Managing LLM Interactions

This Llama 3 is powerful and uncensored, let’s run it

This may be my favorite simple Ollama GUI

Build Talking AI MultiAgent with Ollama, Llama 3, LangChain, CrewAI & ElevenLabs from YouTube Vi...