Deploy Open LLMs with LLAMA-CPP Server

preview_player
Показать описание
Learn how to install LLAMA CPP on your local machine, set up the server, and serve multiple users with a single LLM and GPU. We'll walk through installation via Homebrew, setting up the LLAMA server, and making POST requests using curl, the OpenAI client, and Python requests package. By the end, you'll know how to deploy and interact with different models like a pro.

#llamacpp #deployment #llm_deployment

💻 RAG Beyond Basics Course:

Signup for Newsletter, localgpt:

LINKS:

TIMESTAMPS:
00:00 Introduction to LLM Deployment Series
00:22 Overview of LLAMA CPP
01:40 Installing LLAMA CPP
02:02 Setting Up the LLAMA CPP Server
03:08 Making Requests to the Server
05:30 Practical Examples and Demonstrations
07:04 Advanced Server Options
09:38 Using OpenAI Client with LLAMA CPP
11:14 Concurrent Requests with Python
12:47 Conclusion and Next Steps

All Interesting Videos:

Рекомендации по теме
Комментарии
Автор

👏 I'm glad to see you're focusing on DevOps options for AI apps. In my opinion, LlamaCpp will remain the best way to launch a production LLM server. One notable feature is its support for hardware-level concurrency. Using the `-np 4` (or `--parallel 4`) flag allows running 4 slots in parallel, where 4 can be any number of concurrent requests you want.

One thing to remember the context window will be divided accordingly. For example, if you pass `-c 4096`, each slot will have a context size of 1024. Adding the `--n-gpu-layers` (`-ngl 99`) flag will offload the model layers to your GPU, providing the best performance. So, a command like `-c 4096 -np 4 -ngl 99` will offer excellent concurrency on a machine with a 4090 GPU.

unclecode
Автор

Is there a way to get the response from CPP-CLI directly for local app development?

eduardorivadeneira
Автор

Mozilla's Llamafile format is very flexible for deploying LLM(s) across operating systems. NIM has the advantage of bundling other types of models like audio or video.

johnkost
Автор

any idea how to use it with litellm simple proxy? in litellm you need to specify the provider... would that be ollama?

themaxgo
Автор

can we finetune it using lora? i need it to be about ai so i have doqnloded data about ai and i want to add it to this model

thecodingchallengeshow
Автор

bro, I wanna ask, do I need to use GPU to run this ?

andreawijayakusuma
visit shbcf.ru