Open Source RAG Chatbot with Gemma and Langchain | (Deploy LLM on-prem)

preview_player
Показать описание
In this video, I show how to serve your open-source LLM and Embedding model on-prem for designing a Retrieval Augmented Generation (RAG) chatbot. For this purpose, I take RAG-GPT chatbot and instead of the GPT model, I use *Google Gemma 7B* as the LLM and instead of text-embedding-ada-002, I use *baai/bge-large-en* from Huggingface. I use Flask to develop a web server that will serve the LLM for real-time inferencing and I show you how to use Postman to develop these types of projects.

00:00 intro
01:22 Demo
03:12 RAG-GPT schema
04:39 RAG-Gemma schema
05:19 Challenges of open-source LLMa for chatbots
7:15 A possible solution for serving LLMs on-prem
08:02 Lost in the middle (A tip for context-length)
09:26 Project structure
11:18 How to load and interact with Gemma
13:06 Developing LLM web server with Flask
20:38 Testing and debugging the LLM web server with Postman
25:00 Testing the RAG chatbot
27:22 GPU usage of this chatbot
27:48 Do we need a web server for the embedding model?

🎓 *Models that are used in this chatbot:*

📚 *Extra Resources:*

#opensource #llm #rag #chatbot #huggingface #google #gemma #gradio #langchain #Flask #python #GUI #postman
Рекомендации по теме
Комментарии
Автор

I have a question. So once you start using this app and start uploading the documents additionally with the 'upload PDF or doc file' button, do the documents stay in the app data and you can use them whenever you want, or they will be deleted? (sorry if i got the answer in the video, i probably missed it.)

NinVibe
Автор

Just another fantastic video! Thanks Farzad

omidsa
Автор

Hi. Thanks for your detail & step-by-step with your codes. Am enjoying and learning same time. 🙏

ginisksam
Автор

Great stuff. Instead of using pdfs, can you do a tutorial on using a large csv doc for RAG?

horyekhunley
Автор

Thanks for your efforts. I really like your explanation style. I have a question that really kills me :) How to control the LLM response and be assured that it will be from the RAG? are there any specific techniques for that? Also, How to provide feedback about the response to the LLM? I can see you are showing thumbs up and down here where is this feedback saved? and How to inform the LLM by this feedback? I am sorry if my question is primitive (may be for others) but for me it is very important to understand these questions. Thank you again

zfurlho
Автор

This is great but I think it would be better if you did your projects on colab

tk-ttbw
Автор

this is awesome! thank you! One thing though - by now only lazy did not say that gemma is a poor model. Many experiments on yt show its underperpormance. How about make a use of any open llm? Can you experiment with LM Studio?

Mike-Denver
Автор

21:22 Where did you get that url, the POST, I don't see any url like that in the output

KhanhLe-puwx
Автор

I appreciate this example of a local LLM RAG deployment as well as your attention to detail. I wish every presenter was this thorough!

I have the Gradio UI working for queries against the documents which I did on the manual ingestion step. However, when I attempt to upload PDFs from the interface I get an "Error" message on the screen. When I look at the app.py terminal, the error is: "TypeError: isfile: path should be string, bytes, os.PathLike or integer, not _TemporaryFileWrapper"

Any idea what might be going wrong with the filepath and what needs to be changed?

fyi I had to set Gradio to version 3.48.0 to over come the error: "cannot import name 'RootModel' from 'pydantic'" and perhaps this caused the problem above?

doctorbill
Автор

hi farzad, how are you ?
so im testing RAG platforms and currently im using privategpt+ollama { mistral instruct fp16 v0..2} + bge-m3 + bge-reranker-large
i have a 3090
the inference is superfast, but the results are not satisfying


my question is whats the best RAG platform right now if you want to go fully open source ?
also please try your rag with more complex pdf files, not easy text files like a story

saeednsp
Автор

Hey! Thanks for the lovely videos....
Just a question, how do I work without GPU? It's asking me for nvidia drive which is not installed in my system. And I do not intend to install it either... So any workaround you may suggest?

ashwinisivanandan
Автор

Hi, I have a question. I already implement rag with mistral dolphin 7B. I also test some advanced rag techniques like ensembles, parent-child, multi-query retrievals. I don't have a GPU so I run my llm on LM studio server. In your video you say we need GPU can we also use CPU and make the same interface and project? I also want to deploy my app on server. I looked into your video in which you are using flask to deploy the app but its locally what I have to do to deploy it on server. I also came across with Heroku platform to deploy applications with gpu. I am confused can I make the application like you did on my local system without gpu for test and than deploy it on server.

AliFarooq-ygfn
Автор

These videos are amazing! Can you release a version with the summary doc task added back….?? Subject to LLM model…thanks

musumo
Автор

are there any good open source embedding models? if someone wants to keep their data private, won't ada require you to send your data to openai?

stbrfex
Автор

Hi,
How can open-source tools and frameworks be utilized to evaluate the performance of a Retrieval-Augmented Generation (RAG) system that integrates Large Language Models (LLMs) like Google Gemma?

TooyAshy-