Local RAG with llama.cpp

preview_player
Показать описание


Рекомендации по теме
Комментарии
Автор

I can already tell from the first few seconds, this guy knows his stuff and explains it really well! Thank you

LondonSoundDimension
Автор

Thank you Mark! I was today years old when I learned my M2 mini/M3 air Macs indeed have the capacity for the Metal option. No wonder my queries drained the machine of all its RAM without generating a response anyway!
I haven't finished this one yet (had to pause to write down that Metal epiphany) but definitely going to watch more of your stuff as I'm just starting out making tech videos too. Love seeing what others do!

madsciai
Автор

Thanks again for another fantastic video! Quick question: Do you know the best way to format prompts when running the llama.cpp Server with the chat_format chatml parameter for RAG? I’m hosting and using OpenAI to create completions. Everything rans locally by me. My current setup has a system role that includes a system prompt and the relevant context, and a user role with only the query. However, sometimes the model just returns the system prompt instead of answering the question. Any ideas why that happens or how to fix it? Thanks a lot in advance!

Автор

Hi Mark, I don't understand why you're chunking the documents with your function "chunk"? Can we not just feed each 247 documents to the llm to create the embeddings? Like:
document_embeddings = llm.create_embedding( [ item.page_content for item in documents ] ).
We get the embedding back for each document and that's it. Am I missing something?
Are you doing it just to have 3 batches (100, 100 and 47) and embed them in parallel?

Sendero-ypgi
Автор

thank you Mark! great video! learnt a lot. may i ask one silly question. i can see you used a simpler emb model to extract the emb, but later on when you switch to llama3 emb model, i didn't understand why the same emb are reused rather than re-gen by the better and bigger llama3 model? does it mean that "search" is not the difficult part in RAG, but how to compile the final answer is?

MrWerewolf
Автор

Hi Mark, I can't find your chunk function on the github page u mentioned in description. Could u help me with that. Sorry am new to all this so might be a silly ask at the moment. Thanks a lot

MuhammadZubair-flwd
Автор

Hi Mark, great tutorial. I have been playing around a bit and tried to use my already existing ChromaDb as a retriever. Unfortunately simply changing the context to my retrieverDB did not work. I received "ValueError: Requested tokens (941) exceed context window of 512". Do you happen to know how to expand the context window or how to fix this?

inf-co
Автор

Hi Mark, thanks for sharing! How did you choose the embedding_llm? Is there a best practice or a guideline on how to choose it? I'm testing and I was wondering what I should use for embedding... any help would be greatly appreciated!

ElisaPiccin
Автор

Hi Mark, how quicker should inference be when setting n_gpu_layers = 1? I am on a Mac M1 pro with 16GB GPU, and if I set n_gpu_layers = 1 it is actually slower than not using it. Do you have an explanation for that or a way to check what is happening? Cheers!

Sendero-ypgi
Автор

Do you know what the "correct" way is to prompt for the GGUF? I have been using llama 3 through GGUF, Ollama and ChatOllama and I feel like the GGUF gives less lively answers than the Ollama Versions. Do you know why this is happening? Do I need to configure more or change the prompt?

inf-co