Multi-modal RAG: Chat with Docs containing Images

preview_player
Показать описание
Learn how to build a multimodal RAG system using CLIP mdoel.

LINKS:
Flow charts in the paper:

💻 RAG Beyond Basics Course:

Let's Connect:

Signup for Newsletter, localgpt:

00:00 Introduction to Multimodal RAC Systems
01:24 First Approach: Unified Vector Space
02:23 Second Approach: Grounding Modalities to Text
03:57 Third Approach: Separate Vector Stores
06:26 Code Implementation: Setting Up
09:05 Code Implementation: Downloading Data
11:13 Code Implementation: Creating Vector Stores
14:00 Querying the Vector Store

All Interesting Videos:

Рекомендации по теме
Комментарии
Автор

This is the best AI channel out there, PERIOD. Thanks for sharing your knowledge

rubencabrera
Автор

a nice open source and self hosted version would be great

ilaydelrey
Автор

Keep going with this approach, it is something I have been struggling with.

aerotheory
Автор

Such an insightful information, Eagerly waiting for more multimodel approches.

AI-Teamone
Автор

Thanks, is there a video of the same project, but with langchain instead of llama index?

b.lem.
Автор

I appreciate your effort. Pl create one to fine tune the model for efficient retrieval if possible, with lang chain.

ai-touch
Автор

Use case is to extract the relevant text information along with images available in the file using generative ai, When any prompt is given then relevant text information and image should display as response.

AyishaAshraf-sf
Автор

Very nice video but if you can do it with open source embedding model it would be very cool. thank you for the video

legendchdou
Автор

Hi your videos are very helpful thank you

ArdeniusYT
Автор

What about make same, but using LLAMA3 or less local LLM?

Technmanac
Автор

Can you pls dive deeper into why qdrant was used and other vector dbs limitations to store both text and image embeddings, thx

vinayakaholla
Автор

Thanks your videos are very helpful. I have several Gigs of pdf ebooks that i would like to process with RAG. What do you think what approach would be the best, this or a graphrag. In my case i'm looking only for local models as the costs would be very high. What if to convert all pdf pages into images first and then process them with local model like phi 3 vision and then process it with Graphrag, would it work out?

BACA
Автор

Need to do it all in open source. No API Keys.

ScottzPlaylists
Автор

can you make it using comeplete open source models?

avinashnair
Автор

Out of interest what is the application called that you used to illustrate the flows? (2:53 in the video) thanks.

BarryMarkGee
Автор

do you think all of this is now replaced with Gemini ?

RedCloudServices
Автор

Is it better than GraphRAG? How does the output quality compare to it?

codelucky
Автор

Can we do this method using Langchain ?

amanharis
Автор

It is essential to conduct a thorough preprocessing of the documents before entering them into the RAG. This involves extracting the text, tables, and images, and processing the latter through a vision module. Additionally, it is crucial to maintain content coherence by ensuring that references to tables and images are correctly preserved in the text. Only after this processing should the documents be entered into a LLM.

ignaciopincheira
Автор

What if the user query contain text + image?

cristiantironi