LlamaIndex Workshop: Multimodal + Advanced RAG Workhop with Gemini

Показать описание

The Google Gemini release included both exciting multi-modal capabilities as well as semantic retrieval.

In this workshop, we cover two cool LLM + RAG use cases with Google Gemini:

1️⃣ Multi-modal RAG: Use the Gemini model to extract structured outputs from images. Then learn how to index these texts + images and build a QA system from it (also using Gemini).

2️⃣ Advanced RAG: Learn how to use the brand-new Semantic Retrieval API. You can decompose it into different components - custom embedding-based retrieval and custom response synthesis.

We had the pleasure of co-host this with folks from the Google Labs team (Cher Hu, Lawrence Tsang, Michael Chen)

Timeline:
00:00-27:20 Advanced RAG
27:20-52:59 Multimodal

LlamaIndex

Комментарии

When we create a simple google index in the first simple usecase, which google region is this index created?

ramih

### Summary:

In this special edition of The W index webinar series, the focus was on presenting multimodal and advanced retrieval-augmented generation (RAG) use cases utilizing Google's API offerings, specifically the Google Gemini and Llama index. The session provided insights into semantic retrieval and how to build an advanced RAG with L index components, followed by a workshop on creating multimodal use cases with Google Gemini and Llama index.

#### Part 1: Advanced RAG with Llama Index and Google Gemini

**Presenters:** Lawrence, Michael, and Sher from Google Labs

The presentation covered RAG use cases for both novice and advanced users, including:
- A simple RAG pattern introduction for context setting.
- Google's developer RAG offerings.
- Advanced techniques for customizing use cases and improving quality.
- A demonstration of the RAG process.

**Simple RAG Pattern:**
- Ingestion phase with embeddings and Vector store.
- Retrieval step with user query and Vector store.
- Response synthesis with L to arrive at an answer.

**Google's Offerings:**
- Google Vector store - a managed Vector database and embeddings, designed for simplicity, flexibility, and production readiness. It's optimized for a small corpus of 1 million chunks.
- AQA (Attributed Question Answering) model - provides grounded answers, attributions, answerability probability, voice styles, and safety settings.

**Advanced Techniques:**
- Breaking down complex queries into focused sub-questions for better retrieval.
- Re-ranking to refine the retrieval process by comparing textual content in the question and retrieved documents.

**Demonstration:**
- A live demo showed how Google's AQA model and Llama index can be used to answer complex questions and handle cases where an answer is not available in the provided documents.

#### Part 2: Multimodal RAG with Google Gemini and Llama Index

**Presenters:** Jerry and Howan from L index

This section focused on leveraging multimodal data (text and images) to enhance RAG use cases. The presenters discussed the integration of the DEI Pro visual model and the L index, which supports text and image inputs to generate text outputs.

**Multimodal RAG:**
- Indexing both text and images.
- Retrieving relevant information using queries that include text and/or images.
- Re-ranking and synthesizing responses that incorporate multimodal data.

**Image Indexing:**
- Extracting structured text from images using a multimodal model.
- Generating image embeddings and storing them in a vector store.

**Multimodal Retrieval and Generation:**
- Retrieving and synthesizing responses based on text and image inputs.
- Using structured data extraction to create structured metadata from images.
- Leveraging this structured output to build a knowledge base for RAG.

**Demonstration:**
- A case study showed how Google Maps screenshots of restaurants were used to extract structured metadata, which was then indexed and used to answer queries about restaurant recommendations, including nearby tourist places.

**Final Q&A:**
- The possibility of fine-tuning Gemini for improved capabilities.
- Uncertainty about Gemini's ability to process video and audio.

The webinar ended with encouragement for the audience to provide feedback and explore the shared notebooks.

chaoticblankness

is there a way by which we can retrieve images from a folder of images using text query? Using gemini not openai

RuturajHange-kz

Very helpful, thanks, and can you share the code for the first demo?

unclecode

All these techniques work quite fine for general content and knowledge. Now, for niche domains, the problems pop-up. In particular the pre-trained encoders lack accuracy and the VQA is not very helpful. The fine-tuning of the encoders is mandatory... but here again the curse of labelelling is present. Despite the size of the datasets for FT is less than for pre-training, it is still a big challenge for many companies. Again and again, the source of progress is within the labeled data and the labeling resources which are now made of Subject Matter Experts.

Is the Google code already available for developers in the Google Cloud?

chrsl

Great content thx. Can you share the slides?

Jmstr-ph