Building long context RAG with RAPTOR from scratch

preview_player
Показать описание
The rise of long context LLMs and embeddings will change RAG pipeline design. Instead of splitting docs and indexing doc chunks, it will become feasible to index full documents. RAG approaches will need to flexibly answer lower-level questions from single documents or higher-level questions that require information across many documents.

RAPTOR (Sarthi et al) is one approach to tackle this by building a tree of document summaries: docs are clustered and clusters are summarized to capture higher-level information across similar docs.
This is repeated recursively, resulting in a tree of summaries from individual docs as leafs to intermediate summaries of related docs to high-level summaries of the full doc collection.

In this video, we build RAPTOR from scratch and test it on 33 web pages (each ranging 2k - 12k tokens) of LangChain docs using the recently released Claude3 model from Anthropic to build the summarization tree. The pages and tree of summaries are indexed together for RAG with Claude3, enabling QA on lower-lever questions or higher-level concepts (captured in summaries that span related pages).

This idea can scale to large collections of documents or to documents of arbitrary size (up to embd / LLM context window).

Code:

Paper:
Рекомендации по теме
Комментарии
Автор

Lance is killing it with these videos. Keep it up!

maxi-g
Автор

That was so useful. Thanks! I'd love to see more advanced technics like that.

danielschoenbohm
Автор

Excellent approach and very well explained.

One challenge that comes to mind with this summarisation hierarchy is maintaining it, as the source content changes or is revised. I am thinking in scenarios where there are hundreds of millions of documents to index.

johnnydubrovnic
Автор

I think this approach is very interesting, and was very well presented, thank you for the video.
One thing is, this works when working with a "closed" context, so we know we will query ONLY these 31 pages, let's say.
If we are in an environment where this is dynamic, the clustering approach might not work so well.
When we add more documents, we would have to run the clustering again, not simply load the model and predict the cluster, because new documents might get added that have completely new information. This becomes a problem when scaling this up, basically - both in terms of time spent, as well as cost for running the summarization again.

cnmoro
Автор

Fantastic video. Thanks heaps for the content. It really feels like you could present a series of these talks. I want to learn more about implementation of some of these ideas.

MrPlatinum
Автор

Hilarious I just came up with this idea a few months ago for a project really makes me think I should just get into doing the research in this field that seems my ideas end up becoming common concepts over and over again over the last few years. 😊 Such a cool field

Novacasa
Автор

This is great, long context is a tool for a specific use case. Until costs and latency with long context are the same as RAG, RAG will be what most apps use.

jaysonp
Автор

First, I want to mention I like your explanations/videos. Thanks for your great work.

In this occasion I was blocked (but I will solve that) because of the following:

1. Claude is not available in some regions (like mine, being Belgium) - I'm on the waiting list.
2. I tried with GPT4 as an alternative, but I forgot that you must put money on the account (I still have most of the $5 free test account, but that's limited to GPT 3.5).

isa-bv
Автор

Some of the readers have commented that we need run the entire clustering algorithm again if we get a new set of documents, or need it to be dynamic.

I DO NOT think we need to do this. Here is why

Lance (the speaker) shows how the documents are clustered recursively till it reaches n or a single cluster.
So let us say there are 10, 000 clusters and the new documents impact only 4 clusters [see at 06:33], where he talks about Gaussian Mixture model (AFAIK, this means a point can belong to multiple clusters), then we have two cases here
1. No new clusters are created: So only "those 4 clusters" have to be rebuilt and its changes need to be propagated up through the chain to the root node right? We continue to have 10, 000 clusters

2. Let us say it ends up in expanding the # of clusters from 4 to 6 say, then only the impacted clusters will have to be rebuilt from that point to the root cluster. We will now have 10, 0002 clusters

If this is true, we do not need to rebuild everything but only that clusters that get impacted. Its like rebalancing the tree

HealthLucid
Автор

F yes, it's lance from langchain again, it is going to be a good day.

torrence-carmichael
Автор

Thank you for your awesome presentation :)

Armenian-abrank
Автор

This approach and implementation is amazing to alleviate the 3 issues you mentioned, thanks! One query though: have you checked the accuracy of the output as against the entire content into single prompt in the large context LLMs?

paraconscious
Автор

Hey I got a issue, What if sum of cluster documents exceed maximum token of summary chain ?

YueNan
Автор

An enhancement here would be to have it expand to the summarised nodes into the original nodes.

StephenRayner
Автор

One key question of your approach is how to define the summary so that it offers adequate information to be used in RAG. If the summary does not include some minor information points, it would be impossible for RAG to identify the document as relevant solely based on the summary. And moreover, what if the document itself contains too many scatter info, and is hard to summarize, the approach would have many I do believe using this approach for many docs, but this approach does have some pre-requisites...

perrygoldman
Автор

Indeed an interesting approach that is not limited by the context length of the LLM. I have some remarks: a) Is the threshold also not the same as choosing the K-parameter of KNN (can Kohonen map not be used, its also unsupervised clustering...?) b) you don't have a performance impact retrieving from a long embedded text and also from the summarization clusters? c) already ponted out in some of the comments: how 2 update when adding new docs efficiently? (of course u can do, for example, using a copy vectorstore and do the update and switch over when done). d) have u tested the results using the "standard" method without summarization and this "Raptor" method and timing the inference time of both methods?
btw: using long context is NOT very cost effective if u are using the commercial big AI companies.

henkhbit
Автор

So in the example you adding a batch of 30 pages and they clustered and summarized. What happens when you add another batch or even just one extra doc. Is it added to an existing cluster and summary or does this become a new cluster summary

jeffsteyn
Автор

7:56 What does it mean to "embed" the document?

HashimWarren
Автор

If higher level summary of is being used as the context during generation, how would one go about providing reference? Specially, in use cases where answers have to be 100% factual and reference is necessary to have transparency. Thanks!

ShoiebAhmedChowdhury
Автор

Will you make videos about RAG with PDF (contains not only text but also tables and images). That would be a very helpful video for me. Thank you for the great work!

anhvunguyen