The 5 Levels Of Text Splitting For Retrieval

preview_player
Показать описание


Greg’s Info:

Outline:
0:00 - Intro
3:42 - Theory
6:57 - Level 1: Character Split
16:04 - Level 2: Recursive Character Split
20:59 - Level 3: Document Specific Splitting
32:10 - Level 4: Semantic Splitting (With Embeddings)
48:02 - Level 5: Agentic Splitting
1:02:47 - Bonus Level: Alternative Representation
Рекомендации по теме
Комментарии
Автор

Both LangChain and Llama Index have added Semantic Chunking (level 4) to their libraries

DataIndependent
Автор

First video I came across that actually explain langchain in detail so that a layman can understand how it actually works

AshWickramasinghe
Автор

Why did YouTube take so long to recommend me this channel? Incredible work!

stavroskyriakidis
Автор

With the continuous influx of short form content, props to you for making this so interesting to watch. Didn't even realise it was an hour long. Loved every second of it. Thanks!

adityasankhla
Автор

Hi Greg, many thanks for the work you put into this and to help all of us learn. Great clarity, depth and tempo! 💪

artislove
Автор

Wow! I hadn't even thought about Agentic Chunking! I need to try this. I did some extensive experimentation with chunking on a project at work for a clinical knowledge base and I found that chunking strategies can make the difference between an ok retrieval and an awesome retrieval that works across a higher percentage of queries.

NadaaTaiyab
Автор

This is an insanely detailed from first principles tutorial. Thank you for taking the time to put this together.

stonedizzleful
Автор

Thanks for this Greg. I've been looking at agentic chunking for a while and this video really helped me with implementation. Not heard of you before I searched but now subbed. Thanks a lot :)

truthwillout
Автор

Nice vid, Greg! You're on the cutting edge with some of these splitting techniques. Well done. 😎

andreyseas
Автор

I thought the explanation and showing your experimentation for semantic splitting was creative. Thank you very much.

kenchang
Автор

man it took me 3 weeks to find you. thank you please keep on coming.

JoanApita
Автор

what the ___. how good can a tutorial be. such a gem of a video. thx for making this. new to ml and found this very helpful

drakongames
Автор

Amazing!! I am fascinated by how document specific splitting or the bonus level also ties with how we structure our data schema. E.g. extracting metadata like "Introduction" in level 3 or applying a summary to the podcast and indexing that to then link to the raw clip in the bonus level. All amazing, super useful stuff -- I am a bit skeptical on embedding based splitting though, maybe just need to dive in further! Mostly bullish on level 5: agentic splitting with multimodal llms that kind of blend levels 3 and 5

connorshorten
Автор

Thanks I was thinking about solving my own Retrieval problem. I already got the small crude proof of concept using just simple chunking, embedding, RAG, etc. Now I need to get bigger user inputs that are in bigger pdf files. I thought using agents for it to get around the context window, you agentic chunker is a good starter and does make intuitive sense. I will try this route.

JunYamog
Автор

Hi Greg, thanks for the video. It's awesome to have someone publishing good content who's doing the exact same thing as me. Hope to see more videos on advanced topics like this!

Jonathan-rmkt
Автор

you really deserve that like buttons really thanks for this out of the world content

chakerayachi
Автор

Liked this semantic splitting! Cool stuff you´ve done there!! Also agentic chunking. Pretty cool!!!

mgqkclf
Автор

Thanks greg! Love the long form instructional video :D Greatly appreciated

srikanthganta
Автор

Awesome video, thansk so much, its so much informative and clear to follow. Well done.

maria-whkm
Автор

Love your videos, especially this one. The information density and presentation is off the charts. It is so altruistic of you to put this out there for free.

I am especially interested in the semantic chunking. One use case is transcripts which often have distinct conversation blocks or qhestion answer pairs. Since it is important to capture the question and answer for full context, i was wonderinf what methodology might work best.

Alternatively, semantically chunking a document vs pre-defined themes - sort of the opposite direction as the agentic chunker. First generate or define the overarching themes or buckets, then assign chunks to them.

It seems that there is some real possibility in the semantic chunking methods. 🎉 Looking forward to experimenting more.

Thank you again.

robxmccarthy