LangChain Data Loaders, Tokenizers, Chunking, and Datasets - Data Prep 101

preview_player
Показать описание
In this video, we're going to focus on preparing our text using LangChain data loaders, tokenization using the tiktoken tokenizers, chunking with LangChain text splitters, and storing data with Hugging Face datasets. Naturally, the focus here is on OpenAI embedding and completion models, but we can apply the same logic to other LLMs like those available via Hugging Face, Cohere, and so on.

🔗 Notebook link:

🎙️ Support me on Patreon:

🎨 AI Art:

🤖 70% Discount on the NLP With Transformers in Python course:

🎉 Subscribe for Article and Video Updates!

👾 Discord:

00:00 Data preparation for LLMs
00:45 Downloading the LangChain docs
03:29 Using LangChain document loaders
05:54 How much text can we fit in LLMs?
11:57 Using tiktoken tokenizer to find length of text
16:02 Initializing the recursive text splitter in Langchain
17:25 Why we use chunk overlap
20:23 Chunking with RecursiveCharacterTextSplitter
21:37 Creating the dataset
24:50 Saving and loading with JSONL file
28:40 Data prep is important
Рекомендации по теме
Комментарии
Автор

LangChain docs have moved so the original wget command in this video will no longer download everything, now you need to use:

jamesbriggs
Автор

Fascinating channel, thanks! Remarkable to learn about LLMs, how to interact with LLMs, what can be built, and what could be possible over time. I look forward to more.

videowatching
Автор

Great video. Finally somebody that goes in depth into data prep. I've always wondered unecessary (key, value) pairs in json files.

ADHDOCD
Автор

Thank you for sharing this informative video showcasing Lanchain's powerful text chunking capabilities using the Previously, I had to write several functions to tokenize and split text while managing context overlap to avoid missing crucial information. However, accomplishing the same task now only requires a few lines of code. Very impressive.

dikshyakasaju
Автор

Great video series. Appreciate you sharing your thought process as we go - this is the part most online content creators of tech mill. They cover the how, and more often than not miss the why. Thanks again. Enjoying all the videos in this playlist

harleenmann
Автор

Chunking the most important idea and largely ignored. Thanks James love your technical depth.

wyfcvij
Автор

Thank you James for the in depth explanation of data prep. Learning a lot with your videos.

redfield
Автор

I need to chunk text for retrieval augmentation and did a search on YouTube and found... James Briggs' video. I know I will find in it what I need. Nice!

fgfanta
Автор

Thank you for sharing your knowledge. these are some of the best videos on LangChain.

grandplazaunited
Автор

This was great! I made a terrible mistake of chunking without considering this simple math and embedded indexed into Pinecone at the larger size. Now I have to go redo all them after realizing at their current sizes it isn’t quite suitable for LangChain retrieval

SnowyMango
Автор

Very helpful and very well explained. Thanks for sharing your knowledge about this! LangChain really feels as the missing glue between the open web and all those new AI models popping up.

alvinpinoy
Автор

I'm @ 12:34 and this is an amazing explanation thus far. Thank you!

temiwale
Автор

Thanks for the tutorial, really clear explaination !!

codecritique
Автор

Uff the video that I was expecting! Thank youuu!

eRiicBelleT
Автор

James dropping the great content as usual.

fraternitas
Автор

Great content once again, thanks for sharing. I wish I had this a couple weeks ago :D

Автор

Awesome I was just trying to figure out how to do this with the langchain docs so that I can learn it quicker!

lf
Автор

You always know what I am looking for thanks for this 🙏

muhammadhammadkhan
Автор

Ayo I was literally looking for how to prepare my data for the past hour . Thank you for making these.

siamhasan
Автор

James you are helping a lot in my activities. Thank you.

MatheusGamer