filmov
tv
LangChain Data Loaders, Tokenizers, Chunking, and Datasets - Data Prep 101

Показать описание
In this video, we're going to focus on preparing our text using LangChain data loaders, tokenization using the tiktoken tokenizers, chunking with LangChain text splitters, and storing data with Hugging Face datasets. Naturally, the focus here is on OpenAI embedding and completion models, but we can apply the same logic to other LLMs like those available via Hugging Face, Cohere, and so on.
🔗 Notebook link:
🎙️ Support me on Patreon:
🎨 AI Art:
🤖 70% Discount on the NLP With Transformers in Python course:
🎉 Subscribe for Article and Video Updates!
👾 Discord:
00:00 Data preparation for LLMs
00:45 Downloading the LangChain docs
03:29 Using LangChain document loaders
05:54 How much text can we fit in LLMs?
11:57 Using tiktoken tokenizer to find length of text
16:02 Initializing the recursive text splitter in Langchain
17:25 Why we use chunk overlap
20:23 Chunking with RecursiveCharacterTextSplitter
21:37 Creating the dataset
24:50 Saving and loading with JSONL file
28:40 Data prep is important
🔗 Notebook link:
🎙️ Support me on Patreon:
🎨 AI Art:
🤖 70% Discount on the NLP With Transformers in Python course:
🎉 Subscribe for Article and Video Updates!
👾 Discord:
00:00 Data preparation for LLMs
00:45 Downloading the LangChain docs
03:29 Using LangChain document loaders
05:54 How much text can we fit in LLMs?
11:57 Using tiktoken tokenizer to find length of text
16:02 Initializing the recursive text splitter in Langchain
17:25 Why we use chunk overlap
20:23 Chunking with RecursiveCharacterTextSplitter
21:37 Creating the dataset
24:50 Saving and loading with JSONL file
28:40 Data prep is important
Комментарии