LangChain Data Loaders, Tokenizers, Chunking, and Datasets - Data Prep 101

Показать описание

In this video, we're going to focus on preparing our text using LangChain data loaders, tokenization using the tiktoken tokenizers, chunking with LangChain text splitters, and storing data with Hugging Face datasets. Naturally, the focus here is on OpenAI embedding and completion models, but we can apply the same logic to other LLMs like those available via Hugging Face, Cohere, and so on.

🔗 Notebook link:

🎙️ Support me on Patreon:

🎨 AI Art:

🤖 70% Discount on the NLP With Transformers in Python course:

🎉 Subscribe for Article and Video Updates!

👾 Discord:

00:00 Data preparation for LLMs
00:45 Downloading the LangChain docs
03:29 Using LangChain document loaders
05:54 How much text can we fit in LLMs?
11:57 Using tiktoken tokenizer to find length of text
16:02 Initializing the recursive text splitter in Langchain
17:25 Why we use chunk overlap
20:23 Chunking with RecursiveCharacterTextSplitter
21:37 Creating the dataset
24:50 Saving and loading with JSONL file
28:40 Data prep is important

Рекомендации по теме

Комментарии

LangChain docs have moved so the original wget command in this video will no longer download everything, now you need to use:

jamesbriggs

Fascinating channel, thanks! Remarkable to learn about LLMs, how to interact with LLMs, what can be built, and what could be possible over time. I look forward to more.

videowatching

Great video. Finally somebody that goes in depth into data prep. I've always wondered unecessary (key, value) pairs in json files.

ADHDOCD

Thank you for sharing this informative video showcasing Lanchain's powerful text chunking capabilities using the Previously, I had to write several functions to tokenize and split text while managing context overlap to avoid missing crucial information. However, accomplishing the same task now only requires a few lines of code. Very impressive.

dikshyakasaju

Great video series. Appreciate you sharing your thought process as we go - this is the part most online content creators of tech mill. They cover the how, and more often than not miss the why. Thanks again. Enjoying all the videos in this playlist

harleenmann

Chunking the most important idea and largely ignored. Thanks James love your technical depth.

wyfcvij

Thank you James for the in depth explanation of data prep. Learning a lot with your videos.

redfield

I need to chunk text for retrieval augmentation and did a search on YouTube and found... James Briggs' video. I know I will find in it what I need. Nice!

fgfanta

Thank you for sharing your knowledge. these are some of the best videos on LangChain.

grandplazaunited

This was great! I made a terrible mistake of chunking without considering this simple math and embedded indexed into Pinecone at the larger size. Now I have to go redo all them after realizing at their current sizes it isn’t quite suitable for LangChain retrieval

SnowyMango

Very helpful and very well explained. Thanks for sharing your knowledge about this! LangChain really feels as the missing glue between the open web and all those new AI models popping up.

alvinpinoy

I'm @ 12:34 and this is an amazing explanation thus far. Thank you!

temiwale

Thanks for the tutorial, really clear explaination !!

codecritique

Uff the video that I was expecting! Thank youuu!

eRiicBelleT

James dropping the great content as usual.

fraternitas

Great content once again, thanks for sharing. I wish I had this a couple weeks ago :D

Awesome I was just trying to figure out how to do this with the langchain docs so that I can learn it quicker!

lf

You always know what I am looking for thanks for this 🙏

muhammadhammadkhan

Ayo I was literally looking for how to prepare my data for the past hour . Thank you for making these.

siamhasan

James you are helping a lot in my activities. Thank you.

MatheusGamer

LangChain Data Loaders, Tokenizers, Chunking, and Datasets - Data Prep 101

LangChain Data Loaders, Tokenizers, Chunking, and Datasets - Data Prep 101

LangChain: How to Properly Split your Chunks

5 Chunking Strategies for better RAG | Better Retrieval | Text Splitting Methods | RAG #ai

But, How is Chunking Done ? Splitting Basics Using LangChain

Chunking methods for LLMs

Workaround OpenAI's Token Limit With Chain Types

Langchain Document Loaders Part 1: Unstructured Files

How to Determine Optimal Chunk Size for LLM

GenAI & LLMs | Video9 |Part2 | Mastering Chatbots & Memory with Langchain | Venkat Reddy AI...

Chunk large complex PDFs to summarize using LLM

Multi-Vector Retriever for RAG on Tables + Texts Using LANGCHAIN & UNSTRUCTURED

Add Chunking to MultiVector for Chatting With Your Data

Use LLMs To Extract Data From Text (Expert Mode)

Learn How To Query Pdf using Langchain Open AI in 5 min

Loading PDF Data Into Langchain : To Use Or Not To Use Unstructured Library

Semantic-Text-Splitter - Create meaningful chunks from documents

LangChain - Using Hugging Face Models locally (code walkthrough)

How To Deal With OpenAI Token Limit Issue - Part - 1 | OpenAI | Langchain | Python

Advanced RAG 02 - Parent Document Retriever

LangChain + Ray Tutorial: How to Generate Embeddings For 33,000 Pages in Under 4 Minutes

Fixing LLM Hallucinations with Retrieval Augmentation in LangChain #6

Summarization Crash Course with LangChain

LangChain & GPT 4 For Data Analysis: The Pandas Dataframe Agent

AI on How to Improve Memory (Chunking Technique)