How To Create Datasets for Finetuning From Multiple Sources! Improving Finetunes With Embeddings.

preview_player
Показать описание
Today, we delve into the process of setting up data sets for fine-tuning large language models (LLMs). Starting from the initial considerations needed before dataset construction, we navigate through various pipeline setup questions, such as the need for embeddings. We discuss how to structure raw text data for fine-tuning, exemplified with real coding and medical appeals scenarios.

We also explore how to leverage embeddings to provide additional context to our models, a crucial step in building more general and robust models. The video further explains how to transform books into structured data sets using LLMs, with an example of transforming the book 'Twenty Thousand Leagues Under the Sea' into a question-and-answer format.

In addition, we look at the process of fine-tuning LLMs to write in specific programming languages, showing a practical application with a Cipher query for graph databases. Lastly, we demonstrate how to enhance the performance of a medical application with the use of embedded information utilizing the Superbooga platform.

Whether you're interested in coding, medical applications, book conversion, or simply fine-tuning LLMs in general, this video provides comprehensive insights. Tune in to discover how to augment your models with advanced techniques and tools. Join us on our live stream for a deep dive into how to broaden the context in local models and results from our book training and comedy sets.

0:00 Intro
0:44 Considerations For Finetuning Datasets
2:45 Reviewing Embeddings
5:35 Finetuning With Embeddings
8:31 Creating Datasets From Raw/Books
12:08 Coding Finetuning Example
14:02 Medicare/Medicaid Appeals Example
17:01 Outro

#machinelearning #ArtificialIntelligence #LargeLanguageModels #FineTuning #DataPreprocessing #Embeddings
Рекомендации по теме
Комментарии
Автор

This content is top notch among ML and AI in YouTube showing us how it really works!

cesarsantos
Автор

Okay so after a cup of coffee and watching a couple of times, WOW. You helped me so much thank you. This has been driving me nuts and you make it look so easy to fix. I wish I was as smart as you. Thank you again. 🎉

timothymaggenti
Автор

Comedy dataset update! I have found an approach I think I like for it, though I didn't have time to complete it for this video. So, I will also cover that in today's live stream!

AemonAlgiz
Автор

You’re literally a genius! I appreciate you taking the time to share the knowledge with us! Exactly what I was looking for… how to create a dataset and in such a well put together video. Thank you

RAGNetwork
Автор

Amazing, Thanks a lot for sharing your reflections on your work and experience ! It is much appreciated ! First time I check something like this quickly browsing and stick without having to review / study and come back later. I am able to get a Birds eye view on the topic and options available for work, and the underlying purpose. 🥇Pure Gold. Definitely Subscribed !

flowers
Автор

Dude seriously your content is so clear and easy to follow keep it up!

HistoryIsAbsurd
Автор

Finally some freaking great tutorial! Practical, straight to the point and it works!!

fabsync
Автор

I would pay a lot of money for this information, thank you.

boogfromopenseason
Автор

I very much appreciate that you always have this way of listing the most important bullet points at the beginning

leont.
Автор

Great explanation with the right level of details and depth. Good stuff. Thanks!

rosenangelow
Автор

Amazing work... this channel is pure gold, the exact amount of concepts, everything is spot on. Nothing beats teaching by experience like you do.

pelaus
Автор

I knew I subscribed here for good reason. this is consistently extremely high quality information -- not the regurgitated stuff. This is super educational and has immensely improved my understanding.

Please keep going bud, this is great.

smellslikeupdog
Автор

Awesome content!! Thank you very much!!👏🏻👏🏻👍🏻

redbaron
Автор

Wow, how do you make everything look easy. Nice thanks. So East coast, man your early bird.

timothymaggenti
Автор

great explanations thanks a lot for your efforts making this great content!

babyfox
Автор

Thats awesome! And you can even save the new appeal to create more data !

Hypersniper
Автор

The appeal has been processed by the approval AI... And it passed! The prescription will now be covered. 😊
(Thank you for the video! I think datasets and install dependencies are ML's greatest pain points at the moment.)

jonmichaelgalindo
Автор

How would building a training set on a codebase look? Is there a good example of automating generation of a Q&A training set based on code? How do you chunk it to fit in context window - break it up by functions and classes? Where would extraneous stuff go, like requirements, imports, etc... Thanks for the great content!

kenfink
Автор

This video was awesome! I'm finally starting to wrap my head round this stuff. At the same time I'm realising the power that is being unleashed onto the world!
BTW did you see this new paper:SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression. Looks like it's right up your alley!

arinco
Автор

Hey man, thanks for your videos they are instructive. I am new to LLMs and I think there is a significant gap in YouTube content with the new LLMs. I know there are videos on fine tuning GPT3 but I can't find anything like walk through in fine tuning a larger new open source model like Falcon-40b instruct. If there was a playlist going through the process: QA fine tune data definition, synthetic data production, fine tuning and test. I am sure others like myself will be very keen followers

danielmz