Coding a Multimodal (Vision) Language Model from scratch in PyTorch with full explanation

preview_player
Показать описание
Full coding of a Multimodal (Vision) Language Model from scratch using only Python and PyTorch.

We will be coding the PaliGemma Vision Language Model from scratch while explaining all the concepts behind it:
- Transformer model (Embeddings, Positional Encoding, Multi-Head Attention, Feed Forward Layer, Logits, Softmax)
- Vision Transformer model
- Contrastive learning (CLIP, SigLip)
- Numerical stability of the Softmax and the Cross Entropy Loss
- Rotary Positional Embedding
- Multi-Head Attention
- Grouped Query Attention
- Normalization layers (Batch, Layer and RMS)
- KV-Cache (prefilling and token generation)
- Attention masks (causal and non-causal)
- Weight tying
- Top-P Sampling and Temperature
and much more!

All the topics will be explained using materials developed by me. For the Multi-Head Attention I have also drawn all the tensor operations that we do with the code so that we can have a visual representation of what happens under the hood.

Prerequisites:

🚀🚀 Join Writer 🚀🚀
Writer is the full-stack generative AI platform for enterprises. We make it easy for organizations to deploy AI apps and workflows that deliver impactful ROI.
We train our own models and we are looking for amazing researchers to join us!

Chapters
00:00:00 - Introduction
00:05:52 - Contrastive Learning and CLIP
00:16:50 - Numerical stability of the Softmax
00:23:00 - SigLip
00:26:30 - Why a Contrastive Vision Encoder?
00:29:13 - Vision Transformer
00:35:38 - Coding SigLip
00:54:25 - Batch Normalization, Layer Normalization
01:05:28 - Coding SigLip (Encoder)
01:16:12 - Coding SigLip (FFN)
01:20:45 - Multi-Head Attention (Coding + Explanation)
02:15:40 - Coding SigLip
02:18:30 - PaliGemma Architecture review
02:21:19 - PaliGemma input processor
02:40:56 - Coding Gemma
02:43:44 - Weight tying
02:46:20 - Coding Gemma
03:08:54 - KV-Cache (Explanation)
03:33:35 - Coding Gemma
03:52:05 - Image features projection
03:53:17 - Coding Gemma
04:02:45 - RMS Normalization
04:09:50 - Gemma Decoder Layer
04:12:44 - Gemma FFN (MLP)
04:16:02 - Multi-Head Attention (Coding)
04:18:30 - Grouped Query Attention
04:38:35 - Multi-Head Attention (Coding)
04:43:26 - KV-Cache (Coding)
04:47:44 - Multi-Head Attention (Coding)
04:56:00 - Rotary Positional Embedding
05:23:40 - Inference code
05:32:50 - Top-P Sampling
05:40:40 - Inference code
05:43:40 - Conclusion
Рекомендации по теме
Комментарии
Автор

My favorite (one of my favorite) pizza is actually "Pizza with mozzarella di bufala", also known as "bufalina" in Italy 😆😋

umarjamilai
Автор

You Sir are a source of pride to all of us Italian Computer Scientists. Auguri! Grazie!

flavioferlin
Автор

I have no words to thank you. I was thinking 2 weeks ago, why there are not books for VLMs like LLMs and today I found your comprehensive explanation video.

hamzawi
Автор

5 hours of top tier content that too completely for free! Thank you so much! Please keep uploading such content

harshwardhanfartale
Автор

you are the best youtuber on the internet, the best! Not one of! I have listened bunch of programming videos, none of them are like you, yours are so good, so up to date, so amazing

zhuoranlu
Автор

Finito! Thank you a lot: I was very curious to learn how multimodal algorithms were even able to work, and it has been a very good challenge to follow along the flow of informations from input to the output, one math operation at a time. Kudos!

DanieleO.
Автор

This channel and video is the real deal. Amazing quality. Can't wait to watch the whole thing. Can't believe its completely free - we have no excuse! Keep up the great work and Assalamu Alaikum from Austin, TX!

Bbb
Автор

Your contribution to the world is immeasurable...

dfrqwmn
Автор

You and Andrej are the two guys inspiring me a lot. Respect!

dinhluongnguyen
Автор

You're the best in explanation papers with codes. keep it up bro👏. I hope the next about *ControNet* from scratch.

nasirnr
Автор

Stable diffusion video was great. I bet this is even better. Nice to see your videos man welcome back.

Reverberie
Автор

you are the best ml engineer bro, there is no full explanation with pytorch code for multimodal LM in the entire youtube, may god preserve you .

tamineabderrahmane
Автор

Super well commented and structured code, well explained video. Superb quality video with zero fee!

linhvu
Автор

Thank youuu very much, I'm doing my master's thesis on Visual Language Models and this video is such an amazing resource to complete it. Excelent work!

machiniram
Автор

I wanted to spend a few days reading how multimodal LMs work, but your video broke all my plans 😅. As always, perfect timing and explanation, keep up the great work!

arturstupa
Автор

I haven't seen the video yet, but I'm sure it's amazing, like all your videos, here's a little thank you

marsupilami
Автор

Can't thank you enough. You are simply the best guy on Youtube in this field...

bhaweshs
Автор

Before watching the video, I want to thank you for the great effort! Your videos always answer my questions!!

Yo-rwmq
Автор

Man you are a saviour ...pls keep up the good work

danish
Автор

Love it so much this content. I dont are if take 6 hour long. men i appreciate the effort

santiagopazbedoya