Coding a Multimodal (Vision) Language Model from scratch in PyTorch with full explanation

Показать описание

Full coding of a Multimodal (Vision) Language Model from scratch using only Python and PyTorch.

We will be coding the PaliGemma Vision Language Model from scratch while explaining all the concepts behind it:
- Transformer model (Embeddings, Positional Encoding, Multi-Head Attention, Feed Forward Layer, Logits, Softmax)
- Vision Transformer model
- Contrastive learning (CLIP, SigLip)
- Numerical stability of the Softmax and the Cross Entropy Loss
- Rotary Positional Embedding
- Multi-Head Attention
- Grouped Query Attention
- Normalization layers (Batch, Layer and RMS)
- KV-Cache (prefilling and token generation)
- Attention masks (causal and non-causal)
- Weight tying
- Top-P Sampling and Temperature
and much more!

All the topics will be explained using materials developed by me. For the Multi-Head Attention I have also drawn all the tensor operations that we do with the code so that we can have a visual representation of what happens under the hood.

Prerequisites:

🚀🚀 Join Writer 🚀🚀
Writer is the full-stack generative AI platform for enterprises. We make it easy for organizations to deploy AI apps and workflows that deliver impactful ROI.
We train our own models and we are looking for amazing researchers to join us!

Chapters
00:00:00 - Introduction
00:05:52 - Contrastive Learning and CLIP
00:16:50 - Numerical stability of the Softmax
00:23:00 - SigLip
00:26:30 - Why a Contrastive Vision Encoder?
00:29:13 - Vision Transformer
00:35:38 - Coding SigLip
00:54:25 - Batch Normalization, Layer Normalization
01:05:28 - Coding SigLip (Encoder)
01:16:12 - Coding SigLip (FFN)
01:20:45 - Multi-Head Attention (Coding + Explanation)
02:15:40 - Coding SigLip
02:18:30 - PaliGemma Architecture review
02:21:19 - PaliGemma input processor
02:40:56 - Coding Gemma
02:43:44 - Weight tying
02:46:20 - Coding Gemma
03:08:54 - KV-Cache (Explanation)
03:33:35 - Coding Gemma
03:52:05 - Image features projection
03:53:17 - Coding Gemma
04:02:45 - RMS Normalization
04:09:50 - Gemma Decoder Layer
04:12:44 - Gemma FFN (MLP)
04:16:02 - Multi-Head Attention (Coding)
04:18:30 - Grouped Query Attention
04:38:35 - Multi-Head Attention (Coding)
04:43:26 - KV-Cache (Coding)
04:47:44 - Multi-Head Attention (Coding)
04:56:00 - Rotary Positional Embedding
05:23:40 - Inference code
05:32:50 - Top-P Sampling
05:40:40 - Inference code
05:43:40 - Conclusion

Рекомендации по теме

Комментарии

My favorite (one of my favorite) pizza is actually "Pizza with mozzarella di bufala", also known as "bufalina" in Italy 😆😋

umarjamilai

You Sir are a source of pride to all of us Italian Computer Scientists. Auguri! Grazie!

flavioferlin

I have no words to thank you. I was thinking 2 weeks ago, why there are not books for VLMs like LLMs and today I found your comprehensive explanation video.

hamzawi

5 hours of top tier content that too completely for free! Thank you so much! Please keep uploading such content

harshwardhanfartale

you are the best youtuber on the internet, the best! Not one of! I have listened bunch of programming videos, none of them are like you, yours are so good, so up to date, so amazing

zhuoranlu

Finito! Thank you a lot: I was very curious to learn how multimodal algorithms were even able to work, and it has been a very good challenge to follow along the flow of informations from input to the output, one math operation at a time. Kudos!

DanieleO.

This channel and video is the real deal. Amazing quality. Can't wait to watch the whole thing. Can't believe its completely free - we have no excuse! Keep up the great work and Assalamu Alaikum from Austin, TX!

Bbb

Your contribution to the world is immeasurable...

dfrqwmn

You and Andrej are the two guys inspiring me a lot. Respect!

dinhluongnguyen

You're the best in explanation papers with codes. keep it up bro👏. I hope the next about *ControNet* from scratch.

nasirnr

Stable diffusion video was great. I bet this is even better. Nice to see your videos man welcome back.

Reverberie

you are the best ml engineer bro, there is no full explanation with pytorch code for multimodal LM in the entire youtube, may god preserve you .

tamineabderrahmane

Super well commented and structured code, well explained video. Superb quality video with zero fee!

linhvu

Thank youuu very much, I'm doing my master's thesis on Visual Language Models and this video is such an amazing resource to complete it. Excelent work!

machiniram

I wanted to spend a few days reading how multimodal LMs work, but your video broke all my plans 😅. As always, perfect timing and explanation, keep up the great work!

arturstupa

I haven't seen the video yet, but I'm sure it's amazing, like all your videos, here's a little thank you

marsupilami

Can't thank you enough. You are simply the best guy on Youtube in this field...

bhaweshs

Before watching the video, I want to thank you for the great effort! Your videos always answer my questions!!

Yo-rwmq

Man you are a saviour ...pls keep up the good work

danish

Love it so much this content. I dont are if take 6 hour long. men i appreciate the effort

santiagopazbedoya

Coding a Multimodal (Vision) Language Model from scratch in PyTorch with full explanation

Coding a Multimodal (Vision) Language Model from scratch in PyTorch with full explanation

How do Multimodal AI models work? Simple explanation

Multimodal RAG with GPT-4-Vision and LangChain | Retrieval with Images, Tables and Text

“LLAMA2 supercharged with vision & hearing?!” | Multimodal 101 tutorial

Transformer combining Vision and Language? ViLBERT - NLP meets Computer Vision

Microsoft Phi-3 Vision-the first Multimodal model By Microsoft- Demo With Huggingface

OpenAI CLIP Explained | Multi-modal ML

S1 E1: Approaching Visual Question Answering (VQA) - Vision Language Modelling Series.

KAIST NLP Workshop: Mohit Bansal, Isabelle Augenstein, Kai-Wei Chang, Pang Wei Koh (Aug 10)

Imp-V1-3B: How a Tiny Model is Beating Giants in Multimodal LLM Space

Building a Multimodal RAG App for Medical Applications

Fine Tune a Multimodal LLM 'IDEFICS 9B' for Visual Question Answering

OpenAI CLIP: ConnectingText and Images (Paper Explained)

MMF, a PyTorch powered MultiModal Framework

How to build Multimodal Retrieval-Augmented Generation (RAG) with Gemini

Hugging Face Transformers Pipelines - Multimodal

[1hr Talk] Intro to Large Language Models

How Large Language Models Work

LLM UNDERSTANDING: 39. Aishwarya AGRAWAL 'Multimodal Vision-Language Learning'

Transformers, explained: Understand the model behind GPT, BERT, and T5

[CVPR2023 Tutorial Talk] Multimodal Agents: Chaining Multimodal Experts with LLMs

Florence-2: Fine-tune Microsoft’s Multimodal Model

Introducing LLaVA-NeXT-Interleave: The Ultimate Multimodal AI for Multi-Image and 3D Tasks

【S2E10】Vision-and-Language Alignment - Towards Universal Multimodal AI