LLaVA - the first instruction following multi-modal model (paper explained)

preview_player
Показать описание
There is a lot of emerging interest in developing multimodal foundation models similar to foundation models for language which are LLMs. LLAVA which stands for Large Language and Vision Assistant is the first paper to apply instruction tuning to visual data thereby pushing the possibilities of Large Multimodal Models (LMMs). This video explains the first paper in the LLaVA series of papers such as LLaVA, LLaVA-RLFH, LLaVA-Med and the latest LLaVA 1.5

RELATED LINKS

🛠 🛠 🛠 MY SOFTWARE TOOLS 🛠 🛠 🛠

📚 📚 📚 BOOKS I HAVE READ, REFER AND RECOMMEND 📚 📚 📚

MY KEY LINKS

WHO AM I?
I am a Machine Learning Researcher / Practioner who has seen the grind of academia and start-ups equally. I started my career as a software engineer 15 years back. Because of my love for Mathematics (coupled with a glimmer of luck), I graduated with a Master's in Computer Vision and Robotics in 2016 when the now happening AI revolution just started. Life has changed for the better ever since.

#machinelearning #deeplearning #aibites
Рекомендации по теме
Комментарии
Автор

This was a super helpful video - high level, but still detailed enough for me to understand and feel confident that I can try to reproduce their work. Thanks!

Hello-txug
Автор

this is great, I was waiting for research like this to come out!

IntrospectiveMinds
Автор

Can you please explain the SpeechX paper from Microsoft

MohamedEmad-td