LLaVA: A large multi-modal language model

preview_player
Показать описание
In this video, we'll learn about LLaVA (Large Language And Vision Assistant), a multimodal model that integrates a CLIP vision encoder and the VICUNA LLM.

We'll see how well it gets on describing a cartoon cat, a photo of me with AI generated parrots, and a bunch of images created by the Mid Journey Generative AI tool.

And most importantly, we'll find out whether it knows who Cristiano Ronaldo is!

#AI #MultimodalModels #llava #GPT4 #ImageRecognition #Streamlit #MachineLearning #AndrewNg #llms

Рекомендации по теме
Комментарии
Автор

So cool. GenAI is a never ending stream of fun.

aragaodan
Автор

Can you do a video on finetuning a multimodal LLM (Video-LlaMA, LLaVA, or CLIP) with a custom multimodal dataset containing images and texts for relation extraction or a specific task? Can you do it using open-source multimodal LLM and multimodal datasets like video-llama or else so anyone can further their experiments with the help of your tutorial. Can you also talk about how we can boost the performance of the fine-tuned modal using prompt tuning in the same video?

thisurawz
Автор

Thanks for posting. I have it working but I do see an error in Cygwin when I run it regarding a missing cl.exe which exists but it seems to be working

kenbajema
Автор

Me, immediately honing in on the misspelling of "instruction" at the 17 second mark. 🫠

PeterCorless