What is KOSMOS-2?

preview_player
Показать описание
"KOSMOS-2: Grounding Multimodal Large Language Models to the World" is a new preprint from Microsoft research that illustrates multimodal grounding abilities in a large vision-language model.

Timestamps:
00:00 - KOSMOS-2
00:12 - Grounding Multimodal Large Language Models to the World
01:25 - KOSMOS-1
02:01 - KOSMOS-2 overview
03:42 - The Grounded Image-Text Pairs (GrIT) dataset
05:07 - Kosmos-2: Model and training details
07:00 - Testing the model on some tricky images
07:57 - Hallucinations (Flamingo reference)
09:34 - Phrase grounding
09:59 - Referring expression comprehension
10:43 - Flamingo - where art thou?
11:56 - Language tasks
12:20 - The Ethics Statement
13:43 - Closing thoughts

Topics: #LLMs #ai #microsoft #KOSMOS-2

For related content:

(Optional) if you'd like to support the channel:

Acknowledgements:
Рекомендации по теме
Комментарии
Автор

This is so informative, thanks! I enjoyed the update and not having to read the paper. 😅

AICoffeeBreak
Автор

Please keep making these videos, thank you so much!

KennethAngelikas
Автор

Your videos are amazing! It reminds me of TwoMinutePapers in the beginning. Keep it up

nathabonfim
Автор

Hey thanks for the time you take on these . . You are awesome and I really appreciate the depth of info here!!!❤

oliviarojas
Автор

Great overview! This is next-gen computer vision. Even if the current implementation is far from perfect, this is definitely a huge step forward.

geocow
Автор

you're AWESOME as usual Mr. Samuel, I have some suggestions that might improve the performance of KOSMOS-2 which are,
1-Merge it with a depth map model : this will help the model to extract more data from the provided images.
2-Merge it with Upscaling model : this (also) will help the model to extract more data (more pixels) from the provided images.
3-Merge it with text-to image model through a linear layer this (probably) is gonna be as a reversed feedback.
And as always, I look forward to your opinion🙂🙂

younesprog
Автор

I think that if they instruct tuning the model using the LRV-Instruction dataset, it will be less prone to hallucinations.

Ez-sedl
Автор

I guess filling out all those Captchas is finally paying off (for someone)

cholst