Flamingo: a Visual Language Model for Few-Shot Learning

preview_player
Показать описание
DeepMind's Flamingo model was introduced in the work "Flamingo: a Visual Language Model for Few-Shot Learning" by J-B. Alayrac et al. (NeurIPS 2022). This video provides a description of the details.

Timestamps:
00:00 - Flamingo: a Visual Language Model for Few-Shot Learning
00:21 - Outline
01:10 - Motivation
04:46 - Challenges for multimodal generative modelling
07:42 - Related Work
14:23 - Flamingo Model
17:26 - Vision encoder: pixels to features
18:48 - Vision encoder details
21:41 - Perceiver resampler
23:30 - Conditioning the language model
25:31 - Per-image/video attention masking
29:01 - Flamingo - training data
32:32 - Flamingo training objective
33:16 - Task adaptation with few-shot in-context learning
35:24 - Few-shot in-context learning details
40:06 - Flamingo models
41:51 - Few-shot evaluation benchmarks
44:23 - Flamingo: dataset deduplication
46:53 - Flamingo: nuts and bolts training details
50:17 - Few-shot: comparison to SotA
53:50 - Few-shot: further analysis
59:07 - Contrastive pretraining: zero-shot retrieval
59:58 - Fine-tuning Flamingo
01:01:58 - Ablation studies
01:12:50 - Qualitative results
01:17:43 - Qualitative results - dialogue
01:21:33 - Qualitative results - video
01:22:16 - Qualitative results - more videos
01:22:36 - Flamingo limitations
01:24:58 - Flamingo failures: hallucinations/ungrounded guesses
01:25:52 - Trade-offs of few-shot learning methods
01:29:07 - Flamingo opportunities
01:30:25 - Flamingo benefits
01:31:29 - Flamingo risks and mitigation strategies
01:35:01 - Summary

Particular thanks to Antoine Miech for his help in clarifying several details of the work.

For related content:

For (optional) coffee donations:
Рекомендации по теме
Комментарии
Автор

Great explanation of Flamingo. Probably the best currently on youtube.

piratepartyftw
Автор

The explanation was excellent. Well done Sir.

sajjadayobi
Автор

Very nice and detailed digest! Awsome! May I ask how you make these beautiful slides?

jiaruixu
Автор

Excellent video! Two questions for you: (1) How exactly are the learned latent arrays being learned? Are they using some kind of clustering algorithm to learn a reduced dimensional representation of the flattened input features (from the Vision Encoder) + temporal encodings (Xf)? If so, what clustering algorithm do they use to do this? (2) The diagram of the Perceiver Resampler (pg. 11) seems to suggest that each frame (e.g., t=0, t=1, t=2, etc.) is processed through a separate Vision Encoder. Is this a correct understanding?

robboswell
Автор

I do understand how next-token prediction works, but I don't understand how likelihood computations (and rankings) can be done. Is it based on the per-token softmax computations?

Sciencehub-oqgo
Автор

Awesome. Can the slides be shared? I really want to look into it many times. :D

mohammadmahdiderakhshani