Flamingo: a Visual Language Model for Few-Shot Learning

Показать описание

DeepMind's Flamingo model was introduced in the work "Flamingo: a Visual Language Model for Few-Shot Learning" by J-B. Alayrac et al. (NeurIPS 2022). This video provides a description of the details.

Timestamps:
00:00 - Flamingo: a Visual Language Model for Few-Shot Learning
00:21 - Outline
01:10 - Motivation
04:46 - Challenges for multimodal generative modelling
07:42 - Related Work
14:23 - Flamingo Model
17:26 - Vision encoder: pixels to features
18:48 - Vision encoder details
21:41 - Perceiver resampler
23:30 - Conditioning the language model
25:31 - Per-image/video attention masking
29:01 - Flamingo - training data
32:32 - Flamingo training objective
33:16 - Task adaptation with few-shot in-context learning
35:24 - Few-shot in-context learning details
40:06 - Flamingo models
41:51 - Few-shot evaluation benchmarks
44:23 - Flamingo: dataset deduplication
46:53 - Flamingo: nuts and bolts training details
50:17 - Few-shot: comparison to SotA
53:50 - Few-shot: further analysis
59:07 - Contrastive pretraining: zero-shot retrieval
59:58 - Fine-tuning Flamingo
01:01:58 - Ablation studies
01:12:50 - Qualitative results
01:17:43 - Qualitative results - dialogue
01:21:33 - Qualitative results - video
01:22:16 - Qualitative results - more videos
01:22:36 - Flamingo limitations
01:24:58 - Flamingo failures: hallucinations/ungrounded guesses
01:25:52 - Trade-offs of few-shot learning methods
01:29:07 - Flamingo opportunities
01:30:25 - Flamingo benefits
01:31:29 - Flamingo risks and mitigation strategies
01:35:01 - Summary

Particular thanks to Antoine Miech for his help in clarifying several details of the work.

For related content:

For (optional) coffee donations:

Рекомендации по теме

Комментарии

Great explanation of Flamingo. Probably the best currently on youtube.

piratepartyftw

The explanation was excellent. Well done Sir.

sajjadayobi

Very nice and detailed digest! Awsome! May I ask how you make these beautiful slides?

jiaruixu

Excellent video! Two questions for you: (1) How exactly are the learned latent arrays being learned? Are they using some kind of clustering algorithm to learn a reduced dimensional representation of the flattened input features (from the Vision Encoder) + temporal encodings (Xf)? If so, what clustering algorithm do they use to do this? (2) The diagram of the Perceiver Resampler (pg. 11) seems to suggest that each frame (e.g., t=0, t=1, t=2, etc.) is processed through a separate Vision Encoder. Is this a correct understanding?

robboswell

I do understand how next-token prediction works, but I don't understand how likelihood computations (and rankings) can be done. Is it based on the per-token softmax computations?

Sciencehub-oqgo

Awesome. Can the slides be shared? I really want to look into it many times. :D

mohammadmahdiderakhshani

Flamingo: a Visual Language Model for Few-Shot Learning

Flamingo: a Visual Language Model for Few-Shot Learning

DeepMind Flamingo explained - 32 images are enough

Harvard Medical AI: Lucy He on 'Flamingo: a Visual Language Model for Few-Shot Learning'

Flamingo: Visual Language Model for Few-Shot Learning

Antoine Miech - Flamingo: a Visual Language Model for Few-Shot Learning

Flashback - DeepMind Flamingo (similar to GPT-4 as a visual language model) - Parts 1 & 2 - May/...

[ML News] DeepMind's Flamingo Image-Text model | Locked-Image Tuning | Jurassic X & MRKL

Transformer for VS | Flamingo: a Visual Language Model for Few-Shot Learning | Session 5 | CVPR 2022

EE837 (Fall 2023): Flamingo: a Visual Language Model for Few-Shot Learning

Understanding Vision-Language Models with 🦩Flamingo

Paper Club with Peter - Flamingo: a Visual Language Model for a Few-Shot Learning.

Google launches Flamingo, a visual language model

Part 2 - Flamingo by DeepMind (Apr/2022) - Visual LM with Chinchilla - Integrated AI - Obama [4K]

Interleaved Image/Audio/Video-Text Dataset for Flamingo-Style Model

Understand how multi-modality AI works: Case study of DeepMind Flamingo

Multi-Modal Networks, Flamingo, CLIP, 3D CV

Flamingo DeepMind

Integrated AI - Flamingo by DeepMind (Apr/2022) - Visual LM with Chinchilla (80B) - some DALL-E 2

[Interspeech 2024] Whisper-Flamingo: Integrating Visual Features into Whisper

Prof. Michael Wooldridge (Oxford University) on large language models #machinelearning

Time Lapse of A Flamingo Drawing

Flamingo Names