Simple Diffusion Language Models

preview_player
Показать описание
Short tutorial on text diffusion.

* Simplified and Generalized Masked Diffusion for Discrete Data

Errata:

* 7:32,: I say q is ‘denoising’ but I meant ‘noising.’
* 9:16 - 10:03: There’s a term missing in the loss. See the paper for the full version which uses a slightly different notation.
Рекомендации по теме
Комментарии
Автор

@srush_nlp Great explanation! How do you think discrete diffusion models should be modified to enable long context sequence generation comparable to LLMs?

ASarkar-ML
Автор

Really cool stuff! It’s a shame it’s not quite at the level of auto regressive models (especially for DNA), but I’m excited about future work in the field. Love the explanation, it made reading the paper much more digestible

sarthak-ti
Автор

I made a Retrieval Base chatbot from scratch, but I'm not a professional, but the main component was compressing the vocabulary with synonyms, training the model on the compressed vocabulary to make it grok faster. I have a feeling that approach would allow for very small and intelligent models. What do you think about compressing the vocabulary?

MagusArtStudios
Автор

Why do this by masking and unmasking whole tokens or words? Why not pretrain some kind of latent space for each token/word and then do the diffusion in the latent space? Then the diffusion becomes much simpler. Of course you still need to convert from the latent space into the best token/word after that, but that should be relatively straightforward as well.

jrkirby
Автор

Making this process discrete seems very strange to me. Why not noise the token embeddings themselves (e.g. at pure noise levels a given token embedding is made up of the embeddings of all tokens, at zero noise it is made up of a one-hot vector like normal). And as you do diffusion you can update this token-embedding probability space since you have the logits.

After n inference steps you will probably end up with tokens that probably don't converge to a single token but instead map to some subset of tokens that should all be roughly equivalent in semantic space, so you can just randomly sample from said distribution based on the final logits. Tokens you've already generated will be one-hot, noised tokens will be blended as described.

marinepower
Автор

Really good video. I have to improve on my math, but i get the general idea. Will try to implement the idea

john_olu
Автор

Really liked it! This could work on ARC better.

wwkk