Pre-train with patches for huge compute savings

preview_player
Показать описание
PATCH-LEVEL TRAINING FOR LARGE LANGUAGE MODELS

My Twitter, LinkedIn, Discord, Patreon, consultation booking page, etc:

Timestamps:
00:00 intro
00:55 background/motivation
06:29 experiments
11:31 scaling trends
13:45 comparing patch sizes
15:21 why it works
16:55 outro
Рекомендации по теме
Комментарии
Автор

This made me think about how to do a transformer that learns how to do tokenization automatically (instead of using k=4).

Take a sequence of bytes / characters, and output a split / don't split for each character. This gives us a sequence of tokens based on some threshold function. This is done by the first transformer.

Where we actually split, we then feedforward / backpropagate through a second transformer to do the actual language modeling task. The loss backpropagates back all the way back to each token, and then out to the original characters. Basically, the token-level loss is backpropagated to the characters chosen based on magnitude of the predictions. E.g. If the characters are 'T', 'H', 'E', and the magnitudes are 0.1 for T, 0.1 for H, and 0.9 for E, then the total loss that is backpropagated is 0.1/1.1 for T, 0.1/1.1 for H, and 0.9/1.1 for E, since the word taken / processed was 'THE' within the second transformer

marinepower
Автор

Super cool. This makes so much sense. I wonder if it would help with character-level transformers? Because then during patch training, the characters would be grouped. You could maybe use a tiny NLP model to split into patches, e.g. by word. But then ultimately train on raw characters for the final model.

tornyu
Автор

How do they get E embedding function without pre-training on token level?

winwin-gwrn
Автор

Wew there is a lot of crypto scam bots here in the comments.

jondoe
welcome to shbcf.ru