Build a Custom Transformer Tokenizer - Transformers From Scratch #2

preview_player
Показать описание
How can we build our own custom transformer models?

Maybe we'd like our model to understand a less common language, how many transformer models out there have been trained on Piemontese or the Nahuatl languages?

In that case, we need to do something different. We need to build our own model - from scratch.

In this video, we'll learn how to use HuggingFace's tokenizers library to build our own custom transformer tokenizer.

---

🤖 70% Discount on the NLP With Transformers in Python course:

📙 Medium article:

📖 If membership is too expensive - here's a free link:

👾 Discord

🕹️ Free AI-Powered Code Refactoring with Sourcery:
Рекомендации по теме
Комментарии
Автор

this video is all I need
searching for this content for whole day and finally find it
internet is such a blessing and facinates me sometimes

LokeshSharma-mepg
Автор

Thank you for this video, just one query, does it supports token for Latin related languages or can we do it for any other language or script

ajitkumar
Автор

Thank you for your kind explaining the video. I wonder what kind of program are you using. It seems like you can see several examples of options during typing. Is that jupyter lab? or anything else?

hjpark
Автор

I've been able to save the tokenizer locally (merge and vocab files), however when I come to initialise them (tokenizer = I get an OS error even though both vocab and merge files are in the directory. Any ideas why this would happen? Otherwise great set of tutorials :)

fgbcior
Автор

how can get those files from local disk ?

hemanthkumar-tjhs
Автор

hello but can i use this Tokenizer to train my own XLM-R model :v

thangphanchau
Автор

Hi, James what BPE should I use for english tokenizing? it seems following these tokenizing makes the tokens for Latin. Thank you in advance!

etherealshift
Автор

Hi James. You tutorials a f... amazing. I tested it over a very small vocabulary (1K only). When I checked the merge.txt (see @ 9:55), then I found (ra zione) (la zione) (ca zione) and (ta zione), meaning four items separated, which joined would form (razione), (lazione), (cazione), and (tazione). Then I checked inside vocab.json and found the following (zione), (razione), (lazione), (cazione) and (tazione) with an ID to each one. Does that mean the MERGE.txt indicate pieces of strings that are joined to create another larger string?? Thankssss again. Sincerely, F. Andutta.

fernandoandutta
Автор

Worth asking if this is truly from scratch with so many imports

MVR_
Автор

file bertius/config.json not found
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization.
The tokenizer class you load from this checkpoint is 'BertTokenizer'.
The class this function is called from is 'RobertaTokenizer'.

Getting the above error while using

parmeetsingh