filmov
tv
Python TF2: BERT model | Code your WordPiece - Tokenizer (w/ HuggingFace)
Показать описание
Python TF2 code (w/ JupyterLab) to train your WordPiece tokenizer: Tokenizers are one of the core components of the NLP pipeline. They serve one purpose: to translate text into data that can be processed by the BERT model.
Why would you need a new & improved tokenizer?
That's because Transformer models very often use subword tokenization algorithms, and they need to be trained to identify the parts of words that are often present in the corpus of your input text (sentences, Paragraphs, documents, ..), the sentences you are interested in. In order to build your optimized vocabulary.
WordPiece is a subword tokenization algorithm quite similar to BPE, used mainly by Google in models like BERT. It tries to build long words first, splitting in multiple tokens when entire words don’t exist in the vocabulary. This is different from BPE that starts from characters, building bigger tokens as possible.
Be aware: Training a tokenizer is not (!) the same as training a DL BERT model.
All w/ TensorFlow2 code (JupyterLab).
A special case of a newly trained WordPiece tokenizer (see also HuggingFace's Tokenizer Library)
#Tokenizer
#HuggingFace
#WordPiece
00:00 WordPiece Tokenizer
04:20 WordPiece model for Tokenizer
09:25 Train your WordPiece Tokenizer
13:35 Encode your sentences
17:00 Save your new Tokenizer
19:15 Use your new Tokenizer
23:10 Output (hidden layers)
24:53 Example for sentiment analysis
Why would you need a new & improved tokenizer?
That's because Transformer models very often use subword tokenization algorithms, and they need to be trained to identify the parts of words that are often present in the corpus of your input text (sentences, Paragraphs, documents, ..), the sentences you are interested in. In order to build your optimized vocabulary.
WordPiece is a subword tokenization algorithm quite similar to BPE, used mainly by Google in models like BERT. It tries to build long words first, splitting in multiple tokens when entire words don’t exist in the vocabulary. This is different from BPE that starts from characters, building bigger tokens as possible.
Be aware: Training a tokenizer is not (!) the same as training a DL BERT model.
All w/ TensorFlow2 code (JupyterLab).
A special case of a newly trained WordPiece tokenizer (see also HuggingFace's Tokenizer Library)
#Tokenizer
#HuggingFace
#WordPiece
00:00 WordPiece Tokenizer
04:20 WordPiece model for Tokenizer
09:25 Train your WordPiece Tokenizer
13:35 Encode your sentences
17:00 Save your new Tokenizer
19:15 Use your new Tokenizer
23:10 Output (hidden layers)
24:53 Example for sentiment analysis
Комментарии