Python TF2: BERT model | Code your WordPiece - Tokenizer (w/ HuggingFace)

preview_player
Показать описание
Python TF2 code (w/ JupyterLab) to train your WordPiece tokenizer: Tokenizers are one of the core components of the NLP pipeline. They serve one purpose: to translate text into data that can be processed by the BERT model.

Why would you need a new & improved tokenizer?

That's because Transformer models very often use subword tokenization algorithms, and they need to be trained to identify the parts of words that are often present in the corpus of your input text (sentences, Paragraphs, documents, ..), the sentences you are interested in. In order to build your optimized vocabulary.

WordPiece is a subword tokenization algorithm quite similar to BPE, used mainly by Google in models like BERT. It tries to build long words first, splitting in multiple tokens when entire words don’t exist in the vocabulary. This is different from BPE that starts from characters, building bigger tokens as possible.

Be aware: Training a tokenizer is not (!) the same as training a DL BERT model.
All w/ TensorFlow2 code (JupyterLab).

A special case of a newly trained WordPiece tokenizer (see also HuggingFace's Tokenizer Library)

#Tokenizer
#HuggingFace
#WordPiece

00:00 WordPiece Tokenizer
04:20 WordPiece model for Tokenizer
09:25 Train your WordPiece Tokenizer
13:35 Encode your sentences
17:00 Save your new Tokenizer
19:15 Use your new Tokenizer
23:10 Output (hidden layers)
24:53 Example for sentiment analysis
Рекомендации по теме
Комментарии
Автор

hello. I am getting an error on

optimizer = AdamW(model.parameters(), lr=5e-5)

AttributeError: 'TFBertModel' object has no attribute 'parameters'

shamnaseer