BERT Document Classification Tutorial with Code

preview_player
Показать описание
==== Free Course & Notebook ====

Learn how to fine-tune BERT for document classification. We'll be using the Wikipedia Personal Attacks benchmark as our example.

Bonus - In Part 3, we'll also look briefly at how we can apply BERT to search for "semantically similar" comments in the dataset.

==== Pre-reqs ====
This tutorial builds on my “BERT Fine-Tuning Tutorial with PyTorch”. If you want to know more of the basics of fine-tuning BERT, check it out!

==== References ====
Here are the links for the dataset (these are also provided in my notebook):

==== Updates ====
Рекомендации по теме
Комментарии
Автор

your explanation is amazing. What's more amazing is your voice. Wow :)

chethanbabu
Автор

Thank you chris I was looking for this semantic search part from last 2 week and you saved me. I have lot of data for classification but not semantic search and you explained what I was exactly needed.

VijayMauryavm
Автор

These tutorials are gold. Please keep posting.

kazakx
Автор

Your tutorial is easy to follow and somehow entertaining. Loved the dataset used here!

beatlekim
Автор

Thank you! I do think F1 score is a better overall metric vs ROC/AUC

harisjaved
Автор

I guess when we pass Max_length in encoder, it will itself take care of padding? Also attention mask is taken care by encoder?

pratik
Автор

can I use BERT model for classification audio files

zaynabmuneef
Автор

Can you apply quantization to Bert model so as to reduce the size of model for this task? These models are huge.

cbrao
Автор

Can you do a tutorial on using bioBERT or some BERT on medical NER?

at
Автор

Where is a Colab link for this Document classification task? Can you Please share it

ammaarahmad
Автор

why am i getting encoded_layers as a str object??

tlpunisher
Автор

Does hugging face Bert allow training on tpu?

TechVizTheDataScienceGuy
Автор

hi. it is a amazing tutorial. but i have a question about truncating text. if there is a text(ex: 5, "I have a brown cat") longer than max_length(for ex:3), then which one do i have to make? a: [("I have a"), ("brown cat [pad]")] or b:[("I have a"), ("have a brown"), ("a brown cat")] which one should be better?

junhyuklee
Автор

A question: I think the [SEP] token may be truncated out of the input if we simply call function <pad_sequences> on the original <input_ids>. Am I correct? Thank you.

hduanacduan
Автор

I have followed your steps. I want to make only word embedding (unsupervised), could you?

cendradevayanaputra
Автор

In the future video, can you talk more about how hidden state output looks like. How to interpret the dimension and what is the index of CLS token for every sentences?

pratik
Автор

Let's say we need to perform sentiment analysis on a document. This approach (truncation) might work If sentiment is not changing throughout the document, but If it requires reading the whole document in order to classify a sentiment, then it is not best way to do it.

alizhadigerov
Автор

Hi Chris I have a big doubt, don’t u need to do text cleaning / preprocessing that we usually do for normal nlp tasks like stemming stop word removal etc for Bert? I knew it can handle punctuations

Sandeep-sllp
Автор

Chris, great lecture, as usual. It would be nice to see the results from BERT if you hadn't fine-tuned it. I thought Jacob Devlin's comment was insightful, but it would be interesting to see how much improvement was made by the fine-tuning.

jimcrotinger
Автор

What would you suggest if I want to build a question answering model with BERT on long documents (> 512)?

tomc
join shbcf.ru