BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Показать описание

Abstract:
We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models, BERT is designed to pre-train deep bidirectional representations by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT representations can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications.
BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE benchmark to 80.4% (7.6% absolute improvement), MultiNLI accuracy to 86.7 (5.6% absolute improvement) and the SQuAD v1.1 question answering Test F1 to 93.2 (1.5% absolute improvement), outperforming human performance by 2.0%.

Authors:
Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova

Рекомендации по теме

Комментарии

I watch the ads entirely just to show my support for this amazing channel

teodorflorianafrim

Its so kind of you to introduce these papers to us in such a decent way. Thanks a lot.

zhangc

“the man went to MASK store” makes a lot of sense these days.

kevinnejad

this is one of your best NLP videos to me, a very quick but clear recap of language models, RNN, word vectors, attention. All to explain the bert revolution. This is awesome! and I would love a series of recap videos like this. Kudos!

marcobuiani

I like the way he knows he isnt the best at explaining stuff but still tried 110% to explain ! Thanks man for the amazing papers.

niduttbhuptani

This is my second comment on your videos. I am really thankful to you for creating such an informatory video on BERT. Now I can go through the paper with some confidence.

shrikanthsingh

Fantastic overview. Really appreciate your patient and detailed walk-through of the paper.

ramiyer

"the problem is that a character in itself doesn't really have a meaning"

f

sofia.eris.bauhaus

special thanks for tokenization detour and deeper dive into finetuning/evaluation tasks!

Alex-msyd

At 25:10 you're taking about character level tokens. Does that refers to "Enriching word vectors with sub subword representation" paper?

vinayreddy

Thank you. In 10:51 I think although in ELMo they're concatenating left and right side, when making a prediction if there is a softmax the back-propagating error to left should be effected by right side and vice versa. However, I understand what you mean by they're not that coupled.

tempvariable

Thank you a lot I search a lot and I read the paper but I have difficulty to understand it until I watch your video you make everything easy

sasna

Very nice explanation. Can you please elaborate the token embedding used in Bert. Are these the same 300 dimensions vector from glove or these embedding are trained from scratch in Bert. How are we getting the base embedding is something I am not able to understand. Thanks in advance for clarifying.

saurabhgoel

Hey, what BERT claims is infact very similar to the working of a Transformer Encoder layer as described in the "Attention Is All You Need Paper". The encoder submodel is allowed to peek into future tokens as well.

tuhinmukherjee

Does bert take in fixed length sentences for the question and paragraph task ? if not then how is the variable length input is handled? basically what is the size of data fed into the network

thak

Thank you so much for the amazing paper explanation!
Is the speaking at 16:28 means they pre-train two tasks at the same time(predict the mask "and" the isNext label) or have an order training(pre-train the task 1 then the task 2).

tusov

have a question?

When you train a BERT model, lets say for a named-entity recognition task like "Subscribe to Pewdiepie", does BERT model automatically map the words 'Subscribe', 'to', 'Pewdiepie' to its already trained word embeddings read off the corpus? If it does, it means the BERT model comes with its huge bag of word embeddings.

fahds

Dear Yannic,
Could you please share with me how to use BERT for fine-tuning in a regression task? My data looks like:
input: a sentence with length 30 words
output: a score in [0, 5].
Is it good to use BERT for a dataset like this? I found some document said transfer learning is effective for a new dataset which is the same with the source task/dataset.
Thank you!

tamvominh

somehow the image in figure 1 comparison is different on the arvix 2019 paper?

elnazsn

Sir, wonderful and clear explanation..i have douth that qa system with bert technique is supervised or unsupervised...is bert is pre training model

dr.deepayogish

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding