Distilling BERT | Sam Sucik

Показать описание

Presented by Sam Sucik, Machine Learning Resarcher at Rasa's Level 3 AI Assistant Conference. The popular BERT model can deal well with natural language understanding tasks, but it is too slow for many practical applications like real-time human-bot conversations. One solution is knowledge distillation: letting tiny "student" models learn from the "teacher" BERT. While it works well, it is opaque just like the neural models involved. I explore the learning that is going on. Using different model interpretation techniques, I analyse what the students do and don't learn well. On easier tasks, I show that even 1000x smaller student models perform on par with BERT, even though their language understanding capabilities differ from the teacher's. On difficult tasks where solid understanding of meaning is required, I demonstrate that only the most sophisticated skills are learnt poorly by the students. My findings can help us improve models and choose the right model for a task, instead of automatically using BERT.