Accelerate Transformer inference on CPU with Optimum and ONNX

preview_player
Показать описание
In this video, I show you how to accelerate Transformer inference with Optimum, an open source library by Hugging Face, and ONNX.

I start from a DistilBERT model fine-tuned for text classification, export it to ONNX format, then optimize it, and finally quantize it. Running benchmarks on an AWS c6i instance (Intel Ice Lake architecture), we speed up the original model more than 2.5x and divide its size by two, with just a few lines of simple Python code and without any accuracy drop!

⭐️⭐️⭐️ Don't forget to subscribe to be notified of future videos ⭐️⭐️⭐️

Рекомендации по теме
Комментарии
Автор

Thanks a lot for creating this video. I saved a month by watching this video!

geekyprogrammer
Автор

To the point ! great explanation, thanks 😀

youssefbenhachem
Автор

How do you export to onnx using cuda? It seems optimum doesnt support it - is there an alternative?

Gerald-izmv
Автор

is there any optimization methods applied on word2vec 2.0 model ? and can I apply these methods on the word2vec 2.0

ahlamhusni
Автор

I am trying to follow along. Many updates to the code so many errors unfortunately.

TheBontenbal
Автор

what the difference between setfit.exporters.onnx and optimum.onnxruntime (optimizer = optimizer.optimize()) etc.?

Gerald-xgrq