Fast T5 transformer model CPU inference with ONNX conversion and quantization

preview_player
Показать описание
The Colab notebook shown in the video is available in the course.

With the conversion of T5 transformer model to ONNX and quantization, you can reduce the model size by 3x and gain inference speed by up to 5x. This helps to deploy a question generation model like T5 on a CPU with sub-second latency.

Also, neat visualization with the Gradio app is included.

We are using the T5 transformer model from Huggingface and using the FastT5 library for ONNX and quantization.

With optimizations like these, thousands of dollars could be saved on the production deployment of these systems.

Timestamps:
00:00 Introduction and Agenda
01:07 Install the transformers library from hugging face
02:18 Download Hugging face model
02:40 Sample of Generating a question
04:00 Gradio app deployment in GUI
08:11 Convert T5 Pytorch to ONNX & Quantize with FastT5
17:22 Store the model in the drive
18:30 Run the Gradio App with New model
21:55 Future episode & Conclusion
Рекомендации по теме
Комментарии
Автор

Very high quality tutorial. Thank you. Keep it up

adnanahmad
Автор

00:00 Introduction and Agenda
01:07 Install the transformers library from hugging face
02:18 Download Hugging face model
02:40 Sample of Generating a question
04:00 Gradio app deployment in GUI
08:11 Convert T5 Pytorch to ONNX & Quantize with FastT5
17:22 Store the model in drive for later use
18:30 Run the Gradio App with New model
21:55 Future episode & Conclusion

petretrusca
Автор

How to run on gpu, the quantized model

afrinpeshimam
Автор

Also one question, can we use the same procedure to convert a custom distlbert model saved in local disk to onnx format?

debjyotibanerjee
Автор

You are running colab on GPU right? What if i do this inference on CPU with the onnx with quantized model. How much time it will take?

debjyotibanerjee