Fast T5 transformer model CPU inference with ONNX conversion and quantization

Показать описание

The Colab notebook shown in the video is available in the course.

With the conversion of T5 transformer model to ONNX and quantization, you can reduce the model size by 3x and gain inference speed by up to 5x. This helps to deploy a question generation model like T5 on a CPU with sub-second latency.

Also, neat visualization with the Gradio app is included.

We are using the T5 transformer model from Huggingface and using the FastT5 library for ONNX and quantization.

With optimizations like these, thousands of dollars could be saved on the production deployment of these systems.

Timestamps:
00:00 Introduction and Agenda
01:07 Install the transformers library from hugging face
02:18 Download Hugging face model
02:40 Sample of Generating a question
04:00 Gradio app deployment in GUI
08:11 Convert T5 Pytorch to ONNX & Quantize with FastT5
17:22 Store the model in the drive
18:30 Run the Gradio App with New model
21:55 Future episode & Conclusion

Practical AI by Ramsri

Рекомендации по теме

Комментарии

Very high quality tutorial. Thank you. Keep it up

adnanahmad

00:00 Introduction and Agenda
01:07 Install the transformers library from hugging face
02:18 Download Hugging face model
02:40 Sample of Generating a question
04:00 Gradio app deployment in GUI
08:11 Convert T5 Pytorch to ONNX & Quantize with FastT5
17:22 Store the model in drive for later use
18:30 Run the Gradio App with New model
21:55 Future episode & Conclusion

petretrusca

How to run on gpu, the quantized model

afrinpeshimam

Also one question, can we use the same procedure to convert a custom distlbert model saved in local disk to onnx format?

debjyotibanerjee

You are running colab on GPU right? What if i do this inference on CPU with the onnx with quantized model. How much time it will take?

debjyotibanerjee

Fast T5 transformer model CPU inference with ONNX conversion and quantization

Fast T5 transformer model CPU inference with ONNX conversion and quantization

Ep1 - How to make Transformer (Encoder Decoder) Models Production Ready?FAST, COMPACT and ACCURATE

Accelerate Transformer inference on GPU with Optimum and Better Transformer

News Text Summarization Application using T5 Transformer

Accelerate Transformer inference on CPU with Optimum and ONNX

Install T5 Base Grammar Correction Model Locally with Happy Transformer

Fine-tuning T5 LLM for Text Generation: Complete Tutorial w/ free COLAB #coding

T5 and Flan T5 Tutorial

Easy Custom NLP T5 Model Training Tutorial - Abstractive Summarization Demo with SimpleT5

Top 3 Fine-Tuned T5 Transformer Models (Text-to-Text NLP)

How to Sparsify BERT for Better CPU Performance & Smaller File Size

FasterTransformer | FasterTransformer Architecture Explained | Optimize Transformer

Optimizing (NLP) Transformer Models for Performance

Deploy T5 transformer model as a serverless FastAPI service on Google Cloud Run

Running a Hugging Face LLM on your laptop

How to Load Large Hugging Face Models on Low-End Hardware | CoLab | HF | Karndeep Singh

Accelerating Transformers with Hugging Face Optimum and Infinity

PyTorch in 100 Seconds

Terminator Genisys (2015) - Killing the T-1000 Scene (4/10) | Movieclips

Tutorial 1-Transformer And Bert Implementation With Huggingface

Tutorial 2- Fine Tuning Pretrained Model On Custom Dataset Using 🤗 Transformer

How Large Language Models Work

Mixture-of-Depths - Make AI Models Faster By 50%

NEW Flan-T5 Language model | CODE example | Better than ChatGPT?