Fast And Portable LLM Inference With WebAssembly And Rust by Michael Yuan

preview_player
Показать описание
Fast And Portable LLM Inference With WebAssembly And Rust by Michael Yuan at #LambdaConf2024

It is estimated that inference consumes over 90% of computing resources for AI workloads. The heavy resource consumption is exacerbated with LLMs, as we see companies struggle to meet customers’ inference demands. Traditional AI inference apps are written in Python and then wrapped in a container or VM for cloud deployment. Those apps are heavyweight (10GB+) and slow with Python-based data processing. Wasm has emerged as a strong alternative runtime for AI inference workloads. Developers write inference functions in Rust / JS / Python and then run them in Wasm sandboxes. Wasm functions are lightweight, fast, safe for the cloud, and portable.

In this talk, Michael will talk about the way to zero python dependency in LLM inference and how to create Llama2 inference functions and extensions in Rust. Michael will also discuss how the WasmEdge community has leveraged and built up the Wasm container infra for LLM plugins.
Рекомендации по теме