Exploring the Latency/Throughput & Cost Space for LLM Inference // Timothée Lacroix // CTO Mistral

Показать описание

// Abstract
Getting the right LLM inference stack means choosing the right model for your task, and running it on the right hardware, with proper inference code. This talk will go through popular inference stacks and set-ups, detailing what makes inference costly. We'll talk about the current generation of open-source models and how to make the best use of them, but we will also touch on features currently missing from the open-source serving stack as well as what the future generations of models will unlock.

// Bio
Timothée Lacroix, aged 31, is Chief Technical Officer in charge of technical issues relating to product efficacy and research. Started as an engineer at Facebook AI Research in 2015 in New York, where he completed his thesis between 2016 and 2019, in collaboration with École des Ponts, on tensor factorization for recommender systems. He continued his career at Meta until 2023 when he co-founded @Mistral-AI.

// Sign up for our Newsletter to never miss an event:

// Watch all the conference videos here:

// Read our blog:

// Join an in-person local meetup near you:

// MLOps Swag/Merch:

// Follow us on Twitter:

//Follow us on Linkedin:

Рекомендации по теме

Комментарии

There seems to be a mistake in the cost estimate at 21:53. It uses the price for the A10 but the throughput of the H100. I believe the actual cost estimate would be $48, not $15.

iandanforth

This is awesome. Thanks for sharing super useful

mndflctzn

The math around 6:50 for A100 batch size isn't working out. It would be great if the values used to calculate the 400 batch size were provided.

Based on the equations provided for compute time and model load time, the point of intersection is Flops/(2*MemoryBand) NOT the (2*FLOPS)/MemoryBand which is in the video.

eduardoalvarez

Great talk! is there link to the slides for this talk?

frank

hi what benchmark he run to generate the plots? any open source github links?

Gerald-izmv

@5:40 why do we need to load the entire model all the time? can't we just load once? If so, we might lower the needs of memory movement, and the intersection would shift left

janilbolswong

It's possible that I'm misunderstanding, but given our use of a significantly large key-value cache (2GB multiplied by the batch size), can we still assert that the memory bandwidth is solely influenced by the model's weights?

boussouarsari

What a horrible unethical response on the ethics of training data

AbdulK-krjv

Exploring the Latency/Throughput & Cost Space for LLM Inference // Timothée Lacroix // CTO Mistral

Exploring the Latency/Throughput & Cost Space for LLM Inference // Timothée Lacroix // CTO Mist...

Bandwidth vs. Throughput vs. Latency | Computer Networks

Throughput vs. Latency: How To Debug A Latency Problem

Philly ETE 2014 #29 - Data in Motion: Latency and Throughput - David Richardson & John Granieri

Latency Monitoring Metrics

Latency in Audio Interfaces | Does It Matter in 2023?

Computer Architecture - Lecture 10: Low-Latency Memory (ETH Zürich, Fall 2020)

Memory Systems - Lecture 4.2: Low-Latency Memory (Technion, Summer 2018)

Complete Training: AWS Certified Cloud Practitioner Certification 2024

1.4 Performance

Navigating Latency vs Throughput | System design basics | For beginners

Memory Systems - Lecture 4.1: Low-Latency Memory (Technion, Summer 2018)

'The Real Revolution... From Latency to the Throughput Age' - Dr. Jesus Labarta

Computer Architecture - Lecture 11: Low-Latency Memory (Fall 2021)

To improve Latency and throughput by using AWS #latency #throughput #aws #cloud

Onur Mutlu @ ACACES 2018 - Memory Systems - Lecture 5: Low-Latency Memory

Do you know the latency of your Services? Numbers to keep in mind for handling a million requests

'Measuring and Optimizing Tail Latency' by Kathryn McKinley

Computer Architecture - Lecture 10b: Memory Latency (ETH Zürich, Fall 2018)

Measuring Latency and Cost

What is edge computing?

A Better Way to Measure Latency in Broadband Networks

CppCon 2017: Carl Cook “When a Microsecond Is an Eternity: High Performance Trading Systems in C++”...

Netdev 0x16 - dcPIM Low latency, High throughput, Receiver driven Transport Protocol