Все публикации

vLLM Office Hours - vLLM’s 2024 Wrapped and 2025 Vision - December 19, 2024

[vLLM Office Hours] 2024 Highlights and 2025 Roadmap

vLLM Office Hours - Exploring Machete, a Mixed-Input GEMM Kernel for Hopper GPUs - December 5, 2024

vLLM Office Hours - Disaggregated Prefill and KV Cache Storage in vLLM - November 14, 2024

vLLM Office Hours - SOTA Tool-Calling Implementation in vLLM - November 7, 2024

vLLM Office Hours - Deep Dive into Mistral on vLLM - October 17, 2024

vLLM Office Hours - Speculative Decoding in vLLM - October 3, 2024

vLLM Office Hours - Advanced Techniques for Maximizing vLLM Performance - September 19, 2024

vLLM Office Hours - Using NVIDIA CUTLASS for High-Performance Inference - September 05, 2024

vLLM Office Hours - vLLM on AMD GPUs and Google TPUs - August 21, 2024

vLLM Office Hours - Multimodal Models in vLLM with Roblox - August 8, 2024

vLLM Office Hours - Model Quantization for Efficient vLLM Inference - July 25, 2024

Deploy LLMs More Efficiently with vLLM and Neural Magic

vLLM Office Hours - FP8 Quantization Deep Dive - July 9, 2024

vLLM Office Hours - June 20, 2024

vLLM and Neural Magic Office Hours - June 5, 2024

Are MLOps disappearing?

5x Faster YOLOv8 on CPUs

Deploy Fast and Accurate YOLOv8 Object Detection Models on CPUs You Already Have

Unlock Faster and More Efficient LLMs with SparseGPT

Pruning and Quantizing ML Models With One Shot Without Retraining

Sparse Transferring Hugging Face Models With SparseML

Apply Second-Order Pruning Algorithms for SOTA Model Compression

Use Sparse Transfer Learning to Create Sparse Models Fine-Tuned to Your Datasets