Fine tuning Pixtral - Multi-modal Vision and Text Model

Показать описание

VIDEO RESOURCES:

TIMESTAMPS:
0:00 How to fine-tune Pixtral.
0:43 Video Overview
1:27 Pixtral architecture and design choices
3:51 Mistral’s custom image encoder - trained from scratch
8:35 Fine-tuning Pixtral in a Jupyter notebook
9:33 GPU setup for notebook fine-tuning and VRAM requirements
12:23 Getting a “transformers” version of Pixtral for fine-tuning
15:00 Loading Pixtral
16:21 Dataset loading and preparation
18:08 Chat templating (somewhat advanced, but recommended)
23:33 Inspecting and evaluating baseline performance on the custom data
26:34 Setting up data collation (including for multi-turn training).
31:09 Training on completions only (tricky but improves performance)
35:08 Setting up LoRA fine-tuning
41:04 Setting up training arguments (batch size, learning rate, gradient checkpointing)
43:36 Setting up tensor board
46:48 Evaluating the trained model
47:46 Merging LoRA adapters and pushing the model to hub
49:07 Measuring performance on OCR (optical character recognition)
50:28 Inferencing Pixtral with vLLM, setting up an API endpoint
55:17 Video resources

Рекомендации по теме

Комментарии

And for those with access to Trelis.com/ADVANCED-vision, I've uploaded a fine-tuning script for Llama 3.2 Vision.

TrelisResearch

Yours is definitely one of the best applied ML channels out there. Great work as usual!

Xaelum

Great tutorial! Currently only GPT4o-V can correctly digest my Morrowind screenshots, but maybe with sufficient training Pixtral might have a shot.
Thanks for the informative mate, keep it coming! 🚀

JaredWoodruff

Thanks mate! I'd been prepping a small feasibility study on using a LLaVA+Mixtral model for spatial planning, then Pixtral came out that combines the two better than I ever could.. developments go way fast, you're a gem for sharing your insights.

Any tips on how to best approach the fine-tuning of multiple chained together models? I am trying to use Pixtral to generate textual descriptions of what shapes on a map need to be generated (basically coordinates), which should then be used by a code-generating LLM/FM for generating that shape in GML (being a niche version of XML for spatial contexts). Getting a bit lost in the weeds on the sheer amount and complexity of parts that can be altered and fine-tuned (the models themselves, their interaction, formatting of input/output, training parameters..). How to structure your approach in such a way that you're tackling ensemble fine-tuning effectively?

Thanks again for your insights already within this video (:

fabian

Man your content is good!.. stay this way..

Let me ask you something, does YouTube provide you with stats pf who your viewers are.. cuz it seems to me most of em would be mid senior level devs.

picklenickil

May I know why do we need to create custom chat template instead using the same as trained model? I wonder how it works after changing the original chat template. Do we have any scientific reason? Thanks for sharing nice content

maulikmadhavi

Is Annotation Necessary for the Model to Learn the Details in the Image? or beneficial ?

For example, when I provide an image and ask, there is a speed sign on the left side, there are children eating ice cream on the right side, the stork is flying above, the weather is sunny, the ground is asphalt and pavement, etc. Do I need to anotate these for the model to learn them?

cagataydemirbas

i see the code ```max_tokens_per_img = 4096``` in the advance demo of vLLM offline inference example, does this basically mean the maximum amount of the patches of 16x16 pixel it would support?

strongcourage-ce

Amazing tutorial, thanks for that! Would you mind posting the `requirements.txt`? I am getting some strange inconsistencies with `tokenizer.padding_side` and I'm wondering if it's due to different package versions.

AntonMilan

Hey! I wanted to know the VRAM required to fine-tune it on BF-16 precision for the best performance possible.Could u give me the approximate requirement to perform a LoRa on Pixtral 12B.

Will an 80GB A100 be sufficient? Or I would require something more.I am aware that it will definitely be fine if i set the batch size to 1 but will that be enough .I want to make an OCR out of it which is able to detect and correctly interpret the worst doc handwritings out there

byron-ihge

is there is a multimodal can fine-tuning for sentiment analysis from Audio, Image, Text and Video

IsmailIfakir

Hi, Can you provide links to get the archtiecture diagram and other details

danish

I’ve tried pixstral with food labels and nutritional tables with no luck, the model answered with random data ( for example, “how many grams of protein per 100g?”, it answers 12.5g even if there was 0.3g). A fine tuning process on food labels will fix this behavior?

dami

Can you do molmo by allenai? Largest model is based on qwen2 vl 72b and on par with gpt4o so i think it is quite good! I think i will defintely buy the repo u are awesome ❤

poisonza

Fine tuning Pixtral - Multi-modal Vision and Text Model

Fine tuning Pixtral - Multi-modal Vision and Text Model

Pixtral-12B 👀: Mistral AI's First Multi-Modal VLLM is HERE!

Mastering AI: GRIN MoE, Pixtral 12B, and LLM Fine-Tuning Techniques

Insane New AI Model - PIXTRAL Large - That Finally Beats OpenAI and Google

Pixtral 12b just broke the ankles of other multimodal models - Paper Review

Run Pixtral 12B Multimodal AI from Mistral on Colab🔥🔥

NEW VLM: PIXTRAL 124B 24.11 - Better Than Sonnet?

Mistral Vision Upgraded!

Pixtral 12B (Full Test): Can It Crush Qwen-2 VL as a Vision LLM?

BIG! Pixtral Open Source AI Vision Model, Moshi Voice, Small Chat LLM, Controversial Reflection 70B

The Future of Multimodal AI | Open-Source Mixture-of-Experts Model #aria

EASIET Way to Install LLaVA - Free and Open-Source Alternative to GPT-4 Vision

Getting Started With Hugging Face in 15 Minutes | Transformers, Pipeline, Tokenizer, Models

Getting Started With Meta Llama 3.2 And its Variants With Groq And Huggingface

LLMs | Can We Edit AI's Knowledge without Re-training? | Lec 21.1

6 Best Consumer GPUs For Local LLMs and AI Software in Late 2024

NEW Mixtral 8x22b Tested - Mistral's New Flagship MoE Open-Source Model

Comment fine-tuner un modèle Mistral AI avec La Plateforme

Llama 3.2 - 90B (Fully Tested): Is it actually good and beat Pixtral 12B ?

Which nVidia GPU is BEST for Local Generative AI and LLMs in 2024?

Ministral (Fully Tested) : This NEW Mistral Model is the Llama-3.1 REPLACEMENT! (Good at Coding!)

vLLM: AI Server with 3.5x Higher Throughput

¡Pixtral 12B al Descubierto! El Nuevo LLM Multimodal de Mistral Puesto a Prueba

vLLM Office Hours - Advanced Techniques for Maximizing vLLM Performance - September 19, 2024