Fine tuning Pixtral - Multi-modal Vision and Text Model

preview_player
Показать описание

VIDEO RESOURCES:

TIMESTAMPS:
0:00 How to fine-tune Pixtral.
0:43 Video Overview
1:27 Pixtral architecture and design choices
3:51 Mistral’s custom image encoder - trained from scratch
8:35 Fine-tuning Pixtral in a Jupyter notebook
9:33 GPU setup for notebook fine-tuning and VRAM requirements
12:23 Getting a “transformers” version of Pixtral for fine-tuning
15:00 Loading Pixtral
16:21 Dataset loading and preparation
18:08 Chat templating (somewhat advanced, but recommended)
23:33 Inspecting and evaluating baseline performance on the custom data
26:34 Setting up data collation (including for multi-turn training).
31:09 Training on completions only (tricky but improves performance)
35:08 Setting up LoRA fine-tuning
41:04 Setting up training arguments (batch size, learning rate, gradient checkpointing)
43:36 Setting up tensor board
46:48 Evaluating the trained model
47:46 Merging LoRA adapters and pushing the model to hub
49:07 Measuring performance on OCR (optical character recognition)
50:28 Inferencing Pixtral with vLLM, setting up an API endpoint
55:17 Video resources
Рекомендации по теме
Комментарии
Автор


And for those with access to Trelis.com/ADVANCED-vision, I've uploaded a fine-tuning script for Llama 3.2 Vision.

TrelisResearch
Автор

Yours is definitely one of the best applied ML channels out there. Great work as usual!

Xaelum
Автор

Great tutorial! Currently only GPT4o-V can correctly digest my Morrowind screenshots, but maybe with sufficient training Pixtral might have a shot.
Thanks for the informative mate, keep it coming! 🚀

JaredWoodruff
Автор

Thanks mate! I'd been prepping a small feasibility study on using a LLaVA+Mixtral model for spatial planning, then Pixtral came out that combines the two better than I ever could.. developments go way fast, you're a gem for sharing your insights.

Any tips on how to best approach the fine-tuning of multiple chained together models? I am trying to use Pixtral to generate textual descriptions of what shapes on a map need to be generated (basically coordinates), which should then be used by a code-generating LLM/FM for generating that shape in GML (being a niche version of XML for spatial contexts). Getting a bit lost in the weeds on the sheer amount and complexity of parts that can be altered and fine-tuned (the models themselves, their interaction, formatting of input/output, training parameters..). How to structure your approach in such a way that you're tackling ensemble fine-tuning effectively?

Thanks again for your insights already within this video (:

fabian
Автор

Man your content is good!.. stay this way..

Let me ask you something, does YouTube provide you with stats pf who your viewers are.. cuz it seems to me most of em would be mid senior level devs.

picklenickil
Автор

May I know why do we need to create custom chat template instead using the same as trained model? I wonder how it works after changing the original chat template. Do we have any scientific reason? Thanks for sharing nice content

maulikmadhavi
Автор

Is Annotation Necessary for the Model to Learn the Details in the Image? or beneficial ?

For example, when I provide an image and ask, there is a speed sign on the left side, there are children eating ice cream on the right side, the stork is flying above, the weather is sunny, the ground is asphalt and pavement, etc. Do I need to anotate these for the model to learn them?

cagataydemirbas
Автор

i see the code ```max_tokens_per_img = 4096``` in the advance demo of vLLM offline inference example, does this basically mean the maximum amount of the patches of 16x16 pixel it would support?

strongcourage-ce
Автор

Amazing tutorial, thanks for that! Would you mind posting the `requirements.txt`? I am getting some strange inconsistencies with `tokenizer.padding_side` and I'm wondering if it's due to different package versions.

AntonMilan
Автор

Hey! I wanted to know the VRAM required to fine-tune it on BF-16 precision for the best performance possible.Could u give me the approximate requirement to perform a LoRa on Pixtral 12B.

Will an 80GB A100 be sufficient? Or I would require something more.I am aware that it will definitely be fine if i set the batch size to 1 but will that be enough .I want to make an OCR out of it which is able to detect and correctly interpret the worst doc handwritings out there

byron-ihge
Автор

is there is a multimodal can fine-tuning for sentiment analysis from Audio, Image, Text and Video

IsmailIfakir
Автор

Hi, Can you provide links to get the archtiecture diagram and other details

danish
Автор

I’ve tried pixstral with food labels and nutritional tables with no luck, the model answered with random data ( for example, “how many grams of protein per 100g?”, it answers 12.5g even if there was 0.3g). A fine tuning process on food labels will fix this behavior?

dami
Автор

Can you do molmo by allenai? Largest model is based on qwen2 vl 72b and on par with gpt4o so i think it is quite good! I think i will defintely buy the repo u are awesome ❤

poisonza