Florence-2: Fine-tune Microsoft’s Multimodal Model

Показать описание

Learn how to fine-tune Microsoft's Florence-2, a powerful open-source Vision Language Model, for custom object detection tasks. This in-depth tutorial guides you through setting up your environment in Google Colab, preparing datasets, and optimizing the model using LoRA.

Chapters:

- 00:00 Introduction: Unlock the Power of Florence-2
- 01:09 Getting Started: Prepare for VLM Fine-Tuning
- 03:55 Florence-2 in Action: Explore Pre-trained Capabilities
- 07:00 Dataset Deep Dive: PyTorch Data Loading for Florence-2
- 13:02 LoRA: Optimize Your VLM Training
- 14:21 Fine-Tuning: Unleash Florence-2's Custom Object Detection
- 17:30 Model Evaluation: Measure Your VLM's Success
- 21:37 Florence-2 vs Other Computer Vision Models
- 24:09 Conclusion and Next Steps

Resources:

Рекомендации по теме

Комментарии

I've been waiting for this tutorial for days.

Thank you again for being the first to comprehensively review this new model.

Super exited! 🎉🥳

abdshomad

thank you roboflow for providing such nice and lovely tutorials for free and with a nice instructions

jk_c

Thank you for this turtorial, was working on these kind of setup for a couple of days. You definetely could save lot of time

artem-ywkm

Thanks a ton for this awesome video! Every single term is explained so clearly—it's super helpful.

I can't wait to dive in the code and start putting this knowledge to use!

SatyamKumar-cbmt

Very informative video. Thanks for making auch a valuable video free of cost. Just one request when your you make tutorials if possible try to do inferencing, training or fine tuning on agricultural or satellite related data.

VLM

how to train this model on custom dataset for OCR

SridharanS-vzre

Thanks for the Video tutorial.

Though multiple tasks can be achieved by this model, all the videos are single task

Can you explain how we can tune the model for two different tasks, for example : OCR and OD

NaveenKumarLaskari

hello bro!Thank you for your selfless sharing all along.When I was fine-tuning Florence-2, I encountered some issues, and now I would like to seek your advice.
Resolving Accuracy Issues in Chinese Output for Florence-2 Fine-Tuned with LoRA：Using the llava-instruct-chinese dataset, the image encoder weights are frozen, and the language part of Florence-2 is fine-tuned using the LoRA method. While performing the "CAPTION" task, the model is capable of outputting in Chinese, but the accuracy of the answers is zero. How can this issue be resolved?

yjrljjw

thank you for the video tutorial, you are 👏👏👏
I hope there is this tutorial using jupyter notebook 😁

arifahnurainia

Thank you for the awesome tutorial! I wonder what about the detection accuracy comparing to YOLO based model?

kylewang

Thanks Sir. Please do fine-tuning for Oct, captioning and segmentation task

geniusxbyofejiroagbaduta

9:35 how did you see this embedding vector projection thing for the Roboflow 100 datasets?

nikilragav

I would really really really really really like to see how you do train multiple datasets on different tasks like OD, OCR, REGION_PROPOSAL, and maybe something like OPEN_VOCABULARY on 1 set and MORE DETAILED CAPTION on another and seeing if effectively can transfer the knowledge for example including in the captioned images things that are not in the caption dataset but are in the other or improve OCR in images description

barderino

Hi, I'm looking to fine-tune Florence 2 for Segmentation task. Would appreciate your insights!

TheVarun

Hey guys do you have have example to finetune an OCR model by Florence-2?

hegalzhang

Wonderful tutorial! Could you make a tutorial about how to fine tune florence 2 for the segmentation task?

sandrojunioraraujo

Master, could you please tell me if Florence-2 can perform SER (Semantic Entity Recognition) and RE (Relation Extraction) tasks? If so, what should my dataset look like? 🤔

dabaizhang-xb

Why does the Florence model results are different when you re-run the code ?

indranilcool

For the community session I have a couple of (beginner) questions:
- the google collabs on roboflow seem to be linux based, is there an easy way to make them work on windows?
- in general, how do I download a model (YOLO) to use in a python app (on windows)
- are there models that would run for realtime video detection on a regular laptop with an integrated iGPU?
- I am planning to use a YOLO model for a sports live stream, but only have a simple 3 Year old mid range laptop on me - would it be better to send the stream over to my desktop PC with an Rtx3060Ti-8GB and let the model run there (and send back the detection back and sync on the laptop) - if a laptop is underpowered?
- for simple applications, like the realtime sports detection of yours, would it be better to run it on my own hardware or investigate in cloud servers for inference?

Thank you very much for your tutorials, the help a lot!

-P

Hell Sir Thanks for your all videos and efforts. I am following your channel, but I request you please upload one detail video on how to finetuning Yolov5 model for custome images classification.

mctgpfi

Florence-2: Fine-tune Microsoft’s Multimodal Model

OCR Using Microsoft's Florence-2 Vision Model on Free Google Colab

Florence 2 - The Best Small VLM Out There?

Florence-2 And Deepseek Coder v2 - Open Source LLM With Strong Vision And Logic Beats GPT4o

This free MIND BLOWING Workflow Just Changed Filmmaking

The next wave of AI Innovations for Startups by Microsoft & OpenAI

Segment Anything 2: Memory + Vision = Object Permanence — with Nikhila Ravi and Joseph Nelson

Microsoft Build Into Focus: AI | KEY06

Episode 67 - Nuance | Developing a Clinical Research Tool with Azure and Best of AI Show

Do Language Models Have a Critical Period for Language Acquisition? - ArXiv:2407.19325

This Embodied LLM is...

The AI Doctor Med-Palm M: Can it help or replace doctors?

Prompt Engineering: Prompt based learning in NLP

Do Language Models Have a Critical Period for Language Acquisition? - ArXiv:2407.19325

AN INTRODUCTION TO TRANSFER LEARNING IN NLP AND HUGGINGFACE

It's not just words: LLMs in Computer Vision

Snap4City una vista generale (ITA) parte 1, elemento 2 di 2, corso 2020

ActivityNet Event Dense-Captioning

CVPR #18541 - Workshop and Challenges for New Frontiers in Visual Language Reasoning

ChatGPT and Large Language Model: Achieving Human Like Conversational Intelligence

Generative Language Models in Molecular Discovery: Regression Transformer, GT4SD and Beyond

Google I/O 2023 Keynote - Pixel Fold, Pixel Tablet, PaLM 2

Technically Speaking (E13): Building a foundation for AI models

The Future of 24/7 Clean Energy driven by AI

Dialog - A Natural Language Generation Task