Fine-Tuning Multimodal LLMs (LLAVA) for Image Data Parsing

Показать описание

In this video, we'll fine-tune LLAVA, an open-source multi-modal LLM from HuggingFace, to extract information from receipt images and output it as JSON. By the end, we'll deploy the model using Flask and create a Streamlit dashboard for this task.

00:00 Intro
00:42 Dashboard demo
01:55 LLAVA background
02:44 LLAVA playground
04:23 Fine-tuning pipeline schema
06:21 Hardware requirements (Hyperstack GPUs)
07:59 Sample datasets (cord-v2 and docvqa)
12:09 LLAVA architecture
15:07 Project code overview
15:57 Test LLAVA 7B to 34B
23:38 This video's pipeline overview
25:12 Data preparation
37:29 Model preparation and training
45:33 Testing the fine-tuned model
48:18 Model deployment and dashboard design

#hyperstack #gpu #huggingface, #pytorch #streamlit
#llm #python #llava

📚 Extra Resources:

Farzad Roozitalab

Рекомендации по теме

Комментарии

Just came after seeing post on LinkedIn as I follow you there - going to try on weekends

intresting

Thanks for this informative video. I have a question: how can we perform distributed model training on multiple GPUs? In this video, the training is performed on a single 80GB GPU. For example, if we want to perform the training on multiple GPUs (48, 48GB) than what should we do?

MuhammadAdnan-tqfx

What do you suggest for that making Python GUI app using tkinkter? or do you prefer other one? do you have any video for it? Thank you in advance!!! Big fan of your teaching!!!

PareshPawar-yw

Fine-Tuning Multimodal LLMs (LLAVA) for Image Data Parsing

Fine-Tuning Multimodal LLMs (LLAVA) for Image Data Parsing

Fine-tune Multi-modal LLaVA Vision and Language Models

How To Fine-tune LLaVA Model (From Your Laptop!)

Finetune MultiModal LLaVA

How LLaVA works 🌋 A Multimodal Open Source LLM for image recognition and chat.

LLaVA - the first instruction following multi-modal model (paper explained)

Fine Tune Vision Model LlaVa on Custom Dataset

LLaVA - This Open Source Model Can SEE Just like GPT-4-V

First open-source multimodal math dataset boosts MLLM performance - Podcast

Fine Tuning LLaVA

Visual Instruction Tuning using LLaVA

Fine Tune Multimodal LLM 'Idefics 2' using QLoRA

Fine Tune a Multimodal LLM 'IDEFICS 9B' for Visual Question Answering

Fine tuning Pixtral - Multi-modal Vision and Text Model

How do Multimodal AI models work? Simple explanation

Tiny Text + Vision Models - Fine tuning and API Setup

LLaVA: A large multi-modal language model

👑 LLaVA - The NEW Open Access MultiModal KING!!!

Building a Custom LLM for your domain based on LLaVA-Med

Fine-Tune Large LLMs with QLoRA (Free Colab Tutorial)

How To Install LLaVA 👀 Open-Source and FREE 'ChatGPT Vision'

LLaVA LLM: Visual and Language Multimodal Model Chatbot

Are LLaVA variants better than original?

“LLAMA2 supercharged with vision & hearing?!” | Multimodal 101 tutorial