NEW Llama 3.2 90B - TEST for Image Reasoning (careful!)

Показать описание

I test the brand new released Llama 3.2 90B VISION model from META and compare it with the new MOLMO Multimodal new model from AllenAI.
Test #1 is simple, both model fail my Test #2.

All rights with authors:
Llama 3.2: Revolutionizing edge AI and vision with open, customizable models

MOLMO
Multimodal Open Language Model built by Ai2

00:00 NEW Llama 3.2 and AllenAI MOLMO Model
01:21 TEST 1 for Image reasoning - MOLMO
02:03 TEST 1 Llama 3.2 90B Instruct VISION
03:02 TEST 2 Llama 3.2 90B Instruct VISION
04:44 TEST 2 on MOLMO Multimodal Model
08:46 TEST 2 Llama 3.2 90B continued and FAIL

#ai
#meta
#artificialintelligence

Рекомендации по теме

Комментарии

Hi community,
as requested: I just ran test #2 on Pixtral 12B. The exact same image, the exact same text for the prompt, and the result of Mistral Pixtral is:

"Interpretation of Visual Data:
Performance Trends:
GPT-40: Shows a steady decline in cumulative average score as the number of tokens increases.
Claude 3 Opus: Starts with a high score but shows a significant drop as the number of tokens increases.
Claude 3.5 Sonnet: Exhibits a similar trend to Claude 3 Opus but starts at a lower score.
Gemini 1.5 Flash (08/27): Maintains a relatively stable score with a slight decline as the number of tokens increases.
Gemini 1.5 Pro (08/27): Starts with a high score and maintains it relatively well compared to other models, showing a slight decline."

THEREFORE: in short .... yes, MISTRAL Pixtral 12B provides a correct description of my data visualization. 😃
Smile.

Additional output by Pixtral:

"Comparison of Models:
Gemini 1.5 Pro (08/27) consistently outperforms the other models across all token ranges.
Gemini 1.5 Flash (08/27) also performs well but slightly lower than Gemini 1.5 Pro.
Claude 3 Opus and Claude 3.5 Sonnet show similar performance trends but start at different scores.

Scientific Insight:
The graph illustrates the impact of context length on the performance of different language models in multi-round coreference resolution tasks. As the number of tokens increases, most models show a decline in cumulative average scores, indicating that longer contexts pose a challenge for maintaining high performance. However, Gemini 1.5 Pro (08/27) demonstrates superior robustness, maintaining higher scores even with increased context length. This suggests that Gemini 1.5 Pro (08/27) has a better capability to handle long contexts and maintain coreference resolution accuracy compared to other models."

codeAI

Video request - A comparison of Molmo with other VLMs for extracting entities and relationships from images and text, and then generating Cypher statements to merge the extracted data into open source Neo4j.

The ability of VLMs to act as an interface with hybrid vector and graph databases is crucial.

johnkintree

3 videos in one day?!?! You are an animal!

irbsurfer

Nice tests! I'm really curious how Pixtral 12B and closed source models (Gemini Flash, Claude Haiku, GPT Mini) perform on the second task. It would have been nice to have included them as a reference, to see how existing models compare, and whether they reflect the benchmarks.

capybaracloud

I am literally downloading those two models and I see this pop up in my feed...❤

AaronALAI

I did some testing as well on some images and asked Molmo about things like, what direction something is leaning towards, every test showed the same result, it interpreted a mirrored image. So for example, if a person was leaning against a wall, as I see it on my computer, the person is clearly leaning to the "right" in the picture, but Molmo always answered left. Hence I got an impression that it interpreted a mirrored version of the image.

MrKolten

for me. the Llama 3.2 90B VISION model answers correctly the first try!

"The image presents a line graph illustrating the Multi-Round Coreference Resolution (MRCR) performance of various models, with the x-axis representing the number of tokens in context and the y-axis representing the cumulative average score. The graph features five distinct lines, each corresponding to a different model: GPT 4o, Claude 3 Opus, Claude 3.5 Sonnet, Gemini 1.5 Flash (08/27), and Gemini 1.5 Pro (08/27).
Key Observations:
Initial Performance: At the starting point (2K tokens), all models exhibit similar performance, with scores ranging from approximately 0.95 to 1.00.
Decline in Performance: As the number of tokens increases, the performance of all models gradually declines. However, the rate of decline varies among the models.
Gemini Models: The Gemini models (Gemini 1.5 Flash and Gemini 1.5 Pro) consistently outperform the other models across all token counts. Notably, they maintain a relatively stable performance level, with scores remaining above 0.85 even at higher token counts.
Claude Models: The Claude models (Claude 3 Opus and Claude 3.5 Sonnet) demonstrate a more pronounced decline in performance as the number of tokens increases. Their scores drop below 0.80 at higher token counts.
GPT 4o: The GPT 4o model exhibits a unique pattern, with its performance initially declining rapidly before stabilizing at around 0.75 for higher token counts.

Main Insight:
The main insight from this graph is that the Gemini models (Gemini 1.5 Flash and Gemini 1.5 Pro) consistently outperform the other models in terms of MRCR performance across all token counts. This suggests that the Gemini models are better equipped to handle larger contexts and maintain their performance levels even when faced with increasing amounts of data. In contrast, the Claude models and GPT 4o exhibit more significant declines in performance as the number of tokens increases, indicating potential limitations in their ability to handle complex contexts."

UncleDao

I want to see AI models be able to detect either deceptive data graphs or indications that the data is not being analyzed correctly to reach a conclusion or there are data anomalies apparent individual graph that indicate that the data underline the graph should be inspected and tested and even recommend some data test to validate it.

It would be nice for it to formulate some theories on how the graphic could be improved for both data accuracy and communication

kabaduck

Maybe if we give it a pair of Metas's glasses 🕶️ it would help.

gnsdgabriel

Did you try to lower the temperature? Contradictory statements are more likely with higher temperature (At the moment I do not have the numbers to back up my claim but based on my experience the likelihood of contradictory statements decreases by lowering the temperature to, say, 0.01)

ilmionomefattimiei

ugh i think we need the community to fix them since they are open source!

tails_the_god

Molmo has on their blog that their demo is ONLY 7b.

yinxing

Hi I'm new to local LLMs and LLMs in general, trying to learn, how are you running the 90b at such high speed? I tried running 3.1 40B and it was too slow to use realistically. I'm on a 13900k, 4080, 128gb ram, although it was not using my full CPU, only about 40%. Do you have a dedicated crazy AI setup or am I doing something wrong? I'd like to be able to use the larger models if possible.

tangibleplanetvisualmedia

the graph does show an upward trend... in an upside-down world 😎

themaxgo

I bet we get much better results once there is a vision model for Qwen2.5

justtiredthings

Can you point me in the direction of LMM’s that I can train to compare two images based on my criteria that I’ll setup in my training data.

therobotocracy

Could it be the vector that's associated with downward and upward are close together and it's not able to select with a high confidence? I noticed all these AI models seem to get the word and AND in poorly selected, wonder if it's doing it with these words

kabaduck

It seems like everyone including Meta are trying to get these smaller models to be as capable as the large foundation models. They never will be nor should they be. We need to stop equating the LLM as AI. It's not. The LLM is no more AI than my prefrontal ortex is me. This 90b model is perfect for narrow domains, I bet with a proper fine-tuning in a specific domain and it will perform as expected in that domain. It'snot the ability to read and understand a graph on LLM performance that is important. it's important that the LLM understand it's been handed a graph and that graphs are used to present data and can be interpreted.. That is what is important. the are the very basic tools. give us models with these very basic understanding of the world and how to interact with it and we will do the rest.

ToddWBucy-lfyz

This is a generalist model. You have to guide it through the prompt by helping understand what you expect.

Don't expect a generalist model to solve this problem in zero shot.

Even human experts need some guidance sometime

stephanembatchou

I see the point the point you're trying to make but getting emotional about this is not useful to any of us. Show what it can do consistently and where it fails consistently instead of acting like you can make a better LLM alone in your basement

chosencode

NEW Llama 3.2 90B - TEST for Image Reasoning (careful!)

NEW Llama 3.2 90B - TEST for Image Reasoning (careful!)

Meta's New Llama 3.2 with Vision is here - Run it Privately on your Computer

Llama 3.2 is HERE and has VISION 👀

Meta releases Llama 3.2 | NEW small & vision models are here!

Llama 3.2 is Beating OpenAI at Their Own Game (Real-Time AI Voice, Vision...)

Llama 3.2: Best Multimodal Model Yet?

LLaMA 3.2 is HERE - 1B, 3B, 11B & 90B Multimodal - Complete Guide To Run Locally & Finetune

Getting Started With Meta Llama 3.2 And its Variants With Groq And Huggingface

Metas Llama 3.2 Is Much Bigger Than You Think!

Compared! Open Source AI Vision Models Tested, Llama 3.2 vs Pixtral

Meta LLaMA 3.2 is the GAME CHANGER Released From 1B To 90B Multimodal LLM

Meta’s LLaMA 3.2: A Multimodal AI Game-Changer

Napkins + Llama-3.2 : Generate Apps from Screenshots for FREE with Llama-3.2 90B!

Introducing Llama 3.2: BEST Opensource Multimodal LLM Ever!

Testing Llama 3.2 by @meta w/ Langchain & Streamlit (Testing the 3B, 11B, and 90B models)

How to Setup and Test Llama 3.2 VISION Model

NEW: Llama 3.2 Open Source Multimodal🤖🦙 1B, 3B, 11B, & 90 Vision LLM & Orion Glasses Mark Zu...

Try Llama3.2 (11B, 90B) for FREE. Create App with screenshots. #nocode #generativeai #llama3 #vision

How to Use Llama 3.2 to Create Vision Apps and Multimodal Agents in AutoGen

Llama 3.2 Destroys GPT-4 in Vision AI!

How Did Llama-3 Beat Models x200 Its Size?

MultiModal Llama 3.2 has ARRIVED!!!

EASIEST Way to Fine-Tune LLAMA-3.2 and Run it in Ollama

Meta's Llama 3.2 Just STUNNED the Industry! OpenAI is in DANGER (Vision, Real-Time Voice and Mo...