NEW Llama 3.2 90B - TEST for Image Reasoning (careful!)

preview_player
Показать описание
I test the brand new released Llama 3.2 90B VISION model from META and compare it with the new MOLMO Multimodal new model from AllenAI.
Test #1 is simple, both model fail my Test #2.

All rights with authors:
Llama 3.2: Revolutionizing edge AI and vision with open, customizable models

MOLMO
Multimodal Open Language Model built by Ai2

00:00 NEW Llama 3.2 and AllenAI MOLMO Model
01:21 TEST 1 for Image reasoning - MOLMO
02:03 TEST 1 Llama 3.2 90B Instruct VISION
03:02 TEST 2 Llama 3.2 90B Instruct VISION
04:44 TEST 2 on MOLMO Multimodal Model
08:46 TEST 2 Llama 3.2 90B continued and FAIL

#ai
#meta
#artificialintelligence
Рекомендации по теме
Комментарии
Автор

Hi community,
as requested: I just ran test #2 on Pixtral 12B. The exact same image, the exact same text for the prompt, and the result of Mistral Pixtral is:

"Interpretation of Visual Data:
Performance Trends:
GPT-40: Shows a steady decline in cumulative average score as the number of tokens increases.
Claude 3 Opus: Starts with a high score but shows a significant drop as the number of tokens increases.
Claude 3.5 Sonnet: Exhibits a similar trend to Claude 3 Opus but starts at a lower score.
Gemini 1.5 Flash (08/27): Maintains a relatively stable score with a slight decline as the number of tokens increases.
Gemini 1.5 Pro (08/27): Starts with a high score and maintains it relatively well compared to other models, showing a slight decline."

THEREFORE: in short .... yes, MISTRAL Pixtral 12B provides a correct description of my data visualization. 😃
Smile.

Additional output by Pixtral:

"Comparison of Models:
Gemini 1.5 Pro (08/27) consistently outperforms the other models across all token ranges.
Gemini 1.5 Flash (08/27) also performs well but slightly lower than Gemini 1.5 Pro.
Claude 3 Opus and Claude 3.5 Sonnet show similar performance trends but start at different scores.


Scientific Insight:
The graph illustrates the impact of context length on the performance of different language models in multi-round coreference resolution tasks. As the number of tokens increases, most models show a decline in cumulative average scores, indicating that longer contexts pose a challenge for maintaining high performance. However, Gemini 1.5 Pro (08/27) demonstrates superior robustness, maintaining higher scores even with increased context length. This suggests that Gemini 1.5 Pro (08/27) has a better capability to handle long contexts and maintain coreference resolution accuracy compared to other models."

codeAI
Автор

Video request - A comparison of Molmo with other VLMs for extracting entities and relationships from images and text, and then generating Cypher statements to merge the extracted data into open source Neo4j.

The ability of VLMs to act as an interface with hybrid vector and graph databases is crucial.

johnkintree
Автор

3 videos in one day?!?! You are an animal!

irbsurfer
Автор

Nice tests! I'm really curious how Pixtral 12B and closed source models (Gemini Flash, Claude Haiku, GPT Mini) perform on the second task. It would have been nice to have included them as a reference, to see how existing models compare, and whether they reflect the benchmarks.

capybaracloud
Автор

I am literally downloading those two models and I see this pop up in my feed...❤

AaronALAI
Автор

I did some testing as well on some images and asked Molmo about things like, what direction something is leaning towards, every test showed the same result, it interpreted a mirrored image. So for example, if a person was leaning against a wall, as I see it on my computer, the person is clearly leaning to the "right" in the picture, but Molmo always answered left. Hence I got an impression that it interpreted a mirrored version of the image.

MrKolten
Автор

for me. the Llama 3.2 90B VISION model answers correctly the first try!

"The image presents a line graph illustrating the Multi-Round Coreference Resolution (MRCR) performance of various models, with the x-axis representing the number of tokens in context and the y-axis representing the cumulative average score. The graph features five distinct lines, each corresponding to a different model: GPT 4o, Claude 3 Opus, Claude 3.5 Sonnet, Gemini 1.5 Flash (08/27), and Gemini 1.5 Pro (08/27).
Key Observations:
Initial Performance: At the starting point (2K tokens), all models exhibit similar performance, with scores ranging from approximately 0.95 to 1.00.
Decline in Performance: As the number of tokens increases, the performance of all models gradually declines. However, the rate of decline varies among the models.
Gemini Models: The Gemini models (Gemini 1.5 Flash and Gemini 1.5 Pro) consistently outperform the other models across all token counts. Notably, they maintain a relatively stable performance level, with scores remaining above 0.85 even at higher token counts.
Claude Models: The Claude models (Claude 3 Opus and Claude 3.5 Sonnet) demonstrate a more pronounced decline in performance as the number of tokens increases. Their scores drop below 0.80 at higher token counts.
GPT 4o: The GPT 4o model exhibits a unique pattern, with its performance initially declining rapidly before stabilizing at around 0.75 for higher token counts.

Main Insight:
The main insight from this graph is that the Gemini models (Gemini 1.5 Flash and Gemini 1.5 Pro) consistently outperform the other models in terms of MRCR performance across all token counts. This suggests that the Gemini models are better equipped to handle larger contexts and maintain their performance levels even when faced with increasing amounts of data. In contrast, the Claude models and GPT 4o exhibit more significant declines in performance as the number of tokens increases, indicating potential limitations in their ability to handle complex contexts."

UncleDao
Автор

I want to see AI models be able to detect either deceptive data graphs or indications that the data is not being analyzed correctly to reach a conclusion or there are data anomalies apparent individual graph that indicate that the data underline the graph should be inspected and tested and even recommend some data test to validate it.

It would be nice for it to formulate some theories on how the graphic could be improved for both data accuracy and communication

kabaduck
Автор

Maybe if we give it a pair of Metas's glasses 🕶️ it would help.

gnsdgabriel
Автор

Did you try to lower the temperature? Contradictory statements are more likely with higher temperature (At the moment I do not have the numbers to back up my claim but based on my experience the likelihood of contradictory statements decreases by lowering the temperature to, say, 0.01)

ilmionomefattimiei
Автор

ugh i think we need the community to fix them since they are open source!

tails_the_god
Автор

Molmo has on their blog that their demo is ONLY 7b.

yinxing
Автор

Hi I'm new to local LLMs and LLMs in general, trying to learn, how are you running the 90b at such high speed? I tried running 3.1 40B and it was too slow to use realistically. I'm on a 13900k, 4080, 128gb ram, although it was not using my full CPU, only about 40%. Do you have a dedicated crazy AI setup or am I doing something wrong? I'd like to be able to use the larger models if possible.

tangibleplanetvisualmedia
Автор

the graph does show an upward trend... in an upside-down world 😎

themaxgo
Автор

I bet we get much better results once there is a vision model for Qwen2.5

justtiredthings
Автор

Can you point me in the direction of LMM’s that I can train to compare two images based on my criteria that I’ll setup in my training data.

therobotocracy
Автор

Could it be the vector that's associated with downward and upward are close together and it's not able to select with a high confidence? I noticed all these AI models seem to get the word and AND in poorly selected, wonder if it's doing it with these words

kabaduck
Автор

It seems like everyone including Meta are trying to get these smaller models to be as capable as the large foundation models. They never will be nor should they be. We need to stop equating the LLM as AI. It's not. The LLM is no more AI than my prefrontal ortex is me. This 90b model is perfect for narrow domains, I bet with a proper fine-tuning in a specific domain and it will perform as expected in that domain. It'snot the ability to read and understand a graph on LLM performance that is important. it's important that the LLM understand it's been handed a graph and that graphs are used to present data and can be interpreted.. That is what is important. the are the very basic tools. give us models with these very basic understanding of the world and how to interact with it and we will do the rest.

ToddWBucy-lfyz
Автор

This is a generalist model. You have to guide it through the prompt by helping understand what you expect.

Don't expect a generalist model to solve this problem in zero shot.

Even human experts need some guidance sometime

stephanembatchou
Автор

I see the point the point you're trying to make but getting emotional about this is not useful to any of us. Show what it can do consistently and where it fails consistently instead of acting like you can make a better LLM alone in your basement

chosencode