This New AI Can Find Your Dog In A Video! 🐩

preview_player
Показать описание

📝 The paper "MTTR - End-to-End Referring Video Object Segmentation with Multimodal Transformers" is available here:

🙏 We would like to thank our generous Patreon supporters who make Two Minute Papers possible:
Aleksandr Mashrabov, Alex Balfanz, Alex Haro, Andrew Melnychuk, Angelos Evripiotis, Benji Rabhan, Bryan Learn, Christian Ahlin, Eric Martel, Gordon Child, Ivo Galic, Jace O'Brien, Javier Bustamante, John Le, Jonas, Jonathan, Kenneth Davis, Klaus Busse, Lorin Atzberger, Lukas Biewald, Matthew Allen Fisher, Michael Albrecht, Michael Tedder, Nikhil Velpanur, Owen Campbell-Moore, Owen Skarpness, Rajarshi Nigam, Ramsey Elbasheer, Steef, Taras Bobrovytsky, Thomas Krcmar, Timothy Sum Hon Mun, Torsten Reil, Tybie Fitzhugh, Ueli Gallizzi.

Thumbnail background image credit:

Károly Zsolnai-Fehér's links:
Рекомендации по теме
Комментарии
Автор

I would like to be able to point my camera at the forest and have it pick out and highlight any animals in the picture or any mushrooms or any specific types of plants for me so they can just walk up to those plants or animals or whatever

ChadKovac
Автор

bit confused. All of these videos were of two subjects with clear distinctions, yet the search key for selecting them had these elaborate highly specific descriptions. For instance the search keys 'Man' and 'Surfboard' would have been completely sufficient for selecting the two. Is the AI actually capable of understanding these complex descriptions or were all those descriptors unnecessary fluff to make it seem more impressive? if I gave it the same surfing video but with 5 more surfers all with different appearances and different colored boards, would it actually select the correct one? If so, these videos were poorly picked to demonstrate that imo. Also admittedly a bit confused with this being compared to pose estimation. Image segmentation and pose estimation are rather tenuously related. Odd comparison.

radpugguy
Автор

I see this being used to make a comprehensive database of all human video searchable. Something akin to what google did a few years ago with scanning every book they could get their hands on. With youtube already under their Alphabet umbrella, they have a big starting database.

felekar
Автор

Whoa!
I've said it before and I'll say it again: Multimodal AIs are the future
This is insane!
I wanna see all these things (This one, CLIP / CLOOB and/or Dall-E, CM3 and more) combined into one big end to end any-modality-to-any-modality

Kram
Автор

Amazing! This can be used in robotics to make robots really smart, e.g. "Jarvis, bring me a bottle of water"

michaelvechtomov
Автор

Just imagine, the end of rotoscoping. I could tell the video editing program what I wanted to cut out and it would rotoscope a mask of exactly what I wanted, literal hours of busy work saved

willnine
Автор

This will be such huge plus to add into getty images or any other stock footage companies to detect exactly what a person needs

InnoSang
Автор

Can you imagine what this implementation could do in the medicine world??? For example, if we use it as a predictive model to know the progression of a mobility debilitating issue, we could catch it early on without using invasive techniques, making diagnostic medicine so much affordable
Mind blowing

Mufasa
Автор

Find Crouching Tiger, hidden dragon.

I can’t wait for a generative version of this:
"Add Dr Károly, holding onto his papers"

Will-ktjk
Автор

The Hungarian accent has really grown on me over the past year. It's so endearing! I remember how the first video I watched drew me in and it was the first one on the hide and seek A.I.

jfk_the_second
Автор

Im confused by the demo-videos, because they dont show the "extra" description being used in any way at all. you could say "man", "skateboard", "racket" etc. and it would still find whats there, i think the demo is really lacking

SeyHan
Автор

Are there examples of them telling it to look for the wrong thing? Like the examples where you give a specific color if you have the wrong color would it not highlight it as expected or are we just being shown the correct way?

robertrynard
Автор

With every video, the "yea, we are simulated" becomes more apparent

ТуанНгуен-ьп
Автор

Alexa bring me the yellow surboard from the left!

ccosmin
Автор

This will be/is a great tool for autonomous driving and AI applications

beautolan
Автор

If I'm not mistaken, the technique in this paper doesn't do tracking or pose estimation, but rather a segmentation. That's a bit different task.

Vassay
Автор

Does this work for determining whether an object actually is within a video? Or does it just apply the colouring to whatever it thinks is closet. If it is the former, this could be great for sifting through video data, say security footage to find occurences of events.

shayboual
Автор

Can't wait for the next paper: "replace the person riding a bike by a capybara wearing a hat and sunglasses"

pasikavecpruhovany
Автор

Wow, end-to-end trivial to learn surveillance!

StephenRoseDuo
Автор

I'm not actually questioning the validity of the software here, more like questioning the efficacy of the demonstrations. Lets say at 4:19, the thing we are searching for is "a tennis racket in the hand of a player with a red skirt", however simply going of this video for all we know it is only tracking a tennis racket because there are no things in the scene that it would really need that prompt to differentiate between. Aside from that, it really is awesome

blankblank