filmov
tv
【S3E6】Generalist Embodied AI in an Open World
Показать описание
#artificialintelligence #computervision #chatgpt #robot
Title: Generalist Embodied AI in an Open World
Abstract:
From generalist manipulators to humanoids, robotics, and embodied AI is at the center of the stage again but surrounded by a completely different AI landscape, where largely pretrained models like LLMs and VLMs are roaring at multiple fronts of human intelligence. Indeed, embodied AI itself is also experiencing a paradigm shift: from close-world and static settings to more realistic, open-world, and dynamic environments. In this talk, I will present some of our recent efforts to bring more open-endedness to the world of embodied agents. We will first cover SQA3D, a new benchmark for embodied reasoning in 3D scenes. It combines the best of both worlds with open-vocabulary, knowledge-extensive, and situated reasoning and imposes substantial challenges to existing ML models including LLMs. Moving from this foundational groundwork, I will provide some updates on developing open-world generalist embodied agents by leveraging these large models and their principles. Specifically, we explore some key ingredients in developing a vision-based multi-task agent controller in Minecraft, including multimodal fusion and horizon prediction. To further enable solving complex long-term tasks, a hierarchical goal execution agent architecture based on large models is proposed and it becomes one of the best agents so far on the “ObtainDiamond '' challenge. Finally, I will review some ongoing and possible future directions.
Bio:
Xiaojian Ma is a research scientist at Beijing Institute for General Artificial Intelligence (BIGAI). He received his Ph.D. in Computer Science at UCLA and a bachelor's degree in Computer Science at Tsinghua University. His research interest primarily focuses on large-scale multimodal learning for understanding, reasoning, and skill learning. In particular, He is interested in building models/agents that can learn from 2D/3D vision and text data, and perform a wide range of reasoning, embodied planning, and control tasks. He has worked at DeepMind, NVIDIA Research, and Google Brain Robotics with a focus on large-scale machine learning. His research has been recognized with the best paper award at the ICML workshop and research fellowships.
Title: Generalist Embodied AI in an Open World
Abstract:
From generalist manipulators to humanoids, robotics, and embodied AI is at the center of the stage again but surrounded by a completely different AI landscape, where largely pretrained models like LLMs and VLMs are roaring at multiple fronts of human intelligence. Indeed, embodied AI itself is also experiencing a paradigm shift: from close-world and static settings to more realistic, open-world, and dynamic environments. In this talk, I will present some of our recent efforts to bring more open-endedness to the world of embodied agents. We will first cover SQA3D, a new benchmark for embodied reasoning in 3D scenes. It combines the best of both worlds with open-vocabulary, knowledge-extensive, and situated reasoning and imposes substantial challenges to existing ML models including LLMs. Moving from this foundational groundwork, I will provide some updates on developing open-world generalist embodied agents by leveraging these large models and their principles. Specifically, we explore some key ingredients in developing a vision-based multi-task agent controller in Minecraft, including multimodal fusion and horizon prediction. To further enable solving complex long-term tasks, a hierarchical goal execution agent architecture based on large models is proposed and it becomes one of the best agents so far on the “ObtainDiamond '' challenge. Finally, I will review some ongoing and possible future directions.
Bio:
Xiaojian Ma is a research scientist at Beijing Institute for General Artificial Intelligence (BIGAI). He received his Ph.D. in Computer Science at UCLA and a bachelor's degree in Computer Science at Tsinghua University. His research interest primarily focuses on large-scale multimodal learning for understanding, reasoning, and skill learning. In particular, He is interested in building models/agents that can learn from 2D/3D vision and text data, and perform a wide range of reasoning, embodied planning, and control tasks. He has worked at DeepMind, NVIDIA Research, and Google Brain Robotics with a focus on large-scale machine learning. His research has been recognized with the best paper award at the ICML workshop and research fellowships.