Watch a robot navigate the Google DeepMind offices using Gemini

Generative AI has already shown a lot of promise in robots. Applications include natural language interactions, robot learning, no-code programming and even design. Google’s DeepMind Robotics team this week is showcasing another potential sweet spot between the two disciplines: navigation.

In a paper titled “Mobility VLA: Multimodal Instruction Navigation with Long-Context VLMs and Topological Graphs,” the team demonstrates how it has implemented Google Gemini 1.5 Pro to teach a robot to respond to commands and navigate around an office. Naturally, DeepMind used some of the Every Day Robots that have been hanging around since Google shuttered the project amid widespread layoffs last year.

In a series of videos attached to the project, DeepMind employees open with a smart assistant-style “OK, Robot,” before asking the system to perform different tasks around the 9,000-square-foot office space.

Image Credits: Google DeepMind

In one example, a Googler asks the robot to take him somewhere to draw things. “OK,” the robot responds, wearing a jaunty yellow bowtie, “give me a minute. Thinking with Gemini …” The robot then proceeds to lead the human to a wall-sized white board. In a second video, a different person tells the robot to follow the directions on the whiteboard.

A simple map shows the robot how to get to the “Blue Area.” Again, the robot thinks for a moment before taking a long walk to what turns out to be a robotics testing any. “I’ve successfully followed the directions on the whiteboard,” the robot announces with a level of self-confidence most humans can only dream of.

Prior to these videos, the robots were familiarized with the space using what the team calls “Multimodal Instruction Navigation with demonstration Tours (MINT).” Effectively, that means walking the robot around the office while pointing out different landmarks with speech. Next, the team utilizes hierarchical Vision-Language-Action (VLA) to “that combin[e] the environment understanding and common sense reasoning power.” Once the processes are combined, the robot can respond to written and drawn commands, as well as gestures.