RT-2, Google’s New Breakthrough To Build Wall-E
Achieving Embodied Intelligence
Since the release in November 2022 of ChatGPT, it seems that the whole world revolves around AI.
But don’t worry, this isn’t yet another ‘ChatGPT is amazing’ article, we’re talking about something more revolutionary today.
Google Deepmind's brand new robot, RT-2.
But even though robotic arms aren’t “new”, RT-2 is unparalleled in terms of what it can do. In fact, to create RT-2, Google had to create a new class of AI models, something completely unseen until now.
Embodied intelligence is here.
And, maybe, so is Wall-E.
A New Model Class
What society has achieved with AI in the last six months is absolutely incredible.
In simple terms, we’ve democratized the very first machines that communicate with humans the same way humans do, via natural language.
But the potential of AI far exceeds simple text, and many researchers have set their eyes on an even bigger prize.
On a path to building multimodality
In a recent podcast, AI superhero Andrew NG commented that Computer Vision, the AI field that trains models to process and understand images, was around “two years behind text-prompting”, but he expected it to become a revolution on par with those models.
And what he was referring to with “text-prompting” was none other than Large Language Models, or LLMs, of which ChatGPT is the greatest example.
But everybody knows that, if we want to build machines that are human-grade in terms of capabilities, we need much more than that, because we humans are multimodal.
In layman’s terms, we don’t build our understanding of our world based only on text. We have eyes, we have ears, we have touch… all our senses help us build a representation of what the world is.
In fact, these senses teach us about the world when we’re young, years earlier we learn to read our very first sentence.
Consequently, it’s only natural for AI researchers to want to build models that not only…