Dear Reader,

The empty fryer basket hit the oil, and the CEO’s face turned red with embarrassment.

I’d been invited out to see a demo of a “robot cook.” It was supposed to automate the messy work of using the fryer and grill in fast-food restaurants.

The robot’s french fry hopper had run dry during the demo. But it was still tending to an empty frying basket as if it was full of sizzling fries.

It was a good idea. But the tech still had a ways to go.

Last week, AI researchers showed off a robot that wouldn’t have made that same mistake.

And it has important implications for the future of work…

Meet RT-2

On July 28, DeepMind, Google’s AI research lab, unveiled Robotic Transformer 2 (RT-2), an AI-powered robot.

Unlike previous generations of robots that had limited application, RT-2 can use reasoning to understand commands and interact with the world. It does this with two technologies.

First, it uses vision-language models to reason through commands and make sense of the world.

That means it sees the world and assigns labels to everything within it. It can then process a natural language command and act on it.

Today, most robotic automation is designed to complete simple, repetitive tasks. What makes RT-2 unique is that it can “see,” “hear,” and understand the world around it.

Let me offer you an example from DeepMind’s presentation.

RT-2 was asked to pick up an extinct animal from a table littered with toy animals, fruits, balls, and other objects. Within seconds of being prompted, RT-2 reached for the dinosaur and held it up.

The AI was able to identify all of the objects and reasoned that the toy brachiosaurus was the extinct animal. Importantly, RT-2 wasn’t programmed to perform that specific task. It had to “figure out” the command.

Source: Google DeepMind

That means its AI correctly identified all of the objects on the table, assigned meaningful labels, and translated the command into the correct action.

That’s an incredible breakthrough.

This is the first time to my knowledge that a robot has shown this level of reasoning and response. And what’s most interesting is how RT-2 “thinks.”

Real-Time Learning

Previous versions of robots were trained with large visual data sets. Imagine teaching a robot what everything is with a giant stack of flashcards.

In a controlled lab setting, it might do alright. But if you take it out in the real world, it’ll struggle to make sense of what it sees.

RT-2 doesn’t work that way.

Using two different versions of AI models, it effectively searches a snapshot of the web for all of the objects in its view. In some sense, it learns in real time.

RT-2 uses PaLI-X and PaLM-E to make sense of commands and what it sees.

PaLI-X is a variant of the Pathways Language and Image model. This is what powers its ability to make sense of what it sees. In essence, it adds captions to everything within its field of view.

PaLM-E is the Pathways Language model Embodied. This is a multimodal (language and images) AI that has been “embodied” to make sense of the world through its robotic sensors.

PaLI-X uses 32 billion parameters and PaLM-E uses 2 billion. The parameters are the values that make up the final AI model.

Despite the difference in the number of parameters used, the “reasoning” abilities of these models surpass previous AI-robot models.

(Click to enlarge)

The purple and green bars are RT-2. The gray bar is RT-1, the previous version from DeepMind. And VC1 is the classic, visually trained “flashcard” model.

You’ll notice a dramatic increase in capabilities. The old visually trained robot never scored higher than 20%. RT-2 scored between 30% and 80%.

I’ll admit, that still isn’t high enough for real-world use. We certainly don’t want robots banging around with only a 50% success rate.

But what RT-2 shows is a breakthrough in how to program robots. RT-2 isn’t the end… It’s the very beginning of interactive robots.

Let’s take a moment to consider the practical applications of RT-2.

RT in the Real World

I can see an AI-powered robot like RT-2 being used in Amazon warehouses. It would be able to receive items in the warehouse, sort them into bins that go on the shelves, and also retrieve items for shipping.

It could use both visual identification and barcode scanning to make sure that it’s handling the correct item. Similarly, robots like RT-2 could be used to restock store shelves overnight.

Those are mass commercial applications.

I also think these robots could make their way into our homes. Imagine the convenience of being able to ask RT-2 to sort and fold your clothing… set the table for dinner… or let the dog out.

Those might seem like mundane uses, but what if these robots allowed the elderly to stay independent for longer? Having an untiring helper always there to assist with daily chores preserves independence and even offers a layer of safety.

I’d feel better knowing that my grandma isn’t getting up on a stool to reach dishes.

I don’t think these are far-flung dreams. DeepMind unveiled RT-1 in December 2022. Not even a year later, we have RT-2… a 2x-3x improvement.

Subsequent versions will get even better. And competitors will use this framework to create rivals that will have an edge in one way or another.

RT-2 shows that AI isn’t just capable of doing white-collar work like writing reports or creating images. It can also interact with the physical world.

That means we’re heading for a world where certain forms of physical labor are divorced from human effort. I can just imagine robots laying bricks, picking fruits, loading trucks, and so much more… That would create a leap in productivity that we can hardly imagine today.

Now, here’s my question to you: Would you invite a future version of RT-2 into your home as an assistant?

Share your thoughts with me at [email protected].


Colin Tedards
Editor, The Bleeding Edge