Researchers from Carnegie Mellon University have recently released a pre-print paper showcasing their latest system, MT-ACT: Multi-Task Action Chunking Transformer. The main purpose of this system is to train robots with minimal data, allowing them to break down movement tasks into 12 specific “skills” like pick, place, slide, and wipe. With the ability to combine these skills, the robots can perform complex movements in a variety of scenarios.
In their study, the researchers highlight how RoboAgent, after being trained on just 7500 trajectories, successfully executed 12 complex manipulation skills across 38 tasks. Furthermore, the system demonstrated the capability to generalize these skills to hundreds of new scenarios, even in situations with unfamiliar objects, tasks, and kitchens. Remarkably, RoboAgent can adapt and improve its capabilities through new experiences.
The researchers were particularly impressed by the system’s adaptability to minor variations in a kitchen scene, such as changes in object location, lighting, background texture, and the introduction of new objects. This adaptability is noteworthy as most vision-based systems struggle with similar variations. This limitation is why many robots are currently confined to repetitive tasks in secure spaces.
To aid RoboAgent in its tasks, the system utilizes a tool called SAM (Segment Anything Model), which was developed by Meta. This modular approach to generalizing tasks, combined with the implementation of SAM, has proven to be more successful than traditional methods of training systems for every potential variable.
For those interested in further exploration, the project’s GitHub site offers freely available datasets. Additionally, you can read the pre-print research paper in PDF format and watch the accompanying video below to gain more insights.
Image Source: Carnegie Mellon University