RoboEngine: Plug-and-Play Robot Data Augmentation with Semantic Robot Segmentation and Background Generation

Abstract

Visual augmentation has become a crucial technique for enhancing the visual robustness of imitation learning. However, existing methods are often limited by prerequisites such as camera calibration or the need for controlled environments (e.g., green screen setups). In this work, we introduce RoboEngine, the first plug-and-play visual robot data augmentation toolkit. For the first time, users can effortlessly generate physics- and task-aware robot scenes with just a few lines of code. To achieve this, we present a novel robot scene segmentation dataset, a generalizable high-quality robot segmentation model, and a fine-tuned background generation model, which together form the core components of the out-of-the-box toolkit. Using RoboEngine, we demonstrate the ability to generalize robot manipulation tasks across six entirely new scenes, based solely on demonstrations collected from a single scene, achieving a more than 200% performance improvement compared to the no-augmentation baseline. All datasets, model weights, and the toolkit will be publicly released.

RoboSeg Dataset

We propose RoboSeg, a novel dataset with high-quality robot scene segmentation annotations. RoboSeg contains 3,800 images randomly selected from over 35 robot datasets, covering a broad range of robot types (e.g., Franka, WindowX, HelloRobot, UR5, Sawyer, Xarm, etc.), camera views, and background environments.

Segmentation Model

Based on RoboSeg, we fine-tune the state-of-the-art (SoTA) language-conditioned segmentation model EVF-SAM to create a new robot segmentation model, Robo-SAM, which achieves high-quality open-world robot segmentation.

Augmentation Model

Given a robot scene image, we first generate the robot mask and task-related object masks using Robo-SAM and EVF-SAM (conditioned on the object name in the task instruction). We then use a generative model to create a physics- and task-aware background based on the previously generated masks. For this, we use BackGround-Diffusion, which generates a foreground-aware background with physical constraints given a foreground mask and scene description. We fine-tune this model on our RoboSeg dataset to eliminate unreasonable generations in some cases.

Policy Generalization

Source Videos

Novel Scenes

BibTeX

@article{yuan2025roboengine,
  title={RoboEngine: Plug-and-Play Robot Data Augmentation with Semantic Robot Segmentation and Background Generation},
  author={Yuan, Chengbo and Joshi, Suraj and Zhu, Shaoting and Su, Hang and Zhao, Hang and Gao, Yang},
  journal={arXiv preprint arXiv:2503.18738},
  year={2025}
}