RoboCurate: Harnessing Diversity with Action-Verified Neural Trajectory for Robot Learning

Abstract

We present RoboCurate, a novel neural trajectory generation framework that increases diversity via controllable video generation and filters low-quality samples by evaluating motion similarity between generated video and simulator replay. Specifically, RoboCurate replays the predicted actions in a simulator and assesses action quality by measuring the consistency of motion between the simulator rollout and the generated video. In addition, we unlock observation diversity beyond the available dataset via image-to-image editing and apply action-preserving video-to-video transfer to further augment appearance.

Method

1. Generation Stage

We expand observation diversity with two components: (1) image-to-image (I2I) editing on the initial frame for scene-level variation, and (2) video-to-video (V2V) transfer for appearance diversity while preserving initial motion.

2. Filtering Stage

We filter suboptimal synthetic trajectories with inaccurate actions by replaying the predicted actions in a simulator and assessing action quality by measuring the consistency of motion between the simulator rollout and the generated video. We train an attentive probe on top of a frozen video encoder to measure motion similarity between the simulator rollout and the generated video with automatically generated positive and negative samples from real data.

Accurate action: Simulator rollout ≈ Synthetic video

Inaccurate action: Simulator rollout ≠ Synthetic video

Examples of positive and negative pairs for attentive probe training.

Results

ALLEX Humanoid Robot

In-distribution task — Pick and Place Can

Real Real + DreamGen Real + RoboCurate (Ours)

Fail

Success

Out-of-distribution task (Novel Object) — Pick and Place Cup

Real Real + DreamGen Real + RoboCurate (Ours)

Fail

Partial

Success

Out-of-distribution task (Novel Behavior) — Pour Can

Real Real + DreamGen Real + RoboCurate (Ours)

Fail

Success

GR-1 Tabletop

We report the average success rate (%) over 50 trials across 24 tasks (18 rearrangement and 6 articulated).

DexMimicGen

We report the average success rate (%) over 50 trials across 6 tasks (3 GR-1 Humanoid and 3 Bimanual Panda Arms with Dexterous Hands), trained with 100 demonstrations per task.

We observe that our visual augmentation pipeline (e.g., I2I editing and V2V transfer) substantially improves downstream task performance. Moreover, our action-level filtering is effective for curating neural trajectories and further enhances VLA performance.