December 02, 2025
Seed GR-RL
A reinforcement learning framework for long-horizon dexterous manipulation, enabling robots to stably execute multi-step, high-precision tasks in real-world environments.
Technical Report
GR-RL Framework

We present GR-RL, a robotic learning framework that turns a generalist vision-language-action (VLA) policy into a highly capable specialist for long-horizon dexterous manipulation. The optimality of human demonstrations is a core assumption in existing VLA policies. However, we claim that in training models for highly dexterous and precise manipulation tasks, human demonstrations may be noisy and sub-optimal.

GR-RL performs long-horizon, dexterous, and high-precision manipulation, in the task of shoe lacing, by adopting a multi-stage training pipeline, consisting of 1) The workflow consists of three stages: 1) offline data filtering 2) physics-symmetry augmentation 3) online steering reinforcement learning.

GR-RL proposes a multi-stage training pipeline that filters, augments, and reinforces the demonstrations by reinforcement learning. First, GR-RL learns a vision-language-conditioned task progress, filters the demonstration trajectories, and only keeps the transitions that contribute positively to the progress. Specifically, we show that by directly applying offline RL with sparse reward, the resulting Q-values can be treated as a robust progress function. Next, we devise a series of simple yet effective augmentation tricks that greatly improve the performance of GR-RL. Lastly, to better align the VLA policy with its deployment behaviors for high-precision control, we perform online RL by learning a latent space noise predictor.

With this pipeline, GR-RL is—to our knowledge—the first learning-based policy that can autonomously lace up a shoe by threading shoelaces through multiple eyelets with an 83.3% success rate, a task requiring long-horizon reasoning, millimeter-level precision, and compliant soft-body interaction. We hope GR-RL provides a step toward enabling generalist robot foundation models to specialize into reliable real-world experts.

The GR-RL Model

Dexterous Long-Horizon Capabilities

Through a multi-stage training pipeline, we achieved substantial performance improvements. The baseline model trained with behavior cloning achieved a success rate of 45.7%. After implementing task-progress-based data filtering, the success rate increased to 61.6%. Further enhancement through data augmentation techniques boosted the performance to 72.7%.

Left: the success rate of our multi-stage training recipe. Data filtering, mirror augmentation, and online tuning all contribute to the final performance. Right: the binary success signal per episode (dots) and the moving average of success rate (curve) during online finetuning. The performance increases rapidly after an offline-to-online adaptation phase.

Subsequent online reinforcement learning fine-tuning initially showed transient performance fluctuation due to policy distribution shift, followed by rapid recovery and sustained growth. Using the enhanced offline‐learned model as the starting point for online reinforcement learning, GR-RL’s success rate ultimately rose to 83.3% after roughly 150 real-world closed-loop exploration and correction trajectories.

Detailed stage analysis revealed that data filtering and online RL substantially improved reliability during the critical eyelet-threading phase, while data augmentation provided consistent performance gains across all operational stages.

Detailed success rates of different models for completing intermediate stages. The height of each hatched area denotes the decrease in success rate from the previous stage to the current stage.

Generalization to Shoes of Varying Colors

GR-RL can handle shoes of different colors and sizes, and it continues to correctly identify the shoe’s structure and perform stable grasping, adjustment, and threading actions even when the material textures or visual features vary.

Robust Behavior in Various Cases

GR-RL demonstrates robust behavior across various cases. The model automatically retries when the shoelace accidentally drops or when it misses the eyelet. In cases where the two ends of the shoelace are crossed and the correct end is underneath, the model can identify it and pull it out.

GR- RL can also actively manipulate the scene to make the task easier:

When the initial grasping point is far from the shoelace tip, the model places the shoelace on top of the deformable shoe and regrasps closer to the tip before threading. It first reorients the shoe from the left side to straighten it, then begins threading.

When a shoe is placed on the far side of the table, the robot can pull it closer, adjust the shoelace, and complete the task.