Probe, Learn, Distill

Self-Improving Vision-Language-Action Models with Data Generation via Residual RL

Wenli Xiao*, Haotian Lin*, Andy Peng, Haoru Xue, Tairan He, Yuqi Xie, Fengyuan Hu, Jimmy Wu, Zhengyi Luo,
Linxi "Jim" Fan, Guanya Shi, Yuke Zhu

NVIDIA, Carnegie Mellon University, UC Berkeley, UT Austin, †GEAR Team Lead

0:00 / 0:00

You can plug-and-play PLD to improve your VLA models through data generation via residual RL.

Abstract

Supervised fine-tuning (SFT) has become the de facto post-training strategy for large vision-language-action (VLA) models, but its reliance on costly human demonstrations limits scalability and generalization. We propose Probe, Learn, Distill (PLD), a three-stage plug-and-play framework that improves VLAs through residual reinforcement learning (RL) and distribution-aware data collection. In Stage 1 (specialist acquisition), we freeze the VLA backbone and train lightweight residual actors via off-policy RL. These specialists take over in states where the base policy fails, thereby probing failure regions of the VLA generalist. In Stage 2 (data collection), we employ a hybrid rollout scheme that biases residual interventions toward states frequently visited by the base policy, aligning collected trajectories with the generalist's deployment distribution while capturing recovery behaviors. In Stage 3 (fine-tuning), these curated trajectories are distilled back into the generalist with standard SFT, applicable to both flow-matching and autoregressive heads. We evaluate PLD across diverse settings: it achieves a near-saturated 99% task success rate on the LIBERO benchmark, delivers over 50% performance gains in SimplerEnv, and demonstrates a 100% success rate on real-world Franka arm and YAM arm dexterous manipulation tasks. We further provide ablations showing that residual policy probing and distribution-aware replay are key to collecting deployment-aligned data that improves VLAs' capabilities on both seen and unseen tasks. Our results demonstrate that RL-generated, policy-aligned data can surpass teleoperation-only demonstrations, offering a scalable path toward self-improving VLA models.

Empirical Results

99%
Success Rate on LIBERO
Near-saturated performance
50%
Gains in SimplerEnv
Over baseline methods
30/30
Real-World Success
Franka Arm & YAM Arm tasks

YAM Arm Demonstrations

We demonstrate high success rate and recovery capabilities on two Yam arm in precise GPU insertion tasks.

Explore our complete GPU insertion task across all four stages, from initial pickup to final placement.

Demo:

Franka Arm Demonstrations

We demonstrate high success rate and recovery capabilities on Franka Panda arm in precise manipulation tasks like opening a beer bottle, inserting a peg. We also show scene robustness on standard object pickup tasks.

Q: Open the beer bottle

Q: Insert the white peg

Q: Insert the yellow peg

Q: Pickup the blue cube

Visualization of PLD Dataset

To better understand how PLD balances between recover behavior and optimal behavior, we visualize the trajectories of Self-Bootstrapping, RL, and PLD side-by-side across different tasks (feel free to load different tasks).

Task Description

Self-Bootstrapping

RL

PLD

PLD Initialization Comparison

Explore how different PLD probing initialization values (0.2 to 0.8) affect the collected trajectories. Use the slider to compare different probing initialization values.

Task Description

0.20.8
0.20.40.60.8
Loading...

Method

PLD Data Generalization

Our pipeline consists of three stages: 1) learning specialist residual policy for each task via online off-policy RL, with efficient exploration guided by a frozen VLA generalist; 2) Automatic generation of hybrid trajectories by having the VLA rollout for the first t steps and let the specialist takeover to generate recovery data; 3) Supervised fine-tuning using collected multi-task PLD data; 4) Deploy the fine-tuned generalist to diverse manipulation tasks in zero-shot.

BC Policy Failure Mode

BC base policy (VLA) can perform most of the steps, but fails in those steps that require reactive and precise manipulation.

Demo:
Loading BC failure demo...
0:00 / 0:00Frame: 0 / 0

The effect of residual policy

Residual reinforcement learning is trained in the real world to execute corrective actions during recovery steps. As shown, the residual policy intervenes only when necessary—most of the time, it outputs zero actions.

Demo:
Loading trajectory data...
0:00 / 0:00Frame: 0 / 0

Delta EEF Action

X0.00000Y0.00000Z0.00000Roll0.00000Pitch0.00000Yaw0.00000+0-

More Analysis

Residual RL Design Choices

Here we provide comparison of different RL algorithm and design choices.

Methods Comparison

Comparative analysis of different methods and their performance across training steps.

Action Scale vs Steps

Detailed view of action scale performance measured in training steps rather than episodes.

Acknowledgements

We are grateful to Jason Liu, Tony Tao, Colin Li, Max Fu, Yuhui Chen, Ajay Mandlekar, You Liang Tan, Dennis Da, Haoyu Xiong, Stephanie Chen, Charles Xu, Guanzhi Wang, Avnish Narayan for their insightful discussions and technical support. We also thank Tri Cao, Jeremy Chimienti, Lion Park for their assistance with data collection and mechanical setup. Finally, we thank the NVIDIA GEAR Team and CMU LeCAR Lab for their continuous support.

Citation

@misc{xiao2025selfimprovingvisionlanguageactionmodelsdata,
      title={Self-Improving Vision-Language-Action Models with Data Generation via Residual RL}, 
      author={Wenli Xiao and Haotian Lin and Andy Peng and Haoru Xue and Tairan He and Yuqi Xie and Fengyuan Hu and Jimmy Wu and Zhengyi Luo and Linxi "Jim" Fan and Guanya Shi and Yuke Zhu},
      year={2025},
      eprint={2511.00091},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2511.00091}, 
}