Supervised fine-tuning (SFT) has become the de facto post-training strategy for large vision-language-action (VLA) models, but its reliance on costly human demonstrations limits scalability and generalization. We propose Probe, Learn, Distill (PLD), a three-stage plug-and-play framework that improves VLAs through residual reinforcement learning (RL) and distribution-aware data collection. In Stage 1 (specialist acquisition), we freeze the VLA backbone and train lightweight residual actors via off-policy RL. These specialists take over in states where the base policy fails, thereby probing failure regions of the VLA generalist. In Stage 2 (data collection), we employ a hybrid rollout scheme that biases residual interventions toward states frequently visited by the base policy, aligning collected trajectories with the generalist's deployment distribution while capturing recovery behaviors. In Stage 3 (fine-tuning), these curated trajectories are distilled back into the generalist with standard SFT, applicable to both flow-matching and autoregressive heads. We evaluate PLD across diverse settings: it achieves a near-saturated 99% task success rate on the LIBERO benchmark, delivers over 50% performance gains in SimplerEnv, and demonstrates a 100% success rate on real-world Franka arm and YAM arm dexterous manipulation tasks. We further provide ablations showing that residual policy probing and distribution-aware replay are key to collecting deployment-aligned data that improves VLAs' capabilities on both seen and unseen tasks. Our results demonstrate that RL-generated, policy-aligned data can surpass teleoperation-only demonstrations, offering a scalable path toward self-improving VLA models.
We demonstrate high success rate and recovery capabilities on two Yam arm in precise GPU insertion tasks.
Explore our complete GPU insertion task across all four stages, from initial pickup to final placement.
We demonstrate high success rate and recovery capabilities on Franka Panda arm in precise manipulation tasks like opening a beer bottle, inserting a peg. We also show scene robustness on standard object pickup tasks.
To better understand how PLD balances between recover behavior and optimal behavior, we visualize the trajectories of Self-Bootstrapping, RL, and PLD side-by-side across different tasks (feel free to load different tasks).
Explore how different PLD probing initialization values (0.2 to 0.8) affect the collected trajectories. Use the slider to compare different probing initialization values.
Our pipeline consists of three stages: 1) learning specialist residual policy for each task via online off-policy RL, with efficient exploration guided by a frozen VLA generalist; 2) Automatic generation of hybrid trajectories by having the VLA rollout for the first t steps and let the specialist takeover to generate recovery data; 3) Supervised fine-tuning using collected multi-task PLD data; 4) Deploy the fine-tuned generalist to diverse manipulation tasks in zero-shot.
BC base policy (VLA) can perform most of the steps, but fails in those steps that require reactive and precise manipulation.
Residual reinforcement learning is trained in the real world to execute corrective actions during recovery steps. As shown, the residual policy intervenes only when necessary—most of the time, it outputs zero actions.
Here we provide comparison of different RL algorithm and design choices.
Comparative analysis of different methods and their performance across training steps.
Detailed view of action scale performance measured in training steps rather than episodes.
@misc{xiao2025selfimprovingvisionlanguageactionmodelsdata,
title={Self-Improving Vision-Language-Action Models with Data Generation via Residual RL},
author={Wenli Xiao and Haotian Lin and Andy Peng and Haoru Xue and Tairan He and Yuqi Xie and Fengyuan Hu and Jimmy Wu and Zhengyi Luo and Linxi "Jim" Fan and Guanya Shi and Yuke Zhu},
year={2025},
eprint={2511.00091},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2511.00091},
}