Human-in-the-loop Online Rejection Sampling for Robotic Manipulation

Abstract

Reinforcement learning (RL) is widely used to produce robust robotic manipulation policies, but fine-tuning vision-language-action (VLA) models with RL can be unstable due to inaccurate value estimates and sparse supervision at intermediate steps. In contrast, imitation learning (IL) is easy to train but often underperforms due to its offline nature. In this paper, we propose Human-in-the-loop Online Rejection Sampling (Hi-ORS), a simple yet effective post-training method that utilizes rejection sampling to achieve both training stability and high robustness. Hi-ORS stabilizes value estimation by filtering out negatively rewarded samples during online fine-tuning, and adopts a reward-weighted supervised training objective to provide dense intermediate-step supervision. For systematic study, we develop an asynchronous inference-training framework that supports flexible online human-in-the-loop corrections, which serve as explicit guidance for learning error-recovery behaviors. Across three real-world tasks and two embodiments, Hi-ORS fine-tunes a pi-zero base policy to master contact-rich manipulation in just 1.5 hours of real-world training, outperforming RL and IL baselines by a substantial margin in both effectiveness and efficiency. Notably, the fine-tuned policy exhibits strong test-time scalability by reliably executing complex error-recovery behaviors to achieve better performance.

Method

Overall Pipeline

The overall pipeline of Hi-ORS, which consists of a rejection sampling framework, a supervised training objective, a varied frequency strategy, and an asynchronous infrastructure. Hi-ORS enables both training stability and high robustness in post-training VLAs for real-world robotic manipulation. Here we take a flow matching-based policy pi zero as an example.

Results

Quantitative Results

We report the evaluation success rate curve of different methods in three real-world robotic manipulation tasks with different embodiments.

Test-time Scaling

We show that larger trial budgets in evaluation result in higher testing performance, which indicates a potential signal of test-time scaling in robotic manipulation.

Spatial Generalization

Hi-ORS demonstrates strong spatial generalization capabilities across different object positions and orientations in real-world manipulation tasks.

Position 1

Position 2

Position 3

Position 4

Error Recovery

We illustrate impressive error recovery behaviors of Hi-ORS, which boosts its robustness in real-world deployments.

Pick Error Recovery

Back Motion Recovery

BibTeX

@article{lu2025hiors,
  title={Human-in-the-loop Online Rejection Sampling for Robotic Manipulation},
  author={Lu, Guanxing and Zhao, Rui and Lin, Haitao and Zhang, He and Tang, Yansong},
  journal={arXiv preprint},
  year={2025}
}