RA-L 2026 Under Review

MoRI: Mixture of RL and IL Experts for Long-Horizon Manipulation Tasks

Yaohang Xu1, Lianjie Ma1, Gewei Zuo1, Wentao Zhang2, Han Ding1, Lijun Zhu1†

1 Huazhong University of Science and Technology    2The University of Tokyo    Corresponding Author

Our MoRI framework enables adaptive and precise long-horizon robotic manipulation across diverse real-world contact-rich scenarios: (a) drawer storage assembly, (b) lidded object packing, (c) tight plug-socket insertion, (d) deformable towel folding, and (e) coarse-to-fine motion transition.

Abstract

Reinforcement Learning (RL) and Imitation Learning (IL) are the standard frameworks for policy acquisition in manipulation. While IL offers efficient policy derivation, it suffers from compounding errors and distribution shift. Conversely, RL facilitates autonomous exploration but is frequently hindered by low sample efficiency and the high cost of trial and error. Since existing hybrid methods often struggle with complex tasks, we introduce Mixture of RL and IL Experts (MoRI). This system dynamically switches between IL and RL experts based on the variance of expert actions to handle coarse movements and fine-grained manipulations. MoRI employs an offline pre-training stage followed by online fine-tuning to accelerate convergence. To maintain exploration safety and minimize human intervention, the system applies IL-based regularization to the RL component. Evaluation across four complex real-world tasks shows that MoRI achieves an average success rate of 97.5% within 2 to 5 hours of fine-tuning. Compared to baseline RL algorithms, MoRI reduces human intervention by 85.8% and shortens convergence time by 21%, demonstrating its capability in robotic manipulation.


Method Overview

MoRI consists of two core stages: advantage-weighted offline pre-training and MoE-driven online fine-tuning. The framework unites behavioral cloning (IL) and SAC-based RL experts, with a dedicated gating network routing expert selection based on action variance. IL handles deterministic coarse-grained movements, while RL is responsible for uncertain fine-grained contact manipulation. IL regularization further restricts RL policy within demonstration distribution, ensuring smooth expert switching and stable exploration.

MoRI System Overview

All tasks involve both coarse global motion and precise contact-rich local manipulation, with random initial pose perturbations to test generalization. The policy runs at 10Hz, implemented in JAX and trained on a single RTX 4090 GPU.


Real Environment Evaluation


Main Quantitative Comparison

We compare MoRI against the state-of-the-art baseline ConRFT on all four real-world tasks, including training time, success rate, episode length, human intervention ratio and autonomous success ratio.

Learning Curves & Quantitative Trends

Performance Comparison: MoRI vs ConRFT

MoRI outperforms baseline ConRFT in all metrics: average success rate reaches 97.5%, convergence time is reduced by 21.0%, human intervention proportion drops from 49.6% to 7.0%, and autonomous successful trajectory ratio rises from 37.0% to 78.3%. MoRI maintains stable high performance across different manipulation difficulties, and significantly improves data efficiency and real-world deployment friendliness.

Task Training Time (min) Success Rate (%) Demo Ratio (%) Auto-Success Ratio (%)
ConRFT MoRI ConRFT MoRI ConRFT MoRI ConRFT MoRI
Place Block in Drawer 165 142 92 100 12.8 5.5 72.3 85.4
Put Towel in Lidded Box 258 221 86 95 15.6 6.2 68.5 79.2
Insert Two Sockets 312 275 88 95 18.3 6.8 65.7 77.6
Double-Fold the Towel 286 248 84 100 16.9 5.9 66.2 81.5
Average 255.3 221.5 87.5 97.5 15.9 6.1 68.2 80.9

Ablation Study & Model Analysis

1. MoRI vs Individual BC / RL Experts

We conduct ablation to verify the necessity of MoE hybrid design by comparing MoRI with standalone BC and RL policies. MoRI achieves the highest average success rate 97.5% and the shortest average episode length, fully exceeding individual experts. It demonstrates that dynamic fusion of IL and RL can complement their respective weaknesses and adapt to task-varying motion requirements.

Task Success Rate (%) Episode Length
BC RL MoRI BC RL MoRI
Place Block in Drawer 80 60 100 136.0 149.5 125.2
Put Towel in Lidded Box 60 75 95 254.0 204.1 169.1
Insert Two Sockets 10 90 95 228.0 201.9 180.2
Double-Fold the Towel 0 85 100 144.2 171.2
Average 37.5 77.5 97.5 206.0 174.9 161.4

2. Expert Scheduling & Action Variance Analysis

We analyze the Q-value distribution and action variance of RL and BC experts during training. MoRI automatically assigns low-variance deterministic subtasks to BC expert (e.g., drawer opening/closing), and high-uncertainty fine contact manipulation to RL expert. With online training proceeding, the selection ratio of RL expert gradually rises from 50%–60% to 70%–80%, enabling a natural transition from conservative imitation to efficient exploratory learning.

Q-value & Variance Ablation


Conclusion

This work presents MoRI, a novel Mixture-of-Experts framework that dynamically fuses IL and RL for long-horizon robotic manipulation. The offline pre-training and online fine-tuning pipeline boosts sample efficiency, while action variance-based gating and IL regularization balance exploration, stability and safety. Real-robot experiments demonstrate that MoRI achieves high success rates, faster convergence and drastically reduced human intervention. Future work will integrate vision-language models for high-level task reasoning and world models for imagination-based simulation training to further improve generalization and autonomous deployment.

BibTeX

@article{xu2026morimixturerlil,
    title={MoRI: Mixture of RL and IL Experts for Long-Horizon Manipulation Tasks}, 
    author={Yaohang Xu and Lianjie Ma and Gewei Zuo and Wentao Zhang and Han Ding and Lijun Zhu},
    year={2026},
    eprint={2604.10165},
    archivePrefix={arXiv},
    primaryClass={cs.RO},
    url={https://arxiv.org/abs/2604.10165}, 
}