MoRI: Mixture of RL and IL Experts for Long-Horizon Manipulation Tasks

Abstract

Reinforcement Learning (RL) and Imitation Learning (IL) are the standard frameworks for policy acquisition in manipulation. While IL offers efficient policy derivation, it suffers from compounding errors and distribution shift. Conversely, RL facilitates autonomous exploration but is frequently hindered by low sample efficiency and the high cost of trial and error. Since existing hybrid methods often struggle with complex tasks, we introduce Mixture of RL and IL Experts (MoRI). This system dynamically switches between IL and RL experts based on the variance of expert actions to handle coarse movements and fine-grained manipulations. MoRI employs an offline pre-training stage followed by online fine-tuning to accelerate convergence. To maintain exploration safety and minimize human intervention, the system applies IL-based regularization to the RL component. Evaluation across four complex real-world tasks shows that MoRI achieves an average success rate of 97.5% within 2 to 5 hours of fine-tuning. Compared to baseline RL algorithms, MoRI reduces human intervention by 85.8% and shortens convergence time by 21%, demonstrating its capability in robotic manipulation.

Method Overview

MoRI consists of two core stages: advantage-weighted offline pre-training and MoE-driven online fine-tuning. The framework unites behavioral cloning (IL) and SAC-based RL experts, with a dedicated gating network routing expert selection based on action variance. IL handles deterministic coarse-grained movements, while RL is responsible for uncertain fine-grained contact manipulation. IL regularization further restricts RL policy within demonstration distribution, ensuring smooth expert switching and stable exploration.

MoRI System Overview

All tasks involve both coarse global motion and precise contact-rich local manipulation, with random initial pose perturbations to test generalization. The policy runs at 10Hz, implemented in JAX and trained on a single RTX 4090 GPU.

Real Environment Evaluation

Main Quantitative Comparison

We compare MoRI against the state-of-the-art baseline ConRFT on all four real-world tasks, including training time, success rate, episode length, human intervention ratio and autonomous success ratio.

Learning Curves & Quantitative Trends

Performance Comparison: MoRI vs ConRFT

MoRI outperforms baseline ConRFT in all metrics: average success rate reaches 97.5%, convergence time is reduced by 21.0%, human intervention proportion drops from 49.6% to 7.0%, and autonomous successful trajectory ratio rises from 37.0% to 78.3%. MoRI maintains stable high performance across different manipulation difficulties, and significantly improves data efficiency and real-world deployment friendliness.

Task	Training Time (min)		Success Rate (%)		Demo Ratio (%)		Auto-Success Ratio (%)
Task	ConRFT	MoRI	ConRFT	MoRI	ConRFT	MoRI	ConRFT	MoRI
Place Block in Drawer	165	142	92	100	12.8	5.5	72.3	85.4
Put Towel in Lidded Box	258	221	86	95	15.6	6.2	68.5	79.2
Insert Two Sockets	312	275	88	95	18.3	6.8	65.7	77.6
Double-Fold the Towel	286	248	84	100	16.9	5.9	66.2	81.5
Average	255.3	221.5	87.5	97.5	15.9	6.1	68.2	80.9

Ablation Study & Model Analysis

1. MoRI vs Individual BC / RL Experts

We conduct ablation to verify the necessity of MoE hybrid design by comparing MoRI with standalone BC and RL policies. MoRI achieves the highest average success rate 97.5% and the shortest average episode length, fully exceeding individual experts. It demonstrates that dynamic fusion of IL and RL can complement their respective weaknesses and adapt to task-varying motion requirements.

Task	Success Rate (%)			Episode Length
Task	BC	RL	MoRI	BC	RL	MoRI
Place Block in Drawer	80	60	100	136.0	149.5	125.2
Put Towel in Lidded Box	60	75	95	254.0	204.1	169.1
Insert Two Sockets	10	90	95	228.0	201.9	180.2
Double-Fold the Towel	0	85	100	—	144.2	171.2
Average	37.5	77.5	97.5	206.0	174.9	161.4

2. Expert Scheduling & Action Variance Analysis

We analyze the Q-value distribution and action variance of RL and BC experts during training. MoRI automatically assigns low-variance deterministic subtasks to BC expert (e.g., drawer opening/closing), and high-uncertainty fine contact manipulation to RL expert. With online training proceeding, the selection ratio of RL expert gradually rises from 50%–60% to 70%–80%, enabling a natural transition from conservative imitation to efficient exploratory learning.

Q-value & Variance Ablation

Conclusion

This work presents MoRI, a novel Mixture-of-Experts framework that dynamically fuses IL and RL for long-horizon robotic manipulation. The offline pre-training and online fine-tuning pipeline boosts sample efficiency, while action variance-based gating and IL regularization balance exploration, stability and safety. Real-robot experiments demonstrate that MoRI achieves high success rates, faster convergence and drastically reduced human intervention. Future work will integrate vision-language models for high-level task reasoning and world models for imagination-based simulation training to further improve generalization and autonomous deployment.