Official Implementation of MINTO ๐ฟ, introduced in "Use the Online Network If You Can: Towards Fast and Stable Reinforcement Learning" [ICLR 2026].
๐ฟ MINTO is a simple, yet effective target bootstrapping method for temporal-difference RL that enables faster, more stable learning and consistently improves performance across algorithms and benchmarks.
๐ฟ MINTO computes the target value by considering the MINimum estimate between the Target and Online network, hence introducing fresh ๐ฟ and more recent value estimates in a stable manner ๐ก๏ธ by mitigating the potential overestimation bias of using the online network for bootstrapping.
MINTO integrates easily into value-based and actor-critic methods with minimal overhead. Hence, we evaluate it across diverse benchmarks, spanning online and offline RL, as well as discrete and continuous action spaces. To conduct our experiments, we utilized variants of three different repositories:
- Online RL (discrete): Based on slimDQN.
- Offline RL: Based on slimCQL.
- Online RL (continuous): Based on SimbaV2.
To reproduce the main results in the paper, see the corresponding subfolders and their installation guides.
Subfolders:
online_rl_discrete/for online RL (Atari, discrete).offline_rl/for offline RL (Atari, discrete).online_rl_continuous/for continuous control (e.g., MuJoCo).
Example (online RL and Discrete Control):
cd online_rl_discrete
conda create -n minto python=3.10
conda activate minto
pip install --upgrade pip setuptools wheel
pip install -e .[dev,gpu]
bash run_dqn.sh min BreakoutIf you use this codebase or find our work helpful, please consider citing our paper as follows:
@inproceedings{hendawy2025use,
title={Use the Online Network If You Can: Towards Fast and Stable Reinforcement Learning},
author={Hendawy, Ahmed and Metternich, Henrik and Vincent, Th{\'e}o and Kallel, Mahdi and Peters, Jan and D'Eramo, Carlo},
journal={International Conference on Learning Representations (ICLR)},
year={2026}
}
