Imagine2Act: Leveraging Object-Action Motion Consistency from Imagined Goals for Robotic Manipulation
Imagine2Act is a 3D imitation learning framework that incorporates semantic and geometric object constraints into policy learning for high-precision manipulation. It generates imagined goal images conditioned on language instructions and reconstructs corresponding 3D point clouds as semantic–geometric priors for the policy. An object–action consistency strategy with soft pose supervision explicitly aligns predicted end-effector motions with generated object transformations, enabling accurate and consistent 3D action prediction across diverse tasks. Experiments in both simulation and the real world demonstrate that Imagine2Act outperforms previous state-of-the-art policies.
Create a conda environment with the following command:
# initiate conda env
> conda update conda
> conda env create -f environment.yaml
> conda activate imagine2act
# install diffuser
> pip install diffusers["torch"]
# install dgl (https://www.dgl.ai/pages/start.html)
> pip install dgl==1.1.3+cu116 -f https://data.dgl.ai/wheels/cu116/dgl-1.1.3%2Bcu116-cp38-cp38-manylinux1_x86_64.whl
# install flash attention (https://github.com/Dao-AILab/flash-attention#installation-and-features)
> pip install packaging
> pip install ninja
> pip install flash-attn==2.5.9.post1 --no-build-isolation
# Install open3D
> pip install open3d
# Install PyRep (https://github.com/stepjam/PyRep?tab=readme-ov-file#install)
> cd imagine2act
> git clone https://github.com/stepjam/PyRep.git
> cd PyRep/
> wget https://www.coppeliarobotics.com/files/V4_1_0/CoppeliaSim_Edu_V4_1_0_Ubuntu20_04.tar.xz
> tar -xf CoppeliaSim_Edu_V4_1_0_Ubuntu20_04.tar.xz;
> echo "export COPPELIASIM_ROOT=$(pwd)/CoppeliaSim_Edu_V4_1_0_Ubuntu20_04" >> $HOME/.bashrc;
> echo "export LD_LIBRARY_PATH=\$LD_LIBRARY_PATH:\$COPPELIASIM_ROOT" >> $HOME/.bashrc;
> echo "export QT_QPA_PLATFORM_PLUGIN_PATH=\$COPPELIASIM_ROOT" >> $HOME/.bashrc;
> source $HOME/.bashrc;
> conda activate imagine2act
> pip install -r requirements.txt; pip install -e .; cd ..
# Install RLBench (Note: there are different forks of RLBench)
# PerAct setup
> git clone https://github.com/LiangHeng121/RLBench.git
> cd RLBench; git checkout -b peract --track origin/peract; pip install -r requirements.txt; pip install -e .; cd ../..;
We use three external modules for imagined goal generation, 3D reconstruction, and pose alignment.
Each module should be installed in its own conda environment to avoid dependency conflicts.
| Module | Repository | Environment | Description |
|---|---|---|---|
| Grounded-Segment-Anything (GroundedSAM) | LiangHeng121/Grounded-Segment-Anything | groundedsam |
2D segmentation and text grounding |
| FoundationPose | LiangHeng121/FoundationPose | foundationpose |
6D object pose estimation |
| TripoSR | LiangHeng121/TripoSR | triposr |
3D reconstruction from a single image |
Each can be installed following the official instructions in their repositories.
Download the testing episodes from here repo. Extract the zip files to ./imagine2act/data/raw/test/
Download the packaged training/validation demonstrations from here. Extract the zip file to ./imagine2act/data/packaged/
If you want to generate your own high-resolution dataset with imagined goals and corresponding 3D object transformations, follow these steps:
Step 0 – Generate base RLBench episodes
Generate raw RLBench episodes and package them into .dat files for one or multiple tasks.
cd imagine2act
bash scripts/data_generation.sh <split> [task1] [task2] [task3] ...
# Example:
bash scripts/data_generation.sh train phone_on_base place_cups stack_cups
cd ..Step 1 – Image-to-3D Generation Pipeline
Run the following command with your target task name (phone_on_base, place_cups, put_knife_in_knife_block, put_plate_in_colored_dish_rack, put_toilet_roll_on_stand, stack_cups, stack_wine):
bash pipeline.sh <task_name>
# Example:
bash pipeline.sh phone_on_baseStep 2 – Scene Alignment and Fusion Pipeline
Run the following command with your target task name and dataset split (train, val, or test):
bash pipeline_2.sh <task_name> <split>
# Example:
bash pipeline_2.sh phone_on_base trainStep 3 – Merge into new packaged .dat files
Combine the original packaged RLBench episodes with the imagined 3D goals and transformation annotations to produce new .dat files under ./imagine2act/data/packaged/{split}_imagine/.
python ./utils/merge.py <task_name> <split>
# Example:
python ./utils/merge.py phone_on_base trainYou can encode RLBench task instructions using our CLIP-based script:
cd imagine2act
python ./data_preprocessing/preprocess_rlbench_instructions.py \
--tasks phone_on_base place_cups put_knife_in_knife_block \
put_plate_in_colored_dish_rack put_toilet_roll_on_stand \
stack_cups stack_wine \
--output instructions.pklTrain Imagine2Act on RLBench
cd imagine2act
bash scripts/train.shThis script launches single-view training using our imagined goal representations and transformation supervision on all selected RLBench tasks.
After training, checkpoints and logs will be saved under ./imagine2act/train_logs/.
Pretrained Models
You can also download our pretrained Imagine2Act models on RLBench from here.
To evaluate a trained or pretrained model, run:
cd imagine2act
bash scripts/eval.shThe evaluation results, including success rates and visualizations, will be saved under ./imagine2act/eval_logs/.
If you find this code useful for your research, please consider citing our paper "Imagine2Act: Leveraging Object-Action Motion Consistency from Imagined Goals for Robotic Manipulation".
@misc{heng2025imagine2actleveragingobjectactionmotion,
title={Imagine2Act: Leveraging Object-Action Motion Consistency from Imagined Goals for Robotic Manipulation},
author={Liang Heng and Jiadong Xu and Yiwen Wang and Xiaoqi Li and Muhe Cai and Yan Shen and Juan Zhu and Guanghui Ren and Hao Dong},
year={2025},
eprint={2509.17125},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2509.17125},
}
This code base is released under the MIT License (refer to the LICENSE file for details).
Parts of this codebase have been adapted from 3D Diffuser Actor.
