In this work, we propose a motion-based self-reflection framework for fine-grained robotic action correction, which leverages motion instruction as a bridge to connect high-level semantic reflection with low-level robotic action correction.
Concretely, we borrow the inferential capability of Multi-modal Large Language Models (MLLMs) for motion instruction correction, and propose a multi-task motion-conditioned diffusion policy for low-level robotic action correction.
The results prove the effectiveness of our method:
First, install the simulation environment, please refer to the mimicgen documentation for more details:
mkdir deps
cd ${project_path}/deps
git clone https://github.com/NVlabs/mimicgen.git
cd mimicgen
pip install -e .
cd ${project_path}/deps
git clone https://github.com/ARISE-Initiative/robosuite.git
cd robosuite
git checkout b9d8d3de5e3dfd1724f4a0e6555246c460407daa
pip install -e .
cd ${project_path}/deps
git clone https://github.com/ARISE-Initiative/robomimic.git
cd robomimic
git checkout d0b37cf214bd24fb590d182edb6384333f67b661
pip install -e .
cd ${project_path}/deps
git clone https://github.com/ARISE-Initiative/robosuite-task-zoo
cd robosuite-task-zoo
git checkout 74eab7f88214c21ca1ae8617c2b2f8d19718a9ed
pip install -e .
please refer to the Llava Documentation:
cd ${project_path}/deps
git clone https://github.com/haotian-liu/LLaVA.git
cd LLaVA
pip install -e .
pip install -e ".[train]"
pip install flash-attn --no-build-isolation
To train our motion prediction module, motion correction module, and the corresponding motion-conditioned diffusion policy, we need to annotate the trajectories with their corresponding motion instruction.
- Mimicgen Dataset Download Here we utilize "Coffee_D0" task for example:
cd ${project_path}/deps/mimicgen
python mimicgen/scripts/download_datasets.py --dataset_type core --tasks coffee_d0
mv coffee_d0 ${project_path}/diffusion_policy_for_action/train_dataset/origin_datasets
- Motion Instruction Generation
cd ${project_path}/deps/mimicgen/diffusion_policy_for_action/train_dataset
python create_5hz_dataset_new_motion.py
This script will generate the motion information and save it in the adjust_llava_motion folder.
- Data Format Conversion
cd ${project_path}/diffusion_policy_for_action/train_dataset
python create_speed_dataset.py
To improve model training efficiency and GPU utilization, we found that the original hdf5 format stores the entire trajectory as a single meta information, which causes redundant observation and action information to be read during getitem calls. Therefore, we use timestamps as the meta information for storage. While this increases the dataset overhead, it effectively improves GPU utilization.
Specifically, we call create_speed_dataset.py to generate the data and save it in the speed_adjust_llava_motion folder.
- Generate the motion json file
cd ${project_path}/diffusion_policy_for_action/train_dataset
python generate_llava_json_dataset.py
This script will generate the motion json file and save it in the adjust_llava_motion folder.
- Render the rgb images for llava
cd ${project_path}/diffusion_policy_for_action/train_dataset
python generate_render_llava_dataset.py
Based on the generated motion instruction, we could train the multi-task motion-conditioned diffusion policy:
cd ${project_path}/diffusion_policy_for_action
bash scripts/train_speed_dataset.sh
cd ${project_path}/llava_for_motion_instruction
bash scripts/finetune_motion_prediction_module.sh
Utilizing the MPM and the motion-conditioned diffusion policy, we could borrow the common knowledge of MLLMs for low-level robotic action prediction. However, the MPM trained on success expert data often struggles to recover from these failure scenarios. Thus, we develop the motion correction module, to identify failure scenarios and correct behaviors from such situations.
To collect interaction data, we designed an online human intervention approach for gathering interaction correction data. Since it runs on the server side, we save images to enable interaction based on the visuals. If a local host is available, it can be adjusted for real-time rendering instead:
cd ${project_path}/diffusion_policy_for_action
bash scripts/collect_human_feedback.sh
cd ${project_path}/llava_for_motion_instruction
bash scripts/finetune_motion_correction_module.sh
To accelerate inference, we design a multi-environment inference script.
The inference script which only utilize MPM for motion instruction prediction:
cd ${project_path}/diffusion_policy_for_action
bash scripts/inference.sh
The inference script which utilize MCM for motion correction:
cd ${project_path}/diffusion_policy_for_action
bash scripts/inference_with_auto_feedback.sh