This repository contains the official implementation code for the paper "RADI: LLMs as World Models for Robotic Action Decomposition and Imagination".
RADI (Robotic Action Decomposition and Imagination) is a unified framework that leverages memory-augmented hierarchical decomposition and environmental imagination to simulate and verify action outcomes. When discrepancies are detected, self-reflective loops are triggered to iteratively re-optimize action plans without external supervision.
- Multi-layer Decomposition: Goal-level, task-level, and action-level three-layer decomposition strategy.
- Environmental Imagination: Uses LLMs as world models to predict action outcomes
- Memory and Reflection: Experience learning and error analysis through WorldModelMemory
- Self-Correction: Automatic plan correction mechanism based on verification results
- Multi-model Support: Supports ChatGPT-4o and open-source LLMs (Llama-2, etc.)
- Python 3.10
- PyTorch >= 1.12.0
- transformers >= 4.21.0
- sentence-transformers
- openai >= 0.27.0 (for ChatGPT)
# Create and activate conda environment
conda create -n RADI python=3.10
conda activate RADI
# Clone repository
git clone https://github.com/anonymous/RADI.git
cd RADI
# Install dependencies
pip install -r requirement.txtWe conduct experiments in VirtualHome. Download VirtualHome executable file (v2.2.5) from Download VirtualHome here and unzip it to RADI/RADI/virtualhome/:
# Download and extract VirtualHome to the following directory structure
RADI/
└── RADI/
└── virtualhome/
└── simulation/
└── unity_simulator/
└── v2.2.5/
└── linux_exec.v2.2.5_beta.x86_64For open-source LLMs, download models to the pretrain/ directory:
RADI/
└── pretrain/
├── llama-2-7b-chat-hf/
├── llama-2-13b-chat-hf/
├── Meta-Llama-3-8B-Instruct/
├── bloom-3b/
├── bloom-7b/
├── chatglm3-6b/
└── chatglm3-6b-32k/For ChatGPT, configure your API key in the test_openai.py file:
# Edit RADI/RADI/task-planning/test_openai.py and replace with your API key
import openai
client = openai.OpenAI(api_key="your_openai_api_key_here")Note: The framework uses the OpenAI client directly through the test_openai.py module rather than environment variables.
We evaluate our method on four datasets, three from LID (In-Distribution, NovelScenes, NovelTasks) and one created by ourselves (LongTasks). Download the four datasets from Download datasets here and unzip to RADI/RADI/data/test_init_env/:
# Dataset directory structure
RADI/
└── RADI/
└── data/
└── test_init_env/
├── InDistributation.p
├── NovelScenes.p
├── NovelTasks.p
└── LongTasks.pYou can run RADI/RADI/data/create_long.py to create the LongTasks dataset. We remove some samples due to environment bugs from LID.
We employ various LLMs of different scales as the backbone, including Llama-2-chat (7B, 13B), bloom (3B, 7B), ChatGLM3-6B, etc. For open-source LLMs, please download models to RADI/pretrain/:
- Llama-2-7b-chat-hf
- Llama-2-13b-chat-hf
- Meta-Llama-3-8B-Instruct
- bloom-3b, bloom-7b
- chatglm3-6b, etc.
For closed-source models like ChatGPT, please configure your API key in the test_openai.py file by replacing the API key string.
Go to RADI/Instruction-Tuning/. Run run_bloom-3b.sh or run_bloom-7b.sh for fine-tuning bloom. Run run_chatglm.sh for fine-tuning ChatGLM3-6B or ChatGLM3-6B-32K. Run run_llama-7b.sh or run_llama-13b.sh for fine-tuning Llama-2-chat or LongAlpaca. You can modify the parameters like "dataset", "train_batch_size", "accumulation_steps" to fit your own training.
Go to RADI/RADI/task-planning/. Run bash scripts/llm_eval.sh to evaluate open-source LLMs for "RADI", "Embodied", "ReAct", "RADI-goal", and "RADI-task". Run bash scripts/llm_eval_demo.sh to evaluate open-source LLMs for "RADI-ft". Run bash scripts/gpt_eval.sh to evaluate closed-source LLMs for "RADI", "Embodied", "ReAct".
cd RADI/RADI/task-planning/
# First, configure your API key in test_openai.py
# Edit test_openai.py and replace "your_openai_api_key_here" with your actual key
# Evaluate on single dataset
python llm_eval.py \
--llm gpt-4o \
--mode multi-layer \
--subset NovelTasks \
--max_retry 3 \
--base-port 8679# Using Llama-2 model
python llm_eval.py \
--llm ../../pretrain/llama-2-13b-chat-hf \
--mode multi-layer \
--subset NovelTasks \
--max_retry 1from world_model import verify_action_plan, WorldModelMemory
# Initialize memory manager
memory = WorldModelMemory(memory_file="worldmodel_memory.json")
# Verify action plan
world_model_str = "..." # World model description
action_plan = "..." # Action plan
verification_result, is_executable = verify_action_plan(
world_model_str, action_plan, memory=memory
)from llm_policy import LLMPolicy
# Initialize LLM policy
llm_policy = LLMPolicy(args, logging)
llm_policy.reset(ids=task_id)
llm_policy.set_graph(env_graph)
llm_policy.set_goal(task_goal)
# Generate multi-layer decomposition plan
llm_policy.generate_multi_layer_plan()
# Get execution actions
while True:
action = llm_policy.get_action_from_llm()
if action == 'DONE':
break
# Execute action...Contains complete RADI framework evaluation process:
- World model construction and enhancement
- Action plan verification and correction
- Multi-round reflection mechanism
- Performance metric calculation
base_port: port number for VirtualHome environmentllm: the path to LLM backbone (use "gpt-4o" for ChatGPT)lora: the path to lora weight, "None" for using LLM backbone only or closed-source LLMsmode: select task planning method: "multi-layer" for "RADI", "react" for "ReAct", "embodied" for "Embodied"demo: add this to use demonstrationsmax_retry: the number of times that task planning models can try, we set 1 in our experiment, you can set larger for higher success rate but longer inference time, it is useful for generating more training corpus
Note: API key configuration is handled in test_openai.py, not through command line parameters.
# Run RADI evaluation on all datasets (configure API key in test_openai.py first)
for subset in InDistributation NovelScenes NovelTasks LongTasks; do
python llm_eval.py \
--llm gpt-4o \
--mode multi-layer \
--subset $subset \
--max_retry 99 \
--base-port $((8679 + $RANDOM % 100))
doneThe program will output the following performance metrics:
**************** Current Evaluation Metric ****************
Successful / Executable / Current / Total: 45 / 52 / 60 / 100
Success Rate: 75.00
Executability: 86.67
success_rate_results.txt: Success rate summary for each datasetworldmodel_memory.json: World model memory filelog.log: Detailed execution logs
A: Check executable file path and port settings:
# Ensure path is correct
ls RADI/RADI/virtualhome/simulation/unity_simulator/v2.2.5/
# Try different port
python llm_eval.py --base-port 8680A: Check API key configuration in test_openai.py:
# Edit test_openai.py and ensure your API key is correctly set
# Test API connection
python test_openai.pyA: Ensure dataset paths are correct:
# Check dataset files
ls RADI/RADI/data/test_init_env/
# Should contain: InDistributation.p, NovelScenes.p, NovelTasks.p, LongTasks.pRADI/
├── RADI/RADI/task-planning/
│ ├── llm_eval.py # Main evaluation script
│ ├── interactive_evaluation.py # RADI evaluation logic
│ ├── llm_policy.py # LLM policy implementation
│ ├── world_model.py # World model and verification
│ ├── arguments.py # Parameter configuration
│ ├── init_path.py # Path initialization
│ ├── sim_compute.py # Similarity computation
├── RADI/RADI/data/ # Data directory
├── RADI/RADI/virtualhome/ # VirtualHome environment
└── pretrain/ # Pre-trained model directory