DexWM: World Models for Learning Dexterous Hand-Object Interactions from Human Videos
Official PyTorch Implementation
This repo contains the official PyTorch implementation of DexWM: World Models for Learning Dexterous Hand-Object Interactions from Human Videos.
Authors:
Raktim Gautam Goswami1,2, Amir Bar1, David Fan1, Tsung-Yen Yang1, Gaoyue Zhou1,2, Prashanth Krishnamurthy2, Michael Rabbat1, Farshad Khorrami2, Yann LeCun1,2
1 Meta-FAIR 2 New York University
Download the repo and set up the environment:
git clone https://github.com/facebookresearch/dexwm
conda create -n dexwm python=3.11
conda activate dexwm
pip install -r requirements.txt
DexWM is pre-trained in EgoDex and DROID datasets and fine-tuned on exploratory sequences of the RoboCasa simulation data. Download the EgoDex, DROID, RoboCasa Random datasets. See the end of this README for the expected directory structure inside each dataset folder.
Note:
Change the egodex_root_folder and droid_root_folder locations in the config file before running the code.
Using torchrun:
bash scripts/train_torchrun.sh --job_dir <job_dir>Update the script variables to match your available compute resources (e.g., number of nodes, GPUs per node, and host address). Defaults are 1 node, 8 GPUs per node, and localhost.
Or using submitit and slurm:
bash scripts/train_submitit.shUpdate the script variables to match your available compute resources and job_dir. By default, this script trains the model on 32 nodes with 8 GPUs each.
Or locally on one GPU for debug:
python train_wm.py --config configs/egodex_and_droid.yaml --job_dir <job_dir>On the first training run, the code generates split_indices_droid.json to define a DROID validation split. This file is only used to report/track validation loss and is not used elsewhere.
Change the root_folder location and resume path to the pre-trained model in the config file before running the code.
Using torchrun:
bash scripts/multistep_train_torchrun.sh --job_dir <job_dir>Update the script variables to match your available compute resources (e.g., number of nodes, GPUs per node, and host address). Defaults are 1 node, 1 GPUs per node, and localhost.
Or using submitit and slurm:
bash scripts/multistep_train_submitit.shUpdate the script variables to match your available compute resources and job_dir. By default, this script trains the model on 1 nodes with 1 GPUs each.
Or locally on one GPU for debug:
python train_multistep_wm.py --config configs/robocasa_random_multistep.yaml --job_dir <job_dir>On the first training run, the code generates split_indices_robocasa_random.json to define a RoboCasa Random validation split. This file is only used to report/track validation loss and is not used elsewhere.
- Set the model checkpoint:
Edit
test_scripts/test_script.shand update the model checkpoint path to the checkpoint you want to evaluate. - Download the keypoint model:
Evaluation also uses a separately trained keypoint model to predict keypoints from the world model’s predicted latent states. Download this model from the checkpoint download page and configure its path in
test_scripts/test_script.shas well. - (Optional) Visualization: The test script can visualize predicted states. To enable this, you must train a decoder and configure the decoder path/settings in the code.
bash test_scripts/test_script.shThis writes two rollout metrics to the output_dir specified in test_scripts/test_script.sh:
- L2 Error
- PCK (Percentage of Correct Keypoints)
Each metric is saved as an array evaluated every 0.2 seconds, from 0.2s up to 4.0s.
To view the aggregated losses similar to the format reported in the paper, run
python test_scripts/result_stats.py --output_dir <output_dir>- Install and configure RoboCasa simulator with MURP robot following the instructions here.
- Download the Pick-and-Place dataset. It provides the visual goal images used for the manipulation tasks.
- Run evaluation
Before running, update the script variables to match your compute setup (e.g., number of nodes/GPUs), job_dir, and any other relevant settings. By default, the script uses 1 node with 8 GPUs.
conda activate robot_sim_dexwm bash scripts/test_robot_sim.sh
- At the end of evaluation, a res.json file will be generated in the job_dir which will contain a dictionary with all the task names and corresponding success/failure.
The EgoDex and DROID datasets are aranged as follows:
egodex
├── train
│ ├── <task_1>
│ │ ├── 0.hdf5
│ │ ├── 0.mp4
│ │ ├── 1.hdf5
│ │ └── 1.mp4
│ │ ...
│ ├── <task_2>
│ │ ├── 0.hdf5
│ │ ├── 0.mp4
│ │ ├── 1.hdf5
│ │ └── 1.mp4
│ │ ...
│ ...
├── test
│ ├── <task_k>
│ │ ├── 0.hdf5
│ │ ├── 0.mp4
│ │ ├── 1.hdf5
│ │ └── 1.mp4
│ │ ...
│ ...
DROID
├── <lab_name>
│ ├── success
│ │ ├── <date_1>
│ │ │ ├── <time_1>
│ │ │ │ ├── recordings
│ │ │ │ │ ├── MP4
│ │ │ │ │ │ └── ...
│ │ │ │ │ ├── SVO
│ │ │ │ │ │ └── ...
│ │ │ │ ├── metadata_....json
│ │ │ │ └── ...
│ │ └── ...
│ │ ...
│ ├── failure
│ │ ├── <date_i>
│ │ │ ├── <time_j>
│ │ │ │ ├── recordings
│ │ │ │ │ ├── MP4
│ │ │ │ │ │ └── ...
│ │ │ │ │ ├── SVO
│ │ │ │ │ │ └── ...
│ │ │ │ ├── metadata_....json
│ │ │ │ └── ...
│ │ └── ...
│ │ ...
robocasa_random_data
├── exploratory_movements
│ ├── combine_demos_0.hdf5
│ └── combine_demos_1.hdf5
│ │ ...
├── gripper_open_and_close
│ ├── combine_demos_0.hdf5
│ └── combine_demos_1.hdf5
│ │ ...
├── pick-and-place-2.0
│ ├── combine_demos_0.hdf5
│ └── combine_demos_1.hdf5
│ │ ...
DexWM is licensed under CC-BY-NC.
@article{goswami2025world,
title={World Models for Learning Dexterous Hand-Object Interactions from Human Videos},
author={Goswami, Raktim Gautam and Bar, Amir and Fan, David and Yang, Tsung-Yen and Zhou, Gaoyue and Krishnamurthy, Prashanth and Rabbat, Michael and Khorrami, Farshad and LeCun, Yann},
journal={arXiv preprint arXiv:2512.13644},
year={2026}
}