|
| 1 | +# ACT (Action Chunking with Transformers) |
| 2 | + |
| 3 | +ACT is a **lightweight and efficient policy for imitation learning**, especially well-suited for fine-grained manipulation tasks. It's the **first model we recommend when you're starting out** with LeRobot due to its fast training time, low computational requirements, and strong performance. |
| 4 | + |
| 5 | +<div class="video-container"> |
| 6 | + <iframe |
| 7 | + width="100%" |
| 8 | + height="415" |
| 9 | + src="https://www.youtube.com/embed/ft73x0LfGpM" |
| 10 | + title="LeRobot ACT Tutorial" |
| 11 | + frameborder="0" |
| 12 | + allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" |
| 13 | + allowfullscreen |
| 14 | + ></iframe> |
| 15 | +</div> |
| 16 | + |
| 17 | +_Watch this tutorial from the LeRobot team to learn how ACT works: [LeRobot ACT Tutorial](https://www.youtube.com/watch?v=ft73x0LfGpM)_ |
| 18 | + |
| 19 | +## Model Overview |
| 20 | + |
| 21 | +Action Chunking with Transformers (ACT) was introduced in the paper [Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware](https://arxiv.org/abs/2304.13705) by Zhao et al. The policy was designed to enable precise, contact-rich manipulation tasks using affordable hardware and minimal demonstration data. |
| 22 | + |
| 23 | +### Why ACT is Great for Beginners |
| 24 | + |
| 25 | +ACT stands out as an excellent starting point for several reasons: |
| 26 | + |
| 27 | +- **Fast Training**: Trains in a few hours on a single GPU |
| 28 | +- **Lightweight**: Only ~80M parameters, making it efficient and easy to work with |
| 29 | +- **Data Efficient**: Often achieves high success rates with just 50 demonstrations |
| 30 | + |
| 31 | +### Architecture |
| 32 | + |
| 33 | +ACT uses a transformer-based architecture with three main components: |
| 34 | + |
| 35 | +1. **Vision Backbone**: ResNet-18 processes images from multiple camera viewpoints |
| 36 | +2. **Transformer Encoder**: Synthesizes information from camera features, joint positions, and a learned latent variable |
| 37 | +3. **Transformer Decoder**: Generates coherent action sequences using cross-attention |
| 38 | + |
| 39 | +The policy takes as input: |
| 40 | + |
| 41 | +- Multiple RGB images (e.g., from wrist cameras, front/top cameras) |
| 42 | +- Current robot joint positions |
| 43 | +- A latent style variable `z` (learned during training, set to zero during inference) |
| 44 | + |
| 45 | +And outputs a chunk of `k` future action sequences. |
| 46 | + |
| 47 | +## Installation Requirements |
| 48 | + |
| 49 | +1. Install LeRobot by following our [Installation Guide](./installation). |
| 50 | +2. ACT is included in the base LeRobot installation, so no additional dependencies are needed! |
| 51 | + |
| 52 | +## Training ACT |
| 53 | + |
| 54 | +ACT works seamlessly with the standard LeRobot training pipeline. Here's a complete example for training ACT on your dataset: |
| 55 | + |
| 56 | +```bash |
| 57 | +lerobot-train \ |
| 58 | + --dataset.repo_id=${HF_USER}/your_dataset \ |
| 59 | + --policy.type=act \ |
| 60 | + --output_dir=outputs/train/act_your_dataset \ |
| 61 | + --job_name=act_your_dataset \ |
| 62 | + --policy.device=cuda \ |
| 63 | + --wandb.enable=true \ |
| 64 | + --policy.repo_id=${HF_USER}/act_policy |
| 65 | +``` |
| 66 | + |
| 67 | +### Training Tips |
| 68 | + |
| 69 | +1. **Start with defaults**: ACT's default hyperparameters work well for most tasks |
| 70 | +2. **Training duration**: Expect a few hours for 100k training steps on a single GPU |
| 71 | +3. **Batch size**: Start with batch size 8 and adjust based on your GPU memory |
| 72 | + |
| 73 | +### Train using Google Colab |
| 74 | + |
| 75 | +If your local computer doesn't have a powerful GPU, you can utilize Google Colab to train your model by following the [ACT training notebook](./notebooks#training-act). |
| 76 | + |
| 77 | +## Evaluating ACT |
| 78 | + |
| 79 | +Once training is complete, you can evaluate your ACT policy using the `lerobot-record` command with your trained policy. This will run inference and record evaluation episodes: |
| 80 | + |
| 81 | +```bash |
| 82 | +lerobot-record \ |
| 83 | + --robot.type=so100_follower \ |
| 84 | + --robot.port=/dev/ttyACM0 \ |
| 85 | + --robot.id=my_robot \ |
| 86 | + --robot.cameras="{ front: {type: opencv, index_or_path: 0, width: 640, height: 480, fps: 30}}" \ |
| 87 | + --display_data=true \ |
| 88 | + --dataset.repo_id=${HF_USER}/eval_act_your_dataset \ |
| 89 | + --dataset.num_episodes=10 \ |
| 90 | + --dataset.single_task="Your task description" \ |
| 91 | + --policy.path=${HF_USER}/act_policy |
| 92 | +``` |
0 commit comments