From 4ccf28437a785e453888de8e4b415dc9d35ac4e0 Mon Sep 17 00:00:00 2001 From: Pepijn <138571049+pkooij@users.noreply.github.com> Date: Wed, 8 Oct 2025 20:07:14 +0200 Subject: [PATCH] Add act documentation (#2139) * Add act documentation * remove citation as we link the paper * simplify docs * fix pre commit --- docs/source/_toctree.yml | 2 + docs/source/act.mdx | 92 ++++++++++++++++++++++++++++++++++++++++ 2 files changed, 94 insertions(+) create mode 100644 docs/source/act.mdx diff --git a/docs/source/_toctree.yml b/docs/source/_toctree.yml index 36eaea16..3b6cccc9 100644 --- a/docs/source/_toctree.yml +++ b/docs/source/_toctree.yml @@ -27,6 +27,8 @@ title: Porting Large Datasets title: "Datasets" - sections: + - local: act + title: ACT - local: smolvla title: SmolVLA - local: pi0 diff --git a/docs/source/act.mdx b/docs/source/act.mdx new file mode 100644 index 00000000..e3294ca6 --- /dev/null +++ b/docs/source/act.mdx @@ -0,0 +1,92 @@ +# ACT (Action Chunking with Transformers) + +ACT is a **lightweight and efficient policy for imitation learning**, especially well-suited for fine-grained manipulation tasks. It's the **first model we recommend when you're starting out** with LeRobot due to its fast training time, low computational requirements, and strong performance. + +
+ +
+ +_Watch this tutorial from the LeRobot team to learn how ACT works: [LeRobot ACT Tutorial](https://www.youtube.com/watch?v=ft73x0LfGpM)_ + +## Model Overview + +Action Chunking with Transformers (ACT) was introduced in the paper [Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware](https://arxiv.org/abs/2304.13705) by Zhao et al. The policy was designed to enable precise, contact-rich manipulation tasks using affordable hardware and minimal demonstration data. + +### Why ACT is Great for Beginners + +ACT stands out as an excellent starting point for several reasons: + +- **Fast Training**: Trains in a few hours on a single GPU +- **Lightweight**: Only ~80M parameters, making it efficient and easy to work with +- **Data Efficient**: Often achieves high success rates with just 50 demonstrations + +### Architecture + +ACT uses a transformer-based architecture with three main components: + +1. **Vision Backbone**: ResNet-18 processes images from multiple camera viewpoints +2. **Transformer Encoder**: Synthesizes information from camera features, joint positions, and a learned latent variable +3. **Transformer Decoder**: Generates coherent action sequences using cross-attention + +The policy takes as input: + +- Multiple RGB images (e.g., from wrist cameras, front/top cameras) +- Current robot joint positions +- A latent style variable `z` (learned during training, set to zero during inference) + +And outputs a chunk of `k` future action sequences. + +## Installation Requirements + +1. Install LeRobot by following our [Installation Guide](./installation). +2. ACT is included in the base LeRobot installation, so no additional dependencies are needed! + +## Training ACT + +ACT works seamlessly with the standard LeRobot training pipeline. Here's a complete example for training ACT on your dataset: + +```bash +lerobot-train \ + --dataset.repo_id=${HF_USER}/your_dataset \ + --policy.type=act \ + --output_dir=outputs/train/act_your_dataset \ + --job_name=act_your_dataset \ + --policy.device=cuda \ + --wandb.enable=true \ + --policy.repo_id=${HF_USER}/act_policy +``` + +### Training Tips + +1. **Start with defaults**: ACT's default hyperparameters work well for most tasks +2. **Training duration**: Expect a few hours for 100k training steps on a single GPU +3. **Batch size**: Start with batch size 8 and adjust based on your GPU memory + +### Train using Google Colab + +If your local computer doesn't have a powerful GPU, you can utilize Google Colab to train your model by following the [ACT training notebook](./notebooks#training-act). + +## Evaluating ACT + +Once training is complete, you can evaluate your ACT policy using the `lerobot-record` command with your trained policy. This will run inference and record evaluation episodes: + +```bash +lerobot-record \ + --robot.type=so100_follower \ + --robot.port=/dev/ttyACM0 \ + --robot.id=my_robot \ + --robot.cameras="{ front: {type: opencv, index_or_path: 0, width: 640, height: 480, fps: 30}}" \ + --display_data=true \ + --dataset.repo_id=${HF_USER}/eval_act_your_dataset \ + --dataset.num_episodes=10 \ + --dataset.single_task="Your task description" \ + --policy.path=${HF_USER}/act_policy +```