* Enhance training and logging functionality with accelerator support - Added support for multi-GPU training by introducing an `accelerator` parameter in training functions. - Updated `update_policy` to handle gradient updates based on the presence of an accelerator. - Modified logging to prevent duplicate messages in non-main processes. - Enhanced `set_seed` and `get_safe_torch_device` functions to accommodate accelerator usage. - Updated `MetricsTracker` to account for the number of processes when calculating metrics. - Introduced a new feature in `pyproject.toml` for the `accelerate` library dependency. * Initialize logging in training script for both main and non-main processes - Added `init_logging` calls to ensure proper logging setup when using the accelerator and in standard training mode. - This change enhances the clarity and consistency of logging during training sessions. * add docs and only push model once * Place logging under accelerate and update docs * fix pre commit * only log in main process * main logging * try with local rank * add tests * change runner * fix test * dont push to hub in multi gpu tests * pre download dataset in tests * small fixes * fix path optimizer state * update docs, and small improvements in train * simplify accelerate main process detection * small improvements in train * fix OOM bug * change accelerate detection * add some debugging * always use accelerate * cleanup update method * cleanup * fix bug * scale lr decay if we reduce steps * cleanup logging * fix formatting * encorperate feedback pr * add min memory to cpu tests * use accelerate to determin logging * fix precommit and fix tests * chore: minor details --------- Co-authored-by: AdilZouitine <adilzouitinegm@gmail.com> Co-authored-by: Steven Palma <steven.palma@huggingface.co>
126 lines
4.9 KiB
Plaintext
126 lines
4.9 KiB
Plaintext
# Multi-GPU Training
|
||
|
||
This guide shows you how to train policies on multiple GPUs using [Hugging Face Accelerate](https://huggingface.co/docs/accelerate).
|
||
|
||
## Installation
|
||
|
||
First, ensure you have accelerate installed:
|
||
|
||
```bash
|
||
pip install accelerate
|
||
```
|
||
|
||
## Training with Multiple GPUs
|
||
|
||
You can launch training in two ways:
|
||
|
||
### Option 1: Without config (specify parameters directly)
|
||
|
||
You can specify all parameters directly in the command without running `accelerate config`:
|
||
|
||
```bash
|
||
accelerate launch \
|
||
--multi_gpu \
|
||
--num_processes=2 \
|
||
$(which lerobot-train) \
|
||
--dataset.repo_id=${HF_USER}/my_dataset \
|
||
--policy.type=act \
|
||
--policy.repo_id=${HF_USER}/my_trained_policy \
|
||
--output_dir=outputs/train/act_multi_gpu \
|
||
--job_name=act_multi_gpu \
|
||
--wandb.enable=true
|
||
```
|
||
|
||
**Key accelerate parameters:**
|
||
|
||
- `--multi_gpu`: Enable multi-GPU training
|
||
- `--num_processes=2`: Number of GPUs to use
|
||
- `--mixed_precision=fp16`: Use fp16 mixed precision (or `bf16` if supported)
|
||
|
||
### Option 2: Using accelerate config
|
||
|
||
If you prefer to save your configuration, you can optionally configure accelerate for your hardware setup by running:
|
||
|
||
```bash
|
||
accelerate config
|
||
```
|
||
|
||
This interactive setup will ask you questions about your training environment (number of GPUs, mixed precision settings, etc.) and saves the configuration for future use. For a simple multi-GPU setup on a single machine, you can use these recommended settings:
|
||
|
||
- Compute environment: This machine
|
||
- Number of machines: 1
|
||
- Number of processes: (number of GPUs you want to use)
|
||
- GPU ids to use: (leave empty to use all)
|
||
- Mixed precision: fp16 or bf16 (recommended for faster training)
|
||
|
||
Then launch training with:
|
||
|
||
```bash
|
||
accelerate launch $(which lerobot-train) \
|
||
--dataset.repo_id=${HF_USER}/my_dataset \
|
||
--policy.type=act \
|
||
--policy.repo_id=${HF_USER}/my_trained_policy \
|
||
--output_dir=outputs/train/act_multi_gpu \
|
||
--job_name=act_multi_gpu \
|
||
--wandb.enable=true
|
||
```
|
||
|
||
## How It Works
|
||
|
||
When you launch training with accelerate:
|
||
|
||
1. **Automatic detection**: LeRobot automatically detects if it's running under accelerate
|
||
2. **Data distribution**: Your batch is automatically split across GPUs
|
||
3. **Gradient synchronization**: Gradients are synchronized across GPUs during backpropagation
|
||
4. **Single process logging**: Only the main process logs to wandb and saves checkpoints
|
||
|
||
## Learning Rate and Training Steps Scaling
|
||
|
||
**Important:** LeRobot does **NOT** automatically scale learning rates or training steps based on the number of GPUs. This gives you full control over your training hyperparameters.
|
||
|
||
### Why No Automatic Scaling?
|
||
|
||
Many distributed training frameworks automatically scale the learning rate by the number of GPUs (e.g., `lr = base_lr × num_gpus`).
|
||
However, LeRobot keeps the learning rate exactly as you specify it.
|
||
|
||
### When and How to Scale
|
||
|
||
If you want to scale your hyperparameters when using multiple GPUs, you should do it manually:
|
||
|
||
**Learning Rate Scaling:**
|
||
|
||
```bash
|
||
# Example: 2 GPUs with linear LR scaling
|
||
# Base LR: 1e-4, with 2 GPUs -> 2e-4
|
||
accelerate launch --num_processes=2 $(which lerobot-train) \
|
||
--optimizer.lr=2e-4 \
|
||
--dataset.repo_id=lerobot/pusht \
|
||
--policy=act
|
||
```
|
||
|
||
**Training Steps Scaling:**
|
||
|
||
Since the effective batch size `bs` increases with multiple GPUs (batch_size × num_gpus), you may want to reduce the number of training steps proportionally:
|
||
|
||
```bash
|
||
# Example: 2 GPUs with effective batch size 2x larger
|
||
# Original: batch_size=8, steps=100000
|
||
# With 2 GPUs: batch_size=8 (16 in total), steps=50000
|
||
accelerate launch --num_processes=2 $(which lerobot-train) \
|
||
--batch_size=8 \
|
||
--steps=50000 \
|
||
--dataset.repo_id=lerobot/pusht \
|
||
--policy=act
|
||
```
|
||
|
||
## Notes
|
||
|
||
- The `--policy.use_amp` flag in `lerobot-train` is only used when **not** running with accelerate. When using accelerate, mixed precision is controlled by accelerate's configuration.
|
||
- Training logs, checkpoints, and hub uploads are only done by the main process to avoid conflicts. Non-main processes have console logging disabled to prevent duplicate output.
|
||
- The effective batch size is `batch_size × num_gpus`. If you use 4 GPUs with `--batch_size=8`, your effective batch size is 32.
|
||
- Learning rate scheduling is handled correctly across multiple processes—LeRobot sets `step_scheduler_with_optimizer=False` to prevent accelerate from adjusting scheduler steps based on the number of processes.
|
||
- When saving or pushing models, LeRobot automatically unwraps the model from accelerate's distributed wrapper to ensure compatibility.
|
||
- WandB integration automatically initializes only on the main process, preventing multiple runs from being created.
|
||
|
||
For more advanced configurations and troubleshooting, see the [Accelerate documentation](https://huggingface.co/docs/accelerate). If you want to learn more about how to train on a large number of GPUs, checkout this awesome guide: [Ultrascale Playbook](https://huggingface.co/spaces/nanotron/ultrascale-playbook).
|