Add offline inference to visualize_dataset

Use training.eval_freq=-1 for real-world data training + Add diffusion_real.yaml
Refactor make_env
2024-05-29 15:33:52 +00:00 · 2024-05-29 15:30:43 +00:00 · 2024-05-29 15:26:56 +00:00 · 2024-05-29 15:20:44 +00:00 · 2024-05-29 11:09:44 +00:00 · 2024-05-29 11:09:16 +00:00
649 changed files with 5038 additions and 1807 deletions
--- a/.gitattributes
+++ b/.gitattributes
@@ -1,2 +1,6 @@
 *.memmap filter=lfs diff=lfs merge=lfs -text
 *.stl filter=lfs diff=lfs merge=lfs -text
+*.safetensors filter=lfs diff=lfs merge=lfs -text
+*.mp4 filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.json filter=lfs diff=lfs merge=lfs -text
--- a/.github/workflows/test.yml
+++ b/.github/workflows/test.yml
@@ -29,6 +29,8 @@ jobs:
      MUJOCO_GL: egl
    steps:
      - uses: actions/checkout@v4
+        with:
+          lfs: true  # Ensure LFS files are pulled

      - name: Install EGL
        run: sudo apt-get update && sudo apt-get install -y libegl1-mesa-dev
@@ -57,6 +59,40 @@ jobs:
            && rm -rf tests/outputs outputs


+  pytest-minimal:
+    name: Pytest (minimal install)
+    runs-on: ubuntu-latest
+    env:
+      DATA_DIR: tests/data
+      MUJOCO_GL: egl
+    steps:
+      - uses: actions/checkout@v4
+        with:
+          lfs: true  # Ensure LFS files are pulled
+
+      - name: Install poetry
+        run: |
+          pipx install poetry && poetry config virtualenvs.in-project true
+          echo "${{ github.workspace }}/.venv/bin" >> $GITHUB_PATH
+
+      - name: Set up Python 3.10
+        uses: actions/setup-python@v5
+        with:
+          python-version: "3.10"
+
+      - name: Install poetry dependencies
+        run: |
+          poetry install --extras "test"
+
+      - name: Test with pytest
+        run: |
+          pytest tests -v --cov=./lerobot --durations=0 \
+            -W ignore::DeprecationWarning:imageio_ffmpeg._utils:7 \
+            -W ignore::UserWarning:torch.utils.data.dataloader:558 \
+            -W ignore::UserWarning:gymnasium.utils.env_checker:247 \
+            && rm -rf tests/outputs outputs
+
+
  end-to-end:
    name: End-to-end
    runs-on: ubuntu-latest
@@ -65,6 +101,8 @@ jobs:
      MUJOCO_GL: egl
    steps:
      - uses: actions/checkout@v4
+        with:
+          lfs: true  # Ensure LFS files are pulled

      - name: Install EGL
        run: sudo apt-get update && sudo apt-get install -y libegl1-mesa-dev
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -139,13 +139,11 @@ Follow these steps to start contributing:

   To develop on 🤗 LeRobot, you will at least need to install the `dev` and `test` extras dependencies along with the core library:
   ```bash
-   pip install poetry
   poetry install --sync --extras "dev test"
   ```

   You can also install the project with all its dependencies (including environments):
   ```bash
-   pip install poetry
   poetry install --sync --all-extras
   ```

@@ -197,6 +195,11 @@ Follow these steps to start contributing:
   git commit
   ```

+   Note, if you already commited some changes that have a wrong formatting, you can use:
+   ```bash
+   pre-commit run --all-files
+   ```
+
   Please write [good commit messages](https://chris.beams.io/posts/git-commit/).

   It is a good idea to sync your copy of the code with the original
--- a/7
+++ b/7
@@ -22,9 +22,8 @@ test-end-to-end:
 	${MAKE} test-act-ete-eval
 	${MAKE} test-diffusion-ete-train
 	${MAKE} test-diffusion-ete-eval
-	# TODO(rcadene, alexander-soare): enable end-to-end tests for tdmpc
-	# ${MAKE} test-tdmpc-ete-train
-	# ${MAKE} test-tdmpc-ete-eval
+	${MAKE} test-tdmpc-ete-train
+	${MAKE} test-tdmpc-ete-eval
 	${MAKE} test-default-ete-eval

 test-act-ete-train:
@@ -80,7 +79,7 @@ test-tdmpc-ete-train:
 		policy=tdmpc \
 		env=xarm \
 		env.task=XarmLift-v0 \
-		dataset_repo_id=lerobot/xarm_lift_medium_replay \
+		dataset_repo_id=lerobot/xarm_lift_medium \
 		wandb.enable=False \
 		training.offline_steps=2 \
 		training.online_steps=2 \
--- a/README.md
+++ b/README.md
@@ -57,7 +57,6 @@
 - Thanks to Tony Zaho, Zipeng Fu and colleagues for open sourcing ACT policy, ALOHA environments and datasets. Ours are adapted from [ALOHA](https://tonyzhaozh.github.io/aloha) and [Mobile ALOHA](https://mobile-aloha.github.io).
 - Thanks to Cheng Chi, Zhenjia Xu and colleagues for open sourcing Diffusion policy, Pusht environment and datasets, as well as UMI datasets. Ours are adapted from [Diffusion Policy](https://diffusion-policy.cs.columbia.edu) and [UMI Gripper](https://umi-gripper.github.io).
 - Thanks to Nicklas Hansen, Yunhai Feng and colleagues for open sourcing TDMPC policy, Simxarm environments and datasets. Ours are adapted from [TDMPC](https://github.com/nicklashansen/tdmpc) and [FOWM](https://www.yunhaifeng.com/FOWM).
- Thanks to Vincent Moens and colleagues for open sourcing [TorchRL](https://github.com/pytorch/rl). It allowed for quick experimentations on the design of `LeRobot`.
 - Thanks to Antonio Loquercio and Ashish Kumar for their early support.


@@ -93,6 +92,8 @@ To use [Weights and Biases](https://docs.wandb.ai/quickstart) for experiment tra
 wandb login
 ```

+(note: you will also need to enable WandB in the configuration. See below.)
+
 ## Walkthrough

 ```
@@ -159,13 +160,13 @@ See `python lerobot/scripts/eval.py --help` for more instructions.

 Check out [example 3](./examples/3_train_policy.py) that illustrates how to start training a model.

-In general, you can use our training script to easily train any policy. To use wandb for logging training and evaluation curves, make sure you ran `wandb login`. Here is an example of training the ACT policy on trajectories collected by humans on the Aloha simulation environment for the insertion task:
+In general, you can use our training script to easily train any policy. Here is an example of training the ACT policy on trajectories collected by humans on the Aloha simulation environment for the insertion task:
 ```bash
 python lerobot/scripts/train.py \
    policy=act \
    env=aloha \
    env.task=AlohaInsertion-v0 \
-    dataset_repo_id=lerobot/aloha_sim_insertion_human
+    dataset_repo_id=lerobot/aloha_sim_insertion_human \
 ```

 The experiment directory is automatically generated and will show up in yellow in your terminal. It looks like `outputs/train/2024-05-05/20-21-12_aloha_act_default`. You can manually specify an experiment directory by adding this argument to the `train.py` python command:
@@ -173,17 +174,17 @@ The experiment directory is automatically generated and will show up in yellow i
    hydra.run.dir=your/new/experiment/dir
 ```

-A link to the wandb logs for the run will also show up in yellow in your terminal. Here is an example of logs from wandb:
-![](media/wandb.png)
+To use wandb for logging training and evaluation curves, make sure you've run `wandb login` as a one-time setup step. Then, when running the training command above, enable WandB in the configuration by adding:

-You can deactivate wandb by adding these arguments to the `train.py` python command:
 ```bash
-    wandb.disable_artifact=true \
-    wandb.enable=false
+    wandb.enable=true
 ```

-Note: For efficiency, during training every checkpoint is evaluated on a low number of episodes. After training, you may want to re-evaluate your best checkpoints on more episodes or change the evaluation settings. See `python lerobot/scripts/eval.py --help` for more instructions.
+A link to the wandb logs for the run will also show up in yellow in your terminal. Here is an example of what they look like in your browser:

+![](media/wandb.png)
+
+Note: For efficiency, during training every checkpoint is evaluated on a low number of episodes. After training, you may want to re-evaluate your best checkpoints on more episodes or change the evaluation settings. See `python lerobot/scripts/eval.py --help` for more instructions.

 ## Contribute

@@ -196,11 +197,11 @@ To add a dataset to the hub, you need to login using a write-access token, which
 huggingface-cli login --token ${HUGGINGFACE_TOKEN} --add-to-git-credential
 ```

-Then move your dataset folder in `data` directory (e.g. `data/aloha_ping_pong`), and push your dataset to the hub with:
+Then move your dataset folder in `data` directory (e.g. `data/aloha_static_pingpong_test`), and push your dataset to the hub with:
 ```bash
 python lerobot/scripts/push_dataset_to_hub.py \
 --data-dir data \
--dataset-id aloha_ping_ping \
+--dataset-id aloha_static_pingpong_test \
 --raw-format aloha_hdf5 \
 --community-id lerobot
 ```
--- a/docker/lerobot-gpu/Dockerfile
+++ b/docker/lerobot-gpu/Dockerfile
@@ -7,6 +7,11 @@ ARG DEBIAN_FRONTEND=noninteractive
 # Install apt dependencies
 RUN apt-get update && apt-get install -y --no-install-recommends \
    build-essential cmake \
+    git git-lfs openssh-client \
+    nano vim ffmpeg \
+    htop atop nvtop \
+    sed gawk grep curl wget \
+    tcpdump sysstat screen \
    libglib2.0-0 libgl1-mesa-glx libegl1-mesa \
    python${PYTHON_VERSION} python${PYTHON_VERSION}-venv \
    && apt-get clean && rm -rf /var/lib/apt/lists/*
@@ -18,7 +23,8 @@ ENV PATH="/opt/venv/bin:$PATH"
 RUN echo "source /opt/venv/bin/activate" >> /root/.bashrc

 # Install LeRobot
-COPY . /lerobot
+RUN git lfs install
+RUN git clone https://github.com/huggingface/lerobot.git
 WORKDIR /lerobot
 RUN pip install --upgrade --no-cache-dir pip
 RUN pip install --no-cache-dir ".[test, aloha, xarm, pusht]"
--- a/examples/4_calculate_validation_loss.py
+++ b/examples/4_calculate_validation_loss.py
@@ -0,0 +1,90 @@
+"""This script demonstrates how to slice a dataset and calculate the loss on a subset of the data.
+
+This technique can be useful for debugging and testing purposes, as well as identifying whether a policy
+is learning effectively.
+
+Furthermore, relying on validation loss to evaluate performance is generally not considered a good practice,
+especially in the context of imitation learning. The most reliable approach is to evaluate the policy directly
+on the target environment, whether that be in simulation or the real world.
+"""
+
+import math
+from pathlib import Path
+
+import torch
+from huggingface_hub import snapshot_download
+
+from lerobot.common.datasets.lerobot_dataset import LeRobotDataset
+from lerobot.common.policies.diffusion.modeling_diffusion import DiffusionPolicy
+
+device = torch.device("cuda")
+
+# Download the diffusion policy for pusht environment
+pretrained_policy_path = Path(snapshot_download("lerobot/diffusion_pusht"))
+# OR uncomment the following to evaluate a policy from the local outputs/train folder.
+# pretrained_policy_path = Path("outputs/train/example_pusht_diffusion")
+
+policy = DiffusionPolicy.from_pretrained(pretrained_policy_path)
+policy.eval()
+policy.to(device)
+
+# Set up the dataset.
+delta_timestamps = {
+    # Load the previous image and state at -0.1 seconds before current frame,
+    # then load current image and state corresponding to 0.0 second.
+    "observation.image": [-0.1, 0.0],
+    "observation.state": [-0.1, 0.0],
+    # Load the previous action (-0.1), the next action to be executed (0.0),
+    # and 14 future actions with a 0.1 seconds spacing. All these actions will be
+    # used to calculate the loss.
+    "action": [-0.1, 0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 1.1, 1.2, 1.3, 1.4],
+}
+
+# Load the last 10% of episodes of the dataset as a validation set.
+# - Load full dataset
+full_dataset = LeRobotDataset("lerobot/pusht", split="train")
+# - Calculate train and val subsets
+num_train_episodes = math.floor(full_dataset.num_episodes * 90 / 100)
+num_val_episodes = full_dataset.num_episodes - num_train_episodes
+print(f"Number of episodes in full dataset: {full_dataset.num_episodes}")
+print(f"Number of episodes in training dataset (90% subset): {num_train_episodes}")
+print(f"Number of episodes in validation dataset (10% subset): {num_val_episodes}")
+# - Get first frame index of the validation set
+first_val_frame_index = full_dataset.episode_data_index["from"][num_train_episodes].item()
+# - Load frames subset belonging to validation set using the `split` argument.
+#   It utilizes the `datasets` library's syntax for slicing datasets.
+#   For more information on the Slice API, please see:
+#   https://huggingface.co/docs/datasets/v2.19.0/loading#slice-splits
+train_dataset = LeRobotDataset(
+    "lerobot/pusht", split=f"train[:{first_val_frame_index}]", delta_timestamps=delta_timestamps
+)
+val_dataset = LeRobotDataset(
+    "lerobot/pusht", split=f"train[{first_val_frame_index}:]", delta_timestamps=delta_timestamps
+)
+print(f"Number of frames in training dataset (90% subset): {len(train_dataset)}")
+print(f"Number of frames in validation dataset (10% subset): {len(val_dataset)}")
+
+# Create dataloader for evaluation.
+val_dataloader = torch.utils.data.DataLoader(
+    val_dataset,
+    num_workers=4,
+    batch_size=64,
+    shuffle=False,
+    pin_memory=device != torch.device("cpu"),
+    drop_last=False,
+)
+
+# Run validation loop.
+loss_cumsum = 0
+n_examples_evaluated = 0
+for batch in val_dataloader:
+    batch = {k: v.to(device, non_blocking=True) for k, v in batch.items()}
+    output_dict = policy.forward(batch)
+
+    loss_cumsum += output_dict["loss"].item()
+    n_examples_evaluated += batch["index"].shape[0]
+
+# Calculate the average loss over the validation set.
+average_loss = loss_cumsum / n_examples_evaluated
+
+print(f"Average loss on validation set: {average_loss:.4f}")
--- a/lerobot/init.py
+++ b/lerobot/init.py
@@ -1,3 +1,18 @@
+#!/usr/bin/env python
+
+# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
 """
 This file contains lists of available environments, dataset and policies to reflect the current state of LeRobot library.
 We do not want to import all the dependencies, but instead we keep it lightweight to ensure fast access to these variables.
@@ -46,13 +61,21 @@ available_datasets_per_env = {
        "lerobot/aloha_sim_insertion_scripted",
        "lerobot/aloha_sim_transfer_cube_human",
        "lerobot/aloha_sim_transfer_cube_scripted",
+        "lerobot/aloha_sim_insertion_human_image",
+        "lerobot/aloha_sim_insertion_scripted_image",
+        "lerobot/aloha_sim_transfer_cube_human_image",
+        "lerobot/aloha_sim_transfer_cube_scripted_image",
    ],
-    "pusht": ["lerobot/pusht"],
+    "pusht": ["lerobot/pusht", "lerobot/pusht_image"],
    "xarm": [
        "lerobot/xarm_lift_medium",
        "lerobot/xarm_lift_medium_replay",
        "lerobot/xarm_push_medium",
        "lerobot/xarm_push_medium_replay",
+        "lerobot/xarm_lift_medium_image",
+        "lerobot/xarm_lift_medium_replay_image",
+        "lerobot/xarm_push_medium_image",
+        "lerobot/xarm_push_medium_replay_image",
    ],
 }

--- a/lerobot/version.py
+++ b/lerobot/version.py
@@ -1,3 +1,18 @@
+#!/usr/bin/env python
+
+# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
 """To enable `lerobot.__version__`"""

 from importlib.metadata import PackageNotFoundError, version
--- a/lerobot/common/datasets/_video_benchmark/run_video_benchmark.py
+++ b/lerobot/common/datasets/_video_benchmark/run_video_benchmark.py
@@ -1,3 +1,18 @@
+#!/usr/bin/env python
+
+# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
 import json
 import random
 import shutil
--- a/lerobot/common/datasets/factory.py
+++ b/lerobot/common/datasets/factory.py
@@ -1,3 +1,18 @@
+#!/usr/bin/env python
+
+# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
 import logging

 import torch
--- a/lerobot/common/datasets/lerobot_dataset.py
+++ b/lerobot/common/datasets/lerobot_dataset.py
@@ -1,3 +1,18 @@
+#!/usr/bin/env python
+
+# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
 import os
 from pathlib import Path

@@ -5,17 +20,19 @@ import datasets
 import torch

 from lerobot.common.datasets.utils import (
+    calculate_episode_data_index,
    load_episode_data_index,
    load_hf_dataset,
    load_info,
    load_previous_and_future_frames,
    load_stats,
    load_videos,
+    reset_episode_index,
 )
 from lerobot.common.datasets.video_utils import VideoFrame, load_from_videos

 DATA_DIR = Path(os.environ["DATA_DIR"]) if "DATA_DIR" in os.environ else None
-CODEBASE_VERSION = "v1.3"
+CODEBASE_VERSION = "v1.4"


 class LeRobotDataset(torch.utils.data.Dataset):
@@ -39,7 +56,11 @@ class LeRobotDataset(torch.utils.data.Dataset):
        # TODO(rcadene, aliberts): implement faster transfer
        # https://huggingface.co/docs/huggingface_hub/en/guides/download#faster-downloads
        self.hf_dataset = load_hf_dataset(repo_id, version, root, split)
-        self.episode_data_index = load_episode_data_index(repo_id, version, root)
+        if split == "train":
+            self.episode_data_index = load_episode_data_index(repo_id, version, root)
+        else:
+            self.episode_data_index = calculate_episode_data_index(self.hf_dataset)
+            self.hf_dataset = reset_episode_index(self.hf_dataset)
        self.stats = load_stats(repo_id, version, root)
        self.info = load_info(repo_id, version, root)
        if self.video:
--- a/lerobot/common/datasets/push_dataset_to_hub/_diffusion_policy_replay_buffer.py
+++ b/lerobot/common/datasets/push_dataset_to_hub/_diffusion_policy_replay_buffer.py
@@ -1,3 +1,18 @@
+#!/usr/bin/env python
+
+# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
 """Helper code for loading PushT dataset from Diffusion Policy (https://diffusion-policy.cs.columbia.edu/)

 Copied from the original Diffusion Policy repository and used in our `download_and_upload_dataset.py` script.
--- a/lerobot/common/datasets/push_dataset_to_hub/_download_raw.py
+++ b/lerobot/common/datasets/push_dataset_to_hub/_download_raw.py
@@ -1,3 +1,18 @@
+#!/usr/bin/env python
+
+# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
 """
 This file contains all obsolete download scripts. They are centralized here to not have to load
 useless dependencies when using datasets.
@@ -9,17 +24,16 @@ import shutil
 from pathlib import Path

 import tqdm
-
-ALOHA_RAW_URLS_DIR = "lerobot/common/datasets/push_dataset_to_hub/_aloha_raw_urls"
+from huggingface_hub import snapshot_download


 def download_raw(raw_dir, dataset_id):
-    if "pusht" in dataset_id:
+    if "aloha" in dataset_id or "image" in dataset_id:
+        download_hub(raw_dir, dataset_id)
+    elif "pusht" in dataset_id:
        download_pusht(raw_dir)
    elif "xarm" in dataset_id:
        download_xarm(raw_dir)
-    elif "aloha" in dataset_id:
-        download_aloha(raw_dir, dataset_id)
    elif "umi" in dataset_id:
        download_umi(raw_dir)
    else:
@@ -88,37 +102,13 @@ def download_xarm(raw_dir: Path):
    zip_path.unlink()


-def download_aloha(raw_dir: Path, dataset_id: str):
-    import gdown
-
-    subset_id = dataset_id.replace("aloha_", "")
-    urls_path = Path(ALOHA_RAW_URLS_DIR) / f"{subset_id}.txt"
-    assert urls_path.exists(), f"{subset_id}.txt not found in '{ALOHA_RAW_URLS_DIR}' directory."
-
-    with open(urls_path) as f:
-        # strip lines and ignore empty lines
-        urls = [url.strip() for url in f if url.strip()]
-
-    # sanity check
-    for url in urls:
-        assert (
-            "drive.google.com/drive/folders" in url or "drive.google.com/file" in url
-        ), f"Wrong url provided '{url}' in file '{urls_path}'."
-
+def download_hub(raw_dir: Path, dataset_id: str):
    raw_dir = Path(raw_dir)
    raw_dir.mkdir(parents=True, exist_ok=True)

-    logging.info(f"Start downloading from google drive for {dataset_id}")
-    for url in urls:
-        if "drive.google.com/drive/folders" in url:
-            # when a folder url is given, download up to 50 files from the folder
-            gdown.download_folder(url, output=str(raw_dir), remaining_ok=True)
-
-        elif "drive.google.com/file" in url:
-            # because of the 50 files limit per folder, we download the remaining files (file by file)
-            gdown.download(url, output=str(raw_dir), fuzzy=True)
-
-    logging.info(f"End downloading from google drive for {dataset_id}")
+    logging.info(f"Start downloading from huggingface.co/cadene for {dataset_id}")
+    snapshot_download(f"cadene/{dataset_id}_raw", repo_type="dataset", local_dir=raw_dir)
+    logging.info(f"Finish downloading from huggingface.co/cadene for {dataset_id}")


 def download_umi(raw_dir: Path):
@@ -133,21 +123,30 @@ def download_umi(raw_dir: Path):
 if __name__ == "__main__":
    data_dir = Path("data")
    dataset_ids = [
+        "pusht_image",
+        "xarm_lift_medium_image",
+        "xarm_lift_medium_replay_image",
+        "xarm_push_medium_image",
+        "xarm_push_medium_replay_image",
+        "aloha_sim_insertion_human_image",
+        "aloha_sim_insertion_scripted_image",
+        "aloha_sim_transfer_cube_human_image",
+        "aloha_sim_transfer_cube_scripted_image",
        "pusht",
        "xarm_lift_medium",
        "xarm_lift_medium_replay",
        "xarm_push_medium",
        "xarm_push_medium_replay",
+        "aloha_sim_insertion_human",
+        "aloha_sim_insertion_scripted",
+        "aloha_sim_transfer_cube_human",
+        "aloha_sim_transfer_cube_scripted",
        "aloha_mobile_cabinet",
        "aloha_mobile_chair",
        "aloha_mobile_elevator",
        "aloha_mobile_shrimp",
        "aloha_mobile_wash_pan",
        "aloha_mobile_wipe_wine",
-        "aloha_sim_insertion_human",
-        "aloha_sim_insertion_scripted",
-        "aloha_sim_transfer_cube_human",
-        "aloha_sim_transfer_cube_scripted",
        "aloha_static_battery",
        "aloha_static_candy",
        "aloha_static_coffee",
--- a/lerobot/common/datasets/push_dataset_to_hub/_umi_imagecodecs_numcodecs.py
+++ b/lerobot/common/datasets/push_dataset_to_hub/_umi_imagecodecs_numcodecs.py
@@ -1,3 +1,18 @@
+#!/usr/bin/env python
+
+# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
 # imagecodecs/numcodecs.py

 # Copyright (c) 2021-2022, Christoph Gohlke
--- a/lerobot/common/datasets/push_dataset_to_hub/aloha_dora_format.py
+++ b/lerobot/common/datasets/push_dataset_to_hub/aloha_dora_format.py
@@ -0,0 +1,215 @@
+#!/usr/bin/env python
+
+# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Contains utilities to process raw data format from dora-record
+"""
+
+import logging
+import re
+from pathlib import Path
+
+import pandas as pd
+import torch
+from datasets import Dataset, Features, Image, Sequence, Value
+
+from lerobot.common.datasets.utils import (
+    hf_transform_to_torch,
+)
+from lerobot.common.datasets.video_utils import VideoFrame
+from lerobot.common.utils.utils import init_logging
+
+
+def check_format(raw_dir) -> bool:
+    assert raw_dir.exists()
+
+    leader_file = list(raw_dir.glob("*.parquet"))
+    if len(leader_file) == 0:
+        raise ValueError(f"Missing parquet files in '{raw_dir}'")
+    return True
+
+
+def load_from_raw(raw_dir: Path, out_dir: Path, fps: int):
+    # Load data stream that will be used as reference for the timestamps synchronization
+    reference_key = "observation.images.cam_right_wrist"
+    reference_df = pd.read_parquet(raw_dir / f"{reference_key}.parquet")
+    reference_df = reference_df[["timestamp_utc", reference_key]]
+
+    # Merge all data stream using nearest backward strategy
+    df = reference_df
+    for path in raw_dir.glob("*.parquet"):
+        key = path.stem  # action or observation.state or ...
+        if key == reference_key or "failed_episode_index" in key:
+            continue
+        modality_df = pd.read_parquet(path)
+        modality_df = modality_df[["timestamp_utc", key]]
+        df = pd.merge_asof(
+            df,
+            modality_df,
+            on="timestamp_utc",
+            direction="nearest",
+            tolerance=pd.Timedelta(f"{1/fps} seconds"),
+        )
+
+    # Remove rows with episode_index -1 which indicates data that correspond to in-between episodes
+    df = df[df["episode_index"] != -1]
+
+    image_keys = [key for key in df if "observation.images." in key]
+
+    def get_episode_index(row):
+        episode_index_per_cam = {}
+        for key in image_keys:
+            path = row[key][0]["path"]
+            match = re.search(r"_(\d{6}).mp4", path)
+            if not match:
+                raise ValueError(path)
+            episode_index = int(match.group(1))
+            episode_index_per_cam[key] = episode_index
+        assert (
+            len(set(episode_index_per_cam.values())) == 1
+        ), f"All cameras are expected to belong to the same episode, but getting {episode_index_per_cam}"
+        return episode_index
+
+    df["episode_index"] = df.apply(get_episode_index, axis=1)
+
+    # dora only use arrays, so single values are encapsulated into a list
+    df["frame_index"] = df.groupby("episode_index").cumcount()
+    df = df.reset_index()
+    df["index"] = df.index
+
+    # set 'next.done' to True for the last frame of each episode
+    df["next.done"] = False
+    df.loc[df.groupby("episode_index").tail(1).index, "next.done"] = True
+
+    df["timestamp"] = df["timestamp_utc"].map(lambda x: x.timestamp())
+    # each episode starts with timestamp 0 to match the ones from the video
+    df["timestamp"] = df.groupby("episode_index")["timestamp"].transform(lambda x: x - x.iloc[0])
+
+    del df["timestamp_utc"]
+
+    # sanity check
+    has_nan = df.isna().any().any()
+    assert not has_nan
+
+    # sanity check episode indices go from 0 to n-1
+    ep_ids = [ep_idx for ep_idx, _ in df.groupby("episode_index")]
+    expected_ep_ids = list(range(df["episode_index"].max() + 1))
+    assert ep_ids == expected_ep_ids, f"Episodes indices go from {ep_ids} instead of {expected_ep_ids}"
+
+    # Create symlink to raw videos directory (that needs to be absolute not relative)
+    out_dir.mkdir(parents=True, exist_ok=True)
+    videos_dir = out_dir / "videos"
+    videos_dir.symlink_to((raw_dir / "videos").absolute())
+
+    # sanity check the video paths are well formated
+    for key in df:
+        if "observation.images." not in key:
+            continue
+        for ep_idx in ep_ids:
+            video_path = videos_dir / f"{key}_episode_{ep_idx:06d}.mp4"
+            assert video_path.exists(), f"Video file not found in {video_path}"
+
+    data_dict = {}
+    for key in df:
+        # is video frame
+        if "observation.images." in key:
+            # we need `[0] because dora only use arrays, so single values are encapsulated into a list.
+            # it is the case for video_frame dictionary = [{"path": ..., "timestamp": ...}]
+            data_dict[key] = [video_frame[0] for video_frame in df[key].values]
+
+            # sanity check the video path is well formated
+            video_path = videos_dir.parent / data_dict[key][0]["path"]
+            assert video_path.exists(), f"Video file not found in {video_path}"
+        # is number
+        elif df[key].iloc[0].ndim == 0 or df[key].iloc[0].shape[0] == 1:
+            data_dict[key] = torch.from_numpy(df[key].values)
+        # is vector
+        elif df[key].iloc[0].shape[0] > 1:
+            data_dict[key] = torch.stack([torch.from_numpy(x.copy()) for x in df[key].values])
+        else:
+            raise ValueError(key)
+
+    # Get the episode index containing for each unique episode index
+    first_ep_index_df = df.groupby("episode_index").agg(start_index=("index", "first")).reset_index()
+    from_ = first_ep_index_df["start_index"].tolist()
+    to_ = from_[1:] + [len(df)]
+    episode_data_index = {
+        "from": from_,
+        "to": to_,
+    }
+
+    return data_dict, episode_data_index
+
+
+def to_hf_dataset(data_dict, video) -> Dataset:
+    features = {}
+
+    keys = [key for key in data_dict if "observation.images." in key]
+    for key in keys:
+        if video:
+            features[key] = VideoFrame()
+        else:
+            features[key] = Image()
+
+    features["observation.state"] = Sequence(
+        length=data_dict["observation.state"].shape[1], feature=Value(dtype="float32", id=None)
+    )
+    if "observation.velocity" in data_dict:
+        features["observation.velocity"] = Sequence(
+            length=data_dict["observation.velocity"].shape[1], feature=Value(dtype="float32", id=None)
+        )
+    if "observation.effort" in data_dict:
+        features["observation.effort"] = Sequence(
+            length=data_dict["observation.effort"].shape[1], feature=Value(dtype="float32", id=None)
+        )
+    features["action"] = Sequence(
+        length=data_dict["action"].shape[1], feature=Value(dtype="float32", id=None)
+    )
+    features["episode_index"] = Value(dtype="int64", id=None)
+    features["frame_index"] = Value(dtype="int64", id=None)
+    features["timestamp"] = Value(dtype="float32", id=None)
+    features["next.done"] = Value(dtype="bool", id=None)
+    features["index"] = Value(dtype="int64", id=None)
+
+    hf_dataset = Dataset.from_dict(data_dict, features=Features(features))
+    hf_dataset.set_transform(hf_transform_to_torch)
+    return hf_dataset
+
+
+def from_raw_to_lerobot_format(raw_dir: Path, out_dir: Path, fps=None, video=True, debug=False):
+    init_logging()
+
+    if debug:
+        logging.warning("debug=True not implemented. Falling back to debug=False.")
+
+    # sanity check
+    check_format(raw_dir)
+
+    if fps is None:
+        fps = 30
+    else:
+        raise NotImplementedError()
+
+    if not video:
+        raise NotImplementedError()
+
+    data_df, episode_data_index = load_from_raw(raw_dir, out_dir, fps)
+    hf_dataset = to_hf_dataset(data_df, video)
+
+    info = {
+        "fps": fps,
+        "video": video,
+    }
+    return hf_dataset, episode_data_index, info
--- a/lerobot/common/datasets/push_dataset_to_hub/aloha_hdf5_format.py
+++ b/lerobot/common/datasets/push_dataset_to_hub/aloha_hdf5_format.py
@@ -1,8 +1,23 @@
+#!/usr/bin/env python
+
+# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
 """
 Contains utilities to process raw data format of HDF5 files like in: https://github.com/tonyzhaozh/act
 """

-import re
+import gc
 import shutil
 from pathlib import Path

@@ -64,10 +79,8 @@ def load_from_raw(raw_dir, out_dir, fps, video, debug):
    episode_data_index = {"from": [], "to": []}

    id_from = 0
-
-    for ep_path in tqdm.tqdm(hdf5_files, total=len(hdf5_files)):
+    for ep_idx, ep_path in tqdm.tqdm(enumerate(hdf5_files), total=len(hdf5_files)):
        with h5py.File(ep_path, "r") as ep:
-            ep_idx = int(re.search(r"episode_(\d+)", ep_path.name).group(1))
            num_frames = ep["/action"].shape[0]

            # last step of demonstration is considered done
@@ -76,6 +89,10 @@ def load_from_raw(raw_dir, out_dir, fps, video, debug):

            state = torch.from_numpy(ep["/observations/qpos"][:])
            action = torch.from_numpy(ep["/action"][:])
+            if "/observations/qvel" in ep:
+                velocity = torch.from_numpy(ep["/observations/qvel"][:])
+            if "/observations/effort" in ep:
+                effort = torch.from_numpy(ep["/observations/effort"][:])

            ep_dict = {}

@@ -116,6 +133,10 @@ def load_from_raw(raw_dir, out_dir, fps, video, debug):
                    ep_dict[img_key] = [PILImage.fromarray(x) for x in imgs_array]

            ep_dict["observation.state"] = state
+            if "/observations/velocity" in ep:
+                ep_dict["observation.velocity"] = velocity
+            if "/observations/effort" in ep:
+                ep_dict["observation.effort"] = effort
            ep_dict["action"] = action
            ep_dict["episode_index"] = torch.tensor([ep_idx] * num_frames)
            ep_dict["frame_index"] = torch.arange(0, num_frames, 1)
@@ -131,6 +152,8 @@ def load_from_raw(raw_dir, out_dir, fps, video, debug):

        id_from += num_frames

+        gc.collect()
+
        # process first episode only
        if debug:
            break
@@ -152,6 +175,14 @@ def to_hf_dataset(data_dict, video) -> Dataset:
    features["observation.state"] = Sequence(
        length=data_dict["observation.state"].shape[1], feature=Value(dtype="float32", id=None)
    )
+    if "observation.velocity" in data_dict:
+        features["observation.velocity"] = Sequence(
+            length=data_dict["observation.velocity"].shape[1], feature=Value(dtype="float32", id=None)
+        )
+    if "observation.effort" in data_dict:
+        features["observation.effort"] = Sequence(
+            length=data_dict["observation.effort"].shape[1], feature=Value(dtype="float32", id=None)
+        )
    features["action"] = Sequence(
        length=data_dict["action"].shape[1], feature=Value(dtype="float32", id=None)
    )
--- a/lerobot/common/datasets/push_dataset_to_hub/compute_stats.py
+++ b/lerobot/common/datasets/push_dataset_to_hub/compute_stats.py
@@ -1,3 +1,18 @@
+#!/usr/bin/env python
+
+# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
 from copy import deepcopy
 from math import ceil

--- a/lerobot/common/datasets/push_dataset_to_hub/pusht_zarr_format.py
+++ b/lerobot/common/datasets/push_dataset_to_hub/pusht_zarr_format.py
@@ -1,3 +1,18 @@
+#!/usr/bin/env python
+
+# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
 """Process zarr files formatted like in: https://github.com/real-stanford/diffusion_policy"""

 import shutil
--- a/lerobot/common/datasets/push_dataset_to_hub/umi_zarr_format.py
+++ b/lerobot/common/datasets/push_dataset_to_hub/umi_zarr_format.py
@@ -1,3 +1,18 @@
+#!/usr/bin/env python
+
+# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
 """Process UMI (Universal Manipulation Interface) data stored in Zarr format like in: https://github.com/real-stanford/universal_manipulation_interface"""

 import logging
--- a/lerobot/common/datasets/push_dataset_to_hub/utils.py
+++ b/lerobot/common/datasets/push_dataset_to_hub/utils.py
@@ -1,3 +1,18 @@
+#!/usr/bin/env python
+
+# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
 from concurrent.futures import ThreadPoolExecutor
 from pathlib import Path

--- a/lerobot/common/datasets/push_dataset_to_hub/xarm_pkl_format.py
+++ b/lerobot/common/datasets/push_dataset_to_hub/xarm_pkl_format.py
@@ -1,3 +1,18 @@
+#!/usr/bin/env python
+
+# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
 """Process pickle files formatted like in: https://github.com/fyhMer/fowm"""

 import pickle
--- a/lerobot/common/datasets/utils.py
+++ b/lerobot/common/datasets/utils.py
@@ -1,5 +1,22 @@
+#!/usr/bin/env python
+
+# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
 import json
+import re
 from pathlib import Path
+from typing import Dict

 import datasets
 import torch
@@ -64,7 +81,23 @@ def hf_transform_to_torch(items_dict):
 def load_hf_dataset(repo_id, version, root, split) -> datasets.Dataset:
    """hf_dataset contains all the observations, states, actions, rewards, etc."""
    if root is not None:
-        hf_dataset = load_from_disk(str(Path(root) / repo_id / split))
+        hf_dataset = load_from_disk(str(Path(root) / repo_id / "train"))
+        # TODO(rcadene): clean this which enables getting a subset of dataset
+        if split != "train":
+            if "%" in split:
+                raise NotImplementedError(f"We dont support splitting based on percentage for now ({split}).")
+            match_from = re.search(r"train\[(\d+):\]", split)
+            match_to = re.search(r"train\[:(\d+)\]", split)
+            if match_from:
+                from_frame_index = int(match_from.group(1))
+                hf_dataset = hf_dataset.select(range(from_frame_index, len(hf_dataset)))
+            elif match_to:
+                to_frame_index = int(match_to.group(1))
+                hf_dataset = hf_dataset.select(range(to_frame_index))
+            else:
+                raise ValueError(
+                    f'`split` ({split}) should either be "train", "train[INT:]", or "train[:INT]"'
+                )
    else:
        hf_dataset = load_dataset(repo_id, revision=version, split=split)
    hf_dataset.set_transform(hf_transform_to_torch)
@@ -230,6 +263,84 @@ def load_previous_and_future_frames(
    return item


+def calculate_episode_data_index(hf_dataset: datasets.Dataset) -> Dict[str, torch.Tensor]:
+    """
+    Calculate episode data index for the provided HuggingFace Dataset. Relies on episode_index column of hf_dataset.
+
+    Parameters:
+    - hf_dataset (datasets.Dataset): A HuggingFace dataset containing the episode index.
+
+    Returns:
+    - episode_data_index: A dictionary containing the data index for each episode. The dictionary has two keys:
+        - "from": A tensor containing the starting index of each episode.
+        - "to": A tensor containing the ending index of each episode.
+    """
+    episode_data_index = {"from": [], "to": []}
+
+    current_episode = None
+    """
+    The episode_index is a list of integers, each representing the episode index of the corresponding example.
+    For instance, the following is a valid episode_index:
+      [0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 2]
+
+    Below, we iterate through the episode_index and populate the episode_data_index dictionary with the starting and
+    ending index of each episode. For the episode_index above, the episode_data_index dictionary will look like this:
+        {
+            "from": [0, 3, 7],
+            "to": [3, 7, 12]
+        }
+    """
+    if len(hf_dataset) == 0:
+        episode_data_index = {
+            "from": torch.tensor([]),
+            "to": torch.tensor([]),
+        }
+        return episode_data_index
+    for idx, episode_idx in enumerate(hf_dataset["episode_index"]):
+        if episode_idx != current_episode:
+            # We encountered a new episode, so we append its starting location to the "from" list
+            episode_data_index["from"].append(idx)
+            # If this is not the first episode, we append the ending location of the previous episode to the "to" list
+            if current_episode is not None:
+                episode_data_index["to"].append(idx)
+            # Let's keep track of the current episode index
+            current_episode = episode_idx
+        else:
+            # We are still in the same episode, so there is nothing for us to do here
+            pass
+    # We have reached the end of the dataset, so we append the ending location of the last episode to the "to" list
+    episode_data_index["to"].append(idx + 1)
+
+    for k in ["from", "to"]:
+        episode_data_index[k] = torch.tensor(episode_data_index[k])
+
+    return episode_data_index
+
+
+def reset_episode_index(hf_dataset: datasets.Dataset) -> datasets.Dataset:
+    """
+    Reset the `episode_index` of the provided HuggingFace Dataset.
+
+    `episode_data_index` (and related functionality such as `load_previous_and_future_frames`) requires the
+    `episode_index` to be sorted, continuous (1,1,1 and not 1,2,1) and start at 0.
+
+    This brings the `episode_index` to the required format.
+    """
+    if len(hf_dataset) == 0:
+        return hf_dataset
+    unique_episode_idxs = torch.stack(hf_dataset["episode_index"]).unique().tolist()
+    episode_idx_to_reset_idx_mapping = {
+        ep_id: reset_ep_id for reset_ep_id, ep_id in enumerate(unique_episode_idxs)
+    }
+
+    def modify_ep_idx_func(example):
+        example["episode_index"] = episode_idx_to_reset_idx_mapping[example["episode_index"].item()]
+        return example
+
+    hf_dataset = hf_dataset.map(modify_ep_idx_func)
+    return hf_dataset
+
+
 def cycle(iterable):
    """The equivalent of itertools.cycle, but safe for Pytorch dataloaders.

--- a/lerobot/common/datasets/video_utils.py
+++ b/lerobot/common/datasets/video_utils.py
@@ -1,3 +1,18 @@
+#!/usr/bin/env python
+
+# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
 import logging
 import subprocess
 import warnings
--- a/lerobot/common/envs/factory.py
+++ b/lerobot/common/envs/factory.py
@@ -1,3 +1,18 @@
+#!/usr/bin/env python
+
+# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
 import importlib

 import gymnasium as gym
@@ -12,14 +27,6 @@ def make_env(cfg: DictConfig, n_envs: int | None = None) -> gym.vector.VectorEnv
    if n_envs is not None and n_envs < 1:
        raise ValueError("`n_envs must be at least 1")

-    kwargs = {
-        "obs_type": "pixels_agent_pos",
-        "render_mode": "rgb_array",
-        "max_episode_steps": cfg.env.episode_length,
-        "visualization_width": 384,
-        "visualization_height": 384,
-    }
-
    package_name = f"gym_{cfg.env.name}"

    try:
@@ -31,12 +38,16 @@ def make_env(cfg: DictConfig, n_envs: int | None = None) -> gym.vector.VectorEnv
        raise e

    gym_handle = f"{package_name}/{cfg.env.task}"
+    gym_kwgs = cfg.env.get("gym", {})
+
+    if cfg.env.get("episode_length"):
+        gym_kwgs["max_episode_steps"] = cfg.env.episode_length

    # batched version of the env that returns an observation of shape (b, c)
    env_cls = gym.vector.AsyncVectorEnv if cfg.eval.use_async_envs else gym.vector.SyncVectorEnv
    env = env_cls(
        [
-            lambda: gym.make(gym_handle, disable_env_checker=True, **kwargs)
+            lambda: gym.make(gym_handle, disable_env_checker=True, **gym_kwgs)
            for _ in range(n_envs if n_envs is not None else cfg.eval.batch_size)
        ]
    )
--- a/lerobot/common/envs/utils.py
+++ b/lerobot/common/envs/utils.py
@@ -1,3 +1,18 @@
+#!/usr/bin/env python
+
+# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
 import einops
 import numpy as np
 import torch
--- a/lerobot/common/logger.py
+++ b/lerobot/common/logger.py
@@ -1,3 +1,18 @@
+#!/usr/bin/env python
+
+# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
 # TODO(rcadene, alexander-soare): clean this file
 """Borrowed from https://github.com/fyhMer/fowm/blob/main/src/logger.py"""

@@ -82,13 +97,9 @@ class Logger:
            # Also save the full Hydra config for the env configuration.
            OmegaConf.save(self._cfg, save_dir / "config.yaml")
            if self._wandb and not self._disable_wandb_artifact:
-                # note wandb artifact does not accept ":" in its name
+                # note wandb artifact does not accept ":" or "/" in its name
                artifact = self._wandb.Artifact(
-                    self._group.replace(":", "_").replace("/", "__")
-                    + "-"
-                    + str(self._seed)
-                    + "-"
-                    + str(identifier),
+                    f"{self._group.replace(':', '_').replace('/', '_')}-{self._seed}-{identifier}",
                    type="model",
                )
                artifact.add_file(save_dir / SAFETENSORS_SINGLE_FILE)
@@ -98,9 +109,10 @@ class Logger:
        self._buffer_dir.mkdir(parents=True, exist_ok=True)
        fp = self._buffer_dir / f"{str(identifier)}.pkl"
        buffer.save(fp)
-        if self._wandb:
+        if self._wandb and not self._disable_wandb_artifact:
+            # note wandb artifact does not accept ":" or "/" in its name
            artifact = self._wandb.Artifact(
-                self._group + "-" + str(self._seed) + "-" + str(identifier),
+                f"{self._group.replace(':', '_').replace('/', '_')}-{self._seed}-{identifier}",
                type="buffer",
            )
            artifact.add_file(fp)
@@ -118,6 +130,8 @@ class Logger:
        assert mode in {"train", "eval"}
        if self._wandb is not None:
            for k, v in d.items():
+                if not isinstance(v, (int, float, str)):
+                    continue
                self._wandb.log({f"{mode}/{k}": v}, step=step)

    def log_video(self, video_path: str, step: int, mode: str = "train"):
--- a/lerobot/common/policies/act/configuration_act.py
+++ b/lerobot/common/policies/act/configuration_act.py
@@ -1,3 +1,18 @@
+#!/usr/bin/env python
+
+# Copyright 2024 Tony Z. Zhao and The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
 from dataclasses import dataclass, field


@@ -10,6 +25,14 @@ class ACTConfig:
    The parameters you will most likely need to change are the ones which depend on the environment / sensors.
    Those are: `input_shapes` and 'output_shapes`.

+    Notes on the inputs and outputs:
+        - "observation.state" is required as an input key.
+        - At least one key starting with "observation.image is required as an input.
+        - If there are multiple keys beginning with "observation.image" they are treated as multiple camera
+          views.
+          Right now we only support all images having the same shape.
+        - "action" is required as an output key.
+
    Args:
        n_obs_steps: Number of environment steps worth of observations to pass to the policy (takes the
            current step and additional steps going back).
@@ -18,15 +41,15 @@ class ACTConfig:
            This should be no greater than the chunk size. For example, if the chunk size size 100, you may
            set this to 50. This would mean that the model predicts 100 steps worth of actions, runs 50 in the
            environment, and throws the other 50 out.
-        input_shapes: A dictionary defining the shapes of the input data for the policy.
-            The key represents the input data name, and the value is a list indicating the dimensions
-            of the corresponding data. For example, "observation.images.top" refers to an input from the
-            "top" camera with dimensions [3, 96, 96], indicating it has three color channels and 96x96 resolution.
-            Importantly, shapes doesn't include batch dimension or temporal dimension.
-        output_shapes: A dictionary defining the shapes of the output data for the policy.
-            The key represents the output data name, and the value is a list indicating the dimensions
-            of the corresponding data. For example, "action" refers to an output shape of [14], indicating
-            14-dimensional actions. Importantly, shapes doesn't include batch dimension or temporal dimension.
+        input_shapes: A dictionary defining the shapes of the input data for the policy. The key represents
+            the input data name, and the value is a list indicating the dimensions of the corresponding data.
+            For example, "observation.image" refers to an input from a camera with dimensions [3, 96, 96],
+            indicating it has three color channels and 96x96 resolution. Importantly, `input_shapes` doesn't
+            include batch dimension or temporal dimension.
+        output_shapes: A dictionary defining the shapes of the output data for the policy. The key represents
+            the output data name, and the value is a list indicating the dimensions of the corresponding data.
+            For example, "action" refers to an output shape of [14], indicating 14-dimensional actions.
+            Importantly, `output_shapes` doesn't include batch dimension or temporal dimension.
        input_normalization_modes: A dictionary with key representing the modality (e.g. "observation.state"),
            and the value specifies the normalization mode to apply. The two available modes are "mean_std"
            which subtracts the mean and divides by the standard deviation and "min_max" which rescale in a
@@ -51,8 +74,12 @@ class ACTConfig:
            documentation in the policy class).
        latent_dim: The VAE's latent dimension.
        n_vae_encoder_layers: The number of transformer layers to use for the VAE's encoder.
-        use_temporal_aggregation: Whether to blend the actions of multiple policy invocations for any given
-            environment step.
+        temporal_ensemble_momentum: Exponential moving average (EMA) momentum parameter (α) for ensembling
+            actions for a given time step over multiple policy invocations. Updates are calculated as:
+            x⁻ₙ = αx⁻ₙ₋₁ + (1-α)xₙ. Note that the ACT paper and original ACT code describes a different
+            parameter here: they refer to a weighting scheme wᵢ = exp(-m⋅i) and set m = 0.01. With our
+            formulation, this is equivalent to α = exp(-0.01) ≈ 0.99. When this parameter is provided, we
+            require `n_action_steps == 1` (since we need to query the policy every step anyway).
        dropout: Dropout to use in the transformer layers (see code for details).
        kl_weight: The weight to use for the KL-divergence component of the loss if the variational objective
            is enabled. Loss is then calculated as: `reconstruction_loss + kl_weight * kld_loss`.
@@ -100,6 +127,9 @@ class ACTConfig:
    dim_feedforward: int = 3200
    feedforward_activation: str = "relu"
    n_encoder_layers: int = 4
+    # Note: Although the original ACT implementation has 7 for `n_decoder_layers`, there is a bug in the code
+    # that means only the first layer is used. Here we match the original implementation by setting this to 1.
+    # See this issue https://github.com/tonyzhaozh/act/issues/25#issue-2258740521.
    n_decoder_layers: int = 1
    # VAE.
    use_vae: bool = True
@@ -107,7 +137,7 @@ class ACTConfig:
    n_vae_encoder_layers: int = 4

    # Inference.
-    use_temporal_aggregation: bool = False
+    temporal_ensemble_momentum: float | None = None

    # Training and loss computation.
    dropout: float = 0.1
@@ -119,21 +149,13 @@ class ACTConfig:
            raise ValueError(
                f"`vision_backbone` must be one of the ResNet variants. Got {self.vision_backbone}."
            )
-        if self.use_temporal_aggregation:
-            raise NotImplementedError("Temporal aggregation is not yet implemented.")
+        if self.temporal_ensemble_momentum is not None and self.n_action_steps > 1:
+            raise NotImplementedError(
+                "`n_action_steps` must be 1 when using temporal ensembling. This is "
+                "because the policy needs to be queried every step to compute the ensembled action."
+            )
        if self.n_action_steps > self.chunk_size:
            raise ValueError(
                f"The chunk size is the upper bound for the number of action steps per model invocation. Got "
                f"{self.n_action_steps} for `n_action_steps` and {self.chunk_size} for `chunk_size`."
            )
-        if self.n_obs_steps != 1:
-            raise ValueError(
-                f"Multiple observation steps not handled yet. Got `nobs_steps={self.n_obs_steps}`"
-            )
-        # Check that there is only one image.
-        # TODO(alexander-soare): generalize this to multiple images.
-        if (
-            sum(k.startswith("observation.images.") for k in self.input_shapes) != 1
-            or "observation.images.top" not in self.input_shapes
-        ):
-            raise ValueError('For now, only "observation.images.top" is accepted for an image input.')
--- a/lerobot/common/policies/act/modeling_act.py
+++ b/lerobot/common/policies/act/modeling_act.py
@@ -1,3 +1,18 @@
+#!/usr/bin/env python
+
+# Copyright 2024 Tony Z. Zhao and The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
 """Action Chunking Transformer Policy

 As per Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware (https://arxiv.org/abs/2304.13705).
@@ -21,6 +36,9 @@ from torchvision.ops.misc import FrozenBatchNorm2d

 from lerobot.common.policies.act.configuration_act import ACTConfig
 from lerobot.common.policies.normalize import Normalize, Unnormalize
+from lerobot.common.policies.utils import (
+    populate_queues,
+)


 class ACTPolicy(nn.Module, PyTorchModelHubMixin):
@@ -46,7 +64,8 @@ class ACTPolicy(nn.Module, PyTorchModelHubMixin):
        super().__init__()
        if config is None:
            config = ACTConfig()
-        self.config = config
+        self.config: ACTConfig = config
+
        self.normalize_inputs = Normalize(
            config.input_shapes, config.input_normalization_modes, dataset_stats
        )
@@ -56,12 +75,26 @@ class ACTPolicy(nn.Module, PyTorchModelHubMixin):
        self.unnormalize_outputs = Unnormalize(
            config.output_shapes, config.output_normalization_modes, dataset_stats
        )
+
+        # queues are populated during rollout of the policy, they contain the n latest observations and actions
+        self._queues = None
+
        self.model = ACT(config)

+        self.expected_image_keys = [k for k in config.input_shapes if k.startswith("observation.image")]
+
+        self.reset()
+
    def reset(self):
        """This should be called whenever the environment is reset."""
-        if self.config.n_action_steps is not None:
-            self._action_queue = deque([], maxlen=self.config.n_action_steps)
+        if self.config.temporal_ensemble_momentum is not None:
+            self._ensembled_actions = None
+        else:
+            self._queues = {
+                "observation.images": deque(maxlen=self.config.n_obs_steps),
+                "observation.state": deque(maxlen=self.config.n_obs_steps),
+                "action": deque(maxlen=self.config.n_action_steps),
+            }

    @torch.no_grad
    def select_action(self, batch: dict[str, Tensor]) -> Tensor:
@@ -71,66 +104,80 @@ class ACTPolicy(nn.Module, PyTorchModelHubMixin):
        environment. It works by managing the actions in a queue and only calling `select_actions` when the
        queue is empty.
        """
-        assert "observation.images.top" in batch
-        assert "observation.state" in batch
-
        self.eval()

        batch = self.normalize_inputs(batch)
-        self._stack_images(batch)
+        batch["observation.images"] = torch.stack([batch[k] for k in self.expected_image_keys], dim=-4)
+        # Note: It's important that this happens after stacking the images into a single key.
+        self._queues = populate_queues(self._queues, batch)

-        if len(self._action_queue) == 0:
-            # `self.model.forward` returns a (batch_size, n_action_steps, action_dim) tensor, but the queue
-            # effectively has shape (n_action_steps, batch_size, *), hence the transpose.
-            actions = self.model(batch)[0][: self.config.n_action_steps]
+        # If we are doing temporal ensembling, keep track of the exponential moving average (EMA), and return
+        # the first action.
+        if self.config.temporal_ensemble_momentum is not None:
+            # stack n latest observations from the queue
+            batch = {k: torch.stack(list(self._queues[k]), dim=1) for k in batch if k in self._queues}
+
+            actions = self.model(batch)[0]  # (batch_size, chunk_size, action_dim)
+            actions = self.unnormalize_outputs({"action": actions})["action"]
+            if self._ensembled_actions is None:
+                # Initializes `self._ensembled_action` to the sequence of actions predicted during the first
+                # time step of the episode.
+                self._ensembled_actions = actions.clone()
+            else:
+                # self._ensembled_actions will have shape (batch_size, chunk_size - 1, action_dim). Compute
+                # the EMA update for those entries.
+                alpha = self.config.temporal_ensemble_momentum
+                self._ensembled_actions = alpha * self._ensembled_actions + (1 - alpha) * actions[:, :-1]
+                # The last action, which has no prior moving average, needs to get concatenated onto the end.
+                self._ensembled_actions = torch.cat([self._ensembled_actions, actions[:, -1:]], dim=1)
+            # "Consume" the first action.
+            action, self._ensembled_actions = self._ensembled_actions[:, 0], self._ensembled_actions[:, 1:]
+            return action
+
+        # Action queue logic for n_action_steps > 1. When the action_queue is depleted, populate it by
+        # querying the policy.
+        if len(self._queues["action"]) == 0:
+            # stack n latest observations from the queue
+            batch = {k: torch.stack(list(self._queues[k]), dim=1) for k in batch if k in self._queues}
+
+            actions = self.model(batch)[0][:, : self.config.n_action_steps]

            # TODO(rcadene): make _forward return output dictionary?
            actions = self.unnormalize_outputs({"action": actions})["action"]

-            self._action_queue.extend(actions.transpose(0, 1))
-        return self._action_queue.popleft()
+            # `self.model.forward` returns a (batch_size, n_action_steps, action_dim) tensor, but the queue
+            # effectively has shape (n_action_steps, batch_size, *), hence the transpose.
+            self._queues["action"].extend(actions.transpose(0, 1))
+        return self._queues["action"].popleft()

    def forward(self, batch: dict[str, Tensor]) -> dict[str, Tensor]:
        """Run the batch through the model and compute the loss for training or validation."""
        batch = self.normalize_inputs(batch)
+        batch["observation.images"] = torch.stack([batch[k] for k in self.expected_image_keys], dim=-4)
        batch = self.normalize_targets(batch)
-        self._stack_images(batch)
        actions_hat, (mu_hat, log_sigma_x2_hat) = self.model(batch)

-        l1_loss = (
-            F.l1_loss(batch["action"], actions_hat, reduction="none") * ~batch["action_is_pad"].unsqueeze(-1)
-        ).mean()
+        bsize = actions_hat.shape[0]
+        l1_loss = F.l1_loss(batch["action"], actions_hat, reduction="none")
+        l1_loss = l1_loss * ~batch["action_is_pad"].unsqueeze(-1)
+        l1_loss = l1_loss.view(bsize, -1).mean(dim=1)
+
+        out_dict = {}
+        out_dict["actions"] = self.unnormalize_outputs({"action": actions_hat})["action"]
+        out_dict["l1_loss"] = l1_loss

-        loss_dict = {"l1_loss": l1_loss}
        if self.config.use_vae:
            # Calculate Dₖₗ(latent_pdf || standard_normal). Note: After computing the KL-divergence for
            # each dimension independently, we sum over the latent dimension to get the total
            # KL-divergence per batch element, then take the mean over the batch.
            # (See App. B of https://arxiv.org/abs/1312.6114 for more details).
-            mean_kld = (
-                (-0.5 * (1 + log_sigma_x2_hat - mu_hat.pow(2) - (log_sigma_x2_hat).exp())).sum(-1).mean()
-            )
-            loss_dict["kld_loss"] = mean_kld
-            loss_dict["loss"] = l1_loss + mean_kld * self.config.kl_weight
+            kld_loss = (-0.5 * (1 + log_sigma_x2_hat - mu_hat.pow(2) - (log_sigma_x2_hat).exp())).sum(-1)
+            out_dict["kld_loss"] = kld_loss
+            out_dict["loss"] = l1_loss + kld_loss * self.config.kl_weight
        else:
-            loss_dict["loss"] = l1_loss
+            out_dict["loss"] = l1_loss

-        return loss_dict
-
-    def _stack_images(self, batch: dict[str, Tensor]) -> dict[str, Tensor]:
-        """Stacks all the images in a batch and puts them in a new key: "observation.images".
-
-        This function expects `batch` to have (at least):
-        {
-            "observation.state": (B, state_dim) batch of robot states.
-            "observation.images.{name}": (B, C, H, W) tensor of images.
-        }
-        """
-        # Stack images in the order dictated by input_shapes.
-        batch["observation.images"] = torch.stack(
-            [batch[k] for k in self.config.input_shapes if k.startswith("observation.images.")],
-            dim=-4,
-        )
+        return out_dict


 class ACT(nn.Module):
@@ -161,10 +208,10 @@ class ACT(nn.Module):
              │ encoder │ │     │ │Transf.│             │
              │         │ │     │ │encoder│             │
              └───▲─────┘ │     │ │       │             │
-                  │       │     │ └───▲───┘             │
-                  │       │     │     │                 │
-                inputs    └─────┼─────┘                 │
-                                │                       │
+                  │       │     │ └▲──▲─▲─┘             │
+                  │       │     │  │  │ │               │
+                inputs    └─────┼──┘  │ image emb.      │
+                                │    state emb.         │
                                └───────────────────────┘
    """

@@ -173,25 +220,28 @@ class ACT(nn.Module):
        self.config = config
        # BERT style VAE encoder with input [cls, *joint_space_configuration, *action_sequence].
        # The cls token forms parameters of the latent's distribution (like this [*means, *log_variances]).
+        self.has_state = "observation.state" in config.input_shapes
+        self.latent_dim = config.latent_dim
        if self.config.use_vae:
            self.vae_encoder = ACTEncoder(config)
            self.vae_encoder_cls_embed = nn.Embedding(1, config.dim_model)
            # Projection layer for joint-space configuration to hidden dimension.
-            self.vae_encoder_robot_state_input_proj = nn.Linear(
-                config.input_shapes["observation.state"][0], config.dim_model
-            )
+            if self.has_state:
+                self.vae_encoder_robot_state_input_proj = nn.Linear(
+                    config.input_shapes["observation.state"][0], config.dim_model
+                )
            # Projection layer for action (joint-space target) to hidden dimension.
            self.vae_encoder_action_input_proj = nn.Linear(
-                config.input_shapes["observation.state"][0], config.dim_model
+                config.output_shapes["action"][0], config.dim_model
            )
-            self.latent_dim = config.latent_dim
            # Projection layer from the VAE encoder's output to the latent distribution's parameter space.
            self.vae_encoder_latent_output_proj = nn.Linear(config.dim_model, self.latent_dim * 2)
            # Fixed sinusoidal positional embedding the whole input to the VAE encoder. Unsqueeze for batch
            # dimension.
+            num_input_token_encoder = 1 + 1 + config.chunk_size if self.has_state else 1 + config.chunk_size
            self.register_buffer(
                "vae_encoder_pos_enc",
-                create_sinusoidal_pos_embedding(1 + 1 + config.chunk_size, config.dim_model).unsqueeze(0),
+                create_sinusoidal_pos_embedding(num_input_token_encoder, config.dim_model).unsqueeze(0),
            )

        # Backbone for image feature extraction.
@@ -205,21 +255,30 @@ class ACT(nn.Module):
        # Note: The forward method of this returns a dict: {"feature_map": output}.
        self.backbone = IntermediateLayerGetter(backbone_model, return_layers={"layer4": "feature_map"})

+        if self.config.n_obs_steps > 1:
+            self.factorized_conv3d = FactorizedConv3d(
+                backbone_model.fc.in_features,
+                backbone_model.fc.in_features,
+                backbone_model.fc.in_features,
+            )
+
        # Transformer (acts as VAE decoder when training with the variational objective).
        self.encoder = ACTEncoder(config)
        self.decoder = ACTDecoder(config)

        # Transformer encoder input projections. The tokens will be structured like
        # [latent, robot_state, image_feature_map_pixels].
-        self.encoder_robot_state_input_proj = nn.Linear(
-            config.input_shapes["observation.state"][0], config.dim_model
-        )
+        if self.has_state:
+            self.encoder_robot_state_input_proj = nn.Linear(
+                config.input_shapes["observation.state"][0], config.dim_model
+            )
        self.encoder_latent_input_proj = nn.Linear(self.latent_dim, config.dim_model)
        self.encoder_img_feat_input_proj = nn.Conv2d(
            backbone_model.fc.in_features, config.dim_model, kernel_size=1
        )
        # Transformer encoder positional embeddings.
-        self.encoder_robot_and_latent_pos_embed = nn.Embedding(2, config.dim_model)
+        num_input_token_decoder = 2 if self.has_state else 1
+        self.encoder_robot_and_latent_pos_embed = nn.Embedding(num_input_token_decoder, config.dim_model)
        self.encoder_cam_feat_pos_embed = ACTSinusoidalPositionEmbedding2d(config.dim_model // 2)

        # Transformer decoder.
@@ -258,7 +317,7 @@ class ACT(nn.Module):
                "action" in batch
            ), "actions must be provided when using the variational objective in training mode."

-        batch_size = batch["observation.state"].shape[0]
+        batch_size = batch["observation.images"].shape[0]

        # Prepare the latent for input to the transformer encoder.
        if self.config.use_vae and "action" in batch:
@@ -266,11 +325,16 @@ class ACT(nn.Module):
            cls_embed = einops.repeat(
                self.vae_encoder_cls_embed.weight, "1 d -> b 1 d", b=batch_size
            )  # (B, 1, D)
-            robot_state_embed = self.vae_encoder_robot_state_input_proj(batch["observation.state"]).unsqueeze(
-                1
-            )  # (B, 1, D)
+            if self.has_state:
+                robot_state_embed = self.vae_encoder_robot_state_input_proj(batch["observation.state"])
+                robot_state_embed = robot_state_embed.unsqueeze(1)  # (B, 1, D)
            action_embed = self.vae_encoder_action_input_proj(batch["action"])  # (B, S, D)
-            vae_encoder_input = torch.cat([cls_embed, robot_state_embed, action_embed], axis=1)  # (B, S+2, D)
+
+            if self.has_state:
+                vae_encoder_input = [cls_embed, robot_state_embed, action_embed]  # (B, S+2, D)
+            else:
+                vae_encoder_input = [cls_embed, action_embed]
+            vae_encoder_input = torch.cat(vae_encoder_input, axis=1)

            # Prepare fixed positional embedding.
            # Note: detach() shouldn't be necessary but leaving it the same as the original code just in case.
@@ -290,6 +354,7 @@ class ACT(nn.Module):
        else:
            # When not using the VAE encoder, we set the latent to be all zeros.
            mu = log_sigma_x2 = None
+            # TODO(rcadene, alexander-soare): remove call to `.to` to speedup forward ; precompute and use buffer
            latent_sample = torch.zeros([batch_size, self.latent_dim], dtype=torch.float32).to(
                batch["observation.state"].device
            )
@@ -299,25 +364,44 @@ class ACT(nn.Module):
        all_cam_features = []
        all_cam_pos_embeds = []
        images = batch["observation.images"]
+
        for cam_index in range(images.shape[-4]):
-            cam_features = self.backbone(images[:, cam_index])["feature_map"]
+            if self.config.n_obs_steps > 1:
+                assert images.ndim == 6
+                assert images.shape[1] == self.config.n_obs_steps
+                cam_images = images[:, :, cam_index]
+                b, t, c_in, h_in, w_in = cam_images.shape
+                cam_images = cam_images.reshape(b * t, c_in, h_in, w_in)
+                cam_features = self.backbone(cam_images)["feature_map"]
+                bt_, c_out, h_out, w_out = cam_features.shape
+                cam_features = cam_features.view(b, t, c_out, h_out, w_out)
+                cam_features = cam_features.permute(0, 2, 1, 3, 4).contiguous()
+                cam_features = self.factorized_conv3d(cam_features)
+                cam_features = cam_features.mean(2)
+            else:
+                assert images.ndim == 5
+                cam_image = images[:, cam_index]
+                cam_features = self.backbone(cam_image)["feature_map"]
+            # TODO(rcadene, alexander-soare): remove call to `.to` to speedup forward ; precompute and use buffer
            cam_pos_embed = self.encoder_cam_feat_pos_embed(cam_features).to(dtype=cam_features.dtype)
            cam_features = self.encoder_img_feat_input_proj(cam_features)  # (B, C, h, w)
            all_cam_features.append(cam_features)
            all_cam_pos_embeds.append(cam_pos_embed)
        # Concatenate camera observation feature maps and positional embeddings along the width dimension.
-        encoder_in = torch.cat(all_cam_features, axis=3)
-        cam_pos_embed = torch.cat(all_cam_pos_embeds, axis=3)
+        encoder_in = torch.cat(all_cam_features, axis=-1)
+        cam_pos_embed = torch.cat(all_cam_pos_embeds, axis=-1)

        # Get positional embeddings for robot state and latent.
-        robot_state_embed = self.encoder_robot_state_input_proj(batch["observation.state"])
-        latent_embed = self.encoder_latent_input_proj(latent_sample)
+        if self.has_state:
+            robot_state_embed = self.encoder_robot_state_input_proj(batch["observation.state"])  # (B, C)
+        latent_embed = self.encoder_latent_input_proj(latent_sample)  # (B, C)

        # Stack encoder input and positional embeddings moving to (S, B, C).
+        encoder_in_feats = [latent_embed, robot_state_embed] if self.has_state else [latent_embed]
        encoder_in = torch.cat(
            [
-                torch.stack([latent_embed, robot_state_embed], axis=0),
-                encoder_in.flatten(2).permute(2, 0, 1),
+                torch.stack(encoder_in_feats, axis=0),
+                einops.rearrange(encoder_in, "b c h w -> (h w) b c"),
            ]
        )
        pos_embed = torch.cat(
@@ -330,6 +414,7 @@ class ACT(nn.Module):

        # Forward pass through the transformer modules.
        encoder_out = self.encoder(encoder_in, pos_embed=pos_embed)
+        # TODO(rcadene, alexander-soare): remove call to `device` ; precompute and use buffer
        decoder_in = torch.zeros(
            (self.config.chunk_size, batch_size, self.config.dim_model),
            dtype=pos_embed.dtype,
@@ -579,3 +664,24 @@ def get_activation_fn(activation: str) -> Callable:
    if activation == "glu":
        return F.glu
    raise RuntimeError(f"activation should be relu/gelu/glu, not {activation}.")
+
+
+class FactorizedConv3d(nn.Module):
+    def __init__(self, in_channels, hidden_dim, out_channels):
+        super().__init__()
+        self.in_channels = in_channels
+        self.hidden_dim = hidden_dim
+        self.out_channels = out_channels
+        self.spatial_conv = nn.Conv3d(in_channels, hidden_dim, kernel_size=(1, 3, 3), padding=(0, 1, 1))
+        self.temporal_conv = nn.Conv3d(hidden_dim, out_channels, kernel_size=(3, 1, 1), padding=(1, 0, 0))
+
+    def forward(self, x):
+        assert x.ndim == 5, f"Expected shape is b,t,c,h,w but {x.shape} given."
+        assert (
+            x.shape[1] == self.in_channels
+        ), f"Expected number channels as input is {self.in_channels}, but {x.shape[2]} given."
+        x = self.spatial_conv(x)
+        x = F.relu(x, inplace=True)
+        x = self.temporal_conv(x)
+        x = F.relu(x, inplace=True)
+        return x
--- a/lerobot/common/policies/diffusion/configuration_diffusion.py
+++ b/lerobot/common/policies/diffusion/configuration_diffusion.py
@@ -1,3 +1,19 @@
+#!/usr/bin/env python
+
+# Copyright 2024 Columbia Artificial Intelligence, Robotics Lab,
+# and The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
 from dataclasses import dataclass, field


@@ -10,21 +26,29 @@ class DiffusionConfig:
    The parameters you will most likely need to change are the ones which depend on the environment / sensors.
    Those are: `input_shapes` and `output_shapes`.

+    Notes on the inputs and outputs:
+        - "observation.state" is required as an input key.
+        - At least one key starting with "observation.image is required as an input.
+        - If there are multiple keys beginning with "observation.image" they are treated as multiple camera
+          views.
+          Right now we only support all images having the same shape.
+        - "action" is required as an output key.
+
    Args:
        n_obs_steps: Number of environment steps worth of observations to pass to the policy (takes the
            current step and additional steps going back).
        horizon: Diffusion model action prediction size as detailed in `DiffusionPolicy.select_action`.
        n_action_steps: The number of action steps to run in the environment for one invocation of the policy.
            See `DiffusionPolicy.select_action` for more details.
-        input_shapes: A dictionary defining the shapes of the input data for the policy.
-            The key represents the input data name, and the value is a list indicating the dimensions
-            of the corresponding data. For example, "observation.image" refers to an input from
-            a camera with dimensions [3, 96, 96], indicating it has three color channels and 96x96 resolution.
-            Importantly, shapes doesnt include batch dimension or temporal dimension.
-        output_shapes: A dictionary defining the shapes of the output data for the policy.
-            The key represents the output data name, and the value is a list indicating the dimensions
-            of the corresponding data. For example, "action" refers to an output shape of [14], indicating
-            14-dimensional actions. Importantly, shapes doesnt include batch dimension or temporal dimension.
+        input_shapes: A dictionary defining the shapes of the input data for the policy. The key represents
+            the input data name, and the value is a list indicating the dimensions of the corresponding data.
+            For example, "observation.image" refers to an input from a camera with dimensions [3, 96, 96],
+            indicating it has three color channels and 96x96 resolution. Importantly, `input_shapes` doesn't
+            include batch dimension or temporal dimension.
+        output_shapes: A dictionary defining the shapes of the output data for the policy. The key represents
+            the output data name, and the value is a list indicating the dimensions of the corresponding data.
+            For example, "action" refers to an output shape of [14], indicating 14-dimensional actions.
+            Importantly, `output_shapes` doesn't include batch dimension or temporal dimension.
        input_normalization_modes: A dictionary with key representing the modality (e.g. "observation.state"),
            and the value specifies the normalization mode to apply. The two available modes are "mean_std"
            which subtracts the mean and divides by the standard deviation and "min_max" which rescale in a
@@ -51,6 +75,7 @@ class DiffusionConfig:
        use_film_scale_modulation: FiLM (https://arxiv.org/abs/1709.07871) is used for the Unet conditioning.
            Bias modulation is used be default, while this parameter indicates whether to also use scale
            modulation.
+        noise_scheduler_type: Name of the noise scheduler to use. Supported options: ["DDPM", "DDIM"].
        num_train_timesteps: Number of diffusion steps for the forward diffusion schedule.
        beta_schedule: Name of the diffusion beta schedule as per DDPMScheduler from Hugging Face diffusers.
        beta_start: Beta value for the first forward-diffusion step.
@@ -110,6 +135,7 @@ class DiffusionConfig:
    diffusion_step_embed_dim: int = 128
    use_film_scale_modulation: bool = True
    # Noise scheduler.
+    noise_scheduler_type: str = "DDPM"
    num_train_timesteps: int = 100
    beta_schedule: str = "squaredcos_cap_v2"
    beta_start: float = 0.0001
@@ -130,17 +156,34 @@ class DiffusionConfig:
            raise ValueError(
                f"`vision_backbone` must be one of the ResNet variants. Got {self.vision_backbone}."
            )
-        if (
-            self.crop_shape[0] > self.input_shapes["observation.image"][1]
-            or self.crop_shape[1] > self.input_shapes["observation.image"][2]
-        ):
-            raise ValueError(
-                f'`crop_shape` should fit within `input_shapes["observation.image"]`. Got {self.crop_shape} '
-                f'for `crop_shape` and {self.input_shapes["observation.image"]} for '
-                '`input_shapes["observation.image"]`.'
-            )
+        image_keys = {k for k in self.input_shapes if k.startswith("observation.image")}
+        if self.crop_shape is not None:
+            for image_key in image_keys:
+                if (
+                    self.crop_shape[0] > self.input_shapes[image_key][1]
+                    or self.crop_shape[1] > self.input_shapes[image_key][2]
+                ):
+                    raise ValueError(
+                        f"`crop_shape` should fit within `input_shapes[{image_key}]`. Got {self.crop_shape} "
+                        f"for `crop_shape` and {self.input_shapes[image_key]} for "
+                        "`input_shapes[{image_key}]`."
+                    )
+        # Check that all input images have the same shape.
+        first_image_key = next(iter(image_keys))
+        for image_key in image_keys:
+            if self.input_shapes[image_key] != self.input_shapes[first_image_key]:
+                raise ValueError(
+                    f"`input_shapes[{image_key}]` does not match `input_shapes[{first_image_key}]`, but we "
+                    "expect all image shapes to match."
+                )
        supported_prediction_types = ["epsilon", "sample"]
        if self.prediction_type not in supported_prediction_types:
            raise ValueError(
                f"`prediction_type` must be one of {supported_prediction_types}. Got {self.prediction_type}."
            )
+        supported_noise_schedulers = ["DDPM", "DDIM"]
+        if self.noise_scheduler_type not in supported_noise_schedulers:
+            raise ValueError(
+                f"`noise_scheduler_type` must be one of {supported_noise_schedulers}. "
+                f"Got {self.noise_scheduler_type}."
+            )
--- a/lerobot/common/policies/diffusion/modeling_diffusion.py
+++ b/lerobot/common/policies/diffusion/modeling_diffusion.py
@@ -1,7 +1,22 @@
+#!/usr/bin/env python
+
+# Copyright 2024 Columbia Artificial Intelligence, Robotics Lab,
+# and The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
 """Diffusion Policy as per "Diffusion Policy: Visuomotor Policy Learning via Action Diffusion"

 TODO(alexander-soare):
-  - Remove reliance on Robomimic for SpatialSoftmax.
  - Remove reliance on diffusers for DDPMScheduler and LR scheduler.
 """

@@ -10,12 +25,13 @@ from collections import deque
 from typing import Callable

 import einops
+import numpy as np
 import torch
 import torch.nn.functional as F  # noqa: N812
 import torchvision
+from diffusers.schedulers.scheduling_ddim import DDIMScheduler
 from diffusers.schedulers.scheduling_ddpm import DDPMScheduler
 from huggingface_hub import PyTorchModelHubMixin
-from robomimic.models.base_nets import SpatialSoftmax
 from torch import Tensor, nn

 from lerobot.common.policies.diffusion.configuration_diffusion import DiffusionConfig
@@ -66,12 +82,14 @@ class DiffusionPolicy(nn.Module, PyTorchModelHubMixin):

        self.diffusion = DiffusionModel(config)

+        self.expected_image_keys = [k for k in config.input_shapes if k.startswith("observation.image")]
+
+        self.reset()
+
    def reset(self):
-        """
-        Clear observation and action queues. Should be called on `env.reset()`
-        """
+        """Clear observation and action queues. Should be called on `env.reset()`"""
        self._queues = {
-            "observation.image": deque(maxlen=self.config.n_obs_steps),
+            "observation.images": deque(maxlen=self.config.n_obs_steps),
            "observation.state": deque(maxlen=self.config.n_obs_steps),
            "action": deque(maxlen=self.config.n_action_steps),
        }
@@ -98,16 +116,14 @@ class DiffusionPolicy(nn.Module, PyTorchModelHubMixin):
        "horizon" may not the best name to describe what the variable actually means, because this period is
        actually measured from the first observation which (if `n_obs_steps` > 1) happened in the past.
        """
-        assert "observation.image" in batch
-        assert "observation.state" in batch
-
        batch = self.normalize_inputs(batch)
-
+        batch["observation.images"] = torch.stack([batch[k] for k in self.expected_image_keys], dim=-4)
+        # Note: It's important that this happens after stacking the images into a single key.
        self._queues = populate_queues(self._queues, batch)

        if len(self._queues["action"]) == 0:
            # stack n latest observations from the queue
-            batch = {key: torch.stack(list(self._queues[key]), dim=1) for key in batch}
+            batch = {k: torch.stack(list(self._queues[k]), dim=1) for k in batch if k in self._queues}
            actions = self.diffusion.generate_actions(batch)

            # TODO(rcadene): make above methods return output dictionary?
@@ -121,29 +137,44 @@ class DiffusionPolicy(nn.Module, PyTorchModelHubMixin):
    def forward(self, batch: dict[str, Tensor]) -> dict[str, Tensor]:
        """Run the batch through the model and compute the loss for training or validation."""
        batch = self.normalize_inputs(batch)
+        batch["observation.images"] = torch.stack([batch[k] for k in self.expected_image_keys], dim=-4)
        batch = self.normalize_targets(batch)
        loss = self.diffusion.compute_loss(batch)
        return {"loss": loss}


+def _make_noise_scheduler(name: str, **kwargs: dict) -> DDPMScheduler | DDIMScheduler:
+    """
+    Factory for noise scheduler instances of the requested type. All kwargs are passed
+    to the scheduler.
+    """
+    if name == "DDPM":
+        return DDPMScheduler(**kwargs)
+    elif name == "DDIM":
+        return DDIMScheduler(**kwargs)
+    else:
+        raise ValueError(f"Unsupported noise scheduler type {name}")
+
+
 class DiffusionModel(nn.Module):
    def __init__(self, config: DiffusionConfig):
        super().__init__()
        self.config = config

        self.rgb_encoder = DiffusionRgbEncoder(config)
+        num_images = len([k for k in config.input_shapes if k.startswith("observation.image")])
        self.unet = DiffusionConditionalUnet1d(
            config,
-            global_cond_dim=(config.output_shapes["action"][0] + self.rgb_encoder.feature_dim)
+            global_cond_dim=(config.output_shapes["action"][0] + self.rgb_encoder.feature_dim * num_images)
            * config.n_obs_steps,
        )

-        self.noise_scheduler = DDPMScheduler(
+        self.noise_scheduler = _make_noise_scheduler(
+            config.noise_scheduler_type,
            num_train_timesteps=config.num_train_timesteps,
            beta_start=config.beta_start,
            beta_end=config.beta_end,
            beta_schedule=config.beta_schedule,
-            variance_type="fixed_small",
            clip_sample=config.clip_sample,
            clip_sample_range=config.clip_sample_range,
            prediction_type=config.prediction_type,
@@ -183,24 +214,34 @@ class DiffusionModel(nn.Module):

        return sample

+    def _prepare_global_conditioning(self, batch: dict[str, Tensor]) -> Tensor:
+        """Encode image features and concatenate them all together along with the state vector."""
+        batch_size, n_obs_steps = batch["observation.state"].shape[:2]
+        # Extract image feature (first combine batch, sequence, and camera index dims).
+        img_features = self.rgb_encoder(
+            einops.rearrange(batch["observation.images"], "b s n ... -> (b s n) ...")
+        )
+        # Separate batch dim and sequence dim back out. The camera index dim gets absorbed into the feature
+        # dim (effectively concatenating the camera features).
+        img_features = einops.rearrange(
+            img_features, "(b s n) ... -> b s (n ...)", b=batch_size, s=n_obs_steps
+        )
+        # Concatenate state and image features then flatten to (B, global_cond_dim).
+        return torch.cat([batch["observation.state"], img_features], dim=-1).flatten(start_dim=1)
+
    def generate_actions(self, batch: dict[str, Tensor]) -> Tensor:
        """
-        This function expects `batch` to have (at least):
+        This function expects `batch` to have:
        {
            "observation.state": (B, n_obs_steps, state_dim)
-            "observation.image": (B, n_obs_steps, C, H, W)
+            "observation.images": (B, n_obs_steps, num_cameras, C, H, W)
        }
        """
-        assert set(batch).issuperset({"observation.state", "observation.image"})
        batch_size, n_obs_steps = batch["observation.state"].shape[:2]
        assert n_obs_steps == self.config.n_obs_steps

-        # Extract image feature (first combine batch and sequence dims).
-        img_features = self.rgb_encoder(einops.rearrange(batch["observation.image"], "b n ... -> (b n) ..."))
-        # Separate batch and sequence dims.
-        img_features = einops.rearrange(img_features, "(b n) ... -> b n ...", b=batch_size)
-        # Concatenate state and image features then flatten to (B, global_cond_dim).
-        global_cond = torch.cat([batch["observation.state"], img_features], dim=-1).flatten(start_dim=1)
+        # Encode image features and concatenate them all together along with the state vector.
+        global_cond = self._prepare_global_conditioning(batch)  # (B, global_cond_dim)

        # run sampling
        sample = self.conditional_sample(batch_size, global_cond=global_cond)
@@ -219,28 +260,23 @@ class DiffusionModel(nn.Module):
        This function expects `batch` to have (at least):
        {
            "observation.state": (B, n_obs_steps, state_dim)
-            "observation.image": (B, n_obs_steps, C, H, W)
+            "observation.images": (B, n_obs_steps, num_cameras, C, H, W)
            "action": (B, horizon, action_dim)
            "action_is_pad": (B, horizon)
        }
        """
        # Input validation.
-        assert set(batch).issuperset({"observation.state", "observation.image", "action", "action_is_pad"})
-        batch_size, n_obs_steps = batch["observation.state"].shape[:2]
+        assert set(batch).issuperset({"observation.state", "observation.images", "action", "action_is_pad"})
+        n_obs_steps = batch["observation.state"].shape[1]
        horizon = batch["action"].shape[1]
        assert horizon == self.config.horizon
        assert n_obs_steps == self.config.n_obs_steps

-        # Extract image feature (first combine batch and sequence dims).
-        img_features = self.rgb_encoder(einops.rearrange(batch["observation.image"], "b n ... -> (b n) ..."))
-        # Separate batch and sequence dims.
-        img_features = einops.rearrange(img_features, "(b n) ... -> b n ...", b=batch_size)
-        # Concatenate state and image features then flatten to (B, global_cond_dim).
-        global_cond = torch.cat([batch["observation.state"], img_features], dim=-1).flatten(start_dim=1)
-
-        trajectory = batch["action"]
+        # Encode image features and concatenate them all together along with the state vector.
+        global_cond = self._prepare_global_conditioning(batch)  # (B, global_cond_dim)

        # Forward diffusion.
+        trajectory = batch["action"]
        # Sample noise to add to the trajectory.
        eps = torch.randn(trajectory.shape, device=trajectory.device)
        # Sample a random noising timestep for each item in the batch.
@@ -268,13 +304,89 @@ class DiffusionModel(nn.Module):
        loss = F.mse_loss(pred, target, reduction="none")

        # Mask loss wherever the action is padded with copies (edges of the dataset trajectory).
-        if self.config.do_mask_loss_for_padding and "action_is_pad" in batch:
+        if self.config.do_mask_loss_for_padding:
+            if "action_is_pad" not in batch:
+                raise ValueError(
+                    "You need to provide 'action_is_pad' in the batch when "
+                    f"{self.config.do_mask_loss_for_padding=}."
+                )
            in_episode_bound = ~batch["action_is_pad"]
            loss = loss * in_episode_bound.unsqueeze(-1)

        return loss.mean()


+class SpatialSoftmax(nn.Module):
+    """
+    Spatial Soft Argmax operation described in "Deep Spatial Autoencoders for Visuomotor Learning" by Finn et al.
+    (https://arxiv.org/pdf/1509.06113). A minimal port of the robomimic implementation.
+
+    At a high level, this takes 2D feature maps (from a convnet/ViT) and returns the "center of mass"
+    of activations of each channel, i.e., keypoints in the image space for the policy to focus on.
+
+    Example: take feature maps of size (512x10x12). We generate a grid of normalized coordinates (10x12x2):
+    -----------------------------------------------------
+    | (-1., -1.)   | (-0.82, -1.)   | ... | (1., -1.)   |
+    | (-1., -0.78) | (-0.82, -0.78) | ... | (1., -0.78) |
+    | ...          | ...            | ... | ...         |
+    | (-1., 1.)    | (-0.82, 1.)    | ... | (1., 1.)    |
+    -----------------------------------------------------
+    This is achieved by applying channel-wise softmax over the activations (512x120) and computing the dot
+    product with the coordinates (120x2) to get expected points of maximal activation (512x2).
+
+    The example above results in 512 keypoints (corresponding to the 512 input channels). We can optionally
+    provide num_kp != None to control the number of keypoints. This is achieved by a first applying a learnable
+    linear mapping (in_channels, H, W) -> (num_kp, H, W).
+    """
+
+    def __init__(self, input_shape, num_kp=None):
+        """
+        Args:
+            input_shape (list): (C, H, W) input feature map shape.
+            num_kp (int): number of keypoints in output. If None, output will have the same number of channels as input.
+        """
+        super().__init__()
+
+        assert len(input_shape) == 3
+        self._in_c, self._in_h, self._in_w = input_shape
+
+        if num_kp is not None:
+            self.nets = torch.nn.Conv2d(self._in_c, num_kp, kernel_size=1)
+            self._out_c = num_kp
+        else:
+            self.nets = None
+            self._out_c = self._in_c
+
+        # we could use torch.linspace directly but that seems to behave slightly differently than numpy
+        # and causes a small degradation in pc_success of pre-trained models.
+        pos_x, pos_y = np.meshgrid(np.linspace(-1.0, 1.0, self._in_w), np.linspace(-1.0, 1.0, self._in_h))
+        pos_x = torch.from_numpy(pos_x.reshape(self._in_h * self._in_w, 1)).float()
+        pos_y = torch.from_numpy(pos_y.reshape(self._in_h * self._in_w, 1)).float()
+        # register as buffer so it's moved to the correct device.
+        self.register_buffer("pos_grid", torch.cat([pos_x, pos_y], dim=1))
+
+    def forward(self, features: Tensor) -> Tensor:
+        """
+        Args:
+            features: (B, C, H, W) input feature maps.
+        Returns:
+            (B, K, 2) image-space coordinates of keypoints.
+        """
+        if self.nets is not None:
+            features = self.nets(features)
+
+        # [B, K, H, W] -> [B * K, H * W] where K is number of keypoints
+        features = features.reshape(-1, self._in_h * self._in_w)
+        # 2d softmax normalization
+        attention = F.softmax(features, dim=-1)
+        # [B * K, H * W] x [H * W, 2] -> [B * K, 2] for spatial coordinate mean in x and y dimensions
+        expected_xy = attention @ self.pos_grid
+        # reshape to [B, K, 2]
+        feature_keypoints = expected_xy.view(-1, self._out_c, 2)
+
+        return feature_keypoints
+
+
 class DiffusionRgbEncoder(nn.Module):
    """Encoder an RGB image into a 1D feature vector.

@@ -315,11 +427,19 @@ class DiffusionRgbEncoder(nn.Module):

        # Set up pooling and final layers.
        # Use a dry run to get the feature map shape.
+        # The dummy input should take the number of image channels from `config.input_shapes` and it should
+        # use the height and width from `config.crop_shape`.
+        image_keys = [k for k in config.input_shapes if k.startswith("observation.image")]
+        # Note: we have a check in the config class to make sure all images have the same shape.
+        image_key = image_keys[0]
+        if config.crop_shape is None:
+            dummy_input = torch.zeros(size=(1, *config.input_shapes[image_key]))
+        else:
+            dummy_input = torch.zeros(size=(1, config.input_shapes[image_key][0], *config.crop_shape))
        with torch.inference_mode():
-            feat_map_shape = tuple(
-                self.backbone(torch.zeros(size=(1, *config.input_shapes["observation.image"]))).shape[1:]
-            )
-        self.pool = SpatialSoftmax(feat_map_shape, num_kp=config.spatial_softmax_num_keypoints)
+            dummy_feature_map = self.backbone(dummy_input)
+        feature_map_shape = tuple(dummy_feature_map.shape[1:])
+        self.pool = SpatialSoftmax(feature_map_shape, num_kp=config.spatial_softmax_num_keypoints)
        self.feature_dim = config.spatial_softmax_num_keypoints * 2
        self.out = nn.Linear(config.spatial_softmax_num_keypoints * 2, self.feature_dim)
        self.relu = nn.ReLU()
--- a/lerobot/common/policies/factory.py
+++ b/lerobot/common/policies/factory.py
@@ -1,4 +1,20 @@
+#!/usr/bin/env python
+
+# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
 import inspect
+import logging

 from omegaconf import DictConfig, OmegaConf

@@ -8,9 +24,10 @@ from lerobot.common.utils.utils import get_safe_torch_device

 def _policy_cfg_from_hydra_cfg(policy_cfg_class, hydra_cfg):
    expected_kwargs = set(inspect.signature(policy_cfg_class).parameters)
-    assert set(hydra_cfg.policy).issuperset(
-        expected_kwargs
-    ), f"Hydra config is missing arguments: {set(expected_kwargs).difference(hydra_cfg.policy)}"
+    if not set(hydra_cfg.policy).issuperset(expected_kwargs):
+        logging.warning(
+            f"Hydra config is missing arguments: {set(expected_kwargs).difference(hydra_cfg.policy)}"
+        )
    policy_cfg = policy_cfg_class(
        **{
            k: v
@@ -62,11 +79,18 @@ def make_policy(

    policy_cls, policy_cfg_class = get_policy_and_config_classes(hydra_cfg.policy.name)

+    policy_cfg = _policy_cfg_from_hydra_cfg(policy_cfg_class, hydra_cfg)
    if pretrained_policy_name_or_path is None:
-        policy_cfg = _policy_cfg_from_hydra_cfg(policy_cfg_class, hydra_cfg)
+        # Make a fresh policy.
        policy = policy_cls(policy_cfg, dataset_stats)
    else:
-        policy = policy_cls.from_pretrained(pretrained_policy_name_or_path)
+        # Load a pretrained policy and override the config if needed (for example, if there are inference-time
+        # hyperparameters that we want to vary).
+        # TODO(alexander-soare): This hack makes use of huggingface_hub's tooling to load the policy with, pretrained
+        # weights which are then loaded into a fresh policy with the desired config. This PR in huggingface_hub should
+        # make it possible to avoid the hack: https://github.com/huggingface/huggingface_hub/pull/2274.
+        policy = policy_cls(policy_cfg)
+        policy.load_state_dict(policy_cls.from_pretrained(pretrained_policy_name_or_path).state_dict())

    policy.to(get_safe_torch_device(hydra_cfg.device))

--- a/lerobot/common/policies/normalize.py
+++ b/lerobot/common/policies/normalize.py
@@ -1,3 +1,18 @@
+#!/usr/bin/env python
+
+# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
 import torch
 from torch import Tensor, nn

--- a/lerobot/common/policies/policy_protocol.py
+++ b/lerobot/common/policies/policy_protocol.py
@@ -1,3 +1,18 @@
+#!/usr/bin/env python
+
+# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
 """A protocol that all policies should follow.

 This provides a mechanism for type-hinting and isinstance checks without requiring the policies classes
@@ -38,7 +53,8 @@ class Policy(Protocol):
    def forward(self, batch: dict[str, Tensor]) -> dict:
        """Run the batch through the model and compute the loss for training or validation.

-        Returns a dictionary with "loss" and maybe other information.
+        Returns a dictionary with "loss" and potentially other information. Apart from "loss" which is a Tensor, all
+        other items should be logging-friendly, native Python types.
        """

    def select_action(self, batch: dict[str, Tensor]):
--- a/lerobot/common/policies/tdmpc/configuration_tdmpc.py
+++ b/lerobot/common/policies/tdmpc/configuration_tdmpc.py
@@ -1,3 +1,19 @@
+#!/usr/bin/env python
+
+# Copyright 2024 Nicklas Hansen, Xiaolong Wang, Hao Su,
+# and The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
 from dataclasses import dataclass, field


@@ -15,6 +31,15 @@ class TDMPCConfig:
        n_action_repeats: The number of times to repeat the action returned by the planning. (hint: Google
            action repeats in Q-learning or ask your favorite chatbot)
        horizon: Horizon for model predictive control.
+        input_shapes: A dictionary defining the shapes of the input data for the policy. The key represents
+            the input data name, and the value is a list indicating the dimensions of the corresponding data.
+            For example, "observation.image" refers to an input from a camera with dimensions [3, 96, 96],
+            indicating it has three color channels and 96x96 resolution. Importantly, `input_shapes` doesn't
+            include batch dimension or temporal dimension.
+        output_shapes: A dictionary defining the shapes of the output data for the policy. The key represents
+            the output data name, and the value is a list indicating the dimensions of the corresponding data.
+            For example, "action" refers to an output shape of [14], indicating 14-dimensional actions.
+            Importantly, `output_shapes` doesn't include batch dimension or temporal dimension.
        input_normalization_modes: A dictionary with key representing the modality (e.g. "observation.state"),
            and the value specifies the normalization mode to apply. The two available modes are "mean_std"
            which subtracts the mean and divides by the standard deviation and "min_max" which rescale in a
@@ -47,7 +72,7 @@ class TDMPCConfig:
        elite_weighting_temperature: The temperature to use for softmax weighting (by trajectory value) of the
            elites, when updating the gaussian parameters for CEM.
        gaussian_mean_momentum: Momentum (α) used for EMA updates of the mean parameter μ of the gaussian
-            paramters optimized in CEM. Updates are calculated as μ⁻ ← αμ⁻ + (1-α)μ.
+            parameters optimized in CEM. Updates are calculated as μ⁻ ← αμ⁻ + (1-α)μ.
        max_random_shift_ratio: Maximum random shift (as a proportion of the image size) to apply to the
            image(s) (in units of pixels) for training-time augmentation. If set to 0, no such augmentation
            is applied. Note that the input images are assumed to be square for this augmentation.
@@ -131,12 +156,18 @@ class TDMPCConfig:

    def __post_init__(self):
        """Input validation (not exhaustive)."""
-        if self.input_shapes["observation.image"][-2] != self.input_shapes["observation.image"][-1]:
+        # There should only be one image key.
+        image_keys = {k for k in self.input_shapes if k.startswith("observation.image")}
+        if len(image_keys) != 1:
+            raise ValueError(
+                f"{self.__class__.__name__} only handles one image for now. Got image keys {image_keys}."
+            )
+        image_key = next(iter(image_keys))
+        if self.input_shapes[image_key][-2] != self.input_shapes[image_key][-1]:
            # TODO(alexander-soare): This limitation is solely because of code in the random shift
            # augmentation. It should be able to be removed.
            raise ValueError(
-                "Only square images are handled now. Got image shape "
-                f"{self.input_shapes['observation.image']}."
+                f"Only square images are handled now. Got image shape {self.input_shapes[image_key]}."
            )
        if self.n_gaussian_samples <= 0:
            raise ValueError(
--- a/lerobot/common/policies/tdmpc/modeling_tdmpc.py
+++ b/lerobot/common/policies/tdmpc/modeling_tdmpc.py
@@ -1,3 +1,19 @@
+#!/usr/bin/env python
+
+# Copyright 2024 Nicklas Hansen, Xiaolong Wang, Hao Su,
+# and The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
 """Implementation of Finetuning Offline World Models in the Real World.

 The comments in this code may sometimes refer to these references:
@@ -96,13 +112,12 @@ class TDMPCPolicy(nn.Module, PyTorchModelHubMixin):
            config.output_shapes, config.output_normalization_modes, dataset_stats
        )

-    def save(self, fp):
-        """Save state dict of TOLD model to filepath."""
-        torch.save(self.state_dict(), fp)
+        image_keys = [k for k in config.input_shapes if k.startswith("observation.image")]
+        # Note: This check is covered in the post-init of the config but have a sanity check just in case.
+        assert len(image_keys) == 1
+        self.input_image_key = image_keys[0]

-    def load(self, fp):
-        """Load a saved state dict from filepath into current agent."""
-        self.load_state_dict(torch.load(fp))
+        self.reset()

    def reset(self):
        """
@@ -121,10 +136,8 @@ class TDMPCPolicy(nn.Module, PyTorchModelHubMixin):
    @torch.no_grad()
    def select_action(self, batch: dict[str, Tensor]):
        """Select a single action given environment observations."""
-        assert "observation.image" in batch
-        assert "observation.state" in batch
-
        batch = self.normalize_inputs(batch)
+        batch["observation.image"] = batch[self.input_image_key]

        self._queues = populate_queues(self._queues, batch)

@@ -303,13 +316,11 @@ class TDMPCPolicy(nn.Module, PyTorchModelHubMixin):
        device = get_device_from_parameters(self)

        batch = self.normalize_inputs(batch)
+        batch["observation.image"] = batch[self.input_image_key]
        batch = self.normalize_targets(batch)

        info = {}

-        # TODO(alexander-soare): Refactor TDMPC and make it comply with the policy interface documentation.
-        batch_size = batch["index"].shape[0]
-
        # (b, t) -> (t, b)
        for key in batch:
            if batch[key].ndim > 1:
@@ -337,6 +348,7 @@ class TDMPCPolicy(nn.Module, PyTorchModelHubMixin):
        # Run latent rollout using the latent dynamics model and policy model.
        # Note this has shape `horizon+1` because there are `horizon` actions and a current `z`. Each action
        # gives us a next `z`.
+        batch_size = batch["index"].shape[0]
        z_preds = torch.empty(horizon + 1, batch_size, self.config.latent_dim, device=device)
        z_preds[0] = self.model.encode(current_observation)
        reward_preds = torch.empty_like(reward, device=device)
--- a/lerobot/common/policies/utils.py
+++ b/lerobot/common/policies/utils.py
@@ -1,9 +1,28 @@
+#!/usr/bin/env python
+
+# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
 import torch
 from torch import nn


 def populate_queues(queues, batch):
    for key in batch:
+        # Ignore keys not in the queues already (leaving the responsibility to the caller to make sure the
+        # queues have the keys they want).
+        if key not in queues:
+            continue
        if len(queues[key]) != queues[key].maxlen:
            # initialize by copying the first observation several times until the queue is full
            while len(queues[key]) != queues[key].maxlen:
--- a/lerobot/common/utils/import_utils.py
+++ b/lerobot/common/utils/import_utils.py
@@ -1,3 +1,18 @@
+#!/usr/bin/env python
+
+# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
 import importlib
 import logging

--- a/lerobot/common/utils/io_utils.py
+++ b/lerobot/common/utils/io_utils.py
@@ -1,3 +1,18 @@
+#!/usr/bin/env python
+
+# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
 import warnings

 import imageio
--- a/lerobot/common/utils/utils.py
+++ b/lerobot/common/utils/utils.py
@@ -1,8 +1,25 @@
+#!/usr/bin/env python
+
+# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
 import logging
 import os.path as osp
 import random
+from contextlib import contextmanager
 from datetime import datetime
 from pathlib import Path
+from typing import Generator

 import hydra
 import numpy as np
@@ -39,6 +56,31 @@ def set_global_seed(seed):
    torch.cuda.manual_seed_all(seed)


+@contextmanager
+def seeded_context(seed: int) -> Generator[None, None, None]:
+    """Set the seed when entering a context, and restore the prior random state at exit.
+
+    Example usage:
+
+    ```
+    a = random.random()  # produces some random number
+    with seeded_context(1337):
+        b = random.random()  # produces some other random number
+    c = random.random()  # produces yet another random number, but the same it would have if we never made `b`
+    ```
+    """
+    random_state = random.getstate()
+    np_random_state = np.random.get_state()
+    torch_random_state = torch.random.get_rng_state()
+    torch_cuda_random_state = torch.cuda.random.get_rng_state()
+    set_global_seed(seed)
+    yield None
+    random.setstate(random_state)
+    np.random.set_state(np_random_state)
+    torch.random.set_rng_state(torch_random_state)
+    torch.cuda.random.set_rng_state(torch_cuda_random_state)
+
+
 def init_logging():
    def custom_format(record):
        dt = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
--- a/lerobot/configs/default.yaml
+++ b/lerobot/configs/default.yaml
@@ -26,6 +26,8 @@ training:
  save_freq: ???
  log_freq: 250
  save_model: true
+  num_workers: 4
+  batch_size: ???

 eval:
  n_episodes: 1
@@ -35,7 +37,7 @@ eval:
  use_async_envs: false

 wandb:
-  enable: true
+  enable: false
  # Set to true to disable saving an artifact despite save_model == True
  disable_artifact: false
  project: lerobot
--- a/lerobot/configs/env/aloha.yaml
+++ b/lerobot/configs/env/aloha.yaml
@@ -5,10 +5,13 @@ fps: 50
 env:
  name: aloha
  task: AlohaInsertion-v0
-  from_pixels: True
-  pixels_only: False
  image_size: [3, 480, 640]
-  episode_length: 400
-  fps: ${fps}
  state_dim: 14
  action_dim: 14
+  fps: ${fps}
+  episode_length: 400
+  gym:
+    obs_type: pixels_agent_pos
+    render_mode: rgb_array
+    visualization_width: 384
+    visualization_height: 384
--- a/lerobot/configs/env/aloha2_real.yaml
+++ b/lerobot/configs/env/aloha2_real.yaml
@@ -0,0 +1,13 @@
+# @package _global_
+
+fps: 30
+
+env:
+  name: dora
+  task: DoraAloha2-v0
+  state_dim: 14
+  action_dim: 14
+  fps: ${fps}
+  episode_length: 400
+  gym:
+    fps: ${fps}
--- a/lerobot/configs/env/pusht.yaml
+++ b/lerobot/configs/env/pusht.yaml
@@ -5,10 +5,13 @@ fps: 10
 env:
  name: pusht
  task: PushT-v0
-  from_pixels: True
-  pixels_only: False
  image_size: 96
-  episode_length: 300
-  fps: ${fps}
  state_dim: 2
  action_dim: 2
+  fps: ${fps}
+  episode_length: 300
+  gym:
+    obs_type: pixels_agent_pos
+    render_mode: rgb_array
+    visualization_width: 384
+    visualization_height: 384
--- a/lerobot/configs/env/xarm.yaml
+++ b/lerobot/configs/env/xarm.yaml
@@ -5,10 +5,13 @@ fps: 15
 env:
  name: xarm
  task: XarmLift-v0
-  from_pixels: True
-  pixels_only: False
  image_size: 84
-  episode_length: 25
-  fps: ${fps}
  state_dim: 4
  action_dim: 4
+  fps: ${fps}
+  episode_length: 25
+  gym:
+    obs_type: pixels_agent_pos
+    render_mode: rgb_array
+    visualization_width: 384
+    visualization_height: 384
--- a/lerobot/configs/policy/act.yaml
+++ b/lerobot/configs/policy/act.yaml
@@ -3,6 +3,12 @@
 seed: 1000
 dataset_repo_id: lerobot/aloha_sim_insertion_human

+override_dataset_stats:
+  observation.images.top:
+    # stats from imagenet, since we use a pretrained vision model
+    mean: [[[0.485]], [[0.456]], [[0.406]]]  # (c,1,1)
+    std: [[[0.229]], [[0.224]], [[0.225]]]  # (c,1,1)
+
 training:
  offline_steps: 80000
  online_steps: 0
@@ -18,12 +24,6 @@ training:
  grad_clip_norm: 10
  online_steps_between_rollouts: 1

-  override_dataset_stats:
-    observation.images.top:
-      # stats from imagenet, since we use a pretrained vision model
-      mean: [[[0.485]], [[0.456]], [[0.406]]]  # (c,1,1)
-      std: [[[0.229]], [[0.224]], [[0.225]]]  # (c,1,1)
-
  delta_timestamps:
    action: "[i / ${fps} for i in range(${policy.chunk_size})]"

@@ -66,6 +66,9 @@ policy:
  dim_feedforward: 3200
  feedforward_activation: relu
  n_encoder_layers: 4
+  # Note: Although the original ACT implementation has 7 for `n_decoder_layers`, there is a bug in the code
+  # that means only the first layer is used. Here we match the original implementation by setting this to 1.
+  # See this issue https://github.com/tonyzhaozh/act/issues/25#issue-2258740521.
  n_decoder_layers: 1
  # VAE.
  use_vae: true
@@ -73,7 +76,7 @@ policy:
  n_vae_encoder_layers: 4

  # Inference.
-  use_temporal_aggregation: false
+  temporal_ensemble_momentum: null

  # Training and loss computation.
  dropout: 0.1
--- a/lerobot/configs/policy/act_real.yaml
+++ b/lerobot/configs/policy/act_real.yaml
@@ -0,0 +1,101 @@
+# @package _global_
+
+seed: 1000
+dataset_repo_id: cadene/aloha_v2_static_dora_test
+
+override_dataset_stats:
+  observation.images.cam_right_wrist:
+    # stats from imagenet, since we use a pretrained vision model
+    mean: [[[0.485]], [[0.456]], [[0.406]]]  # (c,1,1)
+    std: [[[0.229]], [[0.224]], [[0.225]]]  # (c,1,1)
+  observation.images.cam_left_wrist:
+    # stats from imagenet, since we use a pretrained vision model
+    mean: [[[0.485]], [[0.456]], [[0.406]]]  # (c,1,1)
+    std: [[[0.229]], [[0.224]], [[0.225]]]  # (c,1,1)
+  observation.images.cam_high:
+    # stats from imagenet, since we use a pretrained vision model
+    mean: [[[0.485]], [[0.456]], [[0.406]]]  # (c,1,1)
+    std: [[[0.229]], [[0.224]], [[0.225]]]  # (c,1,1)
+  observation.images.cam_low:
+    # stats from imagenet, since we use a pretrained vision model
+    mean: [[[0.485]], [[0.456]], [[0.406]]]  # (c,1,1)
+    std: [[[0.229]], [[0.224]], [[0.225]]]  # (c,1,1)
+
+training:
+  offline_steps: 80000
+  online_steps: 0
+  eval_freq: -1
+  save_freq: 10000
+  log_freq: 100
+  save_model: true
+
+  batch_size: 8
+  lr: 1e-5
+  lr_backbone: 1e-5
+  weight_decay: 1e-4
+  grad_clip_norm: 10
+  online_steps_between_rollouts: 1
+
+  delta_timestamps:
+    action: "[i / ${fps} for i in range(${policy.chunk_size})]"
+
+eval:
+  n_episodes: 50
+  batch_size: 50
+
+# See `configuration_act.py` for more details.
+policy:
+  name: act
+
+  # Input / output structure.
+  n_obs_steps: 1
+  chunk_size: 100 # chunk_size
+  n_action_steps: 100
+
+  input_shapes:
+    # TODO(rcadene, alexander-soare): add variables for height and width from the dataset/env?
+    observation.images.cam_right_wrist: [3, 480, 640]
+    observation.images.cam_left_wrist: [3, 480, 640]
+    observation.images.cam_high: [3, 480, 640]
+    observation.images.cam_low: [3, 480, 640]
+    observation.state: ["${env.state_dim}"]
+  output_shapes:
+    action: ["${env.action_dim}"]
+
+  # Normalization / Unnormalization
+  input_normalization_modes:
+    observation.images.cam_right_wrist: mean_std
+    observation.images.cam_left_wrist: mean_std
+    observation.images.cam_high: mean_std
+    observation.images.cam_low: mean_std
+    observation.state: mean_std
+  output_normalization_modes:
+    action: mean_std
+
+  # Architecture.
+  # Vision backbone.
+  vision_backbone: resnet18
+  pretrained_backbone_weights: ResNet18_Weights.IMAGENET1K_V1
+  replace_final_stride_with_dilation: false
+  # Transformer layers.
+  pre_norm: false
+  dim_model: 512
+  n_heads: 8
+  dim_feedforward: 3200
+  feedforward_activation: relu
+  n_encoder_layers: 4
+  # Note: Although the original ACT implementation has 7 for `n_decoder_layers`, there is a bug in the code
+  # that means only the first layer is used. Here we match the original implementation by setting this to 1.
+  # See this issue https://github.com/tonyzhaozh/act/issues/25#issue-2258740521.
+  n_decoder_layers: 1
+  # VAE.
+  use_vae: true
+  latent_dim: 32
+  n_vae_encoder_layers: 4
+
+  # Inference.
+  temporal_ensemble_momentum: null
+
+  # Training and loss computation.
+  dropout: 0.1
+  kl_weight: 10.0
--- a/lerobot/configs/policy/diffusion.yaml
+++ b/lerobot/configs/policy/diffusion.yaml
@@ -7,6 +7,20 @@
 seed: 100000
 dataset_repo_id: lerobot/pusht

+override_dataset_stats:
+  # TODO(rcadene, alexander-soare): should we remove image stats as well? do we use a pretrained vision model?
+  observation.image:
+    mean: [[[0.5]], [[0.5]], [[0.5]]]  # (c,1,1)
+    std: [[[0.5]], [[0.5]], [[0.5]]]  # (c,1,1)
+  # TODO(rcadene, alexander-soare): we override state and action stats to use the same as the pretrained model
+  # from the original codebase, but we should remove these and train our own pretrained model
+  observation.state:
+    min: [13.456424, 32.938293]
+    max: [496.14618, 510.9579]
+  action:
+    min: [12.0, 25.0]
+    max: [511.0, 511.0]
+
 training:
  offline_steps: 200000
  online_steps: 0
@@ -34,20 +48,6 @@ eval:
  n_episodes: 50
  batch_size: 50

-override_dataset_stats:
-  # TODO(rcadene, alexander-soare): should we remove image stats as well? do we use a pretrained vision model?
-  observation.image:
-    mean: [[[0.5]], [[0.5]], [[0.5]]]  # (c,1,1)
-    std: [[[0.5]], [[0.5]], [[0.5]]]  # (c,1,1)
-  # TODO(rcadene, alexander-soare): we override state and action stats to use the same as the pretrained model
-  # from the original codebase, but we should remove these and train our own pretrained model
-  observation.state:
-    min: [13.456424, 32.938293]
-    max: [496.14618, 510.9579]
-  action:
-    min: [12.0, 25.0]
-    max: [511.0, 511.0]
-
 policy:
  name: diffusion

@@ -85,6 +85,7 @@ policy:
  diffusion_step_embed_dim: 128
  use_film_scale_modulation: True
  # Noise scheduler.
+  noise_scheduler_type: DDPM
  num_train_timesteps: 100
  beta_schedule: squaredcos_cap_v2
  beta_start: 0.0001
--- a/lerobot/configs/policy/diffusion_real.yaml
+++ b/lerobot/configs/policy/diffusion_real.yaml
@@ -0,0 +1,114 @@
+# @package _global_
+
+# Defaults for training for the PushT dataset as per https://github.com/real-stanford/diffusion_policy.
+# Note: We do not track EMA model weights as we discovered it does not improve the results. See
+#       https://github.com/huggingface/lerobot/pull/134 for more details.
+
+seed: 100000
+dataset_repo_id: lerobot/pusht
+
+override_dataset_stats:
+  observation.images.cam_right_wrist:
+    # stats from imagenet, since we use a pretrained vision model
+    mean: [[[0.485]], [[0.456]], [[0.406]]]  # (c,1,1)
+    std: [[[0.229]], [[0.224]], [[0.225]]]  # (c,1,1)
+  observation.images.cam_left_wrist:
+    # stats from imagenet, since we use a pretrained vision model
+    mean: [[[0.485]], [[0.456]], [[0.406]]]  # (c,1,1)
+    std: [[[0.229]], [[0.224]], [[0.225]]]  # (c,1,1)
+  observation.images.cam_high:
+    # stats from imagenet, since we use a pretrained vision model
+    mean: [[[0.485]], [[0.456]], [[0.406]]]  # (c,1,1)
+    std: [[[0.229]], [[0.224]], [[0.225]]]  # (c,1,1)
+  observation.images.cam_low:
+    # stats from imagenet, since we use a pretrained vision model
+    mean: [[[0.485]], [[0.456]], [[0.406]]]  # (c,1,1)
+    std: [[[0.229]], [[0.224]], [[0.225]]]  # (c,1,1)
+
+training:
+  offline_steps: 200000
+  online_steps: 0
+  eval_freq: -1
+  save_freq: 1000
+  log_freq: 100
+  save_model: true
+
+  batch_size: 64
+  grad_clip_norm: 10
+  lr: 1.0e-4
+  lr_scheduler: cosine
+  lr_warmup_steps: 500
+  adam_betas: [0.95, 0.999]
+  adam_eps: 1.0e-8
+  adam_weight_decay: 1.0e-6
+  online_steps_between_rollouts: 1
+
+  delta_timestamps:
+    observation.images.cam_right_wrist: "[i / ${fps} for i in range(1 - ${policy.n_obs_steps}, 1)]"
+    observation.images.cam_left_wrist: "[i / ${fps} for i in range(1 - ${policy.n_obs_steps}, 1)]"
+    observation.images.cam_high: "[i / ${fps} for i in range(1 - ${policy.n_obs_steps}, 1)]"
+    observation.images.cam_low: "[i / ${fps} for i in range(1 - ${policy.n_obs_steps}, 1)]"
+    observation.state: "[i / ${fps} for i in range(1 - ${policy.n_obs_steps}, 1)]"
+    action: "[i / ${fps} for i in range(1 - ${policy.n_obs_steps}, 1 - ${policy.n_obs_steps} + ${policy.horizon})]"
+
+eval:
+  n_episodes: 50
+  batch_size: 50
+
+policy:
+  name: diffusion
+
+  # Input / output structure.
+  n_obs_steps: 2
+  horizon: 16
+  n_action_steps: 8
+
+  input_shapes:
+    # TODO(rcadene, alexander-soare): add variables for height and width from the dataset/env?
+    observation.images.cam_right_wrist: [3, 480, 640]
+    observation.images.cam_left_wrist: [3, 480, 640]
+    observation.images.cam_high: [3, 480, 640]
+    observation.images.cam_low: [3, 480, 640]
+    observation.state: ["${env.state_dim}"]
+  output_shapes:
+    action: ["${env.action_dim}"]
+
+  # Normalization / Unnormalization
+  input_normalization_modes:
+    observation.images.cam_right_wrist: mean_std
+    observation.images.cam_left_wrist: mean_std
+    observation.images.cam_high: mean_std
+    observation.images.cam_low: mean_std
+    observation.state: mean_std
+  output_normalization_modes:
+    action: mean_std
+
+  # Architecture / modeling.
+  # Vision backbone.
+  vision_backbone: resnet18
+  crop_shape: null
+  crop_is_random: False
+  pretrained_backbone_weights: ResNet18_Weights.IMAGENET1K_V1
+  use_group_norm: False
+  spatial_softmax_num_keypoints: 32
+  # Unet.
+  down_dims: [512, 1024, 2048]
+  kernel_size: 5
+  n_groups: 8
+  diffusion_step_embed_dim: 128
+  use_film_scale_modulation: True
+  # Noise scheduler.
+  noise_scheduler_type: DDPM
+  num_train_timesteps: 100
+  beta_schedule: squaredcos_cap_v2
+  beta_start: 0.0001
+  beta_end: 0.02
+  prediction_type: epsilon # epsilon / sample
+  clip_sample: True
+  clip_sample_range: 1.0
+
+  # Inference
+  num_inference_steps: 100
+
+  # Loss computation
+  do_mask_loss_for_padding: false
--- a/lerobot/configs/policy/tdmpc.yaml
+++ b/lerobot/configs/policy/tdmpc.yaml
@@ -1,7 +1,7 @@
 # @package _global_

 seed: 1
-dataset_repo_id: lerobot/xarm_lift_medium_replay
+dataset_repo_id: lerobot/xarm_lift_medium

 training:
  offline_steps: 25000
--- a/lerobot/scripts/display_sys_info.py
+++ b/lerobot/scripts/display_sys_info.py
@@ -1,3 +1,18 @@
+#!/usr/bin/env python
+
+# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
 import platform

 import huggingface_hub
--- a/lerobot/scripts/eval.py
+++ b/lerobot/scripts/eval.py
@@ -1,3 +1,18 @@
+#!/usr/bin/env python
+
+# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
 """Evaluate a policy on an environment by running rollouts and computing metrics.

 Usage examples:
--- a/lerobot/scripts/push_dataset_to_hub.py
+++ b/lerobot/scripts/push_dataset_to_hub.py
@@ -1,3 +1,18 @@
+#!/usr/bin/env python
+
+# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
 """
 Use this script to convert your dataset into LeRobot dataset format and upload it to the Hugging Face hub,
 or store it locally. LeRobot dataset format is lightweight, fast to load from, and does not require any
@@ -10,7 +25,6 @@ python lerobot/scripts/push_dataset_to_hub.py \
 --dataset-id pusht \
 --raw-format pusht_zarr \
 --community-id lerobot \
--revision v1.2 \
 --dry-run 1 \
 --save-to-disk 1 \
 --save-tests-to-disk 0 \
@@ -21,7 +35,6 @@ python lerobot/scripts/push_dataset_to_hub.py \
 --dataset-id xarm_lift_medium \
 --raw-format xarm_pkl \
 --community-id lerobot \
--revision v1.2 \
 --dry-run 1 \
 --save-to-disk 1 \
 --save-tests-to-disk 0 \
@@ -32,7 +45,6 @@ python lerobot/scripts/push_dataset_to_hub.py \
 --dataset-id aloha_sim_insertion_scripted \
 --raw-format aloha_hdf5 \
 --community-id lerobot \
--revision v1.2 \
 --dry-run 1 \
 --save-to-disk 1 \
 --save-tests-to-disk 0 \
@@ -43,7 +55,6 @@ python lerobot/scripts/push_dataset_to_hub.py \
 --dataset-id umi_cup_in_the_wild \
 --raw-format umi_zarr \
 --community-id lerobot \
--revision v1.2 \
 --dry-run 1 \
 --save-to-disk 1 \
 --save-tests-to-disk 0 \
@@ -73,10 +84,14 @@ def get_from_raw_to_lerobot_format_fn(raw_format):
        from lerobot.common.datasets.push_dataset_to_hub.umi_zarr_format import from_raw_to_lerobot_format
    elif raw_format == "aloha_hdf5":
        from lerobot.common.datasets.push_dataset_to_hub.aloha_hdf5_format import from_raw_to_lerobot_format
+    elif raw_format == "aloha_dora":
+        from lerobot.common.datasets.push_dataset_to_hub.aloha_dora_format import from_raw_to_lerobot_format
    elif raw_format == "xarm_pkl":
        from lerobot.common.datasets.push_dataset_to_hub.xarm_pkl_format import from_raw_to_lerobot_format
    else:
-        raise ValueError(raw_format)
+        raise ValueError(
+            f"The selected {raw_format} can't be found. Did you add it to `lerobot/scripts/push_dataset_to_hub.py::get_from_raw_to_lerobot_format_fn`?"
+        )

    return from_raw_to_lerobot_format

@@ -212,8 +227,7 @@ def push_dataset_to_hub(
        test_hf_dataset = test_hf_dataset.with_format(None)
        test_hf_dataset.save_to_disk(str(tests_out_dir / "train"))

-        # copy meta data to tests directory
-        shutil.copytree(meta_data_dir, tests_meta_data_dir)
+        save_meta_data(info, stats, episode_data_index, tests_meta_data_dir)

        # copy videos of first episode to tests directory
        episode_index = 0
@@ -222,6 +236,10 @@ def push_dataset_to_hub(
            fname = f"{key}_episode_{episode_index:06d}.mp4"
            shutil.copy(videos_dir / fname, tests_videos_dir / fname)

+    if not save_to_disk and out_dir.exists():
+        # remove possible temporary files remaining in the output directory
+        shutil.rmtree(out_dir)
+

 def main():
    parser = argparse.ArgumentParser()
@@ -299,7 +317,7 @@ def main():
    parser.add_argument(
        "--num-workers",
        type=int,
-        default=16,
+        default=8,
        help="Number of processes of Dataloader for computing the dataset statistics.",
    )
    parser.add_argument(
--- a/lerobot/scripts/train.py
+++ b/lerobot/scripts/train.py
@@ -1,3 +1,18 @@
+#!/usr/bin/env python
+
+# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
 import logging
 import time
 from copy import deepcopy
@@ -8,6 +23,7 @@ import hydra
 import torch
 from datasets import concatenate_datasets
 from datasets.utils import disable_progress_bars, enable_progress_bars
+from omegaconf import DictConfig

 from lerobot.common.datasets.factory import make_dataset
 from lerobot.common.datasets.utils import cycle
@@ -72,11 +88,12 @@ def make_optimizer_and_scheduler(cfg, policy):


 def update_policy(policy, batch, optimizer, grad_clip_norm, lr_scheduler=None):
+    """Returns a dictionary of items for logging."""
    start_time = time.time()
    policy.train()
    output_dict = policy.forward(batch)
    # TODO(rcadene): policy.unnormalize_outputs(out_dict)
-    loss = output_dict["loss"]
+    loss = output_dict["loss"].mean()
    loss.backward()
    grad_norm = torch.nn.utils.clip_grad_norm_(
        policy.parameters(),
@@ -99,6 +116,7 @@ def update_policy(policy, batch, optimizer, grad_clip_norm, lr_scheduler=None):
        "grad_norm": float(grad_norm),
        "lr": optimizer.param_groups[0]["lr"],
        "update_s": time.time() - start_time,
+        **{k: v for k, v in output_dict.items() if k != "loss"},
    }

    return info
@@ -122,7 +140,7 @@ def train_notebook(out_dir=None, job_name=None, config_name="default", config_pa
    train(cfg, out_dir=out_dir, job_name=job_name)


-def log_train_info(logger, info, step, cfg, dataset, is_offline):
+def log_train_info(logger: Logger, info, step, cfg, dataset, is_offline):
    loss = info["loss"]
    grad_norm = info["grad_norm"]
    lr = info["lr"]
@@ -290,7 +308,7 @@ def add_episodes_inplace(
    sampler.num_samples = len(concat_dataset)


-def train(cfg: dict, out_dir=None, job_name=None):
+def train(cfg: DictConfig, out_dir: str | None = None, job_name: str | None = None):
    if out_dir is None:
        raise NotImplementedError()
    if job_name is None:
@@ -311,8 +329,12 @@ def train(cfg: dict, out_dir=None, job_name=None):
    logging.info("make_dataset")
    offline_dataset = make_dataset(cfg)

-    logging.info("make_env")
-    eval_env = make_env(cfg)
+    # Create environment used for evaluating checkpoints during training on simulation data.
+    # On real-world data, no need to create an environment as evaluations are done outside train.py,
+    # using the eval.py instead, with gym_dora environment and dora-rs.
+    if cfg.training.eval_freq > 0:
+        logging.info("make_env")
+        eval_env = make_env(cfg)

    logging.info("make_policy")
    policy = make_policy(hydra_cfg=cfg, dataset_stats=offline_dataset.stats)
@@ -338,7 +360,7 @@ def train(cfg: dict, out_dir=None, job_name=None):

    # Note: this helper will be used in offline and online training loops.
    def evaluate_and_checkpoint_if_needed(step):
-        if step % cfg.training.eval_freq == 0:
+        if cfg.training.eval_freq > 0 and step % cfg.training.eval_freq == 0:
            logging.info(f"Eval policy at step {step}")
            eval_info = eval_policy(
                eval_env,
@@ -368,7 +390,7 @@ def train(cfg: dict, out_dir=None, job_name=None):
    # create dataloader for offline training
    dataloader = torch.utils.data.DataLoader(
        offline_dataset,
-        num_workers=4,
+        num_workers=cfg.training.num_workers,
        batch_size=cfg.training.batch_size,
        shuffle=True,
        pin_memory=cfg.device != "cpu",
@@ -399,6 +421,13 @@ def train(cfg: dict, out_dir=None, job_name=None):

        step += 1

+    logging.info("End of offline training")
+
+    if cfg.training.online_steps == 0:
+        if cfg.training.eval_freq > 0:
+            eval_env.close()
+        return
+
    # create an env dedicated to online episodes collection from policy rollout
    online_training_env = make_env(cfg, n_envs=1)

@@ -468,9 +497,10 @@ def train(cfg: dict, out_dir=None, job_name=None):
            step += 1
            online_step += 1

+    logging.info("End of online training")
+
    eval_env.close()
    online_training_env.close()
-    logging.info("End of training")


 if __name__ == "__main__":
--- a/lerobot/scripts/visualize_dataset.py
+++ b/lerobot/scripts/visualize_dataset.py
@@ -1,3 +1,18 @@
+#!/usr/bin/env python
+
+# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
 """ Visualize data of **all** frames of any episode of a dataset of type LeRobotDataset.

 Note: The last frame of the episode doesnt always correspond to a final state.
@@ -15,47 +30,46 @@ Examples:
 - Visualize data stored on a local machine:
 ```
 local$ python lerobot/scripts/visualize_dataset.py \
-    --repo-id lerobot/pusht \
-    --episode-index 0
+    --repo-id lerobot/pusht
+
+local$ open http://localhost:9090
 ```

 - Visualize data stored on a distant machine with a local viewer:
 ```
 distant$ python lerobot/scripts/visualize_dataset.py \
+    --repo-id lerobot/pusht
+
+local$ ssh -L 9090:localhost:9090 distant  # create a ssh tunnel
+local$ open http://localhost:9090
+```
+
+- Select episodes to visualize:
+```
+python lerobot/scripts/visualize_dataset.py \
    --repo-id lerobot/pusht \
-    --episode-index 0 \
-    --save 1 \
-    --output-dir path/to/directory
-
-local$ scp distant:path/to/directory/lerobot_pusht_episode_0.rrd .
-local$ rerun lerobot_pusht_episode_0.rrd
+    --episode-indices 7 3 5 1 4
 ```
-
- Visualize data stored on a distant machine through streaming:
-(You need to forward the websocket port to the distant machine, with
-`ssh -L 9087:localhost:9087 username@remote-host`)
-```
-distant$ python lerobot/scripts/visualize_dataset.py \
-    --repo-id lerobot/pusht \
-    --episode-index 0 \
-    --mode distant \
-    --ws-port 9087
-
-local$ rerun ws://localhost:9087
-```
-
 """

 import argparse
+import http.server
 import logging
-import time
+import os
+import shutil
+import socketserver
 from pathlib import Path

-import rerun as rr
 import torch
 import tqdm
+import yaml
+from bs4 import BeautifulSoup
+from huggingface_hub import snapshot_download
+from safetensors.torch import load_file, save_file

 from lerobot.common.datasets.lerobot_dataset import LeRobotDataset
+from lerobot.common.policies.act.modeling_act import ACTPolicy
+from lerobot.common.utils.utils import init_logging


 class EpisodeSampler(torch.utils.data.Sampler):
@@ -71,33 +85,307 @@ class EpisodeSampler(torch.utils.data.Sampler):
        return len(self.frame_ids)


-def to_hwc_uint8_numpy(chw_float32_torch):
-    assert chw_float32_torch.dtype == torch.float32
-    assert chw_float32_torch.ndim == 3
-    c, h, w = chw_float32_torch.shape
-    assert c < h and c < w, f"expect channel first images, but instead {chw_float32_torch.shape}"
-    hwc_uint8_numpy = (chw_float32_torch * 255).type(torch.uint8).permute(1, 2, 0).numpy()
-    return hwc_uint8_numpy
+class NoCacheHTTPRequestHandler(http.server.SimpleHTTPRequestHandler):
+    def end_headers(self):
+        self.send_header("Cache-Control", "no-store, no-cache, must-revalidate")
+        self.send_header("Pragma", "no-cache")
+        self.send_header("Expires", "0")
+        super().end_headers()


-def visualize_dataset(
-    repo_id: str,
-    episode_index: int,
-    batch_size: int = 32,
-    num_workers: int = 0,
-    mode: str = "local",
-    web_port: int = 9090,
-    ws_port: int = 9087,
-    save: bool = False,
-    output_dir: Path | None = None,
-) -> Path | None:
-    if save:
-        assert (
-            output_dir is not None
-        ), "Set an output directory where to write .rrd files with `--output-dir path/to/directory`."
+def run_server(path, port):
+    # Change directory to serve 'index.html` as front page
+    os.chdir(path)

-    logging.info("Loading dataset")
-    dataset = LeRobotDataset(repo_id)
+    with socketserver.TCPServer(("", port), NoCacheHTTPRequestHandler) as httpd:
+        logging.info(f"Serving HTTP on 0.0.0.0 port {port} (http://0.0.0.0:{port}/) ...")
+        httpd.serve_forever()
+
+
+def create_html_page(page_title: str):
+    """Create a html page with beautiful soop with default doctype, meta, header and title."""
+    soup = BeautifulSoup("", "html.parser")
+
+    doctype = soup.new_tag("!DOCTYPE html")
+    soup.append(doctype)
+
+    html = soup.new_tag("html", lang="en")
+    soup.append(html)
+
+    head = soup.new_tag("head")
+    html.append(head)
+
+    meta_charset = soup.new_tag("meta", charset="UTF-8")
+    head.append(meta_charset)
+
+    meta_viewport = soup.new_tag(
+        "meta", attrs={"name": "viewport", "content": "width=device-width, initial-scale=1.0"}
+    )
+    head.append(meta_viewport)
+
+    title = soup.new_tag("title")
+    title.string = page_title
+    head.append(title)
+
+    body = soup.new_tag("body")
+    html.append(body)
+
+    main_div = soup.new_tag("div")
+    body.append(main_div)
+    return soup, head, body
+
+
+def write_episode_data_csv(output_dir, file_name, episode_index, dataset, inference_results=None):
+    """Write a csv file containg timeseries data of an episode (e.g. state and action).
+    This file will be loaded by Dygraph javascript to plot data in real time."""
+    from_idx = dataset.episode_data_index["from"][episode_index]
+    to_idx = dataset.episode_data_index["to"][episode_index]
+
+    has_state = "observation.state" in dataset.hf_dataset.features
+    has_action = "action" in dataset.hf_dataset.features
+    has_inference = inference_results is not None
+
+    # init header of csv with state and action names
+    header = ["timestamp"]
+    if has_state:
+        dim_state = len(dataset.hf_dataset["observation.state"][0])
+        header += [f"state_{i}" for i in range(dim_state)]
+    if has_action:
+        dim_action = len(dataset.hf_dataset["action"][0])
+        header += [f"action_{i}" for i in range(dim_action)]
+    if has_inference:
+        assert "actions" in inference_results
+        assert "loss" in inference_results
+        dim_pred_action = inference_results["actions"].shape[2]
+        header += [f"pred_action_{i}" for i in range(dim_pred_action)]
+        header += ["loss"]
+
+    columns = ["timestamp"]
+    if has_state:
+        columns += ["observation.state"]
+    if has_action:
+        columns += ["action"]
+
+    rows = []
+    data = dataset.hf_dataset.select_columns(columns)
+    for i in range(from_idx, to_idx):
+        row = [data[i]["timestamp"].item()]
+        if has_state:
+            row += data[i]["observation.state"].tolist()
+        if has_action:
+            row += data[i]["action"].tolist()
+        rows.append(row)
+
+    if has_inference:
+        num_frames = len(rows)
+        assert num_frames == inference_results["actions"].shape[0]
+        assert num_frames == inference_results["loss"].shape[0]
+        for i in range(num_frames):
+            rows[i] += inference_results["actions"][i, 0].tolist()
+            rows[i] += [inference_results["loss"][i].item()]
+
+    output_dir.mkdir(parents=True, exist_ok=True)
+    with open(output_dir / file_name, "w") as f:
+        f.write(",".join(header) + "\n")
+        for row in rows:
+            row_str = [str(col) for col in row]
+            f.write(",".join(row_str) + "\n")
+
+
+def write_episode_data_js(output_dir, file_name, ep_csv_fname, dataset):
+    """Write a javascript file containing logic to synchronize camera feeds and timeseries."""
+    s = ""
+    s += "document.addEventListener('DOMContentLoaded', function () {\n"
+    for i, key in enumerate(dataset.video_frame_keys):
+        s += f"  const video{i} = document.getElementById('video_{key}');\n"
+    s += "  const slider = document.getElementById('videoControl');\n"
+    s += "  const playButton = document.getElementById('playButton');\n"
+    s += f"  const dygraph = new Dygraph(document.getElementById('graph'), '{ep_csv_fname}', " + "{\n"
+    s += "    pixelsPerPoint: 0.01,\n"
+    s += "    legend: 'always',\n"
+    s += "    labelsDiv: document.getElementById('labels'),\n"
+    s += "    labelsSeparateLines: true,\n"
+    s += "    labelsKMB: true,\n"
+    s += "    highlightCircleSize: 1.5,\n"
+    s += "    highlightSeriesOpts: {\n"
+    s += "        strokeWidth: 1.5,\n"
+    s += "        strokeBorderWidth: 1,\n"
+    s += "        highlightCircleSize: 3\n"
+    s += "    }\n"
+    s += "  });\n"
+    s += "\n"
+    s += "  // Function to play both videos\n"
+    s += "  playButton.addEventListener('click', function () {\n"
+    for i in range(len(dataset.video_frame_keys)):
+        s += f"    video{i}.play();\n"
+    s += "    // playButton.disabled = true; // Optional: disable button after playing\n"
+    s += "  });\n"
+    s += "\n"
+    s += "  // Update the video time when the slider value changes\n"
+    s += "  slider.addEventListener('input', function () {\n"
+    s += "    const sliderValue = slider.value;\n"
+    for i in range(len(dataset.video_frame_keys)):
+        s += f"    const time{i} = (video{i}.duration * sliderValue) / 100;\n"
+    for i in range(len(dataset.video_frame_keys)):
+        s += f"    video{i}.currentTime = time{i};\n"
+    s += "  });\n"
+    s += "\n"
+    s += "  // Synchronize slider with the video's current time\n"
+    s += "  const syncSlider = (video) => {\n"
+    s += "    video.addEventListener('timeupdate', function () {\n"
+    s += "      if (video.duration) {\n"
+    s += "        const pc = (100 / video.duration) * video.currentTime;\n"
+    s += "        slider.value = pc;\n"
+    s += "        const index = Math.floor(pc * dygraph.numRows() / 100);\n"
+    s += "        dygraph.setSelection(index, undefined, true, true);\n"
+    s += "      }\n"
+    s += "    });\n"
+    s += "  };\n"
+    s += "\n"
+    for i in range(len(dataset.video_frame_keys)):
+        s += f"  syncSlider(video{i});\n"
+    s += "\n"
+    s += "});\n"
+
+    output_dir.mkdir(parents=True, exist_ok=True)
+    with open(output_dir / file_name, "w", encoding="utf-8") as f:
+        f.write(s)
+
+
+def write_episode_data_html(output_dir, file_name, js_fname, ep_index, dataset):
+    """Write an html file containg video feeds and timeseries associated to an episode."""
+    soup, head, body = create_html_page("")
+
+    css_style = soup.new_tag("style")
+    css_style.string = ""
+    css_style.string += "#labels > span.highlight {\n"
+    css_style.string += "  border: 1px solid grey;\n"
+    css_style.string += "}"
+    head.append(css_style)
+
+    # Add videos from camera feeds
+
+    videos_control_div = soup.new_tag("div")
+    body.append(videos_control_div)
+
+    videos_div = soup.new_tag("div")
+    videos_control_div.append(videos_div)
+
+    def create_video(id, src):
+        video = soup.new_tag("video", id=id, width="320", height="240", controls="")
+        source = soup.new_tag("source", src=src, type="video/mp4")
+        video.string = "Your browser does not support the video tag."
+        video.append(source)
+        return video
+
+    # get first frame of episode (hack to get video_path of the episode)
+    first_frame_idx = dataset.episode_data_index["from"][ep_index].item()
+
+    for key in dataset.video_frame_keys:
+        # Example of video_path: 'videos/observation.image_episode_000004.mp4'
+        video_path = dataset.hf_dataset.select_columns(key)[first_frame_idx][key]["path"]
+        videos_div.append(create_video(f"video_{key}", video_path))
+
+    # Add controls for videos and graph
+
+    control_div = soup.new_tag("div")
+    videos_control_div.append(control_div)
+
+    button_div = soup.new_tag("div")
+    control_div.append(button_div)
+
+    button = soup.new_tag("button", id="playButton")
+    button.string = "Play Videos"
+    button_div.append(button)
+
+    slider_div = soup.new_tag("div")
+    control_div.append(slider_div)
+
+    slider = soup.new_tag("input", type="range", id="videoControl", min="0", max="100", value="0", step="1")
+    control_div.append(slider)
+
+    # Add graph of states/actions, and its labels
+
+    graph_labels_div = soup.new_tag("div", style="display: flex;")
+    body.append(graph_labels_div)
+
+    graph_div = soup.new_tag("div", id="graph", style="flex: 1; width: 85%")
+    graph_labels_div.append(graph_div)
+
+    labels_div = soup.new_tag("div", id="labels", style="flex: 1; width: 15%")
+    graph_labels_div.append(labels_div)
+
+    # add dygraph library
+    script = soup.new_tag("script", type="text/javascript", src=js_fname)
+    body.append(script)
+
+    script_dygraph = soup.new_tag(
+        "script",
+        type="text/javascript",
+        src="https://cdn.jsdelivr.net/npm/dygraphs@2.1.0/dist/dygraph.min.js",
+    )
+    body.append(script_dygraph)
+
+    link_dygraph = soup.new_tag(
+        "link", rel="stylesheet", href="https://cdn.jsdelivr.net/npm/dygraphs@2.1.0/dist/dygraph.min.css"
+    )
+    body.append(link_dygraph)
+
+    # Write as a html file
+
+    output_dir.mkdir(parents=True, exist_ok=True)
+    with open(output_dir / file_name, "w", encoding="utf-8") as f:
+        f.write(soup.prettify())
+
+
+def write_episodes_list_html(output_dir, file_name, ep_indices, ep_html_fnames, dataset):
+    """Write an html file containing information related to the dataset and a list of links to
+    html pages of episodes."""
+    soup, head, body = create_html_page("TODO")
+
+    h3 = soup.new_tag("h3")
+    h3.string = "TODO"
+    body.append(h3)
+
+    ul_info = soup.new_tag("ul")
+    body.append(ul_info)
+
+    li_info = soup.new_tag("li")
+    li_info.string = f"Number of samples/frames: {dataset.num_samples}"
+    ul_info.append(li_info)
+
+    li_info = soup.new_tag("li")
+    li_info.string = f"Number of episodes: {dataset.num_episodes}"
+    ul_info.append(li_info)
+
+    li_info = soup.new_tag("li")
+    li_info.string = f"Frames per second: {dataset.fps}"
+    ul_info.append(li_info)
+
+    # li_info = soup.new_tag("li")
+    # li_info.string = f"Size: {format_big_number(dataset.hf_dataset.info.size_in_bytes)}B"
+    # ul_info.append(li_info)
+
+    ul = soup.new_tag("ul")
+    body.append(ul)
+
+    for ep_idx, ep_html_fname in zip(ep_indices, ep_html_fnames, strict=False):
+        li = soup.new_tag("li")
+        ul.append(li)
+
+        a = soup.new_tag("a", href=ep_html_fname)
+        a.string = f"Episode number {ep_idx}"
+
+        li.append(a)
+
+    output_dir.mkdir(parents=True, exist_ok=True)
+    with open(output_dir / file_name, "w", encoding="utf-8") as f:
+        f.write(soup.prettify())
+
+
+def run_inference(dataset, episode_index, policy, num_workers=4, batch_size=32, device="cuda"):
+    policy.eval()
+    policy.to(device)

    logging.info("Loading dataloader")
    episode_sampler = EpisodeSampler(dataset, episode_index)
@@ -108,68 +396,104 @@ def visualize_dataset(
        sampler=episode_sampler,
    )

-    logging.info("Starting Rerun")
-
-    if mode not in ["local", "distant"]:
-        raise ValueError(mode)
-
-    spawn_local_viewer = mode == "local" and not save
-    rr.init(f"{repo_id}/episode_{episode_index}", spawn=spawn_local_viewer)
-    if mode == "distant":
-        rr.serve(open_browser=False, web_port=web_port, ws_port=ws_port)
-
-    logging.info("Logging to Rerun")
-
-    if num_workers > 0:
-        # TODO(rcadene): fix data workers hanging when `rr.init` is called
-        logging.warning("If data loader is hanging, try `--num-workers 0`.")
-
+    logging.info("Running inference")
+    inference_results = {}
    for batch in tqdm.tqdm(dataloader, total=len(dataloader)):
-        # iterate over the batch
-        for i in range(len(batch["index"])):
-            rr.set_time_sequence("frame_index", batch["frame_index"][i].item())
-            rr.set_time_seconds("timestamp", batch["timestamp"][i].item())
+        batch = {k: v.to(device, non_blocking=True) for k, v in batch.items()}
+        with torch.inference_mode():
+            output_dict = policy.forward(batch)

-            # display each camera image
-            for key in dataset.camera_keys:
-                # TODO(rcadene): add `.compress()`? is it lossless?
-                rr.log(key, rr.Image(to_hwc_uint8_numpy(batch[key][i])))
+        for key in output_dict:
+            if key not in inference_results:
+                inference_results[key] = []
+            inference_results[key].append(output_dict[key].to("cpu"))

-            # display each dimension of action space (e.g. actuators command)
-            if "action" in batch:
-                for dim_idx, val in enumerate(batch["action"][i]):
-                    rr.log(f"action/{dim_idx}", rr.Scalar(val.item()))
+    for key in inference_results:
+        inference_results[key] = torch.cat(inference_results[key])

-            # display each dimension of observed state space (e.g. agent position in joint space)
-            if "observation.state" in batch:
-                for dim_idx, val in enumerate(batch["observation.state"][i]):
-                    rr.log(f"state/{dim_idx}", rr.Scalar(val.item()))
+    return inference_results

-            if "next.done" in batch:
-                rr.log("next.done", rr.Scalar(batch["next.done"][i].item()))

-            if "next.reward" in batch:
-                rr.log("next.reward", rr.Scalar(batch["next.reward"][i].item()))
+def visualize_dataset(
+    repo_id: str,
+    episode_indices: list[int] = None,
+    output_dir: Path | None = None,
+    serve: bool = True,
+    port: int = 9090,
+    force_overwrite: bool = True,
+    policy_repo_id: str | None = None,
+    policy_ckpt_path: Path | None = None,
+    batch_size: int = 32,
+    num_workers: int = 4,
+) -> Path | None:
+    init_logging()

-            if "next.success" in batch:
-                rr.log("next.success", rr.Scalar(batch["next.success"][i].item()))
+    has_policy = policy_repo_id or policy_ckpt_path

-    if mode == "local" and save:
-        # save .rrd locally
-        output_dir = Path(output_dir)
-        output_dir.mkdir(parents=True, exist_ok=True)
-        repo_id_str = repo_id.replace("/", "_")
-        rrd_path = output_dir / f"{repo_id_str}_episode_{episode_index}.rrd"
-        rr.save(rrd_path)
-        return rrd_path
+    if has_policy:
+        logging.info("Loading policy")
+        if policy_repo_id:
+            pretrained_policy_path = Path(snapshot_download(policy_repo_id))
+        elif policy_ckpt_path:
+            pretrained_policy_path = Path(policy_ckpt_path)
+        policy = ACTPolicy.from_pretrained(pretrained_policy_path)
+        with open(pretrained_policy_path / "config.yaml") as f:
+            cfg = yaml.safe_load(f)
+        delta_timestamps = cfg["training"]["delta_timestamps"]
+    else:
+        delta_timestamps = None

-    elif mode == "distant":
-        # stop the process from exiting since it is serving the websocket connection
-        try:
-            while True:
-                time.sleep(1)
-        except KeyboardInterrupt:
-            print("Ctrl-C received. Exiting.")
+    logging.info("Loading dataset")
+    dataset = LeRobotDataset(repo_id, delta_timestamps=delta_timestamps)
+
+    if not dataset.video:
+        raise NotImplementedError(f"Image datasets ({dataset.video=}) are currently not supported.")
+
+    if output_dir is None:
+        output_dir = f"outputs/visualize_dataset/{repo_id}"
+
+    output_dir = Path(output_dir)
+    if force_overwrite and output_dir.exists():
+        shutil.rmtree(output_dir)
+    output_dir.mkdir(parents=True, exist_ok=True)
+
+    # Create a simlink from the dataset video folder containg mp4 files to the output directory
+    # so that the http server can get access to the mp4 files.
+    ln_videos_dir = output_dir / "videos"
+    if not ln_videos_dir.exists():
+        ln_videos_dir.symlink_to(dataset.videos_dir.resolve())
+
+    if episode_indices is None:
+        episode_indices = list(range(dataset.num_episodes))
+
+    logging.info("Writing html")
+    ep_html_fnames = []
+    for episode_index in tqdm.tqdm(episode_indices):
+        inference_results = None
+        if has_policy:
+            inference_results_path = output_dir / f"episode_{episode_index}.safetensors"
+            if inference_results_path.exists():
+                inference_results = load_file(inference_results_path)
+            else:
+                inference_results = run_inference(dataset, episode_index, policy)
+                save_file(inference_results, inference_results_path)
+
+        # write states and actions in a csv
+        ep_csv_fname = f"episode_{episode_index}.csv"
+        write_episode_data_csv(output_dir, ep_csv_fname, episode_index, dataset, inference_results)
+
+        js_fname = f"episode_{episode_index}.js"
+        write_episode_data_js(output_dir, js_fname, ep_csv_fname, dataset)
+
+        # write a html page to view videos and timeseries
+        ep_html_fname = f"episode_{episode_index}.html"
+        write_episode_data_html(output_dir, ep_html_fname, js_fname, episode_index, dataset)
+        ep_html_fnames.append(ep_html_fname)
+
+    write_episodes_list_html(output_dir, "index.html", episode_indices, ep_html_fnames, dataset)
+
+    if serve:
+        run_server(output_dir, port)


 def main():
@@ -179,13 +503,51 @@ def main():
        "--repo-id",
        type=str,
        required=True,
-        help="Name of hugging face repositery containing a LeRobotDataset dataset (e.g. `lerobot/pusht`).",
+        help="Name of hugging face repositery containing a LeRobotDataset dataset (e.g. `lerobot/pusht` for https://huggingface.co/datasets/lerobot/pusht).",
    )
    parser.add_argument(
-        "--episode-index",
+        "--episode-indices",
        type=int,
-        required=True,
-        help="Episode to visualize.",
+        nargs="*",
+        default=None,
+        help="Episode indices to visualize (e.g. `0 1 5 6` to load episodes of index 0, 1, 5 and 6). By default loads all episodes.",
+    )
+    parser.add_argument(
+        "--output-dir",
+        type=str,
+        default=None,
+        help="Directory path to write html files and kickoff a web server. By default write them to 'outputs/visualize_dataset/REPO_ID'.",
+    )
+    parser.add_argument(
+        "--serve",
+        type=int,
+        default=1,
+        help="Launch web server.",
+    )
+    parser.add_argument(
+        "--port",
+        type=int,
+        default=9090,
+        help="Web port used by the http server.",
+    )
+    parser.add_argument(
+        "--force-overwrite",
+        type=int,
+        default=1,
+        help="Delete the output directory if it exists already.",
+    )
+
+    parser.add_argument(
+        "--policy-repo-id",
+        type=str,
+        default=None,
+        help="Name of hugging face repositery containing a pretrained policy (e.g. `lerobot/diffusion_pusht` for https://huggingface.co/lerobot/diffusion_pusht).",
+    )
+    parser.add_argument(
+        "--policy-ckpt-path",
+        type=str,
+        default=None,
+        help="Name of hugging face repositery containing a pretrained policy (e.g. `lerobot/diffusion_pusht` for https://huggingface.co/lerobot/diffusion_pusht).",
    )
    parser.add_argument(
        "--batch-size",
@@ -196,46 +558,9 @@ def main():
    parser.add_argument(
        "--num-workers",
        type=int,
-        default=0,
+        default=4,
        help="Number of processes of Dataloader for loading the data.",
    )
-    parser.add_argument(
-        "--mode",
-        type=str,
-        default="local",
-        help=(
-            "Mode of viewing between 'local' or 'distant'. "
-            "'local' requires data to be on a local machine. It spawns a viewer to visualize the data locally. "
-            "'distant' creates a server on the distant machine where the data is stored. Visualize the data by connecting to the server with `rerun ws://localhost:PORT` on the local machine."
-        ),
-    )
-    parser.add_argument(
-        "--web-port",
-        type=int,
-        default=9090,
-        help="Web port for rerun.io when `--mode distant` is set.",
-    )
-    parser.add_argument(
-        "--ws-port",
-        type=int,
-        default=9087,
-        help="Web socket port for rerun.io when `--mode distant` is set.",
-    )
-    parser.add_argument(
-        "--save",
-        type=int,
-        default=0,
-        help=(
-            "Save a .rrd file in the directory provided by `--output-dir`. "
-            "It also deactivates the spawning of a viewer. ",
-            "Visualize the data by running `rerun path/to/file.rrd` on your local machine.",
-        ),
-    )
-    parser.add_argument(
-        "--output-dir",
-        type=str,
-        help="Directory path to write a .rrd file when `--save 1` is set.",
-    )

    args = parser.parse_args()
    visualize_dataset(**vars(args))
--- a/lerobot/scripts/visualize_dataset_rerun.py
+++ b/lerobot/scripts/visualize_dataset_rerun.py
@@ -0,0 +1,263 @@
+#!/usr/bin/env python
+
+# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Visualize data of **all** frames of any episode of a dataset of type LeRobotDataset.
+
+Note: The last frame of the episode doesnt always correspond to a final state.
+That's because our datasets are composed of transition from state to state up to
+the antepenultimate state associated to the ultimate action to arrive in the final state.
+However, there might not be a transition from a final state to another state.
+
+Note: This script aims to visualize the data used to train the neural networks.
+~What you see is what you get~. When visualizing image modality, it is often expected to observe
+lossly compression artifacts since these images have been decoded from compressed mp4 videos to
+save disk space. The compression factor applied has been tuned to not affect success rate.
+
+Examples:
+
+- Visualize data stored on a local machine:
+```
+local$ python lerobot/scripts/visualize_dataset.py \
+    --repo-id lerobot/pusht \
+    --episode-index 0
+```
+
+- Visualize data stored on a distant machine with a local viewer:
+```
+distant$ python lerobot/scripts/visualize_dataset.py \
+    --repo-id lerobot/pusht \
+    --episode-index 0 \
+    --save 1 \
+    --output-dir path/to/directory
+
+local$ scp distant:path/to/directory/lerobot_pusht_episode_0.rrd .
+local$ rerun lerobot_pusht_episode_0.rrd
+```
+
+- Visualize data stored on a distant machine through streaming:
+(You need to forward the websocket port to the distant machine, with
+`ssh -L 9087:localhost:9087 username@remote-host`)
+```
+distant$ python lerobot/scripts/visualize_dataset.py \
+    --repo-id lerobot/pusht \
+    --episode-index 0 \
+    --mode distant \
+    --ws-port 9087
+
+local$ rerun ws://localhost:9087
+```
+
+"""
+
+import argparse
+import gc
+import logging
+import time
+from pathlib import Path
+
+import rerun as rr
+import torch
+import tqdm
+
+from lerobot.common.datasets.lerobot_dataset import LeRobotDataset
+
+
+class EpisodeSampler(torch.utils.data.Sampler):
+    def __init__(self, dataset, episode_index):
+        from_idx = dataset.episode_data_index["from"][episode_index].item()
+        to_idx = dataset.episode_data_index["to"][episode_index].item()
+        self.frame_ids = range(from_idx, to_idx)
+
+    def __iter__(self):
+        return iter(self.frame_ids)
+
+    def __len__(self):
+        return len(self.frame_ids)
+
+
+def to_hwc_uint8_numpy(chw_float32_torch):
+    assert chw_float32_torch.dtype == torch.float32
+    assert chw_float32_torch.ndim == 3
+    c, h, w = chw_float32_torch.shape
+    assert c < h and c < w, f"expect channel first images, but instead {chw_float32_torch.shape}"
+    hwc_uint8_numpy = (chw_float32_torch * 255).type(torch.uint8).permute(1, 2, 0).numpy()
+    return hwc_uint8_numpy
+
+
+def visualize_dataset(
+    repo_id: str,
+    episode_index: int,
+    batch_size: int = 32,
+    num_workers: int = 0,
+    mode: str = "local",
+    web_port: int = 9090,
+    ws_port: int = 9087,
+    save: bool = False,
+    output_dir: Path | None = None,
+) -> Path | None:
+    if save:
+        assert (
+            output_dir is not None
+        ), "Set an output directory where to write .rrd files with `--output-dir path/to/directory`."
+
+    logging.info("Loading dataset")
+    dataset = LeRobotDataset(repo_id)
+
+    logging.info("Loading dataloader")
+    episode_sampler = EpisodeSampler(dataset, episode_index)
+    dataloader = torch.utils.data.DataLoader(
+        dataset,
+        num_workers=num_workers,
+        batch_size=batch_size,
+        sampler=episode_sampler,
+    )
+
+    logging.info("Starting Rerun")
+
+    if mode not in ["local", "distant"]:
+        raise ValueError(mode)
+
+    spawn_local_viewer = mode == "local" and not save
+    rr.init(f"{repo_id}/episode_{episode_index}", spawn=spawn_local_viewer)
+
+    # Manually call python garbage collector after `rr.init` to avoid hanging in a blocking flush
+    # when iterating on a dataloader with `num_workers` > 0
+    # TODO(rcadene): remove `gc.collect` when rerun version 0.16 is out, which includes a fix
+    gc.collect()
+
+    if mode == "distant":
+        rr.serve(open_browser=False, web_port=web_port, ws_port=ws_port)
+
+    logging.info("Logging to Rerun")
+
+    for batch in tqdm.tqdm(dataloader, total=len(dataloader)):
+        # iterate over the batch
+        for i in range(len(batch["index"])):
+            rr.set_time_sequence("frame_index", batch["frame_index"][i].item())
+            rr.set_time_seconds("timestamp", batch["timestamp"][i].item())
+
+            # display each camera image
+            for key in dataset.camera_keys:
+                # TODO(rcadene): add `.compress()`? is it lossless?
+                rr.log(key, rr.Image(to_hwc_uint8_numpy(batch[key][i])))
+
+            # display each dimension of action space (e.g. actuators command)
+            if "action" in batch:
+                for dim_idx, val in enumerate(batch["action"][i]):
+                    rr.log(f"action/{dim_idx}", rr.Scalar(val.item()))
+
+            # display each dimension of observed state space (e.g. agent position in joint space)
+            if "observation.state" in batch:
+                for dim_idx, val in enumerate(batch["observation.state"][i]):
+                    rr.log(f"state/{dim_idx}", rr.Scalar(val.item()))
+
+            if "next.done" in batch:
+                rr.log("next.done", rr.Scalar(batch["next.done"][i].item()))
+
+            if "next.reward" in batch:
+                rr.log("next.reward", rr.Scalar(batch["next.reward"][i].item()))
+
+            if "next.success" in batch:
+                rr.log("next.success", rr.Scalar(batch["next.success"][i].item()))
+
+    if mode == "local" and save:
+        # save .rrd locally
+        output_dir = Path(output_dir)
+        output_dir.mkdir(parents=True, exist_ok=True)
+        repo_id_str = repo_id.replace("/", "_")
+        rrd_path = output_dir / f"{repo_id_str}_episode_{episode_index}.rrd"
+        rr.save(rrd_path)
+        return rrd_path
+
+    elif mode == "distant":
+        # stop the process from exiting since it is serving the websocket connection
+        try:
+            while True:
+                time.sleep(1)
+        except KeyboardInterrupt:
+            print("Ctrl-C received. Exiting.")
+
+
+def main():
+    parser = argparse.ArgumentParser()
+
+    parser.add_argument(
+        "--repo-id",
+        type=str,
+        required=True,
+        help="Name of hugging face repositery containing a LeRobotDataset dataset (e.g. `lerobot/pusht`).",
+    )
+    parser.add_argument(
+        "--episode-index",
+        type=int,
+        required=True,
+        help="Episode to visualize.",
+    )
+    parser.add_argument(
+        "--batch-size",
+        type=int,
+        default=32,
+        help="Batch size loaded by DataLoader.",
+    )
+    parser.add_argument(
+        "--num-workers",
+        type=int,
+        default=4,
+        help="Number of processes of Dataloader for loading the data.",
+    )
+    parser.add_argument(
+        "--mode",
+        type=str,
+        default="local",
+        help=(
+            "Mode of viewing between 'local' or 'distant'. "
+            "'local' requires data to be on a local machine. It spawns a viewer to visualize the data locally. "
+            "'distant' creates a server on the distant machine where the data is stored. Visualize the data by connecting to the server with `rerun ws://localhost:PORT` on the local machine."
+        ),
+    )
+    parser.add_argument(
+        "--web-port",
+        type=int,
+        default=9090,
+        help="Web port for rerun.io when `--mode distant` is set.",
+    )
+    parser.add_argument(
+        "--ws-port",
+        type=int,
+        default=9087,
+        help="Web socket port for rerun.io when `--mode distant` is set.",
+    )
+    parser.add_argument(
+        "--save",
+        type=int,
+        default=0,
+        help=(
+            "Save a .rrd file in the directory provided by `--output-dir`. "
+            "It also deactivates the spawning of a viewer. ",
+            "Visualize the data by running `rerun path/to/file.rrd` on your local machine.",
+        ),
+    )
+    parser.add_argument(
+        "--output-dir",
+        type=str,
+        help="Directory path to write a .rrd file when `--save 1` is set.",
+    )
+
+    args = parser.parse_args()
+    visualize_dataset(**vars(args))
+
+
+if __name__ == "__main__":
+    main()
--- a/poetry.lock
+++ b/poetry.lock
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -28,40 +28,41 @@ packages = [{include = "lerobot"}]

 [tool.poetry.dependencies]
 python = ">=3.10,<3.13"
-termcolor = "^2.4.0"
-omegaconf = "^2.3.0"
-wandb = "^0.16.3"
-imageio = {extras = ["ffmpeg"], version = "^2.34.0"}
-gdown = "^5.1.0"
-hydra-core = "^1.3.2"
-einops = "^0.8.0"
-pymunk = "^6.6.0"
-zarr = "^2.17.0"
-numba = "^0.59.0"
+termcolor = ">=2.4.0"
+omegaconf = ">=2.3.0"
+wandb = ">=0.16.3"
+imageio = {extras = ["ffmpeg"], version = ">=2.34.0"}
+gdown = ">=5.1.0"
+hydra-core = ">=1.3.2"
+einops = ">=0.8.0"
+pymunk = ">=6.6.0"
+zarr = ">=2.17.0"
+numba = ">=0.59.0"
 torch = "^2.2.1"
-opencv-python = "^4.9.0.80"
+opencv-python = ">=4.9.0"
 diffusers = "^0.27.2"
-torchvision = "^0.18.0"
-h5py = "^3.10.0"
-huggingface-hub = "^0.21.4"
-robomimic = "0.2.0"
-gymnasium = "^0.29.1"
-cmake = "^3.29.0.1"
-gym-pusht = { version = "^0.1.1", optional = true}
-gym-xarm = { version = "^0.1.0", optional = true}
-gym-aloha = { version = "^0.1.0", optional = true}
-pre-commit = {version = "^3.7.0", optional = true}
-debugpy = {version = "^1.8.1", optional = true}
-pytest = {version = "^8.1.0", optional = true}
-pytest-cov = {version = "^5.0.0", optional = true}
+torchvision = ">=0.18.0"
+h5py = ">=3.10.0"
+huggingface-hub = {extras = ["hf-transfer"], version = "^0.23.0"}
+gymnasium = ">=0.29.1"
+cmake = ">=3.29.0.1"
+gym-dora = { path = "gym_dora", optional = true, develop = true}
+gym-pusht = { version = ">=0.1.3", optional = true}
+gym-xarm = { version = ">=0.1.1", optional = true}
+gym-aloha = { version = ">=0.1.1", optional = true}
+pre-commit = {version = ">=3.7.0", optional = true}
+debugpy = {version = ">=1.8.1", optional = true}
+pytest = {version = ">=8.1.0", optional = true}
+pytest-cov = {version = ">=5.0.0", optional = true}
 datasets = "^2.19.0"
-imagecodecs = { version = "^2024.1.1", optional = true }
-pyav = "^12.0.5"
-moviepy = "^1.0.3"
-rerun-sdk = "^0.15.1"
+imagecodecs = { version = ">=2024.1.1", optional = true }
+pyav = ">=12.0.5"
+moviepy = ">=1.0.3"
+rerun-sdk = ">=0.15.1"


 [tool.poetry.extras]
+dora = ["gym-dora"]
 pusht = ["gym-pusht"]
 xarm = ["gym-xarm"]
 aloha = ["gym-aloha"]
@@ -104,5 +105,5 @@ ignore-init-module-imports = true


 [build-system]
-requires = ["poetry-core>=1.5.0"]
+requires = ["poetry-core"]
 build-backend = "poetry.core.masonry.api"
--- a/tests/conftest.py
+++ b/tests/conftest.py
@@ -1,3 +1,18 @@
+#!/usr/bin/env python
+
+# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
 from .utils import DEVICE


--- a/tests/data/lerobot/aloha_mobile_cabinet/meta_data/episode_data_index.safetensors
+++ b/tests/data/lerobot/aloha_mobile_cabinet/meta_data/episode_data_index.safetensors
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:9f9347c8d9ac90ee44e6dd86f65043438168df6bbe4bab2d2b875e55ef7376ef
+size 1488
--- a/tests/data/lerobot/aloha_mobile_cabinet/meta_data/info.json
+++ b/tests/data/lerobot/aloha_mobile_cabinet/meta_data/info.json
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:cf148247bf191c7f7e8af738a7b9e147f9ffffeec0e4b9d1c4783c4e384da7eb
+size 33
--- a/tests/data/lerobot/aloha_mobile_cabinet/meta_data/stats.safetensors
+++ b/tests/data/lerobot/aloha_mobile_cabinet/meta_data/stats.safetensors
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:02fc4ea25766269f65752a60b0594c43d799b0ae528cd773bf024b064b5aa329
+size 4344
--- a/tests/data/lerobot/aloha_mobile_cabinet/train/data-00000-of-00001.arrow
+++ b/tests/data/lerobot/aloha_mobile_cabinet/train/data-00000-of-00001.arrow
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:55d7b1a06fe3e3051482752740074348bdb5fc98fb2e305b06d6203994117b27
+size 592448
--- a/tests/data/lerobot/aloha_mobile_cabinet/train/dataset_info.json
+++ b/tests/data/lerobot/aloha_mobile_cabinet/train/dataset_info.json
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:8b7fbedfdb3d536847bc6fadf2cbabb9f2b5492edf3e2c274a3e8ffb447105e8
+size 1166
--- a/tests/data/lerobot/aloha_mobile_cabinet/train/state.json
+++ b/tests/data/lerobot/aloha_mobile_cabinet/train/state.json
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:98329e4b40e9be0d63f7d36da9d86c44bbe7eeeb1b10d3ba973c923f3be70867
+size 247
--- a/tests/data/lerobot/aloha_mobile_cabinet/videos/observation.images.cam_high_episode_000000.mp4
+++ b/tests/data/lerobot/aloha_mobile_cabinet/videos/observation.images.cam_high_episode_000000.mp4
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:54e42cdfd016a0ced2ab1fe2966a8c15a2384e0dbe1a2fe87433a2d1b8209ac0
+size 5220057
--- a/tests/data/lerobot/aloha_mobile_cabinet/videos/observation.images.cam_left_wrist_episode_000000.mp4
+++ b/tests/data/lerobot/aloha_mobile_cabinet/videos/observation.images.cam_left_wrist_episode_000000.mp4
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:af1ded2a244cb47a96255b75f584a643edf6967e13bb5464b330ffdd9d7ad859
+size 5284692
--- a/tests/data/lerobot/aloha_mobile_cabinet/videos/observation.images.cam_right_wrist_episode_000000.mp4
+++ b/tests/data/lerobot/aloha_mobile_cabinet/videos/observation.images.cam_right_wrist_episode_000000.mp4
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:13d1bebabd79984fd6715971be758ef9a354495adea5e8d33f4e7904365e112b
+size 5258380
--- a/tests/data/lerobot/aloha_mobile_chair/meta_data/episode_data_index.safetensors
+++ b/tests/data/lerobot/aloha_mobile_chair/meta_data/episode_data_index.safetensors
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:f33bc6810f0b91817a42610364cb49ed1b99660f058f0f9407e6f5920d0aee02
+size 1008
--- a/tests/data/lerobot/aloha_mobile_chair/meta_data/info.json
+++ b/tests/data/lerobot/aloha_mobile_chair/meta_data/info.json
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:cf148247bf191c7f7e8af738a7b9e147f9ffffeec0e4b9d1c4783c4e384da7eb
+size 33
--- a/tests/data/lerobot/aloha_mobile_chair/meta_data/stats.safetensors
+++ b/tests/data/lerobot/aloha_mobile_chair/meta_data/stats.safetensors
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:7b58d6c89e936a781a307805ebecf0dd473fbc02d52a7094da62e54bffb9454a
+size 4344
--- a/tests/data/lerobot/aloha_mobile_chair/train/data-00000-of-00001.arrow
+++ b/tests/data/lerobot/aloha_mobile_chair/train/data-00000-of-00001.arrow
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:a08be578285cbe2d35b78f150d464ff3e10604a9865398c976983e0d711774f9
+size 788528
--- a/tests/data/lerobot/aloha_mobile_chair/train/dataset_info.json
+++ b/tests/data/lerobot/aloha_mobile_chair/train/dataset_info.json
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:8b7fbedfdb3d536847bc6fadf2cbabb9f2b5492edf3e2c274a3e8ffb447105e8
+size 1166
--- a/tests/data/lerobot/aloha_mobile_chair/train/state.json
+++ b/tests/data/lerobot/aloha_mobile_chair/train/state.json
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:34e36233477c8aa0b0840314ddace072062d4f486d06546bbd6550832c370065
+size 247
--- a/tests/data/lerobot/aloha_mobile_chair/videos/observation.images.cam_high_episode_000000.mp4
+++ b/tests/data/lerobot/aloha_mobile_chair/videos/observation.images.cam_high_episode_000000.mp4
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:66e7349a4a82ca6042a7189608d01eb1cfa38d100d039b5445ae1a9e65d824ab
+size 14470946
--- a/tests/data/lerobot/aloha_mobile_chair/videos/observation.images.cam_left_wrist_episode_000000.mp4
+++ b/tests/data/lerobot/aloha_mobile_chair/videos/observation.images.cam_left_wrist_episode_000000.mp4
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:a2146f0c10c9f2611e57e617983aa4f91ad681b4fc50d91b992b97abd684f926
+size 11662185
--- a/tests/data/lerobot/aloha_mobile_chair/videos/observation.images.cam_right_wrist_episode_000000.mp4
+++ b/tests/data/lerobot/aloha_mobile_chair/videos/observation.images.cam_right_wrist_episode_000000.mp4
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:5affbaf1c48895ba3c626e0d8cf1309e5f4ec6bbaa135313096f52a22de66c05
+size 11410342
--- a/tests/data/lerobot/aloha_mobile_elevator/meta_data/episode_data_index.safetensors
+++ b/tests/data/lerobot/aloha_mobile_elevator/meta_data/episode_data_index.safetensors
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:6c2b195ca91b88fd16422128d386d2cabd808a1862c6d127e6bf2e83e1fe819a
+size 448
--- a/tests/data/lerobot/aloha_mobile_elevator/meta_data/info.json
+++ b/tests/data/lerobot/aloha_mobile_elevator/meta_data/info.json
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:cf148247bf191c7f7e8af738a7b9e147f9ffffeec0e4b9d1c4783c4e384da7eb
+size 33
--- a/tests/data/lerobot/aloha_mobile_elevator/meta_data/stats.safetensors
+++ b/tests/data/lerobot/aloha_mobile_elevator/meta_data/stats.safetensors
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:b360b6b956d2adcb20589947c553348ef1eb6b70743c989dcbe95243d8592ce5
+size 4344
--- a/tests/data/lerobot/aloha_mobile_elevator/train/data-00000-of-00001.arrow
+++ b/tests/data/lerobot/aloha_mobile_elevator/train/data-00000-of-00001.arrow
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:3f5c3926b4d4da9271abefcdf6a8952bb1f13258a9c39fe0fd223f548dc89dcb
+size 887728
--- a/tests/data/lerobot/aloha_mobile_elevator/train/dataset_info.json
+++ b/tests/data/lerobot/aloha_mobile_elevator/train/dataset_info.json
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:8b7fbedfdb3d536847bc6fadf2cbabb9f2b5492edf3e2c274a3e8ffb447105e8
+size 1166
--- a/tests/data/lerobot/aloha_mobile_elevator/train/state.json
+++ b/tests/data/lerobot/aloha_mobile_elevator/train/state.json
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:4993b05fb026619eec5eb70db8cadaa041ba4ab92d38b4a387167ace03b1018b
+size 247
--- a/tests/data/lerobot/aloha_mobile_elevator/videos/observation.images.cam_high_episode_000000.mp4
+++ b/tests/data/lerobot/aloha_mobile_elevator/videos/observation.images.cam_high_episode_000000.mp4
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:bd25d17ef5b7500386761b5e32920879bbdcafe0e17a8a8845628525d861e644
+size 10231081
--- a/tests/data/lerobot/aloha_mobile_elevator/videos/observation.images.cam_left_wrist_episode_000000.mp4
+++ b/tests/data/lerobot/aloha_mobile_elevator/videos/observation.images.cam_left_wrist_episode_000000.mp4
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:5b557acbfeb0681c0a38e47263d945f6cd3a03461298d8b17209c81e3fd0aae8
+size 9701371
--- a/tests/data/lerobot/aloha_mobile_elevator/videos/observation.images.cam_right_wrist_episode_000000.mp4
+++ b/tests/data/lerobot/aloha_mobile_elevator/videos/observation.images.cam_right_wrist_episode_000000.mp4
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:da8f3b4f9f965da63819652b2c042d4cf7e07d14631113ea072087d56370310e
+size 10473741
--- a/tests/data/lerobot/aloha_mobile_shrimp/meta_data/episode_data_index.safetensors
+++ b/tests/data/lerobot/aloha_mobile_shrimp/meta_data/episode_data_index.safetensors
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:a053506017d8a78cfd307b2912eeafa1ac1485a280cf90913985fcc40120b5ec
+size 416
--- a/tests/data/lerobot/aloha_mobile_shrimp/meta_data/info.json
+++ b/tests/data/lerobot/aloha_mobile_shrimp/meta_data/info.json
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:cf148247bf191c7f7e8af738a7b9e147f9ffffeec0e4b9d1c4783c4e384da7eb
+size 33
--- a/tests/data/lerobot/aloha_mobile_shrimp/meta_data/stats.safetensors
+++ b/tests/data/lerobot/aloha_mobile_shrimp/meta_data/stats.safetensors
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:d6d172d1bca02face22ceb4c21ea2b054cf3463025485dce64711b6f36b31f8a
+size 4344
--- a/tests/data/lerobot/aloha_mobile_shrimp/train/data-00000-of-00001.arrow
+++ b/tests/data/lerobot/aloha_mobile_shrimp/train/data-00000-of-00001.arrow
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:7e5ce817a2c188041f57f8d4c465dab3b9c3e4e1aeb7a9fb270230d1b36df530
+size 1477064
--- a/tests/data/lerobot/aloha_mobile_shrimp/train/dataset_info.json
+++ b/tests/data/lerobot/aloha_mobile_shrimp/train/dataset_info.json
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:8b7fbedfdb3d536847bc6fadf2cbabb9f2b5492edf3e2c274a3e8ffb447105e8
+size 1166
--- a/tests/data/lerobot/aloha_mobile_shrimp/train/state.json
+++ b/tests/data/lerobot/aloha_mobile_shrimp/train/state.json
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:4eb2dc373e4ea7d474742590f9073d66a773f6ab94b9e73a8673df19f93fae6d
+size 247
--- a/tests/data/lerobot/aloha_mobile_shrimp/videos/observation.images.cam_high_episode_000000.mp4
+++ b/tests/data/lerobot/aloha_mobile_shrimp/videos/observation.images.cam_high_episode_000000.mp4
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:d2c55b146fabe78b18c8a28a7746ab56e1ee7a6918e9e3dad9bd196f97975895
+size 26158915
--- a/tests/data/lerobot/aloha_mobile_shrimp/videos/observation.images.cam_left_wrist_episode_000000.mp4
+++ b/tests/data/lerobot/aloha_mobile_shrimp/videos/observation.images.cam_left_wrist_episode_000000.mp4
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:71e1958d77f56843acf1ec48da4f04311a5836c87a0e77dbe26aa47c27c6347e
+size 18786848
--- a/tests/data/lerobot/aloha_mobile_shrimp/videos/observation.images.cam_right_wrist_episode_000000.mp4
+++ b/tests/data/lerobot/aloha_mobile_shrimp/videos/observation.images.cam_right_wrist_episode_000000.mp4
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:20780718399b5759ff9a3a79824986310524793066198e3b9a307222f11a93df
+size 17769988
--- a/tests/data/lerobot/aloha_mobile_wash_pan/meta_data/episode_data_index.safetensors
+++ b/tests/data/lerobot/aloha_mobile_wash_pan/meta_data/episode_data_index.safetensors
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:279916f7689ae46af90e92a46eba9486a71fc762e3e2679ab5441eb37126827b
+size 928
--- a/tests/data/lerobot/aloha_mobile_wash_pan/meta_data/info.json
+++ b/tests/data/lerobot/aloha_mobile_wash_pan/meta_data/info.json
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:cf148247bf191c7f7e8af738a7b9e147f9ffffeec0e4b9d1c4783c4e384da7eb
+size 33
--- a/tests/data/lerobot/aloha_mobile_wash_pan/meta_data/stats.safetensors
+++ b/tests/data/lerobot/aloha_mobile_wash_pan/meta_data/stats.safetensors
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:7a7731051b521694b52b5631470720a7f05331915f4ac4e7f8cd83f9ff459bce
+size 4344
--- a/tests/data/lerobot/aloha_mobile_wash_pan/train/data-00000-of-00001.arrow
+++ b/tests/data/lerobot/aloha_mobile_wash_pan/train/data-00000-of-00001.arrow
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:99608258e8c9fe5191f1a12edc29b47d307790104149dffb6d3046ddad6aeb1b
+size 435600
--- a/tests/data/lerobot/aloha_mobile_wash_pan/train/dataset_info.json
+++ b/tests/data/lerobot/aloha_mobile_wash_pan/train/dataset_info.json
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:8b7fbedfdb3d536847bc6fadf2cbabb9f2b5492edf3e2c274a3e8ffb447105e8
+size 1166
--- a/Show More
+++ b/Show More
Author	SHA1	Message	Date
Remi Cadene	255dbef76a	Add offline inference to visualize_dataset	2024-05-29 15:33:52 +00:00
Remi Cadene	31a25d1dba	Use training.eval_freq=-1 for real-world data training + Add diffusion_real.yaml	2024-05-29 15:30:43 +00:00
Remi Cadene	31b57de633	Refactor make_env	2024-05-29 15:26:56 +00:00
Remi Cadene	2ced211864	Move gym_dora to dora-lerobot github	2024-05-29 15:20:44 +00:00
Remi Cadene	1a8e5c1c0b	remove warning for ignored logged types (WIP)	2024-05-29 11:09:44 +00:00
Remi Cadene	63e61385fc	Add num_workers and batch_size to default.yaml	2024-05-29 11:09:16 +00:00
Remi Cadene	8c1dd0e263	Add observation queue to ACT + refactor into _queues	2024-05-29 10:21:50 +00:00
Remi Cadene	b24f2d0028	fix modeling act	2024-05-29 10:04:27 +00:00
Remi Cadene	ad0033ae12	fix diffusion	2024-05-29 09:50:44 +00:00
Alexander Soare	b47c07fbeb	cherry pick me	2024-05-28 20:57:43 +00:00
Remi Cadene	220b32441d	Act works with n_obs > 1	2024-05-28 20:55:02 +00:00
Remi Cadene	960589849f	Fix aloha_dora_format	2024-05-28 20:37:12 +00:00
haixuantao	1a6602e77c	add support for failed episode	2024-05-28 21:20:43 +02:00
Remi Cadene	743812e471	fix timestamps not aligned (TODO: add some tests)	2024-05-28 19:12:08 +00:00
Remi Cadene	c2fd06f765	WIP: Add no_state option + fix use_vae=False to ACT	2024-05-27 13:20:32 +00:00
Remi Cadene	3189ac26b8	fix symlink	2024-05-24 17:34:05 +00:00
Remi Cadene	5cb19d589f	WIP small adjustment add rerun hot fix WIP	2024-05-24 17:34:00 +00:00
Remi Cadene	495da6deaf	save_freq=10000	2024-05-24 17:33:20 +00:00
Remi Cadene	f33a06070e	WIP	2024-05-24 09:42:02 +00:00
Remi Cadene	f409bee6b1	WIP	2024-05-24 09:37:11 +00:00
Remi Cadene	c91ececc75	WIP	2024-05-24 09:06:29 +00:00
Simon Alibert	8a10e8442b	Add boilerplate code adding dora node logic Add some documentation refactor	2024-05-23 12:59:39 +00:00
Remi Cadene	10c5151fc6	remove hardcoding	2024-05-23 09:14:17 +00:00
Remi Cadene	a7d24f6dc2	works	2024-05-23 07:32:37 +00:00
Remi Cadene	642f1d0328	Add act_real_world.yaml	2024-05-22 15:15:36 +00:00
Remi Cadene	4843988d81	fix	2024-05-22 15:15:13 +00:00
Remi Cadene	772927616a	fix	2024-05-22 09:12:34 +00:00
Remi Cadene	d52c6037e8	fix	2024-05-22 09:03:43 +00:00
Remi Cadene	b0cb342795	WIP Add aloha_dora_format	2024-05-21 21:16:31 +00:00
Remi Cadene	8460ea6f83	rename + format	2024-05-20 16:37:21 +00:00
haixuantao	9a3b0b738a	Adding dora-record script	2024-05-20 16:34:21 +00:00
Remi	c4da689171	Hot fix to compute validation loss example test (#200 ) Co-authored-by: Alexander Soare <alexander.soare159@gmail.com>	2024-05-20 18:30:11 +02:00
Radek Osmulski	9b62c25f6c	Adds split_by_episodes to LeRobotDataset (#158 )	2024-05-20 14:04:04 +02:00
Remi	01eae09ba6	Fix aloha real-world datasets (#175 )	2024-05-20 13:48:09 +02:00
Alexander Soare	19dfb9144a	Update the README to reflect WandB disabled by default (#198 )	2024-05-20 09:02:24 +01:00
Alexander Soare	096149b118	Disable wandb by default (#195 )	2024-05-17 18:01:39 +01:00
Alexander Soare	5ec0af62c6	Explain why n_encoder_layers=1 (#193 )	2024-05-17 15:05:40 +01:00
Alexander Soare	625f0557ef	Act temporal ensembling (#186 )	2024-05-17 14:57:49 +01:00
Alexander Soare	4d7d41cdee	Fix act action queue (#185 )	2024-05-16 15:43:25 +01:00
Akshay Kashyap	c9069df9f1	Port SpatialSoftmax and remove Robomimic dependency (#182 ) Co-authored-by: Alexander Soare <alexander.soare159@gmail.com>	2024-05-16 15:34:10 +01:00
Alexander Soare	68c1b13406	Make policies compatible with other/multiple image keys (#149 )	2024-05-16 13:51:53 +01:00
Simon Alibert	f52f4f2cd2	Add copyrights (#157 )	2024-05-15 12:13:09 +02:00
Simon Alibert	89c6be84ca	Limit datasets major update (#176 ) Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com> Co-authored-by: Quentin Lhoest <lhoest.q@gmail.com>	2024-05-12 08:15:07 +02:00
AshisGhosh	fc5cf3d84a	Fixes issue #152 - error with creating wandb artifact (#172 ) Co-authored-by: Ashis Ghosh <ahsisghosh@live.com> Co-authored-by: Simon Alibert <75076266+aliberts@users.noreply.github.com>	2024-05-12 08:13:12 +02:00
Simon Alibert	29a196c5dd	Fix #173 - Require gym-pusht to be installed for test_examples_3_and_2 (#174 )	2024-05-12 08:08:59 +02:00
Remi	ced3de4c94	Fix hanging in visualize_dataset.py when num_workers > 0 (#165 )	2024-05-11 19:28:22 +03:00
Vincent Moens	7b47ab211b	Remove torchrl acknowledgement (#177 )	2024-05-11 14:45:51 +03:00
Alexander Soare	1249aee3ac	Enable logging all the information returned by the `forward` methods of policies (#151 )	2024-05-10 07:45:32 +01:00
Alexander Soare	b187942db4	Add context manager for seeding (#164 )	2024-05-09 17:58:39 +01:00
Alexander Soare	473345fdf6	Fix stats override in ACT config (#161 )	2024-05-09 15:16:47 +01:00
Alexander Soare	e89521dfa0	Enable tests for TD-MPC (#160 )	2024-05-09 13:42:12 +01:00
Simon Alibert	7bb5b15f4c	Remove dependencies upper bounds constraints (#145 )	2024-05-08 17:23:10 +00:00
Simon Alibert	df914aa76c	Update dev docker build (#148 )	2024-05-08 17:21:58 +00:00
Ikko Eltociear Ashimine	0ea7a8b2a3	refactor: update configuration_tdmpc.py (#153 ) Co-authored-by: Alexander Soare <alexander.soare159@gmail.com>	2024-05-08 18:13:51 +01:00
Akshay Kashyap	460df2ccea	Support for DDIMScheduler in Diffusion Policy (#146 )	2024-05-08 18:05:16 +01:00
Alexander Soare	f5de57b385	Fix SpatialSoftmax input shape (#150 )	2024-05-08 14:57:29 +01:00
Alexander Soare	47de07658c	Override pretrained model config (#147 )	2024-05-08 12:56:21 +01:00