fix: optimize delta-indices construction. No point in fetching after iteration failure in same direction

fix: making sure to stream within-episode frames
fix: minor
2025-05-31 20:08:16 +02:00 · 2025-05-31 20:07:16 +02:00 · 2025-05-29 23:17:02 +02:00 · 2025-05-29 23:14:18 +02:00 · 2025-05-29 23:11:28 +02:00 · 2025-05-27 14:32:14 +02:00
36 changed files with 4187 additions and 572 deletions
--- a/README.md
+++ b/README.md
@@ -198,6 +198,7 @@ Under the hood, the `LeRobotDataset` format makes use of several ways to seriali
 Here are the important details and internal structure organization of a typical `LeRobotDataset` instantiated with `dataset = LeRobotDataset("lerobot/aloha_static_coffee")`. The exact features will change from dataset to dataset but not the main aspects:

 ```
+TODO: IMPROVE
 dataset attributes:
  ├ hf_dataset: a Hugging Face dataset (backed by Arrow/parquet). Typical features example:
  │  ├ observation.images.cam_high (VideoFrame):
@@ -210,7 +211,7 @@ dataset attributes:
  │  ├ timestamp (float32): timestamp in the episode
  │  ├ next.done (bool): indicates the end of en episode ; True for the last frame in each episode
  │  └ index (int64): general index in the whole dataset
-  ├ episode_data_index: contains 2 tensors with the start and end indices of each episode
+  ├ meta: contains 2 tensors with the start and end indices of each episode
  │  ├ from (1D int64 tensor): first frame index for each episode — shape (num episodes,) starts with 0
  │  └ to: (1D int64 tensor): last frame index for each episode — shape (num episodes,)
  ├ stats: a dictionary of statistics (max, mean, min, std) for each feature in the dataset, for instance
--- a/benchmarks/video/run_video_benchmark.py
+++ b/benchmarks/video/run_video_benchmark.py
@@ -108,7 +108,8 @@ def save_decoded_frames(


 def save_first_episode(imgs_dir: Path, dataset: LeRobotDataset) -> None:
-    ep_num_images = dataset.episode_data_index["to"][0].item()
+    episode_index = 0
+    ep_num_images = dataset.meta.episodes["length"][episode_index]
    if imgs_dir.exists() and len(list(imgs_dir.glob("frame_*.png"))) == ep_num_images:
        return

@@ -265,7 +266,8 @@ def benchmark_encoding_decoding(
            overwrite=True,
        )

-    ep_num_images = dataset.episode_data_index["to"][0].item()
+    episode_index = 0
+    ep_num_images = dataset.meta.episodes["length"][episode_index]
    width, height = tuple(dataset[0][dataset.meta.camera_keys[0]].shape[-2:])
    num_pixels = width * height
    video_size_bytes = video_path.stat().st_size
--- a/examples/1_load_lerobot_dataset.py
+++ b/examples/1_load_lerobot_dataset.py
@@ -92,11 +92,11 @@ print(dataset.hf_dataset)
 # LeRobot datasets also subclasses PyTorch datasets so you can do everything you know and love from working
 # with the latter, like iterating through the dataset.
 # The __getitem__ iterates over the frames of the dataset. Since our datasets are also structured by
-# episodes, you can access the frame indices of any episode using the episode_data_index. Here, we access
+# episodes, you can access the frame indices of any episode using dataset.meta.episodes. Here, we access
 # frame indices associated to the first episode:
 episode_index = 0
-from_idx = dataset.episode_data_index["from"][episode_index].item()
-to_idx = dataset.episode_data_index["to"][episode_index].item()
+from_idx = dataset.meta.episodes["dataset_from_index"][episode_index]
+to_idx = dataset.meta.episodes["dataset_to_index"][episode_index]

 # Then we grab all the image frames from the first camera:
 camera_key = dataset.meta.camera_keys[0]
--- a/examples/advanced/1_add_image_transforms.py
+++ b/examples/advanced/1_add_image_transforms.py
@@ -31,7 +31,7 @@ dataset = LeRobotDataset(dataset_repo_id, episodes=[0])
 # This is equivalent to `dataset = LeRobotDataset(dataset_repo_id, image_transforms=None)`

 # Get the index of the first observation in the first episode
-first_idx = dataset.episode_data_index["from"][0].item()
+first_idx = dataset.meta.episodes["dataset_from_index"][0]

 # Get the frame corresponding to the first camera
 frame = dataset[first_idx][dataset.meta.camera_keys[0]]
--- a/examples/port_datasets/droid_rlds/README.md
+++ b/examples/port_datasets/droid_rlds/README.md
@@ -0,0 +1,144 @@
+# Port DROID 1.0.1 dataset to LeRobotDataset
+
+## Download
+
+TODO
+
+It will take 2 TB in your local disk.
+
+## Port on a single computer
+
+First, install tensorflow dataset utilities to read from raw files:
+```bash
+pip install tensorflow
+pip install tensorflow_datasets
+```
+
+Then run this script to start porting the dataset:
+```bash
+python examples/port_datasets/droid_rlds/port_droid.py \
+    --raw-dir /your/data/droid/1.0.1 \
+    --repo-id your_id/droid_1.0.1 \
+    --push-to-hub
+```
+
+It will take 400GB in your local disk.
+
+As usual, your LeRobotDataset will be stored in your huggingface/lerobot cache folder.
+
+WARNING: it will take 7 days for porting the dataset locally and 3 days to upload, so we will need to parallelize over multiple nodes on a slurm cluster.
+
+NOTE: For development, run this script to start porting a shard:
+```bash
+python examples/port_datasets/droid_rlds/port.py \
+    --raw-dir /your/data/droid/1.0.1 \
+    --repo-id your_id/droid_1.0.1 \
+    --num-shards 2048 \
+    --shard-index 0
+```
+
+## Port over SLURM
+
+Install slurm utilities from Hugging Face:
+```bash
+pip install datatrove
+```
+
+
+### 1. Port one shard per job
+
+Run this script to start porting shards of the dataset:
+```bash
+python examples/port_datasets/droid_rlds/slurm_port_shards.py \
+    --raw-dir /your/data/droid/1.0.1 \
+    --repo-id your_id/droid_1.0.1 \
+    --logs-dir /your/logs \
+    --job-name port_droid \
+    --partition your_partition \
+    --workers 2048 \
+    --cpus-per-task 8 \
+    --mem-per-cpu 1950M
+```
+
+**Note on how to set your command line arguments**
+
+Regarding `--partition`, find yours by running:
+```bash
+info --format="%R"`
+```
+and select the CPU partition if you have one. No GPU needed.
+
+Regarding `--workers`, it is the number of slurm jobs you will launch in parallel. 2048 is the maximum number, since there is 2048 shards in Droid. This big number will certainly max-out your cluster.
+
+Regarding `--cpus-per-task` and `--mem-per-cpu`, by default it will use ~16GB of RAM (8*1950M) which is recommended to load the raw frames and 8 CPUs which can be useful to parallelize the encoding of the frames.
+
+Find the number of CPUs and Memory of the nodes of your partition by running:
+```bash
+sinfo -N -p your_partition -h -o "%N cpus=%c mem=%m"
+```
+
+**Useful commands to check progress and debug**
+
+Check if your jobs are running:
+```bash
+squeue -u $USER`
+```
+
+You should see a list with job indices like `15125385_155` where `15125385` is the index of the run and `155` is the worker index. The output/print of this worker is written in real time in `/your/logs/job_name/slurm_jobs/15125385_155.out`. For instance, you can inspect the content of this file by running `less /your/logs/job_name/slurm_jobs/15125385_155.out`.
+
+Check the progression of your jobs by running:
+```bash
+jobs_status /your/logs
+```
+
+If it's not 100% and no more slurm job is running, it means that some of them failed. Inspect the logs by running:
+```bash
+failed_logs /your/logs/job_name
+```
+
+If there is an issue in the code, you can fix it in debug mode with `--slurm 0` which allows to set breakpoint:
+```bash
+python examples/port_datasets/droid_rlds/slurm_port_shards.py --slurm 0 ...
+```
+
+And you can relaunch the same command, which will skip the completed jobs:
+```bash
+python examples/port_datasets/droid_rlds/slurm_port_shards.py --slurm 1 ...
+```
+
+Once all jobs are completed, you will have one dataset per shard (e.g. `droid_1.0.1_world_2048_rank_1594`) saved on disk in your `/lerobot/home/dir/your_id` directory. You can find your `/lerobot/home/dir` by running:
+```bash
+python -c "from lerobot.common.constants import HF_LEROBOT_HOME;print(HF_LEROBOT_HOME)"
+```
+
+
+### 2. Aggregate all shards
+
+Run this script to start aggregation:
+```bash
+python examples/port_datasets/droid_rlds/slurm_aggregate_shards.py \
+    --repo-id your_id/droid_1.0.1 \
+    --logs-dir /your/logs \
+    --job-name aggr_droid \
+    --partition your_partition \
+    --workers 2048 \
+    --cpus-per-task 8 \
+    --mem-per-cpu 1950M
+```
+
+Once all jobs are completed, you will have one dataset your `/lerobot/home/dir/your_id/droid_1.0.1` directory.
+
+
+### 3. Upload dataset
+
+Run this script to start uploading:
+```bash
+python examples/port_datasets/droid_rlds/slurm_upload.py \
+    --repo-id your_id/droid_1.0.1 \
+    --logs-dir /your/logs \
+    --job-name upload_droid \
+    --partition your_partition \
+    --workers 50 \
+    --cpus-per-task 4 \
+    --mem-per-cpu 1950M
+```
--- a/examples/port_datasets/droid_rlds/port_droid.py
+++ b/examples/port_datasets/droid_rlds/port_droid.py
@@ -0,0 +1,430 @@
+#!/usr/bin/env python
+
+# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import logging
+import time
+from pathlib import Path
+
+import numpy as np
+import tensorflow_datasets as tfds
+
+from lerobot.common.datasets.lerobot_dataset import LeRobotDataset, LeRobotDatasetMetadata
+from lerobot.common.utils.utils import get_elapsed_time_in_days_hours_minutes_seconds
+
+DROID_SHARDS = 2048
+DROID_FPS = 15
+DROID_ROBOT_TYPE = "Franka"
+
+# Dataset schema slightly adapted from: https://droid-dataset.github.io/droid/the-droid-dataset.html#-dataset-schema
+DROID_FEATURES = {
+    # true on first step of the episode
+    "is_first": {
+        "dtype": "bool",
+        "shape": (1,),
+        "names": None,
+    },
+    # true on last step of the episode
+    "is_last": {
+        "dtype": "bool",
+        "shape": (1,),
+        "names": None,
+    },
+    # true on last step of the episode if it is a terminal step, True for demos
+    "is_terminal": {
+        "dtype": "bool",
+        "shape": (1,),
+        "names": None,
+    },
+    # language_instruction is also stored as "task" to follow LeRobot standard
+    "language_instruction": {
+        "dtype": "string",
+        "shape": (1,),
+        "names": None,
+    },
+    "language_instruction_2": {
+        "dtype": "string",
+        "shape": (1,),
+        "names": None,
+    },
+    "language_instruction_3": {
+        "dtype": "string",
+        "shape": (1,),
+        "names": None,
+    },
+    "observation.state.gripper_position": {
+        "dtype": "float32",
+        "shape": (1,),
+        "names": {
+            "axes": ["gripper"],
+        },
+    },
+    "observation.state.cartesian_position": {
+        "dtype": "float32",
+        "shape": (6,),
+        "names": {
+            "axes": ["x", "y", "z", "roll", "pitch", "yaw"],
+        },
+    },
+    "observation.state.joint_position": {
+        "dtype": "float32",
+        "shape": (7,),
+        "names": {
+            "axes": ["joint_0", "joint_1", "joint_2", "joint_3", "joint_4", "joint_5", "joint_6"],
+        },
+    },
+    # Add this new feature to follow LeRobot standard of using joint position + gripper
+    "observation.state": {
+        "dtype": "float32",
+        "shape": (8,),
+        "names": {
+            "axes": ["joint_0", "joint_1", "joint_2", "joint_3", "joint_4", "joint_5", "joint_6", "gripper"],
+        },
+    },
+    # Initially called wrist_image_left
+    "observation.images.wrist_left": {
+        "dtype": "video",
+        "shape": (180, 320, 3),
+        "names": [
+            "height",
+            "width",
+            "channels",
+        ],
+    },
+    # Initially called exterior_image_1_left
+    "observation.images.exterior_1_left": {
+        "dtype": "video",
+        "shape": (180, 320, 3),
+        "names": [
+            "height",
+            "width",
+            "channels",
+        ],
+    },
+    # Initially called exterior_image_2_left
+    "observation.images.exterior_2_left": {
+        "dtype": "video",
+        "shape": (180, 320, 3),
+        "names": [
+            "height",
+            "width",
+            "channels",
+        ],
+    },
+    "action.gripper_position": {
+        "dtype": "float32",
+        "shape": (1,),
+        "names": {
+            "axes": ["gripper"],
+        },
+    },
+    "action.gripper_velocity": {
+        "dtype": "float32",
+        "shape": (1,),
+        "names": {
+            "axes": ["gripper"],
+        },
+    },
+    "action.cartesian_position": {
+        "dtype": "float32",
+        "shape": (6,),
+        "names": {
+            "axes": ["x", "y", "z", "roll", "pitch", "yaw"],
+        },
+    },
+    "action.cartesian_velocity": {
+        "dtype": "float32",
+        "shape": (6,),
+        "names": {
+            "axes": ["x", "y", "z", "roll", "pitch", "yaw"],
+        },
+    },
+    "action.joint_position": {
+        "dtype": "float32",
+        "shape": (7,),
+        "names": {
+            "axes": ["joint_0", "joint_1", "joint_2", "joint_3", "joint_4", "joint_5", "joint_6"],
+        },
+    },
+    "action.joint_velocity": {
+        "dtype": "float32",
+        "shape": (7,),
+        "names": {
+            "axes": ["joint_0", "joint_1", "joint_2", "joint_3", "joint_4", "joint_5", "joint_6"],
+        },
+    },
+    # This feature was called "action" in RLDS dataset and consists of [6x joint velocities, 1x gripper position]
+    "action.original": {
+        "dtype": "float32",
+        "shape": (7,),
+        "names": {
+            "axes": ["x", "y", "z", "roll", "pitch", "yaw", "gripper"],
+        },
+    },
+    # Add this new feature to follow LeRobot standard of using joint position + gripper
+    "action": {
+        "dtype": "float32",
+        "shape": (8,),
+        "names": {
+            "axes": ["joint_0", "joint_1", "joint_2", "joint_3", "joint_4", "joint_5", "joint_6", "gripper"],
+        },
+    },
+    "discount": {
+        "dtype": "float32",
+        "shape": (1,),
+        "names": None,
+    },
+    "reward": {
+        "dtype": "float32",
+        "shape": (1,),
+        "names": None,
+    },
+    # Meta data that are the same for all frames in the episode
+    "task_category": {
+        "dtype": "string",
+        "shape": (1,),
+        "names": None,
+    },
+    "building": {
+        "dtype": "string",
+        "shape": (1,),
+        "names": None,
+    },
+    "collector_id": {
+        "dtype": "string",
+        "shape": (1,),
+        "names": None,
+    },
+    "date": {
+        "dtype": "string",
+        "shape": (1,),
+        "names": None,
+    },
+    "camera_extrinsics.wrist_left": {
+        "dtype": "float32",
+        "shape": (6,),
+        "names": {
+            "axes": ["x", "y", "z", "roll", "pitch", "yaw"],
+        },
+    },
+    "camera_extrinsics.exterior_1_left": {
+        "dtype": "float32",
+        "shape": (6,),
+        "names": {
+            "axes": ["x", "y", "z", "roll", "pitch", "yaw"],
+        },
+    },
+    "camera_extrinsics.exterior_2_left": {
+        "dtype": "float32",
+        "shape": (6,),
+        "names": {
+            "axes": ["x", "y", "z", "roll", "pitch", "yaw"],
+        },
+    },
+    "is_episode_successful": {
+        "dtype": "bool",
+        "shape": (1,),
+        "names": None,
+    },
+}
+
+
+def is_episode_successful(tf_episode_metadata):
+    # Adapted from: https://github.com/droid-dataset/droid_policy_learning/blob/dd1020eb20d981f90b5ff07dc80d80d5c0cb108b/robomimic/utils/rlds_utils.py#L8
+    return "/success/" in tf_episode_metadata["file_path"].numpy().decode()
+
+
+def generate_lerobot_frames(tf_episode):
+    m = tf_episode["episode_metadata"]
+    frame_meta = {
+        "task_category": m["building"].numpy().decode(),
+        "building": m["building"].numpy().decode(),
+        "collector_id": m["collector_id"].numpy().decode(),
+        "date": m["date"].numpy().decode(),
+        "camera_extrinsics.wrist_left": m["extrinsics_wrist_cam"].numpy(),
+        "camera_extrinsics.exterior_1_left": m["extrinsics_exterior_cam_1"].numpy(),
+        "camera_extrinsics.exterior_2_left": m["extrinsics_exterior_cam_2"].numpy(),
+        "is_episode_successful": np.array([is_episode_successful(m)]),
+    }
+    for f in tf_episode["steps"]:
+        # Dataset schema slightly adapted from: https://droid-dataset.github.io/droid/the-droid-dataset.html#-dataset-schema
+        frame = {
+            "is_first": np.array([f["is_first"].numpy()]),
+            "is_last": np.array([f["is_last"].numpy()]),
+            "is_terminal": np.array([f["is_terminal"].numpy()]),
+            "language_instruction": f["language_instruction"].numpy().decode(),
+            "language_instruction_2": f["language_instruction_2"].numpy().decode(),
+            "language_instruction_3": f["language_instruction_3"].numpy().decode(),
+            "observation.state.gripper_position": f["observation"]["gripper_position"].numpy(),
+            "observation.state.cartesian_position": f["observation"]["cartesian_position"].numpy(),
+            "observation.state.joint_position": f["observation"]["joint_position"].numpy(),
+            "observation.images.wrist_left": f["observation"]["wrist_image_left"].numpy(),
+            "observation.images.exterior_1_left": f["observation"]["exterior_image_1_left"].numpy(),
+            "observation.images.exterior_2_left": f["observation"]["exterior_image_2_left"].numpy(),
+            "action.gripper_position": f["action_dict"]["gripper_position"].numpy(),
+            "action.gripper_velocity": f["action_dict"]["gripper_velocity"].numpy(),
+            "action.cartesian_position": f["action_dict"]["cartesian_position"].numpy(),
+            "action.cartesian_velocity": f["action_dict"]["cartesian_velocity"].numpy(),
+            "action.joint_position": f["action_dict"]["joint_position"].numpy(),
+            "action.joint_velocity": f["action_dict"]["joint_velocity"].numpy(),
+            "discount": np.array([f["discount"].numpy()]),
+            "reward": np.array([f["reward"].numpy()]),
+            "action.original": f["action"].numpy(),
+        }
+
+        # language_instruction is also stored as "task" to follow LeRobot standard
+        frame["task"] = frame["language_instruction"]
+
+        # Add this new feature to follow LeRobot standard of using joint position + gripper
+        frame["observation.state"] = np.concatenate(
+            [frame["observation.state.joint_position"], frame["observation.state.gripper_position"]]
+        )
+        frame["action"] = np.concatenate([frame["action.joint_position"], frame["action.gripper_position"]])
+
+        # Meta data that are the same for all frames in the episode
+        frame.update(frame_meta)
+
+        # Cast fp64 to fp32
+        for key in frame:
+            if isinstance(frame[key], np.ndarray) and frame[key].dtype == np.float64:
+                frame[key] = frame[key].astype(np.float32)
+
+        yield frame
+
+
+def port_droid(
+    raw_dir: Path,
+    repo_id: str,
+    push_to_hub: bool = False,
+    num_shards: int | None = None,
+    shard_index: int | None = None,
+):
+    dataset_name = raw_dir.parent.name
+    version = raw_dir.name
+    data_dir = raw_dir.parent.parent
+
+    builder = tfds.builder(f"{dataset_name}/{version}", data_dir=data_dir, version="")
+
+    if num_shards is not None:
+        tfds_num_shards = builder.info.splits["train"].num_shards
+        if tfds_num_shards != DROID_SHARDS:
+            raise ValueError(
+                f"Number of shards of Droid dataset is expected to be {DROID_SHARDS} but is {tfds_num_shards}."
+            )
+        if num_shards != tfds_num_shards:
+            raise ValueError(
+                f"We only shard over the fixed number of shards provided by tensorflow dataset ({tfds_num_shards}), but {num_shards} shards provided instead."
+            )
+        if shard_index >= tfds_num_shards:
+            raise ValueError(
+                f"Shard index is greater than the num of shards ({shard_index} >= {num_shards})."
+            )
+
+        raw_dataset = builder.as_dataset(split=f"train[{shard_index}shard]")
+    else:
+        raw_dataset = builder.as_dataset(split="train")
+
+    lerobot_dataset = LeRobotDataset.create(
+        repo_id=repo_id,
+        robot_type=DROID_ROBOT_TYPE,
+        fps=DROID_FPS,
+        features=DROID_FEATURES,
+    )
+
+    start_time = time.time()
+    num_episodes = raw_dataset.cardinality().numpy().item()
+    logging.info(f"Number of episodes {num_episodes}")
+
+    for episode_index, episode in enumerate(raw_dataset):
+        elapsed_time = time.time() - start_time
+        d, h, m, s = get_elapsed_time_in_days_hours_minutes_seconds(elapsed_time)
+
+        logging.info(
+            f"{episode_index} / {num_episodes} episodes processed (after {d} days, {h} hours, {m} minutes, {s:.3f} seconds)"
+        )
+
+        for frame in generate_lerobot_frames(episode):
+            lerobot_dataset.add_frame(frame)
+
+        lerobot_dataset.save_episode()
+        logging.info("Save_episode")
+
+    if push_to_hub:
+        lerobot_dataset.push_to_hub(
+            # Add openx tag, since it belongs to the openx collection of datasets
+            tags=["openx"],
+            private=False,
+        )
+
+
+def validate_dataset(repo_id):
+    """Sanity check that ensure meta data can be loaded and all files are present."""
+    meta = LeRobotDatasetMetadata(repo_id)
+
+    if meta.total_episodes == 0:
+        raise ValueError("Number of episodes is 0.")
+
+    for ep_idx in range(meta.total_episodes):
+        data_path = meta.root / meta.get_data_file_path(ep_idx)
+
+        if not data_path.exists():
+            raise ValueError(f"Parquet file is missing in: {data_path}")
+
+        for vid_key in meta.video_keys:
+            vid_path = meta.root / meta.get_video_file_path(ep_idx, vid_key)
+            if not vid_path.exists():
+                raise ValueError(f"Video file is missing in: {vid_path}")
+
+
+def main():
+    parser = argparse.ArgumentParser()
+
+    parser.add_argument(
+        "--raw-dir",
+        type=Path,
+        required=True,
+        help="Directory containing input raw datasets (e.g. `path/to/dataset` or `path/to/dataset/version).",
+    )
+    parser.add_argument(
+        "--repo-id",
+        type=str,
+        help="Repositery identifier on Hugging Face: a community or a user name `/` the name of the dataset, required when push-to-hub is True",
+    )
+    parser.add_argument(
+        "--push-to-hub",
+        action="store_true",
+        help="Upload to hub.",
+    )
+    parser.add_argument(
+        "--num-shards",
+        type=int,
+        default=None,
+        help="Number of shards. Can be either None to load the full dataset, or 2048 to load one of the 2048 tensorflow dataset files.",
+    )
+    parser.add_argument(
+        "--shard-index",
+        type=int,
+        default=None,
+        help="Index of the shard. Can be either None to load the full dataset, or in [0,2047] to load one of the 2048 tensorflow dataset files.",
+    )
+
+    args = parser.parse_args()
+
+    port_droid(**vars(args))
+
+
+if __name__ == "__main__":
+    main()
--- a/examples/port_datasets/droid_rlds/slurm_aggregate_shards.py
+++ b/examples/port_datasets/droid_rlds/slurm_aggregate_shards.py
@@ -0,0 +1,293 @@
+#!/usr/bin/env python
+
+# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import logging
+from pathlib import Path
+
+import tqdm
+from datatrove.executor import LocalPipelineExecutor
+from datatrove.executor.slurm import SlurmPipelineExecutor
+from datatrove.pipeline.base import PipelineStep
+
+from examples.port_datasets.droid_rlds.port_droid import DROID_SHARDS
+from lerobot.common.datasets.aggregate import validate_all_metadata
+from lerobot.common.datasets.lerobot_dataset import LeRobotDatasetMetadata
+from lerobot.common.datasets.utils import (
+    legacy_write_episode_stats,
+    legacy_write_task,
+    write_episode,
+    write_info,
+)
+from lerobot.common.utils.utils import init_logging
+
+
+class AggregateDatasets(PipelineStep):
+    def __init__(
+        self,
+        repo_ids: list[str],
+        aggregated_repo_id: str,
+    ):
+        super().__init__()
+        self.repo_ids = repo_ids
+        self.aggr_repo_id = aggregated_repo_id
+
+        self.create_aggr_dataset()
+
+    def create_aggr_dataset(self):
+        init_logging()
+
+        logging.info("Start aggregate_datasets")
+
+        all_metadata = [LeRobotDatasetMetadata(repo_id) for repo_id in self.repo_ids]
+
+        fps, robot_type, features = validate_all_metadata(all_metadata)
+
+        # Create resulting dataset folder
+        aggr_meta = LeRobotDatasetMetadata.create(
+            repo_id=self.aggr_repo_id,
+            fps=fps,
+            robot_type=robot_type,
+            features=features,
+        )
+
+        logging.info("Find all tasks")
+        # find all tasks, deduplicate them, create new task indices for each dataset
+        # indexed by dataset index
+        datasets_task_index_to_aggr_task_index = {}
+        aggr_task_index = 0
+        for dataset_index, meta in enumerate(tqdm.tqdm(all_metadata, desc="Find all tasks")):
+            task_index_to_aggr_task_index = {}
+
+            for task_index, task in meta.tasks.items():
+                if task not in aggr_meta.task_to_task_index:
+                    # add the task to aggr tasks mappings
+                    aggr_meta.tasks[aggr_task_index] = task
+                    aggr_meta.task_to_task_index[task] = aggr_task_index
+                    aggr_task_index += 1
+
+                # add task_index anyway
+                task_index_to_aggr_task_index[task_index] = aggr_meta.task_to_task_index[task]
+
+            datasets_task_index_to_aggr_task_index[dataset_index] = task_index_to_aggr_task_index
+
+        logging.info("Prepare copy data and videos")
+        datasets_ep_idx_to_aggr_ep_idx = {}
+        datasets_aggr_episode_index_shift = {}
+        aggr_episode_index_shift = 0
+        for dataset_index, meta in enumerate(tqdm.tqdm(all_metadata, desc="Prepare copy data and videos")):
+            ep_idx_to_aggr_ep_idx = {}
+
+            for episode_index in range(meta.total_episodes):
+                aggr_episode_index = episode_index + aggr_episode_index_shift
+                ep_idx_to_aggr_ep_idx[episode_index] = aggr_episode_index
+
+            datasets_ep_idx_to_aggr_ep_idx[dataset_index] = ep_idx_to_aggr_ep_idx
+            datasets_aggr_episode_index_shift[dataset_index] = aggr_episode_index_shift
+
+            # populate episodes
+            for episode_index, episode_dict in meta.episodes.items():
+                aggr_episode_index = episode_index + aggr_episode_index_shift
+                episode_dict["episode_index"] = aggr_episode_index
+                aggr_meta.episodes[aggr_episode_index] = episode_dict
+
+            # populate episodes_stats
+            for episode_index, episode_stats in meta.episodes_stats.items():
+                aggr_episode_index = episode_index + aggr_episode_index_shift
+                aggr_meta.episodes_stats[aggr_episode_index] = episode_stats
+
+            # populate info
+            aggr_meta.info["total_episodes"] += meta.total_episodes
+            aggr_meta.info["total_frames"] += meta.total_frames
+            aggr_meta.info["total_videos"] += len(aggr_meta.video_keys) * meta.total_episodes
+
+            aggr_episode_index_shift += meta.total_episodes
+
+        logging.info("Write meta data")
+        aggr_meta.info["total_tasks"] = len(aggr_meta.tasks)
+        aggr_meta.info["total_chunks"] = aggr_meta.get_episode_chunk(aggr_episode_index_shift - 1)
+        aggr_meta.info["splits"] = {"train": f"0:{aggr_meta.info['total_episodes']}"}
+
+        # create a new episodes jsonl with updated episode_index using write_episode
+        for episode_dict in tqdm.tqdm(aggr_meta.episodes.values(), desc="Write episodes"):
+            write_episode(episode_dict, aggr_meta.root)
+
+        # create a new episode_stats jsonl with updated episode_index using write_episode_stats
+        for episode_index, episode_stats in tqdm.tqdm(
+            aggr_meta.episodes_stats.items(), desc="Write episodes stats"
+        ):
+            legacy_write_episode_stats(episode_index, episode_stats, aggr_meta.root)
+
+        # create a new task jsonl with updated episode_index using write_task
+        for task_index, task in tqdm.tqdm(aggr_meta.tasks.items(), desc="Write tasks"):
+            legacy_write_task(task_index, task, aggr_meta.root)
+
+        write_info(aggr_meta.info, aggr_meta.root)
+
+        self.datasets_task_index_to_aggr_task_index = datasets_task_index_to_aggr_task_index
+        self.datasets_ep_idx_to_aggr_ep_idx = datasets_ep_idx_to_aggr_ep_idx
+        self.datasets_aggr_episode_index_shift = datasets_aggr_episode_index_shift
+
+        logging.info("Meta data done writing!")
+
+    def run(self, data=None, rank: int = 0, world_size: int = 1):
+        import logging
+        import shutil
+
+        import pandas as pd
+
+        from lerobot.common.datasets.aggregate import get_update_episode_and_task_func
+        from lerobot.common.datasets.lerobot_dataset import LeRobotDatasetMetadata
+        from lerobot.common.utils.utils import init_logging
+
+        init_logging()
+
+        aggr_meta = LeRobotDatasetMetadata(self.aggr_repo_id)
+        all_metadata = [LeRobotDatasetMetadata(repo_id) for repo_id in self.repo_ids]
+
+        if world_size != len(all_metadata):
+            raise ValueError()
+
+        dataset_index = rank
+        meta = all_metadata[dataset_index]
+        aggr_episode_index_shift = self.datasets_aggr_episode_index_shift[dataset_index]
+
+        logging.info("Copy data")
+        for episode_index in range(meta.total_episodes):
+            aggr_episode_index = self.datasets_ep_idx_to_aggr_ep_idx[dataset_index][episode_index]
+            data_path = meta.root / meta.get_data_file_path(episode_index)
+            aggr_data_path = aggr_meta.root / aggr_meta.get_data_file_path(aggr_episode_index)
+
+            # update episode_index and task_index
+            df = pd.read_parquet(data_path)
+            update_row_func = get_update_episode_and_task_func(
+                aggr_episode_index_shift, self.datasets_task_index_to_aggr_task_index[dataset_index]
+            )
+            df = df.apply(update_row_func, axis=1)
+
+            aggr_data_path.parent.mkdir(parents=True, exist_ok=True)
+            df.to_parquet(aggr_data_path)
+
+        logging.info("Copy videos")
+        for episode_index in range(meta.total_episodes):
+            aggr_episode_index = episode_index + aggr_episode_index_shift
+            for vid_key in meta.video_keys:
+                video_path = meta.root / meta.get_video_file_path(episode_index, vid_key)
+                aggr_video_path = aggr_meta.root / aggr_meta.get_video_file_path(aggr_episode_index, vid_key)
+                aggr_video_path.parent.mkdir(parents=True, exist_ok=True)
+                shutil.copy(video_path, aggr_video_path)
+
+                # copy_command = f"cp {video_path} {aggr_video_path} &"
+                # subprocess.Popen(copy_command, shell=True)
+
+        logging.info("Done!")
+
+
+def make_aggregate_executor(
+    repo_ids, repo_id, job_name, logs_dir, workers, partition, cpus_per_task, mem_per_cpu, slurm=True
+):
+    kwargs = {
+        "pipeline": [
+            AggregateDatasets(repo_ids, repo_id),
+        ],
+        "logging_dir": str(logs_dir / job_name),
+    }
+
+    if slurm:
+        kwargs.update(
+            {
+                "job_name": job_name,
+                "tasks": DROID_SHARDS,
+                "workers": workers,
+                "time": "08:00:00",
+                "partition": partition,
+                "cpus_per_task": cpus_per_task,
+                "sbatch_args": {"mem-per-cpu": mem_per_cpu},
+            }
+        )
+        executor = SlurmPipelineExecutor(**kwargs)
+    else:
+        kwargs.update(
+            {
+                "tasks": DROID_SHARDS,
+                "workers": 1,
+            }
+        )
+        executor = LocalPipelineExecutor(**kwargs)
+
+    return executor
+
+
+def main():
+    parser = argparse.ArgumentParser()
+
+    parser.add_argument(
+        "--repo-id",
+        type=str,
+        help="Repositery identifier on Hugging Face: a community or a user name `/` the name of the dataset, required when push-to-hub is True.",
+    )
+    parser.add_argument(
+        "--logs-dir",
+        type=Path,
+        help="Path to logs directory for `datatrove`.",
+    )
+    parser.add_argument(
+        "--job-name",
+        type=str,
+        default="aggr_droid",
+        help="Job name used in slurm, and name of the directory created inside the provided logs directory.",
+    )
+    parser.add_argument(
+        "--slurm",
+        type=int,
+        default=1,
+        help="Launch over slurm. Use `--slurm 0` to launch sequentially (useful to debug).",
+    )
+    parser.add_argument(
+        "--workers",
+        type=int,
+        default=2048,
+        help="Number of slurm workers. It should be less than the maximum number of shards.",
+    )
+    parser.add_argument(
+        "--partition",
+        type=str,
+        help="Slurm partition. Ideally a CPU partition. No need for GPU partition.",
+    )
+    parser.add_argument(
+        "--cpus-per-task",
+        type=int,
+        default=8,
+        help="Number of cpus that each slurm worker will use.",
+    )
+    parser.add_argument(
+        "--mem-per-cpu",
+        type=str,
+        default="1950M",
+        help="Memory per cpu that each worker will use.",
+    )
+
+    args = parser.parse_args()
+    kwargs = vars(args)
+    kwargs["slurm"] = kwargs.pop("slurm") == 1
+
+    repo_ids = [f"{args.repo_id}_world_{DROID_SHARDS}_rank_{rank}" for rank in range(DROID_SHARDS)]
+    aggregate_executor = make_aggregate_executor(repo_ids, **kwargs)
+    aggregate_executor.run()
+
+
+if __name__ == "__main__":
+    main()
--- a/examples/port_datasets/droid_rlds/slurm_port_shards.py
+++ b/examples/port_datasets/droid_rlds/slurm_port_shards.py
@@ -0,0 +1,147 @@
+import argparse
+from pathlib import Path
+
+from datatrove.executor import LocalPipelineExecutor
+from datatrove.executor.slurm import SlurmPipelineExecutor
+from datatrove.pipeline.base import PipelineStep
+
+from examples.port_datasets.droid_rlds.port_droid import DROID_SHARDS
+
+
+class PortDroidShards(PipelineStep):
+    def __init__(
+        self,
+        raw_dir: Path | str,
+        repo_id: str = None,
+    ):
+        super().__init__()
+        self.raw_dir = Path(raw_dir)
+        self.repo_id = repo_id
+
+    def run(self, data=None, rank: int = 0, world_size: int = 1):
+        from datasets.utils.tqdm import disable_progress_bars
+
+        from examples.port_datasets.droid_rlds.port_droid import port_droid, validate_dataset
+        from lerobot.common.utils.utils import init_logging
+
+        init_logging()
+        disable_progress_bars()
+
+        shard_repo_id = f"{self.repo_id}_world_{world_size}_rank_{rank}"
+
+        try:
+            validate_dataset(shard_repo_id)
+            return
+        except:
+            pass
+
+        port_droid(
+            self.raw_dir,
+            shard_repo_id,
+            push_to_hub=False,
+            num_shards=world_size,
+            shard_index=rank,
+        )
+
+        validate_dataset(shard_repo_id)
+
+
+def make_port_executor(
+    raw_dir, repo_id, job_name, logs_dir, workers, partition, cpus_per_task, mem_per_cpu, slurm=True
+):
+    kwargs = {
+        "pipeline": [
+            PortDroidShards(raw_dir, repo_id),
+        ],
+        "logging_dir": str(logs_dir / job_name),
+    }
+
+    if slurm:
+        kwargs.update(
+            {
+                "job_name": job_name,
+                "tasks": DROID_SHARDS,
+                "workers": workers,
+                "time": "08:00:00",
+                "partition": partition,
+                "cpus_per_task": cpus_per_task,
+                "sbatch_args": {"mem-per-cpu": mem_per_cpu},
+            }
+        )
+        executor = SlurmPipelineExecutor(**kwargs)
+    else:
+        kwargs.update(
+            {
+                "tasks": 1,
+                "workers": 1,
+            }
+        )
+        executor = LocalPipelineExecutor(**kwargs)
+
+    return executor
+
+
+def main():
+    parser = argparse.ArgumentParser()
+
+    parser.add_argument(
+        "--raw-dir",
+        type=Path,
+        required=True,
+        help="Directory containing input raw datasets (e.g. `path/to/dataset` or `path/to/dataset/version).",
+    )
+    parser.add_argument(
+        "--repo-id",
+        type=str,
+        help="Repositery identifier on Hugging Face: a community or a user name `/` the name of the dataset, required when push-to-hub is True.",
+    )
+    parser.add_argument(
+        "--logs-dir",
+        type=Path,
+        help="Path to logs directory for `datatrove`.",
+    )
+    parser.add_argument(
+        "--job-name",
+        type=str,
+        default="port_droid",
+        help="Job name used in slurm, and name of the directory created inside the provided logs directory.",
+    )
+    parser.add_argument(
+        "--slurm",
+        type=int,
+        default=1,
+        help="Launch over slurm. Use `--slurm 0` to launch sequentially (useful to debug).",
+    )
+    parser.add_argument(
+        "--workers",
+        type=int,
+        default=2048,
+        help="Number of slurm workers. It should be less than the maximum number of shards.",
+    )
+    parser.add_argument(
+        "--partition",
+        type=str,
+        help="Slurm partition. Ideally a CPU partition. No need for GPU partition.",
+    )
+    parser.add_argument(
+        "--cpus-per-task",
+        type=int,
+        default=8,
+        help="Number of cpus that each slurm worker will use.",
+    )
+    parser.add_argument(
+        "--mem-per-cpu",
+        type=str,
+        default="1950M",
+        help="Memory per cpu that each worker will use.",
+    )
+
+    args = parser.parse_args()
+    kwargs = vars(args)
+    kwargs["slurm"] = kwargs.pop("slurm") == 1
+    port_executor = make_port_executor(**kwargs)
+    port_executor.run()
+
+
+if __name__ == "__main__":
+    main()
--- a/examples/port_datasets/droid_rlds/slurm_upload.py
+++ b/examples/port_datasets/droid_rlds/slurm_upload.py
@@ -0,0 +1,263 @@
+import argparse
+import logging
+import os
+from pathlib import Path
+
+from datatrove.executor import LocalPipelineExecutor
+from datatrove.executor.slurm import SlurmPipelineExecutor
+from datatrove.pipeline.base import PipelineStep
+from huggingface_hub import HfApi
+from huggingface_hub.constants import REPOCARD_NAME
+
+from examples.port_datasets.droid_rlds.port_droid import DROID_SHARDS
+from lerobot.common.datasets.lerobot_dataset import CODEBASE_VERSION, LeRobotDatasetMetadata
+from lerobot.common.datasets.utils import create_lerobot_dataset_card
+from lerobot.common.utils.utils import init_logging
+
+
+class UploadDataset(PipelineStep):
+    def __init__(
+        self,
+        repo_id: str,
+        branch: str | None = None,
+        revision: str | None = None,
+        tags: list | None = None,
+        license: str | None = "apache-2.0",
+        private: bool = False,
+        distant_repo_id: str | None = None,
+        **card_kwargs,
+    ):
+        super().__init__()
+        self.repo_id = repo_id
+        self.distant_repo_id = self.repo_id if distant_repo_id is None else distant_repo_id
+        self.branch = branch
+        self.tags = tags
+        self.license = license
+        self.private = private
+        self.card_kwargs = card_kwargs
+        self.revision = revision if revision else CODEBASE_VERSION
+
+        if os.environ.get("HF_HUB_ENABLE_HF_TRANSFER", "0") != "1":
+            logging.warning(
+                'HF_HUB_ENABLE_HF_TRANSFER is not set to "1". Install hf_transfer and set the env '
+                "variable for faster uploads:\npip install hf-transfer\nexport HF_HUB_ENABLE_HF_TRANSFER=1"
+            )
+
+        self.create_repo()
+
+    def create_repo(self):
+        logging.info(f"Loading meta data from {self.repo_id}...")
+        meta = LeRobotDatasetMetadata(self.repo_id)
+
+        logging.info(f"Creating repo {self.distant_repo_id}...")
+        hub_api = HfApi()
+        hub_api.create_repo(
+            repo_id=self.distant_repo_id,
+            private=self.private,
+            repo_type="dataset",
+            exist_ok=True,
+        )
+        if self.branch:
+            hub_api.create_branch(
+                repo_id=self.distant_repo_id,
+                branch=self.branch,
+                revision=self.revision,
+                repo_type="dataset",
+                exist_ok=True,
+            )
+
+        if not hub_api.file_exists(
+            self.distant_repo_id, REPOCARD_NAME, repo_type="dataset", revision=self.branch
+        ):
+            card = create_lerobot_dataset_card(
+                tags=self.tags, dataset_info=meta.info, license=self.license, **self.card_kwargs
+            )
+            card.push_to_hub(repo_id=self.distant_repo_id, repo_type="dataset", revision=self.branch)
+
+        def list_files_recursively(directory):
+            base_path = Path(directory)
+            return [str(file.relative_to(base_path)) for file in base_path.rglob("*") if file.is_file()]
+
+        logging.info(f"Listing all local files from {self.repo_id}...")
+        self.file_paths = list_files_recursively(meta.root)
+        self.file_paths = sorted(self.file_paths)
+
+    def create_chunks(self, lst, n):
+        from itertools import islice
+
+        it = iter(lst)
+        return [list(islice(it, size)) for size in [len(lst) // n + (i < len(lst) % n) for i in range(n)]]
+
+    def create_commits(self, additions):
+        import logging
+        import math
+        import random
+        import time
+
+        from huggingface_hub import create_commit
+        from huggingface_hub.utils import HfHubHTTPError
+
+        FILES_BETWEEN_COMMITS = 10  # noqa: N806
+        BASE_DELAY = 0.1  # noqa: N806
+        MAX_RETRIES = 12  # noqa: N806
+
+        # Split the files into smaller chunks for faster commit
+        # and avoiding "A commit has happened since" error
+        num_chunks = math.ceil(len(additions) / FILES_BETWEEN_COMMITS)
+        chunks = self.create_chunks(additions, num_chunks)
+
+        for chunk in chunks:
+            retries = 0
+            while True:
+                try:
+                    create_commit(
+                        self.distant_repo_id,
+                        repo_type="dataset",
+                        operations=chunk,
+                        commit_message=f"DataTrove upload ({len(chunk)} files)",
+                        revision=self.branch,
+                    )
+                    # TODO: every 100 chunks super_squach_commits()
+                    logging.info("create_commit completed!")
+                    break
+                except HfHubHTTPError as e:
+                    if "A commit has happened since" in e.server_message:
+                        if retries >= MAX_RETRIES:
+                            logging.error(f"Failed to create commit after {MAX_RETRIES=}. Giving up.")
+                            raise e
+                        logging.info("Commit creation race condition issue. Waiting...")
+                        time.sleep(BASE_DELAY * 2**retries + random.uniform(0, 2))
+                        retries += 1
+                    else:
+                        raise e
+
+    def run(self, data=None, rank: int = 0, world_size: int = 1):
+        import logging
+
+        from datasets.utils.tqdm import disable_progress_bars
+        from huggingface_hub import CommitOperationAdd, preupload_lfs_files
+
+        from lerobot.common.datasets.lerobot_dataset import LeRobotDatasetMetadata
+        from lerobot.common.utils.utils import init_logging
+
+        init_logging()
+        disable_progress_bars()
+
+        chunks = self.create_chunks(self.file_paths, world_size)
+        file_paths = chunks[rank]
+
+        if len(file_paths) == 0:
+            raise ValueError(file_paths)
+
+        logging.info("Pre-uploading LFS files...")
+        for i, path in enumerate(file_paths):
+            logging.info(f"{i}: {path}")
+
+        meta = LeRobotDatasetMetadata(self.repo_id)
+        additions = [
+            CommitOperationAdd(path_in_repo=path, path_or_fileobj=meta.root / path) for path in file_paths
+        ]
+        preupload_lfs_files(
+            repo_id=self.distant_repo_id, repo_type="dataset", additions=additions, revision=self.branch
+        )
+
+        logging.info("Creating commits...")
+        self.create_commits(additions)
+        logging.info("Done!")
+
+
+def make_upload_executor(
+    repo_id, job_name, logs_dir, workers, partition, cpus_per_task, mem_per_cpu, slurm=True
+):
+    kwargs = {
+        "pipeline": [
+            UploadDataset(repo_id),
+        ],
+        "logging_dir": str(logs_dir / job_name),
+    }
+
+    if slurm:
+        kwargs.update(
+            {
+                "job_name": job_name,
+                "tasks": DROID_SHARDS,
+                "workers": workers,
+                "time": "08:00:00",
+                "partition": partition,
+                "cpus_per_task": cpus_per_task,
+                "sbatch_args": {"mem-per-cpu": mem_per_cpu},
+            }
+        )
+        executor = SlurmPipelineExecutor(**kwargs)
+    else:
+        kwargs.update(
+            {
+                "tasks": DROID_SHARDS,
+                "workers": 1,
+            }
+        )
+        executor = LocalPipelineExecutor(**kwargs)
+
+    return executor
+
+
+def main():
+    parser = argparse.ArgumentParser()
+
+    parser.add_argument(
+        "--repo-id",
+        type=str,
+        help="Repositery identifier on Hugging Face: a community or a user name `/` the name of the dataset, required when push-to-hub is True.",
+    )
+    parser.add_argument(
+        "--logs-dir",
+        type=Path,
+        help="Path to logs directory for `datatrove`.",
+    )
+    parser.add_argument(
+        "--job-name",
+        type=str,
+        default="upload_droid",
+        help="Job name used in slurm, and name of the directory created inside the provided logs directory.",
+    )
+    parser.add_argument(
+        "--slurm",
+        type=int,
+        default=1,
+        help="Launch over slurm. Use `--slurm 0` to launch sequentially (useful to debug).",
+    )
+    parser.add_argument(
+        "--workers",
+        type=int,
+        default=50,
+        help="Number of slurm workers. It should be less than the maximum number of shards.",
+    )
+    parser.add_argument(
+        "--partition",
+        type=str,
+        help="Slurm partition. Ideally a CPU partition. No need for GPU partition.",
+    )
+    parser.add_argument(
+        "--cpus-per-task",
+        type=int,
+        default=8,
+        help="Number of cpus that each slurm worker will use.",
+    )
+    parser.add_argument(
+        "--mem-per-cpu",
+        type=str,
+        default="1950M",
+        help="Memory per cpu that each worker will use.",
+    )
+
+    init_logging()
+
+    args = parser.parse_args()
+    kwargs = vars(args)
+    kwargs["slurm"] = kwargs.pop("slurm") == 1
+    upload_executor = make_upload_executor(**kwargs)
+    upload_executor.run()
+
+
+if __name__ == "__main__":
+    main()
--- a/lerobot/common/datasets/aggregate.py
+++ b/lerobot/common/datasets/aggregate.py
@@ -0,0 +1,413 @@
+import logging
+import shutil
+from pathlib import Path
+
+import pandas as pd
+import tqdm
+
+from lerobot.common.constants import HF_LEROBOT_HOME
+from lerobot.common.datasets.compute_stats import aggregate_stats
+from lerobot.common.datasets.lerobot_dataset import LeRobotDataset, LeRobotDatasetMetadata
+from lerobot.common.datasets.utils import (
+    DEFAULT_CHUNK_SIZE,
+    DEFAULT_DATA_FILE_SIZE_IN_MB,
+    DEFAULT_DATA_PATH,
+    DEFAULT_EPISODES_PATH,
+    DEFAULT_VIDEO_FILE_SIZE_IN_MB,
+    DEFAULT_VIDEO_PATH,
+    concat_video_files,
+    get_parquet_file_size_in_mb,
+    get_video_size_in_mb,
+    safe_write_dataframe_to_parquet,
+    update_chunk_file_indices,
+    write_info,
+    write_stats,
+    write_tasks,
+)
+from lerobot.common.utils.utils import init_logging
+
+
+def validate_all_metadata(all_metadata: list[LeRobotDatasetMetadata]):
+    # validate same fps, robot_type, features
+
+    fps = all_metadata[0].fps
+    robot_type = all_metadata[0].robot_type
+    features = all_metadata[0].features
+
+    for meta in tqdm.tqdm(all_metadata, desc="Validate all meta data"):
+        if fps != meta.fps:
+            raise ValueError(f"Same fps is expected, but got fps={meta.fps} instead of {fps}.")
+        if robot_type != meta.robot_type:
+            raise ValueError(
+                f"Same robot_type is expected, but got robot_type={meta.robot_type} instead of {robot_type}."
+            )
+        if features != meta.features:
+            raise ValueError(
+                f"Same features is expected, but got features={meta.features} instead of {features}."
+            )
+
+    return fps, robot_type, features
+
+
+def update_data_df(df, src_meta, dst_meta):
+    def _update(row):
+        row["episode_index"] = row["episode_index"] + dst_meta["total_episodes"]
+        row["index"] = row["index"] + dst_meta["total_frames"]
+        task = src_meta.tasks.iloc[row["task_index"]].name
+        row["task_index"] = dst_meta.tasks.loc[task].task_index.item()
+        return row
+
+    return df.apply(_update, axis=1)
+
+
+def update_meta_data(
+    df,
+    dst_meta,
+    meta_idx,
+    data_idx,
+    videos_idx,
+):
+    def _update(row):
+        row["meta/episodes/chunk_index"] = row["meta/episodes/chunk_index"] + meta_idx["chunk_index"]
+        row["meta/episodes/file_index"] = row["meta/episodes/file_index"] + meta_idx["file_index"]
+        row["data/chunk_index"] = row["data/chunk_index"] + data_idx["chunk_index"]
+        row["data/file_index"] = row["data/file_index"] + data_idx["file_index"]
+        for key, video_idx in videos_idx.items():
+            row[f"videos/{key}/chunk_index"] = row[f"videos/{key}/chunk_index"] + video_idx["chunk_index"]
+            row[f"videos/{key}/file_index"] = row[f"videos/{key}/file_index"] + video_idx["file_index"]
+            row[f"videos/{key}/from_timestamp"] = (
+                row[f"videos/{key}/from_timestamp"] + video_idx["latest_duration"]
+            )
+            row[f"videos/{key}/to_timestamp"] = (
+                row[f"videos/{key}/to_timestamp"] + video_idx["latest_duration"]
+            )
+        row["dataset_from_index"] = row["dataset_from_index"] + dst_meta.info["total_frames"]
+        row["dataset_to_index"] = row["dataset_to_index"] + dst_meta.info["total_frames"]
+        row["episode_index"] = row["episode_index"] + dst_meta.info["total_episodes"]
+        return row
+
+    return df.apply(_update, axis=1)
+
+
+def aggregate_datasets(repo_ids: list[str], aggr_repo_id: str, roots: list[Path] = None, aggr_root=None):
+    logging.info("Start aggregate_datasets")
+
+    # Load metadata
+    all_metadata = (
+        [LeRobotDatasetMetadata(repo_id) for repo_id in repo_ids]
+        if roots is None
+        else [
+            LeRobotDatasetMetadata(repo_id, root=root) for repo_id, root in zip(repo_ids, roots, strict=False)
+        ]
+    )
+    fps, robot_type, features = validate_all_metadata(all_metadata)
+    video_keys = [key for key in features if features[key]["dtype"] == "video"]
+    image_keys = [key for key in features if features[key]["dtype"] == "image"]
+
+    # Initialize output dataset metadata
+    dst_meta = LeRobotDatasetMetadata.create(
+        repo_id=aggr_repo_id,
+        fps=fps,
+        robot_type=robot_type,
+        features=features,
+        root=aggr_root,
+    )
+
+    # Aggregate task info
+    logging.info("Find all tasks")
+    unique_tasks = pd.concat([m.tasks for m in all_metadata]).index.unique()
+    dst_meta.tasks = pd.DataFrame({"task_index": range(len(unique_tasks))}, index=unique_tasks)
+
+    # Track counters and indices
+    meta_idx = {"chunk": 0, "file": 0}
+    data_idx = {"chunk": 0, "file": 0}
+    videos_idx = {
+        key: {"chunk": 0, "file": 0, "latest_duration": 0, "episode_duration": 0} for key in video_keys
+    }
+
+    # Process each dataset
+    for src_meta in tqdm.tqdm(all_metadata, desc="Copy data and videos"):
+        videos_idx = aggregate_videos(src_meta, dst_meta, videos_idx)
+        data_idx = aggregate_data(src_meta, dst_meta, data_idx)
+        meta_idx = aggregate_metadata(src_meta, dst_meta, meta_idx, data_idx, videos_idx, video_keys)
+
+        dst_meta.info["total_episodes"] += src_meta.total_episodes
+        dst_meta.info["total_frames"] += src_meta.total_frames
+
+    finalize_aggregation(aggr_meta, all_metadata)
+    logging.info("Aggregation complete.")
+
+
+# -------------------------------
+# Helper Functions
+# -------------------------------
+
+
+def aggregate_videos(src_meta, dst_meta, videos_idx):
+    """
+    Aggregates video chunks from a dataset into the aggregated dataset folder.
+    """
+    for key, video_idx in videos_idx.items():
+        # Get unique (chunk, file) combinations
+        unique_chunk_file_pairs = {
+            (chunk, file)
+            for chunk, file in zip(
+                src_meta.episodes[f"videos/{key}/chunk_index"],
+                src_meta.episodes[f"videos/{key}/file_index"],
+                strict=False,
+            )
+        }
+
+        # Current target chunk/file index
+        chunk_idx = video_idx["chunk_idx"]
+        file_idx = video_idx["file_idx"]
+
+        for src_chunk_idx, src_file_idx in unique_chunk_file_pairs:
+            src_path = src_meta.root / DEFAULT_VIDEO_PATH.format(
+                video_key=key,
+                chunk_index=src_chunk_idx,
+                file_index=src_file_idx,
+            )
+
+            dst_path = dst_meta.root / DEFAULT_VIDEO_PATH.format(
+                video_key=key,
+                chunk_index=chunk_idx,
+                file_index=file_idx,
+            )
+
+            if not dst_path.exists():
+                # First write to this destination file
+                dst_path.parent.mkdir(parents=True, exist_ok=True)
+                shutil.copy(str(src_path), str(dst_path))
+                continue
+
+            # Check file sizes before appending
+            src_size = get_video_size_in_mb(src_path)
+            dst_size = get_video_size_in_mb(dst_path)
+
+            if dst_size + src_size >= DEFAULT_VIDEO_FILE_SIZE_IN_MB:
+                # Rotate to a new chunk/file
+                chunk_idx, file_idx = update_chunk_file_indices(chunk_idx, file_idx, DEFAULT_CHUNK_SIZE)
+                dst_path = dst_meta.root / DEFAULT_VIDEO_PATH.format(
+                    video_key=key,
+                    chunk_index=chunk_idx,
+                    file_index=file_idx,
+                )
+                dst_path.parent.mkdir(parents=True, exist_ok=True)
+                shutil.copy(str(src_path), str(dst_path))
+            else:
+                # Append to existing video file
+                concat_video_files(
+                    [dst_path, src_path],
+                    dst_meta.root,
+                    key,
+                    chunk_idx,
+                    file_idx,
+                )
+
+                if aggr_size_in_mb + size_in_mb >= DEFAULT_DATA_FILE_SIZE_IN_MB:
+                    # Size limit is reached, prepare new parquet file
+                    aggr_data_chunk_idx, aggr_data_file_idx = update_chunk_file_indices(
+                        aggr_data_chunk_idx, aggr_data_file_idx, DEFAULT_CHUNK_SIZE
+                    )
+                    aggr_path = aggr_root / DEFAULT_DATA_PATH.format(
+                        chunk_index=aggr_data_chunk_idx, file_index=aggr_data_file_idx
+                    )
+                    aggr_path.parent.mkdir(parents=True, exist_ok=True)
+                    df.to_parquet(aggr_path)
+                else:
+                    # Update the existing parquet file with new rows
+                    aggr_df = pd.read_parquet(aggr_path)
+                    df = pd.concat([aggr_df, df], ignore_index=True)
+                    safe_write_dataframe_to_parquet(df, aggr_path, image_keys)
+
+        return videos_idx
+
+
+def aggregate_data(src_meta, dst_meta, data_idx):
+    unique_chunk_file_ids = {
+        (c, f)
+        for c, f in zip(
+            src_meta.episodes["data/chunk_index"], src_meta.episodes["data/file_index"], strict=False
+        )
+    }
+    for src_chunk_idx, src_file_idx in unique_chunk_file_ids:
+        src_path = src_meta.root / DEFAULT_DATA_PATH.format(
+            chunk_index=src_chunk_idx, file_index=src_file_idx
+        )
+        df = pd.read_parquet(src_path)
+        df = update_data_df(df, src_meta, dst_meta)
+
+        dst_path = aggr_root / DEFAULT_DATA_PATH.format(
+            chunk_index=data_idx["chunk"], file_index=data_idx["file"]
+        )
+        data_idx = write_parquet_safely(
+            df,
+            src_path,
+            dst_path,
+            data_idx,
+            DEFAULT_DATA_FILE_SIZE_IN_MB,
+            DEFAULT_CHUNK_SIZE,
+            DEFAULT_DATA_PATH,
+        )
+
+    return data_idx
+
+
+def aggregate_metadata(src_meta, dst_meta, meta_idx, data_idx, videos_idx):
+    chunk_file_ids = {
+        (c, f)
+        for c, f in zip(
+            src_meta.episodes["meta/episodes/chunk_index"],
+            src_meta.episodes["meta/episodes/file_index"],
+            strict=False,
+        )
+    }
+
+    for chunk_idx, file_idx in chunk_file_ids:
+        src_path = src_meta.root / DEFAULT_EPISODES_PATH.format(chunk_index=chunk_idx, file_index=file_idx)
+        df = pd.read_parquet(src_path)
+        df = update_meta_data(
+            df,
+            dst_meta,
+            meta_idx,
+            data_idx,
+            videos_idx,
+        )
+
+        # for k in video_keys:
+        #     video_idx[k]["latest_duration"] += video_idx[k]["episode_duration"]
+
+        dst_path = dst_meta.root / DEFAULT_EPISODES_PATH.format(
+            chunk_index=meta_idx["chunk"], file_index=meta_idx["file"]
+        )
+        write_parquet_safely(
+            df,
+            src_path,
+            dst_path,
+            meta_idx,
+            DEFAULT_DATA_FILE_SIZE_IN_MB,
+            DEFAULT_CHUNK_SIZE,
+            DEFAULT_EPISODES_PATH,
+        )
+
+    return meta_idx
+
+
+def write_parquet_safely(
+    df: pd.DataFrame,
+    src_path: Path,
+    dst_path: Path,
+    idx: dict[str, int],
+    max_mb: float,
+    chunk_size: int,
+    default_path: str,
+):
+    """
+    Safely appends or creates a Parquet file at dst_path based on size constraints.
+
+    Parameters:
+        df (pd.DataFrame): Data to write.
+        src_path (Path): Path to source file (used to get size).
+        dst_path (Path): Target path for writing.
+        idx (dict): Dictionary containing 'chunk' and 'file' indices.
+        max_mb (float): Maximum allowed file size in MB.
+        chunk_size (int): Maximum number of files per chunk.
+        default_path (str): Format string for generating a new file path.
+
+    Returns:
+        dict: Updated index dictionary.
+    """
+
+    # If destination file doesn't exist, just write the new one
+    if not dst_path.exists():
+        dst_path.parent.mkdir(parents=True, exist_ok=True)
+        df.to_parquet(dst_path)
+        return idx
+
+    # Otherwise, check if we exceed the size limit
+    src_size = get_parquet_file_size_in_mb(src_path)
+    dst_size = get_parquet_file_size_in_mb(dst_path)
+
+    if dst_size + src_size >= max_mb:
+        # File is too large, move to a new one
+        idx["chunk"], idx["file"] = update_chunk_file_indices(idx["chunk"], idx["file"], chunk_size)
+        new_path = dst_path.parent / default_path.format(chunk_index=idx["chunk"], file_index=idx["file"])
+        new_path.parent.mkdir(parents=True, exist_ok=True)
+        df.to_parquet(new_path)
+    else:
+        # Append to existing file
+        existing_df = pd.read_parquet(dst_path)
+        combined_df = pd.concat([existing_df, df], ignore_index=True)
+        combined_df.to_parquet(dst_path)
+
+    return idx
+
+
+def finalize_aggregation(aggr_meta, all_metadata):
+    logging.info("write tasks")
+    write_tasks(aggr_meta.tasks, aggr_meta.root)
+
+    logging.info("write info")
+    aggr_meta.info.update(
+        {
+            "total_tasks": len(aggr_meta.tasks),
+            "total_episodes": sum(m.total_episodes for m in all_metadata),
+            "total_frames": sum(m.total_frames for m in all_metadata),
+            "splits": {"train": f"0:{sum(m.total_episodes for m in all_metadata)}"},
+        }
+    )
+    write_info(aggr_meta.info, aggr_meta.root)
+
+    logging.info("write stats")
+    aggr_meta.stats = aggregate_stats([m.stats for m in all_metadata])
+    write_stats(aggr_meta.stats, aggr_meta.root)
+
+
+if __name__ == "__main__":
+    init_logging()
+
+    num_shards = 2048
+    repo_id = "cadene/droid_1.0.1_v30"
+    aggr_repo_id = f"{repo_id}_compact_6"
+    tags = ["openx"]
+
+    # num_shards = 210
+    # repo_id = "cadene/agibot_alpha_v30"
+    # aggr_repo_id = f"{repo_id}"
+    # tags = None
+
+    # aggr_root = Path(f"/tmp/{aggr_repo_id}")
+    aggr_root = HF_LEROBOT_HOME / aggr_repo_id
+    if aggr_root.exists():
+        shutil.rmtree(aggr_root)
+
+    repo_ids = []
+    roots = []
+    for rank in range(num_shards):
+        shard_repo_id = f"{repo_id}_world_{num_shards}_rank_{rank}"
+        shard_root = HF_LEROBOT_HOME / shard_repo_id
+        try:
+            meta = LeRobotDatasetMetadata(shard_repo_id, root=shard_root)
+            if len(meta.video_keys) == 0:
+                continue
+            repo_ids.append(shard_repo_id)
+            roots.append(shard_root)
+        except:
+            pass
+
+        if rank == 1:
+            break
+
+    aggregate_datasets(
+        repo_ids,
+        aggr_repo_id,
+        roots=roots,
+        aggr_root=aggr_root,
+    )
+
+    aggr_dataset = LeRobotDataset(repo_id=aggr_repo_id, root=aggr_root)
+    # for i in tqdm.tqdm(range(len(aggr_dataset))):
+    #     aggr_dataset[i]
+    #     pass
+    aggr_dataset.push_to_hub(tags=tags, upload_large_folder=True)
--- a/lerobot/common/datasets/backward_compatibility.py
+++ b/lerobot/common/datasets/backward_compatibility.py
@@ -47,6 +47,18 @@ If you encounter a problem, contact LeRobot maintainers on [Discord](https://dis
 or open an [issue on GitHub](https://github.com/huggingface/lerobot/issues/new/choose).
 """

+V30_MESSAGE = """
+The dataset you requested ({repo_id}) is in {version} format.
+While current version of LeRobot is backward-compatible with it, the version of your dataset still uses global
+stats instead of per-episode stats. Update your dataset stats to the new format using this command:
+```
+python lerobot/common/datasets/v30/convert_dataset_v21_to_v30.py --repo-id={repo_id}
+```
+
+If you encounter a problem, contact LeRobot maintainers on [Discord](https://discord.com/invite/s3KuuzsPFb)
+or open an [issue on GitHub](https://github.com/huggingface/lerobot/issues/new/choose).
+"""
+
 FUTURE_MESSAGE = """
 The dataset you requested ({repo_id}) is only available in {version} format.
 As we cannot ensure forward compatibility with it, please update your current version of lerobot.
@@ -58,7 +70,14 @@ class CompatibilityError(Exception): ...

 class BackwardCompatibilityError(CompatibilityError):
    def __init__(self, repo_id: str, version: packaging.version.Version):
-        message = V2_MESSAGE.format(repo_id=repo_id, version=version)
+        if version.major == 3:
+            message = V30_MESSAGE.format(repo_id=repo_id, version=version)
+        elif version.major == 2:
+            message = V2_MESSAGE.format(repo_id=repo_id, version=version)
+        else:
+            raise NotImplementedError(
+                "Contact the maintainer on [Discord](https://discord.com/invite/s3KuuzsPFb)."
+            )
        super().__init__(message)


--- a/lerobot/common/datasets/lerobot_dataset.py
+++ b/lerobot/common/datasets/lerobot_dataset.py
@@ -16,16 +16,18 @@
 import contextlib
 import logging
 import shutil
+import tempfile
 from pathlib import Path
 from typing import Callable

 import datasets
 import numpy as np
 import packaging.version
+import pandas as pd
 import PIL.Image
 import torch
 import torch.utils
-from datasets import concatenate_datasets, load_dataset
+from datasets import Dataset
 from huggingface_hub import HfApi, snapshot_download
 from huggingface_hub.constants import REPOCARD_NAME
 from huggingface_hub.errors import RevisionNotFoundError
@@ -34,36 +36,41 @@ from lerobot.common.constants import HF_LEROBOT_HOME
 from lerobot.common.datasets.compute_stats import aggregate_stats, compute_episode_stats
 from lerobot.common.datasets.image_writer import AsyncImageWriter, write_image
 from lerobot.common.datasets.utils import (
+    DEFAULT_EPISODES_PATH,
    DEFAULT_FEATURES,
    DEFAULT_IMAGE_PATH,
    INFO_PATH,
-    TASKS_PATH,
-    append_jsonlines,
-    backward_compatible_episodes_stats,
    check_delta_timestamps,
-    check_timestamps_sync,
    check_version_compatibility,
+    concat_video_files,
    create_empty_dataset_info,
    create_lerobot_dataset_card,
    embed_images,
+    flatten_dict,
    get_delta_indices,
-    get_episode_data_index,
    get_features_from_robot,
+    get_hf_dataset_size_in_mb,
    get_hf_features_from_features,
+    get_parquet_file_size_in_mb,
+    get_parquet_num_frames,
    get_safe_version,
+    get_video_duration_in_s,
+    get_video_size_in_mb,
    hf_transform_to_torch,
    is_valid_version,
    load_episodes,
-    load_episodes_stats,
    load_info,
+    load_nested_dataset,
    load_stats,
    load_tasks,
+    safe_write_dataframe_to_parquet,
+    update_chunk_file_indices,
    validate_episode_buffer,
    validate_frame,
-    write_episode,
-    write_episode_stats,
    write_info,
    write_json,
+    write_stats,
+    write_tasks,
 )
 from lerobot.common.datasets.video_utils import (
    VideoFrame,
@@ -74,7 +81,7 @@ from lerobot.common.datasets.video_utils import (
 )
 from lerobot.common.robot_devices.robots.utils import Robot

-CODEBASE_VERSION = "v2.1"
+CODEBASE_VERSION = "v3.0"


 class LeRobotDatasetMetadata:
@@ -98,20 +105,18 @@ class LeRobotDatasetMetadata:
                self.revision = get_safe_version(self.repo_id, self.revision)

            (self.root / "meta").mkdir(exist_ok=True, parents=True)
+            # TODO(rcadene): instead of downloading all episodes metadata files,
+            # download only the ones associated to the requested episodes. This would
+            # require adding `episodes: list[int]` as argument.
            self.pull_from_repo(allow_patterns="meta/")
            self.load_metadata()

    def load_metadata(self):
        self.info = load_info(self.root)
        check_version_compatibility(self.repo_id, self._version, CODEBASE_VERSION)
-        self.tasks, self.task_to_task_index = load_tasks(self.root)
+        self.tasks = load_tasks(self.root)
        self.episodes = load_episodes(self.root)
-        if self._version < packaging.version.parse("v2.1"):
-            self.stats = load_stats(self.root)
-            self.episodes_stats = backward_compatible_episodes_stats(self.stats, self.episodes)
-        else:
-            self.episodes_stats = load_episodes_stats(self.root)
-            self.stats = aggregate_stats(list(self.episodes_stats.values()))
+        self.stats = load_stats(self.root)

    def pull_from_repo(
        self,
@@ -133,17 +138,22 @@ class LeRobotDatasetMetadata:
        return packaging.version.parse(self.info["codebase_version"])

    def get_data_file_path(self, ep_index: int) -> Path:
-        ep_chunk = self.get_episode_chunk(ep_index)
-        fpath = self.data_path.format(episode_chunk=ep_chunk, episode_index=ep_index)
+        ep = self.episodes[ep_index]
+        chunk_idx = ep["data/chunk_index"]
+        file_idx = ep["data/file_index"]
+        fpath = self.data_path.format(chunk_index=chunk_idx, file_index=file_idx)
        return Path(fpath)

    def get_video_file_path(self, ep_index: int, vid_key: str) -> Path:
-        ep_chunk = self.get_episode_chunk(ep_index)
-        fpath = self.video_path.format(episode_chunk=ep_chunk, video_key=vid_key, episode_index=ep_index)
+        ep = self.episodes[ep_index]
+        chunk_idx = ep[f"videos/{vid_key}/chunk_index"]
+        file_idx = ep[f"videos/{vid_key}/file_index"]
+        fpath = self.video_path.format(video_key=vid_key, chunk_index=chunk_idx, file_index=file_idx)
        return Path(fpath)

-    def get_episode_chunk(self, ep_index: int) -> int:
-        return ep_index // self.chunks_size
+    @property
+    def url_root(self) -> str:
+        return f"https://huggingface.co/datasets/{self.repo_id}/resolve/main"

    @property
    def data_path(self) -> str:
@@ -211,39 +221,108 @@ class LeRobotDatasetMetadata:
        return self.info["total_tasks"]

    @property
-    def total_chunks(self) -> int:
-        """Total number of chunks (groups of episodes)."""
-        return self.info["total_chunks"]
+    def chunks_size(self) -> int:
+        """Max number of files per chunk."""
+        return self.info["chunks_size"]

    @property
-    def chunks_size(self) -> int:
-        """Max number of episodes per chunk."""
-        return self.info["chunks_size"]
+    def data_files_size_in_mb(self) -> int:
+        """Max size of data file in mega bytes."""
+        return self.info["data_files_size_in_mb"]
+
+    @property
+    def video_files_size_in_mb(self) -> int:
+        """Max size of video file in mega bytes."""
+        return self.info["video_files_size_in_mb"]

    def get_task_index(self, task: str) -> int | None:
        """
        Given a task in natural language, returns its task_index if the task already exists in the dataset,
        otherwise return None.
        """
-        return self.task_to_task_index.get(task, None)
+        if task in self.tasks.index:
+            return int(self.tasks.loc[task].task_index)
+        else:
+            return None

-    def add_task(self, task: str):
+    def save_episode_tasks(self, tasks: list[str]):
+        if len(set(tasks)) != len(tasks):
+            raise ValueError(f"Tasks are not unique: {tasks}")
+
+        if self.tasks is None:
+            new_tasks = tasks
+            task_indices = range(len(tasks))
+            self.tasks = pd.DataFrame({"task_index": task_indices}, index=tasks)
+        else:
+            new_tasks = [task for task in tasks if task not in self.tasks.index]
+            new_task_indices = range(len(self.tasks), len(self.tasks) + len(new_tasks))
+            for task_idx, task in zip(new_task_indices, new_tasks, strict=False):
+                self.tasks.loc[task] = task_idx
+
+        if len(new_tasks) > 0:
+            # Update on disk
+            write_tasks(self.tasks, self.root)
+
+    def _save_episode_metadata(self, episode_dict: dict) -> None:
+        """Save episode metadata to a parquet file and update the Hugging Face dataset of episodes metadata.
+
+        This function processes episodes metadata from a dictionary, converts it into a Hugging Face dataset,
+        and saves it as a parquet file. It handles both the creation of new parquet files and the
+        updating of existing ones based on size constraints. After saving the metadata, it reloads
+        the Hugging Face dataset to ensure it is up-to-date.
+
+        Notes: We both need to update parquet files and HF dataset:
+        - `pandas` loads parquet file in RAM
+        - `datasets` relies on a memory mapping from pyarrow (no RAM). It either converts parquet files to a pyarrow cache on disk,
+          or loads directly from pyarrow cache.
        """
-        Given a task in natural language, add it to the dictionary of tasks.
-        """
-        if task in self.task_to_task_index:
-            raise ValueError(f"The task '{task}' already exists and can't be added twice.")
+        # Convert buffer into HF Dataset
+        episode_dict = {key: [value] for key, value in episode_dict.items()}
+        ep_dataset = Dataset.from_dict(episode_dict)
+        ep_size_in_mb = get_hf_dataset_size_in_mb(ep_dataset)
+        df = pd.DataFrame(ep_dataset)
+        num_frames = episode_dict["length"][0]

-        task_index = self.info["total_tasks"]
-        self.task_to_task_index[task] = task_index
-        self.tasks[task_index] = task
-        self.info["total_tasks"] += 1
+        if self.episodes is None:
+            # Initialize indices and frame count for a new dataset made of the first episode data
+            chunk_idx, file_idx = 0, 0
+            df["meta/episodes/chunk_index"] = [chunk_idx]
+            df["meta/episodes/file_index"] = [file_idx]
+            df["dataset_from_index"] = [0]
+            df["dataset_to_index"] = [num_frames]
+        else:
+            # Retrieve information from the latest parquet file
+            latest_ep = self.episodes[-1]
+            chunk_idx = latest_ep["meta/episodes/chunk_index"]
+            file_idx = latest_ep["meta/episodes/file_index"]

-        task_dict = {
-            "task_index": task_index,
-            "task": task,
-        }
-        append_jsonlines(task_dict, self.root / TASKS_PATH)
+            latest_path = self.root / DEFAULT_EPISODES_PATH.format(chunk_index=chunk_idx, file_index=file_idx)
+            latest_size_in_mb = get_parquet_file_size_in_mb(latest_path)
+
+            if latest_size_in_mb + ep_size_in_mb >= self.data_files_size_in_mb:
+                # Size limit is reached, prepare new parquet file
+                chunk_idx, file_idx = update_chunk_file_indices(chunk_idx, file_idx, self.chunks_size)
+
+            # Update the existing pandas dataframe with new row
+            df["meta/episodes/chunk_index"] = [chunk_idx]
+            df["meta/episodes/file_index"] = [file_idx]
+            df["dataset_from_index"] = [latest_ep["dataset_to_index"]]
+            df["dataset_to_index"] = [latest_ep["dataset_to_index"] + num_frames]
+
+            if latest_size_in_mb + ep_size_in_mb < self.data_files_size_in_mb:
+                # Size limit wasnt reached, concatenate latest dataframe with new one
+                latest_df = pd.read_parquet(latest_path)
+                df = pd.concat([latest_df, df], ignore_index=True)
+
+        # Write the resulting dataframe from RAM to disk
+        path = self.root / DEFAULT_EPISODES_PATH.format(chunk_index=chunk_idx, file_index=file_idx)
+        path.parent.mkdir(parents=True, exist_ok=True)
+        df.to_parquet(path, index=False)
+
+        # Update the Hugging Face dataset by reloading it.
+        # This process should be fast because only the latest Parquet file has been modified.
+        # Therefore, only this file needs to be converted to PyArrow; the rest is loaded from the PyArrow memory-mapped cache.
+        self.episodes = load_episodes(self.root)

    def save_episode(
        self,
@@ -251,32 +330,28 @@ class LeRobotDatasetMetadata:
        episode_length: int,
        episode_tasks: list[str],
        episode_stats: dict[str, dict],
+        episode_metadata: dict,
    ) -> None:
-        self.info["total_episodes"] += 1
-        self.info["total_frames"] += episode_length
-
-        chunk = self.get_episode_chunk(episode_index)
-        if chunk >= self.total_chunks:
-            self.info["total_chunks"] += 1
-
-        self.info["splits"] = {"train": f"0:{self.info['total_episodes']}"}
-        self.info["total_videos"] += len(self.video_keys)
-        if len(self.video_keys) > 0:
-            self.update_video_info()
-
-        write_info(self.info, self.root)
-
        episode_dict = {
            "episode_index": episode_index,
            "tasks": episode_tasks,
            "length": episode_length,
        }
-        self.episodes[episode_index] = episode_dict
-        write_episode(episode_dict, self.root)
+        episode_dict.update(episode_metadata)
+        episode_dict.update(flatten_dict({"stats": episode_stats}))
+        self._save_episode_metadata(episode_dict)

-        self.episodes_stats[episode_index] = episode_stats
-        self.stats = aggregate_stats([self.stats, episode_stats]) if self.stats else episode_stats
-        write_episode_stats(episode_index, episode_stats, self.root)
+        # Update info
+        self.info["total_episodes"] += 1
+        self.info["total_frames"] += episode_length
+        self.info["total_tasks"] = len(self.tasks)
+        self.info["splits"] = {"train": f"0:{self.info['total_episodes']}"}
+        if len(self.video_keys) > 0:
+            self.update_video_info()
+        write_info(self.info, self.root)
+
+        self.stats = aggregate_stats([self.stats, episode_stats]) if self.stats is not None else episode_stats
+        write_stats(self.stats, self.root)

    def update_video_info(self) -> None:
        """
@@ -341,8 +416,9 @@ class LeRobotDatasetMetadata:

            features = {**features, **DEFAULT_FEATURES}

-        obj.tasks, obj.task_to_task_index = {}, {}
-        obj.episodes_stats, obj.stats, obj.episodes = {}, {}, {}
+        obj.tasks = None
+        obj.episodes = None
+        obj.stats = None
        obj.info = create_empty_dataset_info(CODEBASE_VERSION, fps, robot_type, features, use_videos)
        if len(obj.video_keys) > 0 and not use_videos:
            raise ValueError()
@@ -487,29 +563,17 @@ class LeRobotDataset(torch.utils.data.Dataset):
        self.meta = LeRobotDatasetMetadata(
            self.repo_id, self.root, self.revision, force_cache_sync=force_cache_sync
        )
-        if self.episodes is not None and self.meta._version >= packaging.version.parse("v2.1"):
-            episodes_stats = [self.meta.episodes_stats[ep_idx] for ep_idx in self.episodes]
-            self.stats = aggregate_stats(episodes_stats)

        # Load actual data
        try:
            if force_cache_sync:
                raise FileNotFoundError
-            assert all((self.root / fpath).is_file() for fpath in self.get_episodes_file_paths())
            self.hf_dataset = self.load_hf_dataset()
        except (AssertionError, FileNotFoundError, NotADirectoryError):
            self.revision = get_safe_version(self.repo_id, self.revision)
-            self.download_episodes(download_videos)
+            self.download(download_videos)
            self.hf_dataset = self.load_hf_dataset()

-        self.episode_data_index = get_episode_data_index(self.meta.episodes, self.episodes)
-
-        # Check timestamps
-        timestamps = torch.stack(self.hf_dataset["timestamp"]).numpy()
-        episode_indices = torch.stack(self.hf_dataset["episode_index"]).numpy()
-        ep_data_index_np = {k: t.numpy() for k, t in self.episode_data_index.items()}
-        check_timestamps_sync(timestamps, episode_indices, ep_data_index_np, self.fps, self.tolerance_s)
-
        # Setup delta_indices
        if self.delta_timestamps is not None:
            check_delta_timestamps(self.delta_timestamps, self.fps, self.tolerance_s)
@@ -585,7 +649,7 @@ class LeRobotDataset(torch.utils.data.Dataset):
            ignore_patterns=ignore_patterns,
        )

-    def download_episodes(self, download_videos: bool = True) -> None:
+    def download(self, download_videos: bool = True) -> None:
        """Downloads the dataset from the given 'repo_id' at the provided version. If 'episodes' is given, this
        will only download those episodes (selected by their episode_index). If 'episodes' is None, the whole
        dataset will be downloaded. Thanks to the behavior of snapshot_download, if the files are already present
@@ -593,11 +657,10 @@ class LeRobotDataset(torch.utils.data.Dataset):
        """
        # TODO(rcadene, aliberts): implement faster transfer
        # https://huggingface.co/docs/huggingface_hub/en/guides/download#faster-downloads
-        files = None
        ignore_patterns = None if download_videos else "videos/"
+        files = None
        if self.episodes is not None:
            files = self.get_episodes_file_paths()
-
        self.pull_from_repo(allow_patterns=files, ignore_patterns=ignore_patterns)

    def get_episodes_file_paths(self) -> list[Path]:
@@ -610,19 +673,13 @@ class LeRobotDataset(torch.utils.data.Dataset):
                for ep_idx in episodes
            ]
            fpaths += video_files
-
+        # episodes are stored in the same files, so we return unique paths only
+        fpaths = list(set(fpaths))
        return fpaths

    def load_hf_dataset(self) -> datasets.Dataset:
        """hf_dataset contains all the observations, states, actions, rewards, etc."""
-        if self.episodes is None:
-            path = str(self.root / "data")
-            hf_dataset = load_dataset("parquet", data_dir=path, split="train")
-        else:
-            files = [str(self.root / self.meta.get_data_file_path(ep_idx)) for ep_idx in self.episodes]
-            hf_dataset = load_dataset("parquet", data_files=files, split="train")
-
-        # TODO(aliberts): hf_dataset.set_format("torch")
+        hf_dataset = load_nested_dataset(self.root / "data")
        hf_dataset.set_transform(hf_transform_to_torch)
        return hf_dataset

@@ -630,8 +687,6 @@ class LeRobotDataset(torch.utils.data.Dataset):
        features = get_hf_features_from_features(self.features)
        ft_dict = {col: [] for col in features}
        hf_dataset = datasets.Dataset.from_dict(ft_dict, features=features, split="train")
-
-        # TODO(aliberts): hf_dataset.set_format("torch")
        hf_dataset.set_transform(hf_transform_to_torch)
        return hf_dataset

@@ -663,15 +718,16 @@ class LeRobotDataset(torch.utils.data.Dataset):
            return get_hf_features_from_features(self.features)

    def _get_query_indices(self, idx: int, ep_idx: int) -> tuple[dict[str, list[int | bool]]]:
-        ep_start = self.episode_data_index["from"][ep_idx]
-        ep_end = self.episode_data_index["to"][ep_idx]
+        ep = self.meta.episodes[ep_idx]
+        ep_start = ep["dataset_from_index"]
+        ep_end = ep["dataset_to_index"]
        query_indices = {
-            key: [max(ep_start.item(), min(ep_end.item() - 1, idx + delta)) for delta in delta_idx]
+            key: [max(ep_start, min(ep_end - 1, idx + delta)) for delta in delta_idx]
            for key, delta_idx in self.delta_indices.items()
        }
        padding = {  # Pad values outside of current episode range
            f"{key}_is_pad": torch.BoolTensor(
-                [(idx + delta < ep_start.item()) | (idx + delta >= ep_end.item()) for delta in delta_idx]
+                [(idx + delta < ep_start) | (idx + delta >= ep_end) for delta in delta_idx]
            )
            for key, delta_idx in self.delta_indices.items()
        }
@@ -685,7 +741,7 @@ class LeRobotDataset(torch.utils.data.Dataset):
        query_timestamps = {}
        for key in self.meta.video_keys:
            if query_indices is not None and key in query_indices:
-                timestamps = self.hf_dataset.select(query_indices[key])["timestamp"]
+                timestamps = self.hf_dataset[query_indices[key]]["timestamp"]
                query_timestamps[key] = torch.stack(timestamps).tolist()
            else:
                query_timestamps[key] = [current_ts]
@@ -694,7 +750,7 @@ class LeRobotDataset(torch.utils.data.Dataset):

    def _query_hf_dataset(self, query_indices: dict[str, list[int]]) -> dict:
        return {
-            key: torch.stack(self.hf_dataset.select(q_idx)[key])
+            key: torch.stack(self.hf_dataset[q_idx][key])
            for key, q_idx in query_indices.items()
            if key not in self.meta.video_keys
        }
@@ -705,10 +761,17 @@ class LeRobotDataset(torch.utils.data.Dataset):
        Segmentation Fault. This probably happens because a memory reference to the video loader is created in
        the main process and a subprocess fails to access it.
        """
+        ep = self.meta.episodes[ep_idx]
        item = {}
        for vid_key, query_ts in query_timestamps.items():
+            # Episodes are stored sequentially on a single mp4 to reduce the number of files.
+            # Thus we load the start timestamp of the episode on this mp4 and
+            # shift the query timestamp accordingly.
+            from_timestamp = ep[f"videos/{vid_key}/from_timestamp"]
+            shifted_query_ts = [from_timestamp + ts for ts in query_ts]
+
            video_path = self.root / self.meta.get_video_file_path(ep_idx, vid_key)
-            frames = decode_video_frames(video_path, query_ts, self.tolerance_s, self.video_backend)
+            frames = decode_video_frames(video_path, shifted_query_ts, self.tolerance_s, self.video_backend)
            item[vid_key] = frames.squeeze(0)

        return item
@@ -746,8 +809,7 @@ class LeRobotDataset(torch.utils.data.Dataset):

        # Add task as a string
        task_idx = item["task_index"].item()
-        item["task"] = self.meta.tasks[task_idx]
-
+        item["task"] = self.meta.tasks.iloc[task_idx].name
        return item

    def __repr__(self):
@@ -777,6 +839,9 @@ class LeRobotDataset(torch.utils.data.Dataset):
        )
        return self.root / fpath

+    def _get_image_file_dir(self, episode_index: int, image_key: str) -> Path:
+        return self._get_image_file_path(episode_index, image_key, frame_index=0).parent
+
    def _save_image(self, image: torch.Tensor | np.ndarray | PIL.Image.Image, fpath: Path) -> None:
        if self.image_writer is None:
            if isinstance(image, torch.Tensor):
@@ -855,11 +920,8 @@ class LeRobotDataset(torch.utils.data.Dataset):
        episode_buffer["index"] = np.arange(self.meta.total_frames, self.meta.total_frames + episode_length)
        episode_buffer["episode_index"] = np.full((episode_length,), episode_index)

-        # Add new tasks to the tasks dictionary
-        for task in episode_tasks:
-            task_index = self.meta.get_task_index(task)
-            if task_index is None:
-                self.meta.add_task(task)
+        # Update tasks and task indices with new tasks if any
+        self.meta.save_episode_tasks(episode_tasks)

        # Given tasks in natural language, find their corresponding task indices
        episode_buffer["task_index"] = np.array([self.meta.get_task_index(task) for task in tasks])
@@ -871,51 +933,151 @@ class LeRobotDataset(torch.utils.data.Dataset):
                continue
            episode_buffer[key] = np.stack(episode_buffer[key])

+        # Wait for image writer to end, so that episode stats over images can be computed
        self._wait_image_writer()
-        self._save_episode_table(episode_buffer, episode_index)
        ep_stats = compute_episode_stats(episode_buffer, self.features)

-        if len(self.meta.video_keys) > 0:
-            video_paths = self.encode_episode_videos(episode_index)
-            for key in self.meta.video_keys:
-                episode_buffer[key] = video_paths[key]
+        ep_metadata = self._save_episode_data(episode_buffer)
+        for video_key in self.meta.video_keys:
+            ep_metadata.update(self._save_episode_video(video_key, episode_index))

-        # `meta.save_episode` be executed after encoding the videos
-        self.meta.save_episode(episode_index, episode_length, episode_tasks, ep_stats)
+        # `meta.save_episode` need to be executed after encoding the videos
+        self.meta.save_episode(episode_index, episode_length, episode_tasks, ep_stats, ep_metadata)

-        ep_data_index = get_episode_data_index(self.meta.episodes, [episode_index])
-        ep_data_index_np = {k: t.numpy() for k, t in ep_data_index.items()}
-        check_timestamps_sync(
-            episode_buffer["timestamp"],
-            episode_buffer["episode_index"],
-            ep_data_index_np,
-            self.fps,
-            self.tolerance_s,
-        )
-
-        video_files = list(self.root.rglob("*.mp4"))
-        assert len(video_files) == self.num_episodes * len(self.meta.video_keys)
-
-        parquet_files = list(self.root.rglob("*.parquet"))
-        assert len(parquet_files) == self.num_episodes
+        # TODO(rcadene): remove? there is only one episode in the episode buffer, no need for ep_data_index
+        # ep_data_index = get_episode_data_index(self.meta.episodes, [episode_index])
+        # ep_data_index_np = {k: t.numpy() for k, t in ep_data_index.items()}
+        # check_timestamps_sync(
+        #     episode_buffer["timestamp"],
+        #     episode_buffer["episode_index"],
+        #     ep_data_index_np,
+        #     self.fps,
+        #     self.tolerance_s,
+        # )

+        # TODO(rcadene): images are also deleted in clear_episode_buffer
        # delete images
        img_dir = self.root / "images"
        if img_dir.is_dir():
            shutil.rmtree(self.root / "images")

-        if not episode_data:  # Reset the buffer
+        if not episode_data:
+            # Reset episode buffer
            self.episode_buffer = self.create_episode_buffer()

-    def _save_episode_table(self, episode_buffer: dict, episode_index: int) -> None:
-        episode_dict = {key: episode_buffer[key] for key in self.hf_features}
-        ep_dataset = datasets.Dataset.from_dict(episode_dict, features=self.hf_features, split="train")
+    def _save_episode_data(self, episode_buffer: dict) -> dict:
+        """Save episode data to a parquet file and update the Hugging Face dataset of frames data.
+
+        This function processes episodes data from a buffer, converts it into a Hugging Face dataset,
+        and saves it as a parquet file. It handles both the creation of new parquet files and the
+        updating of existing ones based on size constraints. After saving the data, it reloads
+        the Hugging Face dataset to ensure it is up-to-date.
+
+        Notes: We both need to update parquet files and HF dataset:
+        - `pandas` loads parquet file in RAM
+        - `datasets` relies on a memory mapping from pyarrow (no RAM). It either converts parquet files to a pyarrow cache on disk,
+          or loads directly from pyarrow cache.
+        """
+        # Convert buffer into HF Dataset
+        ep_dict = {key: episode_buffer[key] for key in self.hf_features}
+        ep_dataset = datasets.Dataset.from_dict(ep_dict, features=self.hf_features, split="train")
        ep_dataset = embed_images(ep_dataset)
-        self.hf_dataset = concatenate_datasets([self.hf_dataset, ep_dataset])
-        self.hf_dataset.set_transform(hf_transform_to_torch)
-        ep_data_path = self.root / self.meta.get_data_file_path(ep_index=episode_index)
-        ep_data_path.parent.mkdir(parents=True, exist_ok=True)
-        ep_dataset.to_parquet(ep_data_path)
+        ep_size_in_mb = get_hf_dataset_size_in_mb(ep_dataset)
+        ep_num_frames = len(ep_dataset)
+        df = pd.DataFrame(ep_dataset)
+
+        if self.meta.episodes is None:
+            # Initialize indices and frame count for a new dataset made of the first episode data
+            chunk_idx, file_idx = 0, 0
+            latest_num_frames = 0
+        else:
+            # Retrieve information from the latest parquet file
+            latest_ep = self.meta.episodes[-1]
+            chunk_idx = latest_ep["data/chunk_index"]
+            file_idx = latest_ep["data/file_index"]
+
+            latest_path = self.root / self.meta.data_path.format(chunk_index=chunk_idx, file_index=file_idx)
+            latest_size_in_mb = get_parquet_file_size_in_mb(latest_path)
+            latest_num_frames = get_parquet_num_frames(latest_path)
+
+            # Determine if a new parquet file is needed
+            if latest_size_in_mb + ep_size_in_mb >= self.meta.data_files_size_in_mb:
+                # Size limit is reached, prepare new parquet file
+                chunk_idx, file_idx = update_chunk_file_indices(chunk_idx, file_idx, self.meta.chunks_size)
+                latest_num_frames = 0
+            else:
+                # Update the existing parquet file with new rows
+                latest_df = pd.read_parquet(latest_path)
+                df = pd.concat([latest_df, df], ignore_index=True)
+
+        # Write the resulting dataframe from RAM to disk
+        path = self.root / self.meta.data_path.format(chunk_index=chunk_idx, file_index=file_idx)
+        path.parent.mkdir(parents=True, exist_ok=True)
+        safe_write_dataframe_to_parquet(df, path, self.meta.image_keys)
+
+        # Update the Hugging Face dataset by reloading it.
+        # This process should be fast because only the latest Parquet file has been modified.
+        # Therefore, only this file needs to be converted to PyArrow; the rest is loaded from the PyArrow memory-mapped cache.
+        self.hf_dataset = self.load_hf_dataset()
+
+        metadata = {
+            "data/chunk_index": chunk_idx,
+            "data/file_index": file_idx,
+            "dataset_from_index": latest_num_frames,
+            "dataset_to_index": latest_num_frames + ep_num_frames,
+        }
+        return metadata
+
+    def _save_episode_video(self, video_key: str, episode_index: int):
+        # Encode episode frames into a temporary video
+        ep_path = self._encode_temporary_episode_video(video_key, episode_index)
+        ep_size_in_mb = get_video_size_in_mb(ep_path)
+        ep_duration_in_s = get_video_duration_in_s(ep_path)
+
+        if self.meta.episodes is None:
+            # Initialize indices for a new dataset made of the first episode data
+            chunk_idx, file_idx = 0, 0
+            latest_duration_in_s = 0
+            new_path = self.root / self.meta.video_path.format(
+                video_key=video_key, chunk_index=chunk_idx, file_index=file_idx
+            )
+            new_path.parent.mkdir(parents=True, exist_ok=True)
+            shutil.move(str(ep_path), str(new_path))
+        else:
+            # Retrieve information from the latest video file
+            latest_ep = self.meta.episodes[-1]
+            chunk_idx = latest_ep[f"videos/{video_key}/chunk_index"]
+            file_idx = latest_ep[f"videos/{video_key}/file_index"]
+
+            latest_path = self.root / self.meta.video_path.format(
+                video_key=video_key, chunk_index=chunk_idx, file_index=file_idx
+            )
+            latest_size_in_mb = get_video_size_in_mb(latest_path)
+            latest_duration_in_s = get_video_duration_in_s(latest_path)
+
+            if latest_size_in_mb + ep_size_in_mb >= self.meta.video_files_size_in_mb:
+                # Move temporary episode video to a new video file in the dataset
+                chunk_idx, file_idx = update_chunk_file_indices(chunk_idx, file_idx, self.meta.chunks_size)
+                new_path = self.root / self.meta.video_path.format(
+                    video_key=video_key, chunk_index=chunk_idx, file_index=file_idx
+                )
+                new_path.parent.mkdir(parents=True, exist_ok=True)
+                shutil.move(str(ep_path), str(new_path))
+            else:
+                # Update latest video file
+                concat_video_files([latest_path, ep_path], self.root, video_key, chunk_idx, file_idx)
+
+        # Remove temporary directory
+        shutil.rmtree(str(ep_path.parent))
+
+        metadata = {
+            "episode_index": episode_index,
+            f"videos/{video_key}/chunk_index": chunk_idx,
+            f"videos/{video_key}/file_index": file_idx,
+            f"videos/{video_key}/from_timestamp": latest_duration_in_s,
+            f"videos/{video_key}/to_timestamp": latest_duration_in_s + ep_duration_in_s,
+        }
+        return metadata

    def clear_episode_buffer(self) -> None:
        episode_index = self.episode_buffer["episode_index"]
@@ -955,34 +1117,16 @@ class LeRobotDataset(torch.utils.data.Dataset):
        if self.image_writer is not None:
            self.image_writer.wait_until_done()

-    def encode_videos(self) -> None:
+    def _encode_temporary_episode_video(self, video_key: str, episode_index: int) -> dict:
        """
        Use ffmpeg to convert frames stored as png into mp4 videos.
        Note: `encode_video_frames` is a blocking call. Making it asynchronous shouldn't speedup encoding,
        since video encoding with ffmpeg is already using multithreading.
        """
-        for ep_idx in range(self.meta.total_episodes):
-            self.encode_episode_videos(ep_idx)
-
-    def encode_episode_videos(self, episode_index: int) -> dict:
-        """
-        Use ffmpeg to convert frames stored as png into mp4 videos.
-        Note: `encode_video_frames` is a blocking call. Making it asynchronous shouldn't speedup encoding,
-        since video encoding with ffmpeg is already using multithreading.
-        """
-        video_paths = {}
-        for key in self.meta.video_keys:
-            video_path = self.root / self.meta.get_video_file_path(episode_index, key)
-            video_paths[key] = str(video_path)
-            if video_path.is_file():
-                # Skip if video is already encoded. Could be the case when resuming data recording.
-                continue
-            img_dir = self._get_image_file_path(
-                episode_index=episode_index, image_key=key, frame_index=0
-            ).parent
-            encode_video_frames(img_dir, video_path, self.fps, overwrite=True)
-
-        return video_paths
+        temp_path = Path(tempfile.mkdtemp(dir=self.root)) / f"{video_key}_{episode_index:03d}.mp4"
+        img_dir = self._get_image_file_dir(episode_index, video_key)
+        encode_video_frames(img_dir, temp_path, self.fps, overwrite=True)
+        return temp_path

    @classmethod
    def create(
@@ -1027,7 +1171,6 @@ class LeRobotDataset(torch.utils.data.Dataset):
        obj.image_transforms = None
        obj.delta_timestamps = None
        obj.delta_indices = None
-        obj.episode_data_index = None
        obj.video_backend = video_backend if video_backend is not None else get_safe_default_codec()
        return obj

--- a/lerobot/common/datasets/online_buffer.py
+++ b/lerobot/common/datasets/online_buffer.py
@@ -337,13 +337,11 @@ def compute_sampler_weights(
    if len(offline_dataset) > 0:
        offline_data_mask_indices = []
        for start_index, end_index in zip(
-            offline_dataset.episode_data_index["from"],
-            offline_dataset.episode_data_index["to"],
+            offline_dataset.meta.episodes["dataset_from_index"],
+            offline_dataset.meta.episodes["dataset_to_index"],
            strict=True,
        ):
-            offline_data_mask_indices.extend(
-                range(start_index.item(), end_index.item() - offline_drop_n_last_frames)
-            )
+            offline_data_mask_indices.extend(range(start_index, end_index - offline_drop_n_last_frames))
        offline_data_mask = torch.zeros(len(offline_dataset), dtype=torch.bool)
        offline_data_mask[torch.tensor(offline_data_mask_indices)] = True
        weights.append(
--- a/lerobot/common/datasets/profile_streaming_dataset.py
+++ b/lerobot/common/datasets/profile_streaming_dataset.py
@@ -0,0 +1,312 @@
+#!/usr/bin/env python
+"""
+Script to profile the StreamingLeRobotDataset iteration speed.
+Run with: python -m lerobot.common.datasets.profile_streaming_dataset
+"""
+
+import argparse
+import time
+
+import numpy as np
+from line_profiler import LineProfiler
+from tqdm import tqdm
+
+from lerobot.common.datasets.streaming_dataset import StreamingLeRobotDataset
+
+
+def timing_stats(times):
+    return {
+        "mean": np.mean(times),
+        "std": np.std(times),
+        "min": np.min(times),
+        "max": np.max(times),
+        "median": np.median(times),
+    }
+
+
+def measure_iteration_times(dataset, num_samples=100, num_runs=5, warmup_iters=2):
+    """
+    Measure individual iteration times and compute statistics.
+
+    Args:
+        dataset: The dataset to iterate over.
+        num_samples (int): Number of samples to iterate through per run.
+        num_runs (int): Number of timing runs to perform.
+        warmup_iters (int): Number of warmup iterations before timing.
+
+    Returns:
+        dict: Statistics including mean, std, min, max times per sample.
+    """
+    print(f"Measuring iteration times over {num_runs} runs of {num_samples} samples each...")
+    print(f"Using {warmup_iters} warmup iterations per run")
+
+    all_sample_times = []
+    run_times = []
+
+    for run in range(num_runs):
+        print(f"Run {run + 1}/{num_runs}...")
+        run_start = time.time()
+
+        # Warmup phase
+        print(f"  Performing {warmup_iters} warmup iterations...")
+        warmup_iterator = iter(dataset)
+        warmup_times = []
+        try:
+            for _ in range(warmup_iters):
+                start_time = time.time()
+                next(warmup_iterator)
+                end_time = time.time()
+                elapsed_ms = (end_time - start_time) * 1000
+                warmup_times.append(elapsed_ms)
+        except StopIteration:
+            print("  Warning: Iterator exhausted during warmup")
+
+        # timing phase
+        iterator = iter(dataset)
+        sample_times = []
+
+        for _ in tqdm(range(num_samples)):
+            start_time = time.time()
+            next(iterator)
+            end_time = time.time()
+            elapsed_ms = (end_time - start_time) * 1000
+            sample_times.append(elapsed_ms)
+
+        run_end = time.time()
+        run_times.append(run_end - run_start)
+        all_sample_times.extend(sample_times)
+
+    # Compute statistics
+    sample_times_array = np.array(all_sample_times)
+    warmup_times_array = np.array(warmup_times)
+    run_times_array = np.array(run_times)
+
+    stats = {
+        "sample_times_ms": timing_stats(sample_times_array),
+        "warmup_times_ms": timing_stats(warmup_times_array),
+        "run_times_s": timing_stats(run_times_array),
+        "samples_per_second": num_samples / np.mean(run_times_array),
+    }
+
+    return stats
+
+
+def print_timing_stats(stats):
+    """Print timing statistics in a readable format."""
+    print("\n" + "=" * 60)
+    print("TIMING STATISTICS")
+    print("=" * 60)
+
+    warmup_stats = stats["warmup_times_ms"]
+    print("Warmup timing (ms):")
+    print(f"  Mean: {warmup_stats['mean']:.2f} ± {warmup_stats['std']:.2f}")
+    print(f"  Median: {warmup_stats['median']:.2f}")
+    print(f"  Range: [{warmup_stats['min']:.2f}, {warmup_stats['max']:.2f}]")
+
+    sample_stats = stats["sample_times_ms"]
+    print("\nPer-sample timing (ms):")
+    print(f"  Mean: {sample_stats['mean']:.2f} ± {sample_stats['std']:.2f}")
+    print(f"  Median: {sample_stats['median']:.2f}")
+    print(f"  Range: [{sample_stats['min']:.2f}, {sample_stats['max']:.2f}]")
+
+    run_stats = stats["run_times_s"]
+    print("\nPer-run timing (seconds):")
+    print(f"  Mean: {run_stats['mean']:.2f} ± {run_stats['std']:.2f}")
+    print(f"  Range: [{run_stats['min']:.2f}, {run_stats['max']:.2f}]")
+
+    print("\nThroughput:")
+    print(f"  Samples/second: {stats['samples_per_second']:.2f}")
+    print("=" * 60)
+
+
+def _time_iterations(dataset, num_samples, num_runs, warmup_iters, stats_file_path):
+    # Measure iteration times with statistics
+    timing_stats = measure_iteration_times(dataset, num_samples, num_runs, warmup_iters)
+    print_timing_stats(timing_stats)
+
+    # Save results to a file
+    with open(stats_file_path, "w") as f:
+        f.write("TIMING STATISTICS\n")
+        f.write("=" * 60 + "\n")
+        warmup_stats = timing_stats["warmup_times_ms"]
+        f.write("Warmup timing (ms):\n")
+        f.write(f"  Mean: {warmup_stats['mean']:.2f} ± {warmup_stats['std']:.2f}\n")
+        f.write(f"  Median: {warmup_stats['median']:.2f}\n")
+        f.write(f"  Range: [{warmup_stats['min']:.2f}, {warmup_stats['max']:.2f}]\n\n")
+
+        sample_stats = timing_stats["sample_times_ms"]
+        f.write("Per-sample timing (ms):\n")
+        f.write(f"  Mean: {sample_stats['mean']:.2f} ± {sample_stats['std']:.2f}\n")
+        f.write(f"  Median: {sample_stats['median']:.2f}\n")
+        f.write(f"  Range: [{sample_stats['min']:.2f}, {sample_stats['max']:.2f}]\n\n")
+
+        run_stats = timing_stats["run_times_s"]
+        f.write("Per-run timing (seconds):\n")
+        f.write(f"  Mean: {run_stats['mean']:.2f} ± {run_stats['std']:.2f}\n")
+        f.write(f"  Range: [{run_stats['min']:.2f}, {run_stats['max']:.2f}]\n\n")
+
+        throughput_stats = timing_stats["samples_per_second"]
+        f.write("Throughput:\n")
+        f.write(f"  Samples/second: {throughput_stats:.2f}\n")
+        f.write("=" * 60 + "\n\n")
+
+        f.write("DETAILED LINE PROFILING RESULTS\n")
+        f.write("=" * 60 + "\n")
+
+    print(f"\nDetailed profiling results saved to {stats_file_path}")
+
+
+def _profile_iteration(dataset, num_samples, stats_file_path):
+    # Create a line profiler instance for detailed profiling
+    profiler = LineProfiler()
+
+    # Add functions to profile
+    profiler.add_function(dataset.__iter__)
+    profiler.add_function(dataset.make_frame)
+    profiler.add_function(dataset._make_backtrackable_dataset)
+    profiler.add_function(dataset._get_delta_frames)
+
+    # Profile the iteration
+
+    # Define the function to profile
+    def iterate_dataset(ds, n):
+        # Iterating without warmup for line profiling
+        iterator = iter(ds)
+        start_time = time.time()
+        for _ in range(n):
+            next(iterator)
+        end_time = time.time()
+        return end_time - start_time
+
+    # Add the function to the profiler
+    profiler.add_function(iterate_dataset)
+
+    # Run the profiled function
+    profiler.runcall(iterate_dataset, dataset, num_samples)
+
+    with open(stats_file_path, "a") as f:
+        profiler.print_stats(stream=f)
+
+
+def _analyze_randomness(dataset, num_samples, stats_file_path):
+    """
+    Analyze the randomness of dataset iteration by checking correlation between
+    iteration index and frame index.
+
+    Args:
+        dataset: The dataset to analyze.
+        num_samples: Number of samples to use for analysis.
+        stats_file_path: Path to save the analysis results.
+    """
+    print("\nAnalyzing randomness of dataset iteration...")
+
+    # Collect iteration index and frame index pairs
+    points = []
+    iterator = iter(dataset)
+
+    for i in tqdm(range(num_samples)):
+        try:
+            frame = next(iterator)
+            points.append(np.array([i, frame["frame_index"]]))
+        except (StopIteration, KeyError) as e:
+            if isinstance(e, StopIteration):
+                print(f"  Warning: Iterator exhausted after {i} samples")
+            else:
+                print(f"  Warning: frame_index not found in sample {i}")
+            break
+
+    # Compute correlation between iteration index and frame index
+    points_array = np.array(points)
+    correlation = np.corrcoef(points_array[:, 0], points_array[:, 1])[0, 1]
+
+    # Save results to file
+    with open(stats_file_path, "a") as f:
+        f.write("\nRANDOMNESS ANALYSIS\n")
+        f.write("=" * 60 + "\n")
+        f.write(f"Correlation between iteration index and frame index: {correlation:.4f}\n")
+        f.write("(Correlation close to 0 indicates more random access pattern)\n")
+        f.write("(Correlation close to 1 indicates sequential access pattern)\n")
+        f.write("=" * 60 + "\n")
+
+    print(f"Correlation between iteration index and frame index: {correlation:.4f}")
+    print("(Correlation close to 0 indicates more random access pattern)")
+    print("(Correlation close to 1 indicates sequential access pattern)")
+
+
+def profile_dataset(
+    repo_id, num_samples=100, buffer_size=1000, max_num_shards=16, seed=42, num_runs=3, warmup_iters=10
+):
+    """
+    Profile the streaming dataset iteration speed.
+
+    Args:
+        repo_id (str): HuggingFace repository ID for the dataset.
+        num_samples (int): Number of samples to iterate through.
+        buffer_size (int): Buffer size for the dataset.
+        max_num_shards (int): Number of shards to use.
+        seed (int): Random seed for reproducibility.
+        num_runs (int): Number of timing runs to perform.
+        warmup_iters (int): Number of warmup iterations before timing.
+    """
+    stats_file_path = "streaming_dataset_profile.txt"
+
+    print(f"Creating dataset from {repo_id} with buffer_size={buffer_size}, max_num_shards={max_num_shards}")
+    camera_key = "observation.images.cam_right_wrist"
+    fps = 50
+
+    delta_timestamps = {
+        # loads 4 images: 1 second before current frame, 500 ms before, 200 ms before, and current frame
+        camera_key: [-1, -0.5, -0.20, 0],
+        # loads 6 state vectors: 1.5 seconds before, 1 second before, ... 200 ms, 100 ms, and current frame
+        "observation.state": [-1.5, -1, -0.5, -0.20, -0.10, 0],
+        # loads 64 action vectors: current frame, 1 frame in the future, 2 frames, ... 63 frames in the future
+        "action": [t / fps for t in range(64)],
+    }
+
+    dataset = StreamingLeRobotDataset(
+        repo_id=repo_id,
+        buffer_size=buffer_size,
+        max_num_shards=max_num_shards,
+        seed=seed,
+        delta_timestamps=delta_timestamps,
+    )
+
+    _time_iterations(dataset, num_samples, num_runs, warmup_iters, stats_file_path)
+    _profile_iteration(dataset, num_samples, stats_file_path)
+    _analyze_randomness(dataset, num_samples, stats_file_path)
+
+
+def main():
+    parser = argparse.ArgumentParser(description="Profile StreamingLeRobotDataset iteration speed")
+    parser.add_argument(
+        "--repo-id",
+        type=str,
+        default="lerobot/aloha_mobile_cabinet",
+        help="HuggingFace repository ID for the dataset",
+    )
+    parser.add_argument("--num-samples", type=int, default=2_000, help="Number of samples to iterate through")
+    parser.add_argument("--buffer-size", type=int, default=1000, help="Buffer size for the dataset")
+    parser.add_argument("--max-num-shards", type=int, default=1, help="Number of shards to use")
+    parser.add_argument("--seed", type=int, default=42, help="Random seed for reproducibility")
+    parser.add_argument(
+        "--num-runs", type=int, default=10, help="Number of timing runs to perform for statistics"
+    )
+    parser.add_argument(
+        "--warmup-iters", type=int, default=1, help="Number of warmup iterations before timing"
+    )
+
+    args = parser.parse_args()
+
+    profile_dataset(
+        repo_id=args.repo_id,
+        num_samples=args.num_samples,
+        buffer_size=args.buffer_size,
+        max_num_shards=args.max_num_shards,
+        seed=args.seed,
+        num_runs=args.num_runs,
+        warmup_iters=args.warmup_iters,
+    )
+
+
+if __name__ == "__main__":
+    main()
--- a/lerobot/common/datasets/sampler.py
+++ b/lerobot/common/datasets/sampler.py
@@ -21,7 +21,8 @@ import torch
 class EpisodeAwareSampler:
    def __init__(
        self,
-        episode_data_index: dict,
+        dataset_from_indices: list[int],
+        dataset_to_indices: list[int],
        episode_indices_to_use: Union[list, None] = None,
        drop_n_first_frames: int = 0,
        drop_n_last_frames: int = 0,
@@ -30,7 +31,8 @@ class EpisodeAwareSampler:
        """Sampler that optionally incorporates episode boundary information.

        Args:
-            episode_data_index: Dictionary with keys 'from' and 'to' containing the start and end indices of each episode.
+            dataset_from_indices: List of indices containing the start of each episode in the dataset.
+            dataset_to_indices: List of indices containing the end of each episode in the dataset.
            episode_indices_to_use: List of episode indices to use. If None, all episodes are used.
                                    Assumes that episodes are indexed from 0 to N-1.
            drop_n_first_frames: Number of frames to drop from the start of each episode.
@@ -39,12 +41,10 @@ class EpisodeAwareSampler:
        """
        indices = []
        for episode_idx, (start_index, end_index) in enumerate(
-            zip(episode_data_index["from"], episode_data_index["to"], strict=True)
+            zip(dataset_from_indices, dataset_to_indices, strict=True)
        ):
            if episode_indices_to_use is None or episode_idx in episode_indices_to_use:
-                indices.extend(
-                    range(start_index.item() + drop_n_first_frames, end_index.item() - drop_n_last_frames)
-                )
+                indices.extend(range(start_index + drop_n_first_frames, end_index - drop_n_last_frames))

        self.indices = indices
        self.shuffle = shuffle
--- a/lerobot/common/datasets/streaming_dataset.py
+++ b/lerobot/common/datasets/streaming_dataset.py
@@ -0,0 +1,519 @@
+import random
+from pathlib import Path
+from typing import Callable, Dict, Generator, Iterator, Tuple
+
+import datasets
+import numpy as np
+import torch
+from datasets import load_dataset
+from line_profiler import profile
+
+from lerobot.common.constants import HF_LEROBOT_HOME
+from lerobot.common.datasets.lerobot_dataset import CODEBASE_VERSION, LeRobotDatasetMetadata
+from lerobot.common.datasets.utils import (
+    Backtrackable,
+    LookAheadError,
+    LookBackError,
+    check_delta_timestamps,
+    check_version_compatibility,
+    get_delta_indices,
+    item_to_torch,
+)
+from lerobot.common.datasets.video_utils import (
+    VideoDecoderCache,
+    decode_video_frames_torchcodec,
+    get_safe_default_codec,
+)
+
+
+class StreamingLeRobotDataset(torch.utils.data.IterableDataset):
+    """LeRobotDataset with streaming capabilities.
+
+    This class extends LeRobotDataset to add streaming functionality, allowing data to be streamed
+    rather than loaded entirely into memory. This is especially useful for large datasets that may
+    not fit in memory or when you want to quickly explore a dataset without downloading it completely.
+
+    The key innovation is using a Backtrackable iterator that maintains a bounded buffer of recent
+    items, allowing us to access previous frames for delta timestamps without loading the entire
+    dataset into memory.
+
+    Example:
+        Basic usage:
+        ```python
+        from lerobot.common.datasets.streaming_dataset import StreamingLeRobotDataset
+
+        # Create a streaming dataset with delta timestamps
+        delta_timestamps = {
+            "observation.image": [-1.0, -0.5, 0.0],  # 1 sec ago, 0.5 sec ago, current
+            "action": [0.0, 0.1, 0.2],  # current, 0.1 sec future, 0.2 sec future
+        }
+
+        dataset = StreamingLeRobotDataset(
+            repo_id="your-dataset-repo-id",
+            delta_timestamps=delta_timestamps,
+            streaming=True,
+            buffer_size=1000,
+        )
+
+        # Iterate over the dataset
+        for i, item in enumerate(dataset):
+            print(f"Sample {i}: Episode {item['episode_index']} Frame {item['frame_index']}")
+            # item will contain stacked frames according to delta_timestamps
+            if i >= 10:
+                break
+        ```
+    """
+
+    def __init__(
+        self,
+        repo_id: str,
+        root: str | Path | None = None,
+        episodes: list[int] | None = None,
+        image_transforms: Callable | None = None,
+        delta_timestamps: dict[list[float]] | None = None,
+        tolerance_s: float = 1e-4,
+        revision: str | None = None,
+        force_cache_sync: bool = False,
+        video_backend: str | None = "torchcodec",
+        streaming: bool = True,
+        buffer_size: int = 1000,
+        max_num_shards: int = 16,
+        seed: int = 42,
+        rng: np.random.Generator | None = None,
+    ):
+        """Initialize a StreamingLeRobotDataset.
+
+        Args:
+            repo_id (str): This is the repo id that will be used to fetch the dataset.
+            root (Path | None, optional): Local directory to use for downloading/writing files.
+            episodes (list[int] | None, optional): If specified, this will only load episodes specified by
+                their episode_index in this list.
+            image_transforms (Callable | None, optional): Transform to apply to image data.
+            tolerance_s (float, optional): Tolerance in seconds for timestamp matching.
+            revision (str, optional): Git revision id (branch name, tag, or commit hash).
+            force_cache_sync (bool, optional): Flag to sync and refresh local files first.
+            video_backend (str | None, optional): Video backend to use for decoding videos. Uses "torchcodec" by default.
+            streaming (bool, optional): Whether to stream the dataset or load it all. Defaults to True.
+            buffer_size (int, optional): Buffer size for shuffling when streaming. Defaults to 1000.
+            max_num_shards (int, optional): Number of shards to re-shard the input dataset into. Defaults to 16.
+            seed (int, optional): Reproducibility random seed.
+            rng (np.random.Generator | None, optional): Random number generator.
+        """
+        super().__init__()
+        self.repo_id = repo_id
+        self.root = Path(root) if root else HF_LEROBOT_HOME / repo_id
+        self.image_transforms = image_transforms
+        self.episodes = episodes
+        self.tolerance_s = tolerance_s
+        self.revision = revision if revision else CODEBASE_VERSION
+        self.video_backend = video_backend if video_backend else get_safe_default_codec()
+        self.seed = seed
+        self.rng = rng if rng is not None else np.random.default_rng(seed)
+
+        self.streaming = streaming
+        self.buffer_size = buffer_size
+
+        # We cache the video decoders to avoid re-initializing them at each frame (avoiding a ~10x slowdown)
+        self.video_decoder_cache = VideoDecoderCache()
+
+        # Unused attributes
+        self.image_writer = None
+        self.episode_buffer = None
+
+        self.root.mkdir(exist_ok=True, parents=True)
+
+        # Load metadata
+        self.meta = LeRobotDatasetMetadata(
+            self.repo_id, self.root, self.revision, force_cache_sync=force_cache_sync
+        )
+        # Check version
+        check_version_compatibility(self.repo_id, self.meta._version, CODEBASE_VERSION)
+
+        if delta_timestamps is not None:
+            self._validate_delta_timestamp_keys(delta_timestamps)  # raises ValueError if invalid
+            self.delta_timestamps = delta_timestamps
+
+        self.hf_dataset: datasets.IterableDataset = self.load_hf_dataset()
+        self.num_shards = min(self.hf_dataset.num_shards, max_num_shards)
+
+        max_backward_steps, max_forward_steps = self._get_window_steps()
+        self.backtrackable_dataset: Backtrackable = Backtrackable(
+            self.hf_dataset, history=max_backward_steps, lookahead=max_forward_steps
+        )
+
+    @property
+    def fps(self):
+        return self.meta.fps
+
+    @staticmethod
+    def _iter_random_indices(
+        rng: np.random.Generator, buffer_size: int, random_batch_size=1000
+    ) -> Iterator[int]:
+        while True:
+            yield from (int(i) for i in rng.integers(0, buffer_size, size=random_batch_size))
+
+    @staticmethod
+    def _infinite_generator_over_elements(elements: list[int]) -> Iterator[int]:
+        return (random.choice(list(elements)) for _ in iter(int, 1))
+
+    def load_hf_dataset(self) -> datasets.IterableDataset:
+        dataset = load_dataset(self.repo_id, split="train", streaming=self.streaming)
+        self.streaming_from_local = False
+
+        # TODO(fracapuano): Add support for streaming from a local folder and not only from HF Hub
+        return dataset
+
+    def _get_window_steps(self) -> Tuple[int, int]:
+        """
+        Returns how many steps backward (& forward) should the backtrackable iterator maintain,
+        based on the input delta_timestamps.
+        """
+        max_backward_steps = 1
+        max_forward_steps = 1
+
+        if self.delta_timestamps is not None:
+            check_delta_timestamps(self.delta_timestamps, self.fps, self.tolerance_s)
+            self.delta_indices = get_delta_indices(self.delta_timestamps, self.fps)
+
+            # Calculate maximum backward steps needed (i.e., history size)
+            for delta_idx in self.delta_indices.values():
+                min_delta = min(delta_idx)
+                max_delta = max(delta_idx)
+                if min_delta < 0:
+                    max_backward_steps = max(max_backward_steps, abs(min_delta))
+                if max_delta > 0:
+                    max_forward_steps = max(max_forward_steps, max_delta)
+
+        return max_backward_steps, max_forward_steps
+
+    def __iter__(self) -> Iterator[Dict[str, torch.Tensor]]:
+        buffer_indices_generator = self._iter_random_indices(self.rng, self.buffer_size)
+        idx_to_backtracktable_dataset = {
+            idx: self._make_backtrackable_dataset(self.hf_dataset.shard(self.num_shards, index=idx))
+            for idx in range(self.num_shards)
+        }
+
+        # This buffer is populated while iterating on the dataset's shards
+        frames_buffer = []
+        try:
+            while available_shards := list(idx_to_backtracktable_dataset.keys()):
+                shard_key = next(self._infinite_generator_over_elements(available_shards))
+                dataset = idx_to_backtracktable_dataset[shard_key]  # selects which shard to iterate on
+                for frame in self.make_frame(dataset):
+                    if len(frames_buffer) == self.buffer_size:
+                        i = next(buffer_indices_generator)
+                        yield frames_buffer[i]
+                        frames_buffer[i] = frame
+                    else:
+                        frames_buffer.append(frame)
+                    break  # random shard sampled, switch shard
+
+        except (
+            RuntimeError,
+            StopIteration,
+        ):  # NOTE: StopIteration inside a generator throws a RuntimeError since 3.7
+            del idx_to_backtracktable_dataset[shard_key]  # Remove exhausted shard, onto another shard
+
+        # Once shards are all exhausted, shuffle the buffer and yield the remaining frames
+        self.rng.shuffle(frames_buffer)
+        yield from frames_buffer
+
+    def _make_backtrackable_dataset(self, dataset: datasets.IterableDataset) -> Backtrackable:
+        history, lookahead = self._get_window_steps()
+        return Backtrackable(dataset, history=history, lookahead=lookahead)
+
+    def _make_timestamps_from_indices(
+        self, start_ts: float, indices: dict[str, list[int]] | None = None
+    ) -> dict[str, list[float]]:
+        if indices is not None:
+            return {
+                key: (start_ts + torch.tensor(indices[key]) / self.fps).tolist()
+                for key in self.delta_timestamps
+            }
+        else:
+            return dict.fromkeys(self.delta_timestamps, start_ts)
+
+    def _make_padding_camera_frame(self, camera_key: str):
+        """Variable-shape padding frame for given camera keys, given in (C, H, W)"""
+        return torch.zeros(self.meta.info["features"][camera_key]["shape"]).permute(-1, 0, 1)
+
+    def _pad_retrieved_video_frames(
+        self,
+        video_frames: dict[str, torch.Tensor],
+        query_timestamps: dict[str, list[float]],
+        original_timestamps: dict[str, list[float]],
+    ) -> tuple[dict[str, torch.Tensor], dict[str, torch.BoolTensor]]:
+        padded_video_frames = {}
+        padding_mask = {}
+
+        for video_key, timestamps in original_timestamps.items():
+            if video_key not in video_frames:
+                continue  # only padding on video keys that are available
+            frames = []
+            mask = []
+            padding_frame = self._make_padding_camera_frame(video_key)
+            for ts in timestamps:
+                if ts in query_timestamps[video_key]:
+                    idx = query_timestamps[video_key].index(ts)
+                    frames.append(video_frames[video_key][idx, :])
+                    mask.append(False)
+                else:
+                    frames.append(padding_frame)
+                    mask.append(True)
+
+            padded_video_frames[video_key] = torch.stack(frames)
+            padding_mask[f"{video_key}.pad_masking"] = torch.BoolTensor(mask)
+
+        return padded_video_frames, padding_mask
+
+    @profile
+    def make_frame(self, dataset_iterator: Backtrackable) -> Generator:
+        """Makes a frame starting from a dataset iterator"""
+        item = next(dataset_iterator)
+        item = item_to_torch(item)
+
+        updates = []  # list of updates to apply to the item
+
+        # Get episode index from the item
+        ep_idx = item["episode_index"]
+
+        # "timestamp" restarts from 0 for each episode, whereas we need a global timestep within the single .mp4 file (given by index/fps)
+        current_ts = item["index"] / self.fps
+
+        episode_boundaries_ts = {
+            key: (
+                self.meta.episodes[ep_idx][f"videos/{key}/from_timestamp"],
+                self.meta.episodes[ep_idx][f"videos/{key}/to_timestamp"],
+            )
+            for key in self.meta.video_keys
+        }
+
+        # Apply delta querying logic if necessary
+        if self.delta_indices is not None:
+            query_result, padding = self._get_delta_frames(dataset_iterator, item)
+            updates.append(query_result)
+            updates.append(padding)
+
+        # Load video frames, when needed
+        if len(self.meta.video_keys) > 0:
+            original_timestamps = self._make_timestamps_from_indices(current_ts, self.delta_indices)
+
+            # Some timestamps might not result available considering the episode's boundaries
+            query_timestamps = self._get_query_timestamps(
+                current_ts, self.delta_indices, episode_boundaries_ts
+            )
+            video_frames = self._query_videos(query_timestamps, ep_idx)
+
+            # We always return the same number of frames. Unavailable frames are padded.
+            padded_video_frames, padding_mask = self._pad_retrieved_video_frames(
+                video_frames, query_timestamps, original_timestamps
+            )
+
+            updates.append(video_frames)
+            updates.append(padded_video_frames)
+            updates.append(padding_mask)
+
+        result = item.copy()
+        for update in updates:
+            result.update(update)
+
+        result["task"] = self.meta.tasks.iloc[item["task_index"]].name
+
+        yield result
+
+    def _get_query_timestamps(
+        self,
+        current_ts: float,
+        query_indices: dict[str, list[int]] | None = None,
+        episode_boundaries_ts: dict[str, tuple[float, float]] | None = None,
+    ) -> dict[str, list[float]]:
+        query_timestamps = {}
+        keys_to_timestamps = self._make_timestamps_from_indices(current_ts, query_indices)
+        for key in self.meta.video_keys:
+            if query_indices is not None and key in query_indices:
+                timestamps = keys_to_timestamps[key]
+                # Filter out timesteps outside of episode boundaries
+                query_timestamps[key] = [
+                    ts
+                    for ts in timestamps
+                    if episode_boundaries_ts[key][0] <= ts <= episode_boundaries_ts[key][1]
+                ]
+
+                if len(query_timestamps[key]) == 0:
+                    raise ValueError(f"No valid timestamps found for key {key} with {query_indices[key]}")
+
+            else:
+                query_timestamps[key] = [current_ts]
+
+        return query_timestamps
+
+    def _query_videos(self, query_timestamps: dict[str, list[float]], ep_idx: int) -> dict:
+        """Note: When using data workers (e.g. DataLoader with num_workers>0), do not call this function
+        in the main process (e.g. by using a second Dataloader with num_workers=0). It will result in a
+        Segmentation Fault. This probably happens because a memory reference to the video loader is created in
+        the main process and a subprocess fails to access it.
+        """
+        item = {}
+        for video_key, query_ts in query_timestamps.items():
+            root = self.meta.url_root if self.streaming and not self.streaming_from_local else self.root
+            video_path = f"{root}/{self.meta.get_video_file_path(ep_idx, video_key)}"
+            frames = decode_video_frames_torchcodec(
+                video_path, query_ts, self.tolerance_s, decoder_cache=self.video_decoder_cache
+            )
+
+            item[video_key] = frames
+
+        return item
+
+    def _make_padding_frame(self, key: str) -> tuple[torch.Tensor, bool]:
+        return torch.zeros(self.meta.info["features"][key]["shape"]), True
+
+    def _get_delta_frames(self, dataset_iterator: Backtrackable, current_item: dict):
+        # TODO(fracapuano): Modularize this function, refactor the code
+        """Get frames with delta offsets using the backtrackable iterator.
+
+        Args:
+            current_item (dict): Current item from the iterator.
+            ep_idx (int): Episode index.
+
+        Returns:
+            tuple: (query_result, padding) - frames at delta offsets and padding info.
+        """
+        current_episode_idx = current_item["episode_index"]
+
+        # Prepare results
+        query_result = {}
+        padding = {}
+
+        for key, delta_indices in self.delta_indices.items():
+            if key in self.meta.video_keys:
+                continue  # visual frames are decoded separately
+
+            target_frames = []
+            is_pad = []
+
+            # Create a results dictionary to store frames in processing order, then reconstruct original order for stacking
+            delta_results = {}
+
+            # Separate and sort deltas by difficulty (easier operations first)
+            negative_deltas = sorted([d for d in delta_indices if d < 0], reverse=True)  # [-1, -2, -3, ...]
+            positive_deltas = sorted([d for d in delta_indices if d > 0])  # [1, 2, 3, ...]
+            zero_deltas = [d for d in delta_indices if d == 0]
+
+            # Process zero deltas (current frame)
+            for delta in zero_deltas:
+                delta_results[delta] = (
+                    current_item[key],
+                    False,
+                )  # unsqueeze to add batch dimension for stacking
+
+            # Process negative deltas in order of increasing difficulty
+            lookback_failed = False
+            for delta in negative_deltas:
+                if lookback_failed:
+                    delta_results[delta] = self._make_padding_frame(key)
+                    continue
+
+                try:
+                    steps_back = abs(delta)
+                    if dataset_iterator.can_peek_back(steps_back):
+                        past_item = dataset_iterator.peek_back(steps_back)
+                        past_item = item_to_torch(past_item)
+
+                        if past_item["episode_index"] == current_episode_idx:
+                            delta_results[delta] = (past_item[key], False)
+                        else:
+                            raise LookBackError("Retrieved frame is from different episode!")
+                    else:
+                        raise LookBackError("Cannot go back further than the history buffer!")
+
+                except LookBackError:
+                    delta_results[delta] = self._make_padding_frame(key)
+                    lookback_failed = True  # All subsequent negative deltas will also fail
+
+            # Process positive deltas in order of increasing difficulty
+            lookahead_failed = False
+            for delta in positive_deltas:
+                if lookahead_failed:
+                    delta_results[delta] = self._make_padding_frame(key)
+                    continue
+
+                try:
+                    if dataset_iterator.can_peek_ahead(delta):
+                        future_item = dataset_iterator.peek_ahead(delta)
+                        future_item = item_to_torch(future_item)
+
+                        if future_item["episode_index"] == current_episode_idx:
+                            delta_results[delta] = (future_item[key], False)
+                        else:
+                            raise LookAheadError("Retrieved frame is from different episode!")
+                    else:
+                        raise LookAheadError("Cannot go ahead further than the lookahead buffer!")
+
+                except LookAheadError:
+                    delta_results[delta] = self._make_padding_frame(key)
+                    lookahead_failed = True  # All subsequent positive deltas will also fail
+
+            # Reconstruct original order for stacking
+            for delta in delta_indices:
+                frame, is_padded = delta_results[delta]
+
+                # add batch dimension for stacking
+                target_frames.append(frame)  # frame.unsqueeze(0))
+                is_pad.append(is_padded)
+
+            # Stack frames and add to results
+            if target_frames:
+                query_result[key] = torch.stack(target_frames)
+                padding[f"{key}.pad_masking"] = torch.BoolTensor(is_pad)
+
+        return query_result, padding
+
+    def _validate_delta_timestamp_keys(self, delta_timestamps: dict[list[float]]) -> None:
+        """
+        Validate that all keys in delta_timestamps correspond to actual features in the dataset.
+
+        Raises:
+            ValueError: If any delta timestamp key doesn't correspond to a dataset feature.
+        """
+        if delta_timestamps is None:
+            return
+
+        # Get all available feature keys from the dataset metadata
+        available_features = set(self.meta.features.keys())
+
+        # Get all keys from delta_timestamps
+        delta_keys = set(delta_timestamps.keys())
+
+        # Find any keys that don't correspond to features
+        invalid_keys = delta_keys - available_features
+
+        if invalid_keys:
+            raise ValueError(
+                f"The following delta_timestamp keys do not correspond to dataset features: {invalid_keys}. "
+                f"Available features are: {sorted(available_features)}"
+            )
+
+
+# Example usage
+if __name__ == "__main__":
+    from tqdm import tqdm
+
+    repo_id = "lerobot/aloha_mobile_cabinet"
+
+    camera_key = "observation.images.cam_right_wrist"
+    fps = 50
+
+    delta_timestamps = {
+        # loads 4 images: 1 second before current frame, 500 ms before, 200 ms before, and current frame
+        camera_key: [-1, -0.5, -0.20, 0],
+        # loads 6 state vectors: 1.5 seconds before, 1 second before, ... 200 ms, 100 ms, and current frame
+        "observation.state": [-1.5, -1, -0.5, -0.20, -0.10, 0],
+        # loads 64 action vectors: current frame, 1 frame in the future, 2 frames, ... 63 frames in the future
+        "action": [t / fps for t in range(64)],
+    }
+
+    dataset = StreamingLeRobotDataset(repo_id, delta_timestamps=delta_timestamps)
+
+    for _i, _frame in tqdm(enumerate(dataset)):
+        pass
--- a/lerobot/common/datasets/utils.py
+++ b/lerobot/common/datasets/utils.py
@@ -17,18 +17,24 @@ import contextlib
 import importlib.resources
 import json
 import logging
-from collections.abc import Iterator
-from itertools import accumulate
+import shutil
+import subprocess
+import tempfile
+from collections import deque
+from collections.abc import Iterable, Iterator
 from pathlib import Path
 from pprint import pformat
 from types import SimpleNamespace
-from typing import Any
+from typing import Any, Deque, Generic, List, TypeVar

 import datasets
-import jsonlines
 import numpy as np
 import packaging.version
+import pandas
+import pandas as pd
+import pyarrow.parquet as pq
 import torch
+from datasets import Dataset, concatenate_datasets
 from datasets.table import embed_table_storage
 from huggingface_hub import DatasetCard, DatasetCardData, HfApi
 from huggingface_hub.errors import RevisionNotFoundError
@@ -42,19 +48,25 @@ from lerobot.common.datasets.backward_compatibility import (
 )
 from lerobot.common.robot_devices.robots.utils import Robot
 from lerobot.common.utils.utils import is_valid_numpy_dtype_string
-from lerobot.configs.types import DictLike, FeatureType, PolicyFeature
+from lerobot.configs.types import FeatureType, PolicyFeature

-DEFAULT_CHUNK_SIZE = 1000  # Max number of episodes per chunk
+DEFAULT_CHUNK_SIZE = 1000  # Max number of files per chunk
+DEFAULT_DATA_FILE_SIZE_IN_MB = 100  # Max size per file
+DEFAULT_VIDEO_FILE_SIZE_IN_MB = 500  # Max size per file

 INFO_PATH = "meta/info.json"
-EPISODES_PATH = "meta/episodes.jsonl"
 STATS_PATH = "meta/stats.json"
-EPISODES_STATS_PATH = "meta/episodes_stats.jsonl"
-TASKS_PATH = "meta/tasks.jsonl"

-DEFAULT_VIDEO_PATH = "videos/chunk-{episode_chunk:03d}/{video_key}/episode_{episode_index:06d}.mp4"
-DEFAULT_PARQUET_PATH = "data/chunk-{episode_chunk:03d}/episode_{episode_index:06d}.parquet"
-DEFAULT_IMAGE_PATH = "images/{image_key}/episode_{episode_index:06d}/frame_{frame_index:06d}.png"
+EPISODES_DIR = "meta/episodes"
+DATA_DIR = "data"
+VIDEO_DIR = "videos"
+
+CHUNK_FILE_PATTERN = "chunk-{chunk_index:03d}/file-{file_index:03d}"
+DEFAULT_TASKS_PATH = "meta/tasks.parquet"
+DEFAULT_EPISODES_PATH = EPISODES_DIR + "/" + CHUNK_FILE_PATTERN + ".parquet"
+DEFAULT_DATA_PATH = DATA_DIR + "/" + CHUNK_FILE_PATTERN + ".parquet"
+DEFAULT_VIDEO_PATH = VIDEO_DIR + "/{video_key}/" + CHUNK_FILE_PATTERN + ".mp4"
+DEFAULT_IMAGE_PATH = "images/{image_key}/episode-{episode_index:06d}/frame-{frame_index:06d}.png"

 DATASET_CARD_TEMPLATE = """
 ---
@@ -74,6 +86,117 @@ DEFAULT_FEATURES = {
    "task_index": {"dtype": "int64", "shape": (1,), "names": None},
 }

+T = TypeVar("T")
+
+
+def get_parquet_file_size_in_mb(parquet_path):
+    metadata = pq.read_metadata(parquet_path)
+    total_uncompressed_size = 0
+    for row_group in range(metadata.num_row_groups):
+        rg_metadata = metadata.row_group(row_group)
+        for column in range(rg_metadata.num_columns):
+            col_metadata = rg_metadata.column(column)
+            total_uncompressed_size += col_metadata.total_uncompressed_size
+    return total_uncompressed_size / (1024**2)
+
+
+def get_hf_dataset_size_in_mb(hf_ds: Dataset) -> int:
+    return hf_ds.data.nbytes / (1024**2)
+
+
+def get_pd_dataframe_size_in_mb(df: pandas.DataFrame) -> int:
+    # TODO(rcadene): unused?
+    memory_usage_bytes = df.memory_usage(deep=True).sum()
+    return memory_usage_bytes / (1024**2)
+
+
+def update_chunk_file_indices(chunk_idx: int, file_idx: int, chunks_size: int):
+    if file_idx == chunks_size - 1:
+        file_idx = 0
+        chunk_idx += 1
+    else:
+        file_idx += 1
+    return chunk_idx, file_idx
+
+
+def load_nested_dataset(pq_dir: Path) -> Dataset:
+    """Find parquet files in provided directory {pq_dir}/chunk-xxx/file-xxx.parquet
+    Convert parquet files to pyarrow memory mapped in a cache folder for efficient RAM usage
+    Concatenate all pyarrow references to return HF Dataset format
+    """
+    paths = sorted(pq_dir.glob("*/*.parquet"))
+    if len(paths) == 0:
+        raise FileNotFoundError(f"Provided directory does not contain any parquet file: {pq_dir}")
+
+    # TODO(rcadene): set num_proc to accelerate conversion to pyarrow
+    datasets = [Dataset.from_parquet(str(path)) for path in paths]
+    return concatenate_datasets(datasets)
+
+
+def get_parquet_num_frames(parquet_path):
+    metadata = pq.read_metadata(parquet_path)
+    return metadata.num_rows
+
+
+def get_video_size_in_mb(mp4_path: Path):
+    file_size_bytes = mp4_path.stat().st_size
+    file_size_mb = file_size_bytes / (1024**2)
+    return file_size_mb
+
+
+def concat_video_files(paths_to_cat: list[Path], root: Path, video_key: str, chunk_idx: int, file_idx: int):
+    # TODO(rcadene): move to video_utils.py
+    # TODO(rcadene): add docstring
+    tmp_dir = Path(tempfile.mkdtemp(dir=root))
+    # Create a text file with the list of files to concatenate
+    path_concat_video_files = tmp_dir / "concat_video_files.txt"
+    with open(path_concat_video_files, "w") as f:
+        for ep_path in paths_to_cat:
+            f.write(f"file '{str(ep_path)}'\n")
+
+    path_tmp_output = tmp_dir / "tmp_output.mp4"
+    command = [
+        "ffmpeg",
+        "-y",
+        "-f",
+        "concat",
+        "-safe",
+        "0",
+        "-i",
+        str(path_concat_video_files),
+        "-c",
+        "copy",
+        str(path_tmp_output),
+    ]
+    subprocess.run(command, check=True)
+
+    output_path = root / DEFAULT_VIDEO_PATH.format(
+        video_key=video_key, chunk_index=chunk_idx, file_index=file_idx
+    )
+    output_path.parent.mkdir(parents=True, exist_ok=True)
+    shutil.move(str(path_tmp_output), str(output_path))
+    shutil.rmtree(str(tmp_dir))
+
+
+def get_video_duration_in_s(mp4_file: Path):
+    # TODO(rcadene): move to video_utils.py
+    command = [
+        "ffprobe",
+        "-v",
+        "error",
+        "-show_entries",
+        "format=duration",
+        "-of",
+        "default=noprint_wrappers=1:nokey=1",
+        str(mp4_file),
+    ]
+    result = subprocess.run(
+        command,
+        stdout=subprocess.PIPE,
+        stderr=subprocess.STDOUT,
+    )
+    return float(result.stdout)
+

 def flatten_dict(d: dict, parent_key: str = "", sep: str = "/") -> dict:
    """Flatten a nested dictionary structure by collapsing nested keys into one key with a separator.
@@ -107,23 +230,13 @@ def unflatten_dict(d: dict, sep: str = "/") -> dict:
    return outdict


-def get_nested_item(obj: DictLike, flattened_key: str, sep: str = "/") -> Any:
-    split_keys = flattened_key.split(sep)
-    getter = obj[split_keys[0]]
-    if len(split_keys) == 1:
-        return getter
-
-    for key in split_keys[1:]:
-        getter = getter[key]
-
-    return getter
-
-
 def serialize_dict(stats: dict[str, torch.Tensor | np.ndarray | dict]) -> dict:
    serialized_dict = {}
    for key, value in flatten_dict(stats).items():
        if isinstance(value, (torch.Tensor, np.ndarray)):
            serialized_dict[key] = value.tolist()
+        elif isinstance(value, list) and isinstance(value[0], (int, float, list)):
+            serialized_dict[key] = value
        elif isinstance(value, np.generic):
            serialized_dict[key] = value.item()
        elif isinstance(value, (int, float)):
@@ -153,23 +266,6 @@ def write_json(data: dict, fpath: Path) -> None:
        json.dump(data, f, indent=4, ensure_ascii=False)


-def load_jsonlines(fpath: Path) -> list[Any]:
-    with jsonlines.open(fpath, "r") as reader:
-        return list(reader)
-
-
-def write_jsonlines(data: dict, fpath: Path) -> None:
-    fpath.parent.mkdir(exist_ok=True, parents=True)
-    with jsonlines.open(fpath, "w") as writer:
-        writer.write_all(data)
-
-
-def append_jsonlines(data: dict, fpath: Path) -> None:
-    fpath.parent.mkdir(exist_ok=True, parents=True)
-    with jsonlines.open(fpath, "a") as writer:
-        writer.write(data)
-
-
 def write_info(info: dict, local_dir: Path):
    write_json(info, local_dir / INFO_PATH)

@@ -198,43 +294,42 @@ def load_stats(local_dir: Path) -> dict[str, dict[str, np.ndarray]]:
    return cast_stats_to_numpy(stats)


-def write_task(task_index: int, task: dict, local_dir: Path):
-    task_dict = {
-        "task_index": task_index,
-        "task": task,
-    }
-    append_jsonlines(task_dict, local_dir / TASKS_PATH)
+def write_hf_dataset(hf_dataset: Dataset, local_dir: Path):
+    if get_hf_dataset_size_in_mb(hf_dataset) > DEFAULT_DATA_FILE_SIZE_IN_MB:
+        raise NotImplementedError("Contact a maintainer.")
+
+    path = local_dir / DEFAULT_DATA_PATH.format(chunk_index=0, file_index=0)
+    path.parent.mkdir(parents=True, exist_ok=True)
+    hf_dataset.to_parquet(path)


-def load_tasks(local_dir: Path) -> tuple[dict, dict]:
-    tasks = load_jsonlines(local_dir / TASKS_PATH)
-    tasks = {item["task_index"]: item["task"] for item in sorted(tasks, key=lambda x: x["task_index"])}
-    task_to_task_index = {task: task_index for task_index, task in tasks.items()}
-    return tasks, task_to_task_index
+def write_tasks(tasks: pandas.DataFrame, local_dir: Path):
+    path = local_dir / DEFAULT_TASKS_PATH
+    path.parent.mkdir(parents=True, exist_ok=True)
+    tasks.to_parquet(path)


-def write_episode(episode: dict, local_dir: Path):
-    append_jsonlines(episode, local_dir / EPISODES_PATH)
+def load_tasks(local_dir: Path):
+    tasks = pd.read_parquet(local_dir / DEFAULT_TASKS_PATH)
+    return tasks


-def load_episodes(local_dir: Path) -> dict:
-    episodes = load_jsonlines(local_dir / EPISODES_PATH)
-    return {item["episode_index"]: item for item in sorted(episodes, key=lambda x: x["episode_index"])}
+def write_episodes(episodes: Dataset, local_dir: Path):
+    if get_hf_dataset_size_in_mb(episodes) > DEFAULT_DATA_FILE_SIZE_IN_MB:
+        raise NotImplementedError("Contact a maintainer.")
+
+    fpath = local_dir / DEFAULT_EPISODES_PATH.format(chunk_index=0, file_index=0)
+    fpath.parent.mkdir(parents=True, exist_ok=True)
+    episodes.to_parquet(fpath)


-def write_episode_stats(episode_index: int, episode_stats: dict, local_dir: Path):
-    # We wrap episode_stats in a dictionary since `episode_stats["episode_index"]`
-    # is a dictionary of stats and not an integer.
-    episode_stats = {"episode_index": episode_index, "stats": serialize_dict(episode_stats)}
-    append_jsonlines(episode_stats, local_dir / EPISODES_STATS_PATH)
-
-
-def load_episodes_stats(local_dir: Path) -> dict:
-    episodes_stats = load_jsonlines(local_dir / EPISODES_STATS_PATH)
-    return {
-        item["episode_index"]: cast_stats_to_numpy(item["stats"])
-        for item in sorted(episodes_stats, key=lambda x: x["episode_index"])
-    }
+def load_episodes(local_dir: Path) -> datasets.Dataset:
+    episodes = load_nested_dataset(local_dir / EPISODES_DIR)
+    # Select episode features/columns containing references to episode data and videos
+    # (e.g. tasks, dataset_from_index, dataset_to_index, data/chunk_index, data/file_index, etc.)
+    # This is to speedup access to these data, instead of having to load episode stats.
+    episodes = episodes.select_columns([key for key in episodes.features if not key.startswith("stats/")])
+    return episodes


 def backward_compatible_episodes_stats(
@@ -388,6 +483,7 @@ def get_hf_features_from_features(features: dict) -> datasets.Features:


 def get_features_from_robot(robot: Robot, use_videos: bool = True) -> dict:
+    # TODO(rcadene): add fps for each feature
    camera_ft = {}
    if robot.cameras:
        camera_ft = {
@@ -441,31 +537,17 @@ def create_empty_dataset_info(
        "total_episodes": 0,
        "total_frames": 0,
        "total_tasks": 0,
-        "total_videos": 0,
-        "total_chunks": 0,
        "chunks_size": DEFAULT_CHUNK_SIZE,
+        "data_files_size_in_mb": DEFAULT_DATA_FILE_SIZE_IN_MB,
+        "video_files_size_in_mb": DEFAULT_VIDEO_FILE_SIZE_IN_MB,
        "fps": fps,
        "splits": {},
-        "data_path": DEFAULT_PARQUET_PATH,
+        "data_path": DEFAULT_DATA_PATH,
        "video_path": DEFAULT_VIDEO_PATH if use_videos else None,
        "features": features,
    }


-def get_episode_data_index(
-    episode_dicts: dict[dict], episodes: list[int] | None = None
-) -> dict[str, torch.Tensor]:
-    episode_lengths = {ep_idx: ep_dict["length"] for ep_idx, ep_dict in episode_dicts.items()}
-    if episodes is not None:
-        episode_lengths = {ep_idx: episode_lengths[ep_idx] for ep_idx in episodes}
-
-    cumulative_lengths = list(accumulate(episode_lengths.values()))
-    return {
-        "from": torch.LongTensor([0] + cumulative_lengths[:-1]),
-        "to": torch.LongTensor(cumulative_lengths),
-    }
-
-
 def check_timestamps_sync(
    timestamps: np.ndarray,
    episode_indices: np.ndarray,
@@ -811,3 +893,208 @@ def validate_episode_buffer(episode_buffer: dict, total_episodes: int, features:
            f"In episode_buffer not in features: {buffer_keys - set(features)}"
            f"In features not in episode_buffer: {set(features) - buffer_keys}"
        )
+
+
+def safe_write_dataframe_to_parquet(df: pandas.DataFrame, path: Path, image_keys: list[str]):
+    if len(image_keys) > 0:
+        # TODO(qlhoest): replace this weird synthax by `df.to_parquet(path)` only
+        datasets.Dataset.from_dict(df.to_dict(orient="list")).to_parquet(path)
+    else:
+        df.to_parquet(path)
+
+
+def item_to_torch(item: dict) -> dict:
+    """Convert all items in a dictionary to PyTorch tensors where appropriate.
+
+    This function is used to convert an item from a streaming dataset to PyTorch tensors.
+
+    Args:
+        item (dict): Dictionary of items from a dataset.
+
+    Returns:
+        dict: Dictionary with all tensor-like items converted to torch.Tensor.
+    """
+    import numpy as np
+    import torch
+
+    for key, val in item.items():
+        if isinstance(val, (np.ndarray, list)) and key not in ["task"]:
+            # Convert numpy arrays and lists to torch tensors
+            item[key] = torch.tensor(val)
+    return item
+
+
+class LookBackError(Exception):
+    """
+    Exception raised when trying to look back in the history of a Backtrackable object.
+    """
+
+    pass
+
+
+class LookAheadError(Exception):
+    """
+    Exception raised when trying to look ahead in the future of a Backtrackable object.
+    """
+
+    pass
+
+
+class Backtrackable(Generic[T]):
+    """
+    Wrap any iterator/iterable so you can step back up to `history` items
+    and look ahead up to `lookahead` items.
+
+    This is useful for streaming datasets where you need to access previous and future items
+    but can't load the entire dataset into memory.
+
+    Example:
+    -------
+    ```python
+    ds = load_dataset("c4", "en", streaming=True, split="train")
+    rev = Backtrackable(ds, history=3, lookahead=2)
+
+    x0 = next(rev)      # forward
+    x1 = next(rev)
+    x2 = next(rev)
+
+    # Look ahead
+    x3_peek = rev.peek_ahead(1)  # next item without moving cursor
+    x4_peek = rev.peek_ahead(2)  # two items ahead
+
+    # Look back
+    x1_again = rev.peek_back(1)  # previous item without moving cursor
+    x0_again = rev.peek_back(2)  # two items back
+
+    # Move backward
+    x1_back = rev.prev()     # back one step
+    next(rev)                # returns x2, continues forward from where we were
+    ```
+    """
+
+    __slots__ = ("_source", "_back_buf", "_ahead_buf", "_cursor", "_lookahead")
+
+    def __init__(self, iterable: Iterable[T], *, history: int = 1, lookahead: int = 0):
+        if history < 1:
+            raise ValueError("history must be ≥ 1")
+        if lookahead < 0:
+            raise ValueError("lookahead must be ≥ 0")
+
+        self._source: Iterator[T] = iter(iterable)
+        self._back_buf: Deque[T] = deque(maxlen=history)
+        self._ahead_buf: Deque[T] = deque(maxlen=lookahead) if lookahead > 0 else deque()
+        self._cursor: int = 0  # 0 == just after the newest item in back_buf
+        self._lookahead = lookahead
+
+    def __iter__(self) -> "Backtrackable[T]":
+        return self
+
+    def __next__(self) -> T:
+        # If we've stepped back, consume from back buffer first
+        if self._cursor < 0:  # -1 means "last item", etc.
+            self._cursor += 1
+            return self._back_buf[self._cursor]
+
+        # If we have items in the ahead buffer, use them first
+        item = self._ahead_buf.popleft() if self._ahead_buf else next(self._source)
+
+        # Add current item to back buffer and reset cursor
+        self._back_buf.append(item)
+        self._cursor = 0
+        return item
+
+    def prev(self) -> T:
+        """
+        Step one item back in history and return it.
+        Raises IndexError if already at the oldest buffered item.
+        """
+        if len(self._back_buf) + self._cursor <= 1:
+            raise LookBackError("At start of history")
+
+        self._cursor -= 1
+        return self._back_buf[self._cursor]
+
+    def peek_back(self, n: int = 1) -> T:
+        """
+        Look `n` items back (n=1 == previous item) without moving the cursor.
+        """
+        if n < 1 or n > len(self._back_buf) + self._cursor:
+            raise LookBackError("peek_back distance out of range")
+
+        return self._back_buf[self._cursor - n]
+
+    def peek_ahead(self, n: int = 1) -> T:
+        """
+        Look `n` items ahead (n=1 == next item) without moving the cursor.
+        Fills the ahead buffer if necessary.
+        """
+        if n < 1:
+            raise LookAheadError("peek_ahead distance must be 1 or more")
+        elif n > self._lookahead:
+            raise LookAheadError("peek_ahead distance exceeds lookahead limit")
+
+        # Fill ahead buffer if we don't have enough items
+        while len(self._ahead_buf) < n:
+            try:
+                item = next(self._source)
+                self._ahead_buf.append(item)
+
+            except StopIteration as err:
+                raise LookAheadError("peek_ahead: not enough items in source") from err
+
+        return self._ahead_buf[n - 1]
+
+    def history(self) -> List[T]:
+        """
+        Return a copy of the buffered history (most recent last).
+        The list length ≤ `history` argument passed at construction.
+        """
+        if self._cursor == 0:
+            return list(self._back_buf)
+
+        # When cursor<0, slice so the order remains chronological
+        return list(self._back_buf)[: self._cursor or None]
+
+    def lookahead_buffer(self) -> List[T]:
+        """
+        Return a copy of the current lookahead buffer.
+        """
+        return list(self._ahead_buf)
+
+    def can_peek_back(self, steps: int = 1) -> bool:
+        """
+        Check if we can go back `steps` items without raising an IndexError.
+        """
+        return steps <= len(self._back_buf) + self._cursor
+
+    def can_peek_ahead(self, steps: int = 1) -> bool:
+        """
+        Check if we can peek ahead `steps` items.
+        This may involve trying to fill the ahead buffer.
+        """
+        if self._lookahead > 0 and steps > self._lookahead:
+            return False
+
+        # Try to fill ahead buffer to check if we can peek that far
+        try:
+            while len(self._ahead_buf) < steps:
+                if self._lookahead > 0 and len(self._ahead_buf) >= self._lookahead:
+                    return False
+                item = next(self._source)
+                self._ahead_buf.append(item)
+            return True
+        except StopIteration:
+            return False
+
+    def reset_cursor(self) -> None:
+        """
+        Reset cursor to the most recent position (equivalent to calling next()
+        until you're back to the latest item).
+        """
+        self._cursor = 0
+
+    def clear_ahead_buffer(self) -> None:
+        """
+        Clear the ahead buffer, discarding any pre-fetched items.
+        """
+        self._ahead_buf.clear()
--- a/lerobot/common/datasets/v2/convert_dataset_v1_to_v2.py
+++ b/lerobot/common/datasets/v2/convert_dataset_v1_to_v2.py
@@ -121,12 +121,12 @@ from safetensors.torch import load_file

 from lerobot.common.datasets.utils import (
    DEFAULT_CHUNK_SIZE,
-    DEFAULT_PARQUET_PATH,
+    DEFAULT_DATA_PATH,
    DEFAULT_VIDEO_PATH,
-    EPISODES_PATH,
    INFO_PATH,
+    LEGACY_EPISODES_PATH,
+    LEGACY_TASKS_PATH,
    STATS_PATH,
-    TASKS_PATH,
    create_branch,
    create_lerobot_dataset_card,
    flatten_dict,
@@ -291,14 +291,12 @@ def split_parquet_by_episodes(
    for ep_chunk in range(total_chunks):
        ep_chunk_start = DEFAULT_CHUNK_SIZE * ep_chunk
        ep_chunk_end = min(DEFAULT_CHUNK_SIZE * (ep_chunk + 1), total_episodes)
-        chunk_dir = "/".join(DEFAULT_PARQUET_PATH.split("/")[:-1]).format(episode_chunk=ep_chunk)
+        chunk_dir = "/".join(DEFAULT_DATA_PATH.split("/")[:-1]).format(episode_chunk=ep_chunk)
        (output_dir / chunk_dir).mkdir(parents=True, exist_ok=True)
        for ep_idx in range(ep_chunk_start, ep_chunk_end):
            ep_table = table.filter(pc.equal(table["episode_index"], ep_idx))
            episode_lengths.insert(ep_idx, len(ep_table))
-            output_file = output_dir / DEFAULT_PARQUET_PATH.format(
-                episode_chunk=ep_chunk, episode_index=ep_idx
-            )
+            output_file = output_dir / DEFAULT_DATA_PATH.format(episode_chunk=ep_chunk, episode_index=ep_idx)
            pq.write_table(ep_table, output_file)

    return episode_lengths
@@ -496,7 +494,7 @@ def convert_dataset(

    assert set(tasks) == {task for ep_tasks in tasks_by_episodes.values() for task in ep_tasks}
    tasks = [{"task_index": task_idx, "task": task} for task_idx, task in enumerate(tasks)]
-    write_jsonlines(tasks, v20_dir / TASKS_PATH)
+    write_jsonlines(tasks, v20_dir / LEGACY_TASKS_PATH)
    features["task_index"] = {
        "dtype": "int64",
        "shape": (1,),
@@ -546,7 +544,7 @@ def convert_dataset(
        {"episode_index": ep_idx, "tasks": tasks_by_episodes[ep_idx], "length": episode_lengths[ep_idx]}
        for ep_idx in episode_indices
    ]
-    write_jsonlines(episodes, v20_dir / EPISODES_PATH)
+    write_jsonlines(episodes, v20_dir / LEGACY_EPISODES_PATH)

    # Assemble metadata v2.0
    metadata_v2_0 = {
@@ -560,7 +558,7 @@ def convert_dataset(
        "chunks_size": DEFAULT_CHUNK_SIZE,
        "fps": metadata_v1["fps"],
        "splits": {"train": f"0:{total_episodes}"},
-        "data_path": DEFAULT_PARQUET_PATH,
+        "data_path": DEFAULT_DATA_PATH,
        "video_path": DEFAULT_VIDEO_PATH if video_keys else None,
        "features": features,
    }
--- a/lerobot/common/datasets/v21/convert_dataset_v20_to_v21.py
+++ b/lerobot/common/datasets/v21/convert_dataset_v20_to_v21.py
@@ -37,7 +37,7 @@ import logging
 from huggingface_hub import HfApi

 from lerobot.common.datasets.lerobot_dataset import CODEBASE_VERSION, LeRobotDataset
-from lerobot.common.datasets.utils import EPISODES_STATS_PATH, STATS_PATH, load_stats, write_info
+from lerobot.common.datasets.utils import LEGACY_EPISODES_STATS_PATH, STATS_PATH, load_stats, write_info
 from lerobot.common.datasets.v21.convert_stats import check_aggregate_stats, convert_stats

 V20 = "v2.0"
@@ -61,8 +61,8 @@ def convert_dataset(
    with SuppressWarnings():
        dataset = LeRobotDataset(repo_id, revision=V20, force_cache_sync=True)

-    if (dataset.root / EPISODES_STATS_PATH).is_file():
-        (dataset.root / EPISODES_STATS_PATH).unlink()
+    if (dataset.root / LEGACY_EPISODES_STATS_PATH).is_file():
+        (dataset.root / LEGACY_EPISODES_STATS_PATH).unlink()

    convert_stats(dataset, num_workers=num_workers)
    ref_stats = load_stats(dataset.root)
--- a/lerobot/common/datasets/v21/convert_stats.py
+++ b/lerobot/common/datasets/v21/convert_stats.py
@@ -19,7 +19,7 @@ from tqdm import tqdm

 from lerobot.common.datasets.compute_stats import aggregate_stats, get_feature_stats, sample_indices
 from lerobot.common.datasets.lerobot_dataset import LeRobotDataset
-from lerobot.common.datasets.utils import write_episode_stats
+from lerobot.common.datasets.utils import legacy_write_episode_stats


 def sample_episode_video_frames(dataset: LeRobotDataset, episode_index: int, ft_key: str) -> np.ndarray:
@@ -72,7 +72,7 @@ def convert_stats(dataset: LeRobotDataset, num_workers: int = 0):
            convert_episode_stats(dataset, ep_idx)

    for ep_idx in tqdm(range(total_episodes)):
-        write_episode_stats(ep_idx, dataset.meta.episodes_stats[ep_idx], dataset.root)
+        legacy_write_episode_stats(ep_idx, dataset.meta.episodes_stats[ep_idx], dataset.root)


 def check_aggregate_stats(
--- a/lerobot/common/datasets/v30/convert_dataset_v21_to_v30.py
+++ b/lerobot/common/datasets/v30/convert_dataset_v21_to_v30.py
@@ -0,0 +1,452 @@
+"""
+This script will help you convert any LeRobot dataset already pushed to the hub from codebase version 2.1 to
+3.0. It will:
+
+- Generate per-episodes stats and writes them in `episodes_stats.jsonl`
+- Check consistency between these new stats and the old ones.
+- Remove the deprecated `stats.json`.
+- Update codebase_version in `info.json`.
+- Push this new version to the hub on the 'main' branch and tags it with "v2.1".
+
+Usage:
+
+```bash
+python lerobot/common/datasets/v30/convert_dataset_v21_to_v30.py \
+    --repo-id=lerobot/pusht
+```
+
+"""
+
+import argparse
+import shutil
+from pathlib import Path
+from typing import Any
+
+import jsonlines
+import pandas as pd
+import pyarrow as pa
+import tqdm
+from datasets import Dataset, Features, Image
+from huggingface_hub import HfApi, snapshot_download
+from requests import HTTPError
+
+from lerobot.common.constants import HF_LEROBOT_HOME
+from lerobot.common.datasets.compute_stats import aggregate_stats
+from lerobot.common.datasets.lerobot_dataset import CODEBASE_VERSION, LeRobotDataset
+from lerobot.common.datasets.utils import (
+    DEFAULT_CHUNK_SIZE,
+    DEFAULT_DATA_FILE_SIZE_IN_MB,
+    DEFAULT_DATA_PATH,
+    DEFAULT_VIDEO_FILE_SIZE_IN_MB,
+    DEFAULT_VIDEO_PATH,
+    cast_stats_to_numpy,
+    concat_video_files,
+    flatten_dict,
+    get_parquet_file_size_in_mb,
+    get_parquet_num_frames,
+    get_video_duration_in_s,
+    get_video_size_in_mb,
+    load_info,
+    update_chunk_file_indices,
+    write_episodes,
+    write_info,
+    write_stats,
+    write_tasks,
+)
+
+LEGACY_EPISODES_PATH = "meta/episodes.jsonl"
+LEGACY_EPISODES_STATS_PATH = "meta/episodes_stats.jsonl"
+LEGACY_TASKS_PATH = "meta/tasks.jsonl"
+LEGACY_DEFAULT_VIDEO_PATH = "videos/chunk-{episode_chunk:03d}/{video_key}/episode_{episode_index:06d}.mp4"
+LEGACY_DEFAULT_PARQUET_PATH = "data/chunk-{episode_chunk:03d}/episode_{episode_index:06d}.parquet"
+
+V21 = "v2.1"
+
+
+"""
+-------------------------
+OLD
+data/chunk-000/episode_000000.parquet
+
+NEW
+data/chunk-000/file_000.parquet
+-------------------------
+OLD
+videos/chunk-000/CAMERA/episode_000000.mp4
+
+NEW
+videos/chunk-000/file_000.mp4
+-------------------------
+OLD
+episodes.jsonl
+{"episode_index": 1, "tasks": ["Put the blue block in the green bowl"], "length": 266}
+
+NEW
+meta/episodes/chunk-000/episodes_000.parquet
+episode_index | video_chunk_index | video_file_index | data_chunk_index | data_file_index | tasks | length
+-------------------------
+OLD
+tasks.jsonl
+{"task_index": 1, "task": "Put the blue block in the green bowl"}
+
+NEW
+meta/tasks/chunk-000/file_000.parquet
+task_index | task
+-------------------------
+OLD
+episodes_stats.jsonl
+
+NEW
+meta/episodes_stats/chunk-000/file_000.parquet
+episode_index | mean | std | min | max
+-------------------------
+UPDATE
+meta/info.json
+-------------------------
+"""
+
+
+def load_jsonlines(fpath: Path) -> list[Any]:
+    with jsonlines.open(fpath, "r") as reader:
+        return list(reader)
+
+
+def legacy_load_episodes(local_dir: Path) -> dict:
+    episodes = load_jsonlines(local_dir / LEGACY_EPISODES_PATH)
+    return {item["episode_index"]: item for item in sorted(episodes, key=lambda x: x["episode_index"])}
+
+
+def legacy_load_episodes_stats(local_dir: Path) -> dict:
+    episodes_stats = load_jsonlines(local_dir / LEGACY_EPISODES_STATS_PATH)
+    return {
+        item["episode_index"]: cast_stats_to_numpy(item["stats"])
+        for item in sorted(episodes_stats, key=lambda x: x["episode_index"])
+    }
+
+
+def legacy_load_tasks(local_dir: Path) -> tuple[dict, dict]:
+    tasks = load_jsonlines(local_dir / LEGACY_TASKS_PATH)
+    tasks = {item["task_index"]: item["task"] for item in sorted(tasks, key=lambda x: x["task_index"])}
+    task_to_task_index = {task: task_index for task_index, task in tasks.items()}
+    return tasks, task_to_task_index
+
+
+def convert_tasks(root, new_root):
+    tasks, _ = legacy_load_tasks(root)
+    task_indices = tasks.keys()
+    task_strings = tasks.values()
+    df_tasks = pd.DataFrame({"task_index": task_indices}, index=task_strings)
+    write_tasks(df_tasks, new_root)
+
+
+def concat_data_files(paths_to_cat, new_root, chunk_idx, file_idx, image_keys):
+    # TODO(rcadene): to save RAM use Dataset.from_parquet(file) and concatenate_datasets
+    dataframes = [pd.read_parquet(file) for file in paths_to_cat]
+    # Concatenate all DataFrames along rows
+    concatenated_df = pd.concat(dataframes, ignore_index=True)
+
+    path = new_root / DEFAULT_DATA_PATH.format(chunk_index=chunk_idx, file_index=file_idx)
+    path.parent.mkdir(parents=True, exist_ok=True)
+
+    if len(image_keys) > 0:
+        schema = pa.Schema.from_pandas(concatenated_df)
+        features = Features.from_arrow_schema(schema)
+        for key in image_keys:
+            features[key] = Image()
+        schema = features.arrow_schema
+    else:
+        schema = None
+
+    concatenated_df.to_parquet(path, index=False, schema=schema)
+
+
+def convert_data(root, new_root):
+    data_dir = root / "data"
+    ep_paths = sorted(data_dir.glob("*/*.parquet"))
+
+    image_keys = get_image_keys(root)
+
+    ep_idx = 0
+    chunk_idx = 0
+    file_idx = 0
+    size_in_mb = 0
+    num_frames = 0
+    paths_to_cat = []
+    episodes_metadata = []
+    for ep_path in ep_paths:
+        ep_size_in_mb = get_parquet_file_size_in_mb(ep_path)
+        ep_num_frames = get_parquet_num_frames(ep_path)
+        ep_metadata = {
+            "episode_index": ep_idx,
+            "data/chunk_index": chunk_idx,
+            "data/file_index": file_idx,
+            "dataset_from_index": num_frames,
+            "dataset_to_index": num_frames + ep_num_frames,
+        }
+        size_in_mb += ep_size_in_mb
+        num_frames += ep_num_frames
+        episodes_metadata.append(ep_metadata)
+        ep_idx += 1
+
+        if size_in_mb < DEFAULT_DATA_FILE_SIZE_IN_MB:
+            paths_to_cat.append(ep_path)
+            continue
+
+        concat_data_files(paths_to_cat, new_root, chunk_idx, file_idx, image_keys)
+
+        # Reset for the next file
+        size_in_mb = ep_size_in_mb
+        num_frames = ep_num_frames
+        paths_to_cat = [ep_path]
+
+        chunk_idx, file_idx = update_chunk_file_indices(chunk_idx, file_idx, DEFAULT_CHUNK_SIZE)
+
+    # Write remaining data if any
+    if paths_to_cat:
+        concat_data_files(paths_to_cat, new_root, chunk_idx, file_idx, image_keys)
+
+    return episodes_metadata
+
+
+def get_video_keys(root):
+    info = load_info(root)
+    features = info["features"]
+    video_keys = [key for key, ft in features.items() if ft["dtype"] == "video"]
+    return video_keys
+
+
+def get_image_keys(root):
+    info = load_info(root)
+    features = info["features"]
+    image_keys = [key for key, ft in features.items() if ft["dtype"] == "image"]
+    return image_keys
+
+
+def convert_videos(root: Path, new_root: Path):
+    video_keys = get_video_keys(root)
+    if len(video_keys) == 0:
+        return None
+
+    video_keys = sorted(video_keys)
+
+    eps_metadata_per_cam = []
+    for camera in video_keys:
+        eps_metadata = convert_videos_of_camera(root, new_root, camera)
+        eps_metadata_per_cam.append(eps_metadata)
+
+    num_eps_per_cam = [len(eps_cam_map) for eps_cam_map in eps_metadata_per_cam]
+    if len(set(num_eps_per_cam)) != 1:
+        raise ValueError(f"All cams dont have same number of episodes ({num_eps_per_cam}).")
+
+    episods_metadata = []
+    num_cameras = len(video_keys)
+    num_episodes = num_eps_per_cam[0]
+    for ep_idx in range(num_episodes):
+        # Sanity check
+        ep_ids = [eps_metadata_per_cam[cam_idx][ep_idx]["episode_index"] for cam_idx in range(num_cameras)]
+        ep_ids += [ep_idx]
+        if len(set(ep_ids)) != 1:
+            raise ValueError(f"All episode indices need to match ({ep_ids}).")
+
+        ep_dict = {}
+        for cam_idx in range(num_cameras):
+            ep_dict.update(eps_metadata_per_cam[cam_idx][ep_idx])
+        episods_metadata.append(ep_dict)
+
+    return episods_metadata
+
+
+def convert_videos_of_camera(root: Path, new_root: Path, video_key):
+    # Access old paths to mp4
+    videos_dir = root / "videos"
+    ep_paths = sorted(videos_dir.glob(f"*/{video_key}/*.mp4"))
+
+    ep_idx = 0
+    chunk_idx = 0
+    file_idx = 0
+    size_in_mb = 0
+    duration_in_s = 0.0
+    paths_to_cat = []
+    episodes_metadata = []
+    for ep_path in tqdm.tqdm(ep_paths, desc=f"convert videos of {video_key}"):
+        ep_size_in_mb = get_video_size_in_mb(ep_path)
+        ep_duration_in_s = get_video_duration_in_s(ep_path)
+        ep_metadata = {
+            "episode_index": ep_idx,
+            f"videos/{video_key}/chunk_index": chunk_idx,
+            f"videos/{video_key}/file_index": file_idx,
+            f"videos/{video_key}/from_timestamp": duration_in_s,
+            f"videos/{video_key}/to_timestamp": duration_in_s + ep_duration_in_s,
+        }
+        size_in_mb += ep_size_in_mb
+        duration_in_s += ep_duration_in_s
+        episodes_metadata.append(ep_metadata)
+        ep_idx += 1
+
+        if size_in_mb < DEFAULT_VIDEO_FILE_SIZE_IN_MB:
+            paths_to_cat.append(ep_path)
+            continue
+
+        concat_video_files(paths_to_cat, new_root, video_key, chunk_idx, file_idx)
+
+        # Reset for the next file
+        size_in_mb = ep_size_in_mb
+        duration_in_s = ep_duration_in_s
+        paths_to_cat = [ep_path]
+
+        chunk_idx, file_idx = update_chunk_file_indices(chunk_idx, file_idx, DEFAULT_CHUNK_SIZE)
+
+    # Write remaining videos if any
+    if paths_to_cat:
+        concat_video_files(paths_to_cat, new_root, video_key, chunk_idx, file_idx)
+
+    return episodes_metadata
+
+
+def generate_episode_metadata_dict(
+    episodes_legacy_metadata, episodes_metadata, episodes_stats, episodes_videos=None
+):
+    num_episodes = len(episodes_metadata)
+    episodes_legacy_metadata_vals = list(episodes_legacy_metadata.values())
+    episodes_stats_vals = list(episodes_stats.values())
+    episodes_stats_keys = list(episodes_stats.keys())
+
+    for i in range(num_episodes):
+        ep_legacy_metadata = episodes_legacy_metadata_vals[i]
+        ep_metadata = episodes_metadata[i]
+        ep_stats = episodes_stats_vals[i]
+
+        ep_ids_set = {
+            ep_legacy_metadata["episode_index"],
+            ep_metadata["episode_index"],
+            episodes_stats_keys[i],
+        }
+
+        if episodes_videos is None:
+            ep_video = {}
+        else:
+            ep_video = episodes_videos[i]
+            ep_ids_set.add(ep_video["episode_index"])
+
+        if len(ep_ids_set) != 1:
+            raise ValueError(f"Number of episodes is not the same ({ep_ids_set}).")
+
+        ep_dict = {**ep_metadata, **ep_video, **ep_legacy_metadata, **flatten_dict({"stats": ep_stats})}
+        ep_dict["meta/episodes/chunk_index"] = 0
+        ep_dict["meta/episodes/file_index"] = 0
+        yield ep_dict
+
+
+def convert_episodes_metadata(root, new_root, episodes_metadata, episodes_video_metadata=None):
+    episodes_legacy_metadata = legacy_load_episodes(root)
+    episodes_stats = legacy_load_episodes_stats(root)
+
+    num_eps_set = {len(episodes_legacy_metadata), len(episodes_metadata)}
+    if episodes_video_metadata is not None:
+        num_eps_set.add(len(episodes_video_metadata))
+
+    if len(num_eps_set) != 1:
+        raise ValueError(f"Number of episodes is not the same ({num_eps_set}).")
+
+    ds_episodes = Dataset.from_generator(
+        lambda: generate_episode_metadata_dict(
+            episodes_legacy_metadata, episodes_metadata, episodes_stats, episodes_video_metadata
+        )
+    )
+    write_episodes(ds_episodes, new_root)
+
+    stats = aggregate_stats(list(episodes_stats.values()))
+    write_stats(stats, new_root)
+
+
+def convert_info(root, new_root):
+    info = load_info(root)
+    info["codebase_version"] = "v3.0"
+    del info["total_chunks"]
+    del info["total_videos"]
+    info["data_files_size_in_mb"] = DEFAULT_DATA_FILE_SIZE_IN_MB
+    info["video_files_size_in_mb"] = DEFAULT_VIDEO_FILE_SIZE_IN_MB
+    info["data_path"] = DEFAULT_DATA_PATH
+    info["video_path"] = DEFAULT_VIDEO_PATH
+    info["fps"] = float(info["fps"])
+    for key in info["features"]:
+        if info["features"][key]["dtype"] == "video":
+            # already has fps in video_info
+            continue
+        info["features"][key]["fps"] = info["fps"]
+    write_info(info, new_root)
+
+
+def convert_dataset(
+    repo_id: str,
+    branch: str | None = None,
+    num_workers: int = 4,
+):
+    root = HF_LEROBOT_HOME / repo_id
+    old_root = HF_LEROBOT_HOME / f"{repo_id}_old"
+    new_root = HF_LEROBOT_HOME / f"{repo_id}_v30"
+
+    if old_root.is_dir() and root.is_dir():
+        shutil.rmtree(str(root))
+        shutil.move(str(old_root), str(root))
+
+    if new_root.is_dir():
+        shutil.rmtree(new_root)
+
+    snapshot_download(
+        repo_id,
+        repo_type="dataset",
+        revision=V21,
+        local_dir=root,
+    )
+
+    convert_info(root, new_root)
+    convert_tasks(root, new_root)
+    episodes_metadata = convert_data(root, new_root)
+    episodes_videos_metadata = convert_videos(root, new_root)
+    convert_episodes_metadata(root, new_root, episodes_metadata, episodes_videos_metadata)
+
+    shutil.move(str(root), str(old_root))
+    shutil.move(str(new_root), str(root))
+
+    hub_api = HfApi()
+    try:
+        hub_api.delete_tag(repo_id, tag=CODEBASE_VERSION, repo_type="dataset")
+    except HTTPError as e:
+        print(f"tag={CODEBASE_VERSION} probably doesn't exist. Skipping exception ({e})")
+        pass
+    hub_api.delete_files(
+        delete_patterns=["data/chunk*/episode_*", "meta/*.jsonl", "videos/chunk*"],
+        repo_id=repo_id,
+        revision=branch,
+        repo_type="dataset",
+    )
+    hub_api.create_tag(repo_id, tag=CODEBASE_VERSION, revision=branch, repo_type="dataset")
+
+    LeRobotDataset(repo_id).push_to_hub()
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--repo-id",
+        type=str,
+        required=True,
+        help="Repository identifier on Hugging Face: a community or a user name `/` the name of the dataset "
+        "(e.g. `lerobot/pusht`, `cadene/aloha_sim_insertion_human`).",
+    )
+    parser.add_argument(
+        "--branch",
+        type=str,
+        default=None,
+        help="Repo branch to push your dataset. Defaults to the main branch.",
+    )
+    parser.add_argument(
+        "--num-workers",
+        type=int,
+        default=4,
+        help="Number of workers for parallelizing stats compute. Defaults to 4.",
+    )
+
+    args = parser.parse_args()
+    convert_dataset(**vars(args))
--- a/lerobot/common/datasets/video_utils.py
+++ b/lerobot/common/datasets/video_utils.py
@@ -21,12 +21,15 @@ import warnings
 from collections import OrderedDict
 from dataclasses import dataclass, field
 from pathlib import Path
-from typing import Any, ClassVar
+from threading import Lock
+from typing import Any, ClassVar, Dict, Literal, Optional

+import fsspec
 import pyarrow as pa
 import torch
 import torchvision
 from datasets.features.features import register_feature
+from line_profiler import profile
 from PIL import Image


@@ -74,7 +77,7 @@ def decode_video_frames_torchvision(
    video_path: Path | str,
    timestamps: list[float],
    tolerance_s: float,
-    backend: str = "pyav",
+    backend: Literal["pyav", "video_reader"] = "pyav",
    log_loaded_timestamps: bool = False,
 ) -> torch.Tensor:
    """Loads frames associated to the requested timestamps of a video
@@ -155,6 +158,7 @@ def decode_video_frames_torchvision(
    )

    # get closest frames to the query timestamps
+    # TODO(rcadene): remove torch.stack
    closest_frames = torch.stack([loaded_frames[idx] for idx in argmin_])
    closest_ts = loaded_ts[argmin_]

@@ -168,15 +172,62 @@ def decode_video_frames_torchvision(
    return closest_frames


+class VideoDecoderCache:
+    """Thread-safe cache for video decoders to avoid expensive re-initialization."""
+
+    def __init__(self):
+        self._cache: Dict[str, Any] = {}
+        self._lock = Lock()
+
+    def get_decoder(self, video_path: str):
+        """Get a cached decoder or create a new one."""
+        if importlib.util.find_spec("torchcodec"):
+            from torchcodec.decoders import VideoDecoder
+        else:
+            raise ImportError("torchcodec is required but not available.")
+
+        video_path = str(video_path)
+
+        with self._lock:
+            if video_path not in self._cache:
+                file_handle = fsspec.open(video_path, client_kwargs={"trust_env": True}).__enter__()
+                decoder = VideoDecoder(file_handle, seek_mode="approximate")
+                self._cache[video_path] = decoder
+
+            return self._cache[video_path]
+
+    def clear(self):
+        """Clear the cache."""
+        with self._lock:
+            self._cache.clear()
+
+    def size(self) -> int:
+        """Return the number of cached decoders."""
+        with self._lock:
+            return len(self._cache)
+
+
+# Global instance
+_default_decoder_cache = VideoDecoderCache()
+
+
+@profile
 def decode_video_frames_torchcodec(
    video_path: Path | str,
    timestamps: list[float],
    tolerance_s: float,
-    device: str = "cpu",
    log_loaded_timestamps: bool = False,
+    decoder_cache: Optional[VideoDecoderCache] = None,
 ) -> torch.Tensor:
    """Loads frames associated with the requested timestamps of a video using torchcodec.

+    Args:
+        video_path: Path to the video file.
+        timestamps: List of timestamps to extract frames.
+        tolerance_s: Allowed deviation in seconds for frame retrieval.
+        log_loaded_timestamps: Whether to log loaded timestamps.
+        decoder_cache: Optional decoder cache instance. Uses default if None.
+
    Note: Setting device="cuda" outside the main process, e.g. in data loader workers, will lead to CUDA initialization errors.

    Note: Video benefits from inter-frame compression. Instead of storing every frame individually,
@@ -185,23 +236,20 @@ def decode_video_frames_torchcodec(
    and all subsequent frames until reaching the requested frame. The number of key frames in a video
    can be adjusted during encoding to take into account decoding time and video size in bytes.
    """
+    if decoder_cache is None:
+        decoder_cache = _default_decoder_cache

-    if importlib.util.find_spec("torchcodec"):
-        from torchcodec.decoders import VideoDecoder
-    else:
-        raise ImportError("torchcodec is required but not available.")
+    # Use cached decoder instead of creating new one each time
+    decoder = decoder_cache.get_decoder(str(video_path))

-    # initialize video decoder
-    decoder = VideoDecoder(video_path, device=device, seek_mode="approximate")
-    loaded_frames = []
    loaded_ts = []
+    loaded_frames = []
+
    # get metadata for frame information
    metadata = decoder.metadata
    average_fps = metadata.average_fps
-
    # convert timestamps to frame indices
    frame_indices = [round(ts * average_fps) for ts in timestamps]
-
    # retrieve frames based on indices
    frames_batch = decoder.get_frames_at(indices=frame_indices)

@@ -252,7 +300,7 @@ def encode_video_frames(
    g: int | None = 2,
    crf: int | None = 30,
    fast_decode: int = 0,
-    log_level: str | None = "error",
+    log_level: str | None = "quiet",
    overwrite: bool = False,
 ) -> None:
    """More info on ffmpeg arguments tuning on `benchmark/video/README.md`"""
@@ -264,7 +312,7 @@ def encode_video_frames(
        [
            ("-f", "image2"),
            ("-r", str(fps)),
-            ("-i", str(imgs_dir / "frame_%06d.png")),
+            ("-i", str(imgs_dir / "frame-%06d.png")),
            ("-vcodec", vcodec),
            ("-pix_fmt", pix_fmt),
        ]
--- a/lerobot/common/utils/utils.py
+++ b/lerobot/common/utils/utils.py
@@ -228,3 +228,13 @@ def is_valid_numpy_dtype_string(dtype_str: str) -> bool:
    except TypeError:
        # If a TypeError is raised, the string is not a valid dtype
        return False
+
+
+def get_elapsed_time_in_days_hours_minutes_seconds(elapsed_time_s: float):
+    days = int(elapsed_time_s // (24 * 3600))
+    elapsed_time_s %= 24 * 3600
+    hours = int(elapsed_time_s // 3600)
+    elapsed_time_s %= 3600
+    minutes = int(elapsed_time_s // 60)
+    seconds = elapsed_time_s % 60
+    return days, hours, minutes, seconds
--- a/lerobot/scripts/train.py
+++ b/lerobot/scripts/train.py
@@ -166,7 +166,8 @@ def train(cfg: TrainPipelineConfig):
    if hasattr(cfg.policy, "drop_n_last_frames"):
        shuffle = False
        sampler = EpisodeAwareSampler(
-            dataset.episode_data_index,
+            dataset.meta.episodes["dataset_from_index"],
+            dataset.meta.episodes["dataset_to_index"],
            drop_n_last_frames=cfg.policy.drop_n_last_frames,
            shuffle=True,
        )
--- a/lerobot/scripts/visualize_dataset.py
+++ b/lerobot/scripts/visualize_dataset.py
@@ -79,8 +79,8 @@ from lerobot.common.datasets.lerobot_dataset import LeRobotDataset

 class EpisodeSampler(torch.utils.data.Sampler):
    def __init__(self, dataset: LeRobotDataset, episode_index: int):
-        from_idx = dataset.episode_data_index["from"][episode_index].item()
-        to_idx = dataset.episode_data_index["to"][episode_index].item()
+        from_idx = dataset.meta.episodes["dataset_from_index"][episode_index]
+        to_idx = dataset.meta.episodes["dataset_to_index"][episode_index]
        self.frame_ids = range(from_idx, to_idx)

    def __iter__(self) -> Iterator:
@@ -283,7 +283,7 @@ def main():
    tolerance_s = kwargs.pop("tolerance_s")

    logging.info("Loading dataset")
-    dataset = LeRobotDataset(repo_id, root=root, tolerance_s=tolerance_s)
+    dataset = LeRobotDataset(repo_id, episodes=[args.episode_index], root=root, tolerance_s=tolerance_s)

    visualize_dataset(dataset, **vars(args))

--- a/lerobot/scripts/visualize_dataset_html.py
+++ b/lerobot/scripts/visualize_dataset_html.py
@@ -271,8 +271,8 @@ def get_episode_data(dataset: LeRobotDataset | IterableNamespace, episode_index)
    selected_columns.insert(0, "timestamp")

    if isinstance(dataset, LeRobotDataset):
-        from_idx = dataset.episode_data_index["from"][episode_index]
-        to_idx = dataset.episode_data_index["to"][episode_index]
+        from_idx = dataset.meta.episodes["dataset_from_index"][episode_index]
+        to_idx = dataset.meta.episodes["dataset_to_index"][episode_index]
        data = (
            dataset.hf_dataset.select(range(from_idx, to_idx))
            .select_columns(selected_columns)
@@ -308,7 +308,7 @@ def get_episode_data(dataset: LeRobotDataset | IterableNamespace, episode_index)

 def get_episode_video_paths(dataset: LeRobotDataset, ep_index: int) -> list[str]:
    # get first frame of episode (hack to get video_path of the episode)
-    first_frame_idx = dataset.episode_data_index["from"][ep_index].item()
+    first_frame_idx = dataset.meta.episodes["dataset_from_index"][ep_index]
    return [
        dataset.hf_dataset.select_columns(key)[first_frame_idx][key]["path"]
        for key in dataset.meta.video_keys
@@ -321,7 +321,7 @@ def get_episode_language_instruction(dataset: LeRobotDataset, ep_index: int) ->
        return None

    # get first frame index
-    first_frame_idx = dataset.episode_data_index["from"][ep_index].item()
+    first_frame_idx = dataset.meta.episodes["dataset_from_index"][ep_index]

    language_instruction = dataset.hf_dataset[first_frame_idx]["language_instruction"]
    # TODO (michel-aractingi) hack to get the sentence, some strings in openx are badly stored
--- a/tests/artifacts/datasets/save_dataset_to_safetensors.py
+++ b/tests/artifacts/datasets/save_dataset_to_safetensors.py
@@ -47,17 +47,23 @@ def save_dataset_to_safetensors(output_dir, repo_id="lerobot/pusht"):
    )

    # save 2 first frames of first episode
-    i = dataset.episode_data_index["from"][0].item()
+    i = dataset.meta.episodes["dataset_from_index"][0].item()
    save_file(dataset[i], repo_dir / f"frame_{i}.safetensors")
    save_file(dataset[i + 1], repo_dir / f"frame_{i + 1}.safetensors")

    # save 2 frames at the middle of first episode
-    i = int((dataset.episode_data_index["to"][0].item() - dataset.episode_data_index["from"][0].item()) / 2)
+    i = int(
+        (
+            dataset.meta.episodes["dataset_to_index"][0].item()
+            - dataset.meta.episodes["dataset_from_index"][0].item()
+        )
+        / 2
+    )
    save_file(dataset[i], repo_dir / f"frame_{i}.safetensors")
    save_file(dataset[i + 1], repo_dir / f"frame_{i + 1}.safetensors")

    # save 2 last frames of first episode
-    i = dataset.episode_data_index["to"][0].item()
+    i = dataset.meta.episodes["dataset_to_index"][0].item()
    save_file(dataset[i - 2], repo_dir / f"frame_{i - 2}.safetensors")
    save_file(dataset[i - 1], repo_dir / f"frame_{i - 1}.safetensors")

@@ -65,17 +71,17 @@ def save_dataset_to_safetensors(output_dir, repo_id="lerobot/pusht"):
    # We currently cant because our test dataset only contains the first episode

    # # save 2 first frames of second episode
-    # i = dataset.episode_data_index["from"][1].item()
+    # i = dataset.meta.episodes["dataset_from_index"][1].item()
    # save_file(dataset[i], repo_dir / f"frame_{i}.safetensors")
    # save_file(dataset[i + 1], repo_dir / f"frame_{i+1}.safetensors")

    # # save 2 last frames of second episode
-    # i = dataset.episode_data_index["to"][1].item()
+    # i = dataset.meta.episodes["dataset_to_index"][1].item()
    # save_file(dataset[i - 2], repo_dir / f"frame_{i-2}.safetensors")
    # save_file(dataset[i - 1], repo_dir / f"frame_{i-1}.safetensors")

    # # save 2 last frames of last episode
-    # i = dataset.episode_data_index["to"][-1].item()
+    # i = dataset.meta.episodes["dataset_to_index"][-1].item()
    # save_file(dataset[i - 2], repo_dir / f"frame_{i-2}.safetensors")
    # save_file(dataset[i - 1], repo_dir / f"frame_{i-1}.safetensors")

--- a/tests/datasets/test_aggregate.py
+++ b/tests/datasets/test_aggregate.py
@@ -0,0 +1,29 @@
+from lerobot.common.datasets.aggregate import aggregate_datasets
+from lerobot.common.datasets.lerobot_dataset import LeRobotDataset
+from tests.fixtures.constants import DUMMY_REPO_ID
+
+
+def test_aggregate_datasets(tmp_path, lerobot_dataset_factory):
+    ds_0 = lerobot_dataset_factory(
+        root=tmp_path / "test_0",
+        repo_id=f"{DUMMY_REPO_ID}_0",
+        total_episodes=10,
+        total_frames=400,
+    )
+    ds_1 = lerobot_dataset_factory(
+        root=tmp_path / "test_1",
+        repo_id=f"{DUMMY_REPO_ID}_1",
+        total_episodes=10,
+        total_frames=400,
+    )
+
+    aggregate_datasets(
+        repo_ids=[ds_0.repo_id, ds_1.repo_id],
+        roots=[ds_0.root, ds_1.root],
+        aggr_repo_id=f"{DUMMY_REPO_ID}_aggr",
+        aggr_root=tmp_path / "test_aggr",
+    )
+
+    aggr_ds = LeRobotDataset(f"{DUMMY_REPO_ID}_aggr", root=tmp_path / "test_aggr")
+    for _ in aggr_ds:
+        pass
--- a/tests/datasets/test_datasets.py
+++ b/tests/datasets/test_datasets.py
@@ -13,10 +13,8 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-import json
 import logging
 import re
-from copy import deepcopy
 from itertools import chain
 from pathlib import Path

@@ -36,8 +34,6 @@ from lerobot.common.datasets.lerobot_dataset import (
 )
 from lerobot.common.datasets.utils import (
    create_branch,
-    flatten_dict,
-    unflatten_dict,
 )
 from lerobot.common.envs.factory import make_env_config
 from lerobot.common.policies.factory import make_policy_config
@@ -75,7 +71,7 @@ def test_same_attributes_defined(tmp_path, lerobot_dataset_factory):
    dataset_create = LeRobotDataset.create(repo_id=DUMMY_REPO_ID, fps=30, robot=robot, root=root_create)

    root_init = tmp_path / "init"
-    dataset_init = lerobot_dataset_factory(root=root_init)
+    dataset_init = lerobot_dataset_factory(root=root_init, total_episodes=1, total_frames=1)

    init_attr = set(vars(dataset_init).keys())
    create_attr = set(vars(dataset_create).keys())
@@ -100,6 +96,25 @@ def test_dataset_initialization(tmp_path, lerobot_dataset_factory):
    assert dataset.num_frames == len(dataset)


+# TODO(rcadene, aliberts): do not run LeRobotDataset.create, instead refactor LeRobotDatasetMetadata.create
+# and test the small resulting function that validates the features
+def test_dataset_feature_with_forward_slash_raises_error():
+    # make sure dir does not exist
+    from lerobot.common.constants import HF_LEROBOT_HOME
+
+    dataset_dir = HF_LEROBOT_HOME / "lerobot/test/with/slash"
+    # make sure does not exist
+    if dataset_dir.exists():
+        dataset_dir.rmdir()
+
+    with pytest.raises(ValueError):
+        LeRobotDataset.create(
+            repo_id="lerobot/test/with/slash",
+            fps=30,
+            features={"a/b": {"dtype": "float32", "shape": 2, "names": None}},
+        )
+
+
 def test_add_frame_missing_task(tmp_path, empty_lerobot_dataset_factory):
    features = {"state": {"dtype": "float32", "shape": (1,), "names": None}}
    dataset = empty_lerobot_dataset_factory(root=tmp_path / "test", features=features)
@@ -329,6 +344,13 @@ def test_image_array_to_pil_image_wrong_range_float_0_255():
 # - [ ] test push_to_hub
 # - [ ] test smaller methods

+# TODO(rcadene):
+# - [ ] fix code so that old test_factory + backward pass
+# - [ ] write new unit tests to test save_episode + getitem
+#   - [ ] save_episode : case where new dataset, concatenate same file, write new file (meta/episodes, data, videos)
+#   - [ ]
+# - [ ] remove old tests
+

@pytest.mark.parametrize(
    "env_name, repo_id, policy_name",
@@ -436,30 +458,6 @@ def test_multidataset_frames():
            assert torch.equal(sub_dataset_item[k], dataset_item[k])


-# TODO(aliberts): Move to more appropriate location
-def test_flatten_unflatten_dict():
-    d = {
-        "obs": {
-            "min": 0,
-            "max": 1,
-            "mean": 2,
-            "std": 3,
-        },
-        "action": {
-            "min": 4,
-            "max": 5,
-            "mean": 6,
-            "std": 7,
-        },
-    }
-
-    original_d = deepcopy(d)
-    d = unflatten_dict(flatten_dict(d))
-
-    # test equality between nested dicts
-    assert json.dumps(original_d, sort_keys=True) == json.dumps(d, sort_keys=True), f"{original_d} != {d}"
-
-
@pytest.mark.parametrize(
    "repo_id",
    [
@@ -506,17 +504,23 @@ def test_backward_compatibility(repo_id):
            )

    # test2 first frames of first episode
-    i = dataset.episode_data_index["from"][0].item()
+    i = dataset.meta.episodes["dataset_from_index"][0].item()
    load_and_compare(i)
    load_and_compare(i + 1)

    # test 2 frames at the middle of first episode
-    i = int((dataset.episode_data_index["to"][0].item() - dataset.episode_data_index["from"][0].item()) / 2)
+    i = int(
+        (
+            dataset.meta.episodes["dataset_to_index"][0].item()
+            - dataset.meta.episodes["dataset_from_index"][0].item()
+        )
+        / 2
+    )
    load_and_compare(i)
    load_and_compare(i + 1)

    # test 2 last frames of first episode
-    i = dataset.episode_data_index["to"][0].item()
+    i = dataset.meta.episodes["dataset_to_index"][0].item()
    load_and_compare(i - 2)
    load_and_compare(i - 1)

@@ -524,17 +528,17 @@ def test_backward_compatibility(repo_id):
    # We currently cant because our test dataset only contains the first episode

    # # test 2 first frames of second episode
-    # i = dataset.episode_data_index["from"][1].item()
+    # i = dataset.meta.episodes["dataset_from_index"][1].item()
    # load_and_compare(i)
    # load_and_compare(i + 1)

    # # test 2 last frames of second episode
-    # i = dataset.episode_data_index["to"][1].item()
+    # i = dataset.meta.episodes["dataset_to_index"][1].item()
    # load_and_compare(i - 2)
    # load_and_compare(i - 1)

    # # test 2 last frames of last episode
-    # i = dataset.episode_data_index["to"][-1].item()
+    # i = dataset.meta.episodes["dataset_to_index"][-1].item()
    # load_and_compare(i - 2)
    # load_and_compare(i - 1)

@@ -563,20 +567,3 @@ def test_create_branch():

    # Clean
    api.delete_repo(repo_id, repo_type=repo_type)
-
-
-def test_dataset_feature_with_forward_slash_raises_error():
-    # make sure dir does not exist
-    from lerobot.common.constants import HF_LEROBOT_HOME
-
-    dataset_dir = HF_LEROBOT_HOME / "lerobot/test/with/slash"
-    # make sure does not exist
-    if dataset_dir.exists():
-        dataset_dir.rmdir()
-
-    with pytest.raises(ValueError):
-        LeRobotDataset.create(
-            repo_id="lerobot/test/with/slash",
-            fps=30,
-            features={"a/b": {"dtype": "float32", "shape": 2, "names": None}},
-        )
--- a/tests/datasets/test_sampler.py
+++ b/tests/datasets/test_sampler.py
@@ -32,7 +32,7 @@ def test_drop_n_first_frames():
    )
    dataset.set_transform(hf_transform_to_torch)
    episode_data_index = calculate_episode_data_index(dataset)
-    sampler = EpisodeAwareSampler(episode_data_index, drop_n_first_frames=1)
+    sampler = EpisodeAwareSampler(episode_data_index["from"], episode_data_index["to"], drop_n_first_frames=1)
    assert sampler.indices == [1, 4, 5]
    assert len(sampler) == 3
    assert list(sampler) == [1, 4, 5]
@@ -48,7 +48,7 @@ def test_drop_n_last_frames():
    )
    dataset.set_transform(hf_transform_to_torch)
    episode_data_index = calculate_episode_data_index(dataset)
-    sampler = EpisodeAwareSampler(episode_data_index, drop_n_last_frames=1)
+    sampler = EpisodeAwareSampler(episode_data_index["from"], episode_data_index["to"], drop_n_last_frames=1)
    assert sampler.indices == [0, 3, 4]
    assert len(sampler) == 3
    assert list(sampler) == [0, 3, 4]
@@ -64,7 +64,9 @@ def test_episode_indices_to_use():
    )
    dataset.set_transform(hf_transform_to_torch)
    episode_data_index = calculate_episode_data_index(dataset)
-    sampler = EpisodeAwareSampler(episode_data_index, episode_indices_to_use=[0, 2])
+    sampler = EpisodeAwareSampler(
+        episode_data_index["from"], episode_data_index["to"], episode_indices_to_use=[0, 2]
+    )
    assert sampler.indices == [0, 1, 3, 4, 5]
    assert len(sampler) == 5
    assert list(sampler) == [0, 1, 3, 4, 5]
@@ -80,11 +82,11 @@ def test_shuffle():
    )
    dataset.set_transform(hf_transform_to_torch)
    episode_data_index = calculate_episode_data_index(dataset)
-    sampler = EpisodeAwareSampler(episode_data_index, shuffle=False)
+    sampler = EpisodeAwareSampler(episode_data_index["from"], episode_data_index["to"], shuffle=False)
    assert sampler.indices == [0, 1, 2, 3, 4, 5]
    assert len(sampler) == 6
    assert list(sampler) == [0, 1, 2, 3, 4, 5]
-    sampler = EpisodeAwareSampler(episode_data_index, shuffle=True)
+    sampler = EpisodeAwareSampler(episode_data_index["from"], episode_data_index["to"], shuffle=True)
    assert sampler.indices == [0, 1, 2, 3, 4, 5]
    assert len(sampler) == 6
    assert set(sampler) == {0, 1, 2, 3, 4, 5}
--- a/tests/datasets/test_utils.py
+++ b/tests/datasets/test_utils.py
@@ -14,12 +14,20 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.

+import json
+from copy import deepcopy
+
 import torch
 from datasets import Dataset
 from huggingface_hub import DatasetCard

 from lerobot.common.datasets.push_dataset_to_hub.utils import calculate_episode_data_index
-from lerobot.common.datasets.utils import create_lerobot_dataset_card, hf_transform_to_torch
+from lerobot.common.datasets.utils import (
+    create_lerobot_dataset_card,
+    flatten_dict,
+    hf_transform_to_torch,
+    unflatten_dict,
+)


 def test_default_parameters():
@@ -53,3 +61,26 @@ def test_calculate_episode_data_index():
    episode_data_index = calculate_episode_data_index(dataset)
    assert torch.equal(episode_data_index["from"], torch.tensor([0, 2, 3]))
    assert torch.equal(episode_data_index["to"], torch.tensor([2, 3, 6]))
+
+
+def test_flatten_unflatten_dict():
+    d = {
+        "obs": {
+            "min": 0,
+            "max": 1,
+            "mean": 2,
+            "std": 3,
+        },
+        "action": {
+            "min": 4,
+            "max": 5,
+            "mean": 6,
+            "std": 7,
+        },
+    }
+
+    original_d = deepcopy(d)
+    d = unflatten_dict(flatten_dict(d))
+
+    # test equality between nested dicts
+    assert json.dumps(original_d, sort_keys=True) == json.dumps(d, sort_keys=True), f"{original_d} != {d}"
--- a/tests/fixtures/constants.py
+++ b/tests/fixtures/constants.py
@@ -29,8 +29,8 @@ DUMMY_MOTOR_FEATURES = {
    },
 }
 DUMMY_CAMERA_FEATURES = {
-    "laptop": {"shape": (480, 640, 3), "names": ["height", "width", "channels"], "info": None},
-    "phone": {"shape": (480, 640, 3), "names": ["height", "width", "channels"], "info": None},
+    "laptop": {"shape": (64, 96, 3), "names": ["height", "width", "channels"], "info": None},
+    "phone": {"shape": (64, 96, 3), "names": ["height", "width", "channels"], "info": None},
 }
 DEFAULT_FPS = 30
 DUMMY_VIDEO_INFO = {
--- a/tests/fixtures/dataset_factories.py
+++ b/tests/fixtures/dataset_factories.py
@@ -12,6 +12,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import random
+import shutil
 from functools import partial
 from pathlib import Path
 from typing import Protocol
@@ -19,19 +20,25 @@ from unittest.mock import patch

 import datasets
 import numpy as np
+import pandas as pd
 import PIL.Image
 import pytest
 import torch
+from datasets import Dataset

 from lerobot.common.datasets.lerobot_dataset import CODEBASE_VERSION, LeRobotDataset, LeRobotDatasetMetadata
 from lerobot.common.datasets.utils import (
    DEFAULT_CHUNK_SIZE,
+    DEFAULT_DATA_FILE_SIZE_IN_MB,
+    DEFAULT_DATA_PATH,
    DEFAULT_FEATURES,
-    DEFAULT_PARQUET_PATH,
+    DEFAULT_VIDEO_FILE_SIZE_IN_MB,
    DEFAULT_VIDEO_PATH,
+    flatten_dict,
    get_hf_features_from_features,
    hf_transform_to_torch,
 )
+from lerobot.common.datasets.video_utils import encode_video_frames
 from tests.fixtures.constants import (
    DEFAULT_FPS,
    DUMMY_CAMERA_FEATURES,
@@ -46,10 +53,10 @@ class LeRobotDatasetFactory(Protocol):
    def __call__(self, *args, **kwargs) -> LeRobotDataset: ...


-def get_task_index(task_dicts: dict, task: str) -> int:
-    tasks = {d["task_index"]: d["task"] for d in task_dicts.values()}
-    task_to_task_index = {task: task_idx for task_idx, task in tasks.items()}
-    return task_to_task_index[task]
+def get_task_index(tasks: datasets.Dataset, task: str) -> int:
+    # TODO(rcadene): a bit complicated no? ^^
+    task_idx = tasks.loc[task].task_index.item()
+    return task_idx


@pytest.fixture(scope="session")
@@ -90,7 +97,7 @@ def features_factory():
    def _create_features(
        motor_features: dict = DUMMY_MOTOR_FEATURES,
        camera_features: dict = DUMMY_CAMERA_FEATURES,
-        use_videos: bool = True,
+        use_videos: bool = False,
    ) -> dict:
        if use_videos:
            camera_ft = {
@@ -117,13 +124,14 @@ def info_factory(features_factory):
        total_frames: int = 0,
        total_tasks: int = 0,
        total_videos: int = 0,
-        total_chunks: int = 0,
        chunks_size: int = DEFAULT_CHUNK_SIZE,
-        data_path: str = DEFAULT_PARQUET_PATH,
+        data_files_size_in_mb: float = DEFAULT_DATA_FILE_SIZE_IN_MB,
+        video_files_size_in_mb: float = DEFAULT_VIDEO_FILE_SIZE_IN_MB,
+        data_path: str = DEFAULT_DATA_PATH,
        video_path: str = DEFAULT_VIDEO_PATH,
        motor_features: dict = DUMMY_MOTOR_FEATURES,
        camera_features: dict = DUMMY_CAMERA_FEATURES,
-        use_videos: bool = True,
+        use_videos: bool = False,
    ) -> dict:
        features = features_factory(motor_features, camera_features, use_videos)
        return {
@@ -133,8 +141,9 @@ def info_factory(features_factory):
            "total_frames": total_frames,
            "total_tasks": total_tasks,
            "total_videos": total_videos,
-            "total_chunks": total_chunks,
            "chunks_size": chunks_size,
+            "data_files_size_in_mb": data_files_size_in_mb,
+            "video_files_size_in_mb": video_files_size_in_mb,
            "fps": fps,
            "splits": {},
            "data_path": data_path,
@@ -175,41 +184,45 @@ def stats_factory():
    return _create_stats


-@pytest.fixture(scope="session")
-def episodes_stats_factory(stats_factory):
-    def _create_episodes_stats(
-        features: dict[str],
-        total_episodes: int = 3,
-    ) -> dict:
-        episodes_stats = {}
-        for episode_index in range(total_episodes):
-            episodes_stats[episode_index] = {
-                "episode_index": episode_index,
-                "stats": stats_factory(features),
-            }
-        return episodes_stats
+# @pytest.fixture(scope="session")
+# def episodes_stats_factory(stats_factory):
+#     def _create_episodes_stats(
+#         features: dict[str],
+#         total_episodes: int = 3,
+#     ) -> dict:

-    return _create_episodes_stats
+#         def _generator(total_episodes):
+#             for ep_idx in range(total_episodes):
+#                 flat_ep_stats = flatten_dict(stats_factory(features))
+#                 flat_ep_stats["episode_index"] = ep_idx
+#                 yield flat_ep_stats
+
+#         # Simpler to rely on generator instead of from_dict
+#         return Dataset.from_generator(lambda: _generator(total_episodes))
+
+#     return _create_episodes_stats


@pytest.fixture(scope="session")
 def tasks_factory():
-    def _create_tasks(total_tasks: int = 3) -> int:
-        tasks = {}
-        for task_index in range(total_tasks):
-            task_dict = {"task_index": task_index, "task": f"Perform action {task_index}."}
-            tasks[task_index] = task_dict
-        return tasks
+    def _create_tasks(total_tasks: int = 3) -> pd.DataFrame:
+        ids = list(range(total_tasks))
+        tasks = [f"Perform action {i}." for i in ids]
+        df = pd.DataFrame({"task_index": ids}, index=tasks)
+        return df

    return _create_tasks


@pytest.fixture(scope="session")
-def episodes_factory(tasks_factory):
+def episodes_factory(tasks_factory, stats_factory):
    def _create_episodes(
+        features: dict[str],
+        fps: int = DEFAULT_FPS,
        total_episodes: int = 3,
        total_frames: int = 400,
-        tasks: dict | None = None,
+        video_keys: list[str] | None = None,
+        tasks: pd.DataFrame | None = None,
        multi_task: bool = False,
    ):
        if total_episodes <= 0 or total_frames <= 0:
@@ -217,66 +230,139 @@ def episodes_factory(tasks_factory):
        if total_frames < total_episodes:
            raise ValueError("total_length must be greater than or equal to num_episodes.")

-        if not tasks:
+        if tasks is None:
            min_tasks = 2 if multi_task else 1
            total_tasks = random.randint(min_tasks, total_episodes)
            tasks = tasks_factory(total_tasks)

-        if total_episodes < len(tasks) and not multi_task:
+        num_tasks_available = len(tasks)
+
+        if total_episodes < num_tasks_available and not multi_task:
            raise ValueError("The number of tasks should be less than the number of episodes.")

        # Generate random lengths that sum up to total_length
        lengths = np.random.multinomial(total_frames, [1 / total_episodes] * total_episodes).tolist()

-        tasks_list = [task_dict["task"] for task_dict in tasks.values()]
-        num_tasks_available = len(tasks_list)
+        # Create empty dictionaries with all keys
+        d = {
+            "episode_index": [],
+            "meta/episodes/chunk_index": [],
+            "meta/episodes/file_index": [],
+            "data/chunk_index": [],
+            "data/file_index": [],
+            "dataset_from_index": [],
+            "dataset_to_index": [],
+            "tasks": [],
+            "length": [],
+        }
+        if video_keys is not None:
+            for video_key in video_keys:
+                d[f"videos/{video_key}/chunk_index"] = []
+                d[f"videos/{video_key}/file_index"] = []
+                d[f"videos/{video_key}/from_timestamp"] = []
+                d[f"videos/{video_key}/to_timestamp"] = []

-        episodes = {}
-        remaining_tasks = tasks_list.copy()
+        for stats_key in flatten_dict({"stats": stats_factory(features)}):
+            d[stats_key] = []
+
+        num_frames = 0
+        remaining_tasks = list(tasks.index)
        for ep_idx in range(total_episodes):
            num_tasks_in_episode = random.randint(1, min(3, num_tasks_available)) if multi_task else 1
-            tasks_to_sample = remaining_tasks if remaining_tasks else tasks_list
+            tasks_to_sample = remaining_tasks if len(remaining_tasks) > 0 else list(tasks.index)
            episode_tasks = random.sample(tasks_to_sample, min(num_tasks_in_episode, len(tasks_to_sample)))
            if remaining_tasks:
                for task in episode_tasks:
                    remaining_tasks.remove(task)

-            episodes[ep_idx] = {
-                "episode_index": ep_idx,
-                "tasks": episode_tasks,
-                "length": lengths[ep_idx],
-            }
+            d["episode_index"].append(ep_idx)
+            # TODO(rcadene): remove heuristic of only one file
+            d["meta/episodes/chunk_index"].append(0)
+            d["meta/episodes/file_index"].append(0)
+            d["data/chunk_index"].append(0)
+            d["data/file_index"].append(0)
+            d["dataset_from_index"].append(num_frames)
+            d["dataset_to_index"].append(num_frames + lengths[ep_idx])
+            d["tasks"].append(episode_tasks)
+            d["length"].append(lengths[ep_idx])

-        return episodes
+            if video_keys is not None:
+                for video_key in video_keys:
+                    d[f"videos/{video_key}/chunk_index"].append(0)
+                    d[f"videos/{video_key}/file_index"].append(0)
+                    d[f"videos/{video_key}/from_timestamp"].append(num_frames / fps)
+                    d[f"videos/{video_key}/to_timestamp"].append((num_frames + lengths[ep_idx]) / fps)
+
+            # Add stats columns like "stats/action/max"
+            for stats_key, stats in flatten_dict({"stats": stats_factory(features)}).items():
+                d[stats_key].append(stats)
+
+            num_frames += lengths[ep_idx]
+
+        return Dataset.from_dict(d)

    return _create_episodes


+@pytest.fixture(scope="session")
+def create_videos(info_factory, img_array_factory):
+    def _create_video_directory(
+        root: Path,
+        info: dict | None = None,
+        total_episodes: int = 3,
+        total_frames: int = 150,
+        total_tasks: int = 1,
+    ):
+        if info is None:
+            info = info_factory(
+                total_episodes=total_episodes, total_frames=total_frames, total_tasks=total_tasks
+            )
+
+        video_feats = {key: feats for key, feats in info["features"].items() if feats["dtype"] == "video"}
+        for key, ft in video_feats.items():
+            # create and save images
+            tmp_dir = root / "tmp_images"
+            tmp_dir.mkdir(parents=True, exist_ok=True)
+            for frame_index in range(info["total_frames"]):
+                img = img_array_factory(height=ft["shape"][1], width=ft["shape"][0])
+                pil_img = PIL.Image.fromarray(img)
+                path = tmp_dir / f"frame-{frame_index:06d}.png"
+                pil_img.save(path)
+
+            video_path = root / DEFAULT_VIDEO_PATH.format(video_key=key, chunk_index=0, file_index=0)
+            encode_video_frames(tmp_dir, video_path, fps=ft["video.fps"])
+            shutil.rmtree(tmp_dir)
+
+    return _create_video_directory
+
+
@pytest.fixture(scope="session")
 def hf_dataset_factory(features_factory, tasks_factory, episodes_factory, img_array_factory):
    def _create_hf_dataset(
        features: dict | None = None,
-        tasks: list[dict] | None = None,
-        episodes: list[dict] | None = None,
+        tasks: pd.DataFrame | None = None,
+        episodes: datasets.Dataset | None = None,
        fps: int = DEFAULT_FPS,
    ) -> datasets.Dataset:
-        if not tasks:
+        if tasks is None:
            tasks = tasks_factory()
-        if not episodes:
-            episodes = episodes_factory()
-        if not features:
+        if features is None:
            features = features_factory()
+        if episodes is None:
+            episodes = episodes_factory(features, fps)

        timestamp_col = np.array([], dtype=np.float32)
        frame_index_col = np.array([], dtype=np.int64)
        episode_index_col = np.array([], dtype=np.int64)
        task_index = np.array([], dtype=np.int64)
-        for ep_dict in episodes.values():
+        for ep_dict in episodes:
            timestamp_col = np.concatenate((timestamp_col, np.arange(ep_dict["length"]) / fps))
            frame_index_col = np.concatenate((frame_index_col, np.arange(ep_dict["length"], dtype=int)))
            episode_index_col = np.concatenate(
                (episode_index_col, np.full(ep_dict["length"], ep_dict["episode_index"], dtype=int))
            )
+            # Slightly incorrect, but for simplicity, we assign to all frames the first task defined in the episode metadata.
+            # TODO(rcadene): assign the tasks of the episode per chunks of frames
            ep_task_index = get_task_index(tasks, ep_dict["tasks"][0])
            task_index = np.concatenate((task_index, np.full(ep_dict["length"], ep_task_index, dtype=int)))

@@ -286,7 +372,7 @@ def hf_dataset_factory(features_factory, tasks_factory, episodes_factory, img_ar
        for key, ft in features.items():
            if ft["dtype"] == "image":
                robot_cols[key] = [
-                    img_array_factory(height=ft["shapes"][1], width=ft["shapes"][0])
+                    img_array_factory(height=ft["shape"][1], width=ft["shape"][0])
                    for _ in range(len(index_col))
                ]
            elif ft["shape"][0] > 1 and ft["dtype"] != "video":
@@ -314,7 +400,6 @@ def hf_dataset_factory(features_factory, tasks_factory, episodes_factory, img_ar
 def lerobot_dataset_metadata_factory(
    info_factory,
    stats_factory,
-    episodes_stats_factory,
    tasks_factory,
    episodes_factory,
    mock_snapshot_download_factory,
@@ -324,29 +409,29 @@ def lerobot_dataset_metadata_factory(
        repo_id: str = DUMMY_REPO_ID,
        info: dict | None = None,
        stats: dict | None = None,
-        episodes_stats: list[dict] | None = None,
-        tasks: list[dict] | None = None,
-        episodes: list[dict] | None = None,
+        tasks: pd.DataFrame | None = None,
+        episodes: datasets.Dataset | None = None,
    ) -> LeRobotDatasetMetadata:
-        if not info:
+        if info is None:
            info = info_factory()
-        if not stats:
+        if stats is None:
            stats = stats_factory(features=info["features"])
-        if not episodes_stats:
-            episodes_stats = episodes_stats_factory(
-                features=info["features"], total_episodes=info["total_episodes"]
-            )
-        if not tasks:
+        if tasks is None:
            tasks = tasks_factory(total_tasks=info["total_tasks"])
-        if not episodes:
+        if episodes is None:
+            video_keys = [key for key, ft in info["features"].items() if ft["dtype"] == "video"]
            episodes = episodes_factory(
-                total_episodes=info["total_episodes"], total_frames=info["total_frames"], tasks=tasks
+                features=info["features"],
+                fps=info["fps"],
+                total_episodes=info["total_episodes"],
+                total_frames=info["total_frames"],
+                video_keys=video_keys,
+                tasks=tasks,
            )

        mock_snapshot_download = mock_snapshot_download_factory(
            info=info,
            stats=stats,
-            episodes_stats=episodes_stats,
            tasks=tasks,
            episodes=episodes,
        )
@@ -368,7 +453,6 @@ def lerobot_dataset_metadata_factory(
 def lerobot_dataset_factory(
    info_factory,
    stats_factory,
-    episodes_stats_factory,
    tasks_factory,
    episodes_factory,
    hf_dataset_factory,
@@ -384,38 +468,38 @@ def lerobot_dataset_factory(
        multi_task: bool = False,
        info: dict | None = None,
        stats: dict | None = None,
-        episodes_stats: list[dict] | None = None,
-        tasks: list[dict] | None = None,
-        episode_dicts: list[dict] | None = None,
+        tasks: pd.DataFrame | None = None,
+        episodes_metadata: datasets.Dataset | None = None,
        hf_dataset: datasets.Dataset | None = None,
        **kwargs,
    ) -> LeRobotDataset:
-        if not info:
+        # Instantiate objects
+        if info is None:
            info = info_factory(
                total_episodes=total_episodes, total_frames=total_frames, total_tasks=total_tasks
            )
-        if not stats:
+        if stats is None:
            stats = stats_factory(features=info["features"])
-        if not episodes_stats:
-            episodes_stats = episodes_stats_factory(features=info["features"], total_episodes=total_episodes)
-        if not tasks:
+        if tasks is None:
            tasks = tasks_factory(total_tasks=info["total_tasks"])
-        if not episode_dicts:
-            episode_dicts = episodes_factory(
+        if episodes_metadata is None:
+            episodes_metadata = episodes_factory(
+                features=info["features"],
+                fps=info["fps"],
                total_episodes=info["total_episodes"],
                total_frames=info["total_frames"],
                tasks=tasks,
                multi_task=multi_task,
            )
        if not hf_dataset:
-            hf_dataset = hf_dataset_factory(tasks=tasks, episodes=episode_dicts, fps=info["fps"])
+            hf_dataset = hf_dataset_factory(tasks=tasks, episodes=episodes_metadata, fps=info["fps"])

+        # Write data on disk
        mock_snapshot_download = mock_snapshot_download_factory(
            info=info,
            stats=stats,
-            episodes_stats=episodes_stats,
            tasks=tasks,
-            episodes=episode_dicts,
+            episodes=episodes_metadata,
            hf_dataset=hf_dataset,
        )
        mock_metadata = lerobot_dataset_metadata_factory(
@@ -423,9 +507,8 @@ def lerobot_dataset_factory(
            repo_id=repo_id,
            info=info,
            stats=stats,
-            episodes_stats=episodes_stats,
            tasks=tasks,
-            episodes=episode_dicts,
+            episodes=episodes_metadata,
        )
        with (
            patch("lerobot.common.datasets.lerobot_dataset.LeRobotDatasetMetadata") as mock_metadata_patch,
--- a/tests/fixtures/files.py
+++ b/tests/fixtures/files.py
@@ -11,92 +11,72 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-import json
 from pathlib import Path

 import datasets
-import jsonlines
+import pandas as pd
 import pyarrow.compute as pc
 import pyarrow.parquet as pq
 import pytest

 from lerobot.common.datasets.utils import (
-    EPISODES_PATH,
-    EPISODES_STATS_PATH,
-    INFO_PATH,
-    STATS_PATH,
-    TASKS_PATH,
+    write_episodes,
+    write_hf_dataset,
+    write_info,
+    write_stats,
+    write_tasks,
 )


@pytest.fixture(scope="session")
-def info_path(info_factory):
-    def _create_info_json_file(dir: Path, info: dict | None = None) -> Path:
-        if not info:
+def create_info(info_factory):
+    def _create_info(dir: Path, info: dict | None = None):
+        if info is None:
            info = info_factory()
-        fpath = dir / INFO_PATH
-        fpath.parent.mkdir(parents=True, exist_ok=True)
-        with open(fpath, "w") as f:
-            json.dump(info, f, indent=4, ensure_ascii=False)
-        return fpath
+        write_info(info, dir)

-    return _create_info_json_file
+    return _create_info


@pytest.fixture(scope="session")
-def stats_path(stats_factory):
-    def _create_stats_json_file(dir: Path, stats: dict | None = None) -> Path:
-        if not stats:
+def create_stats(stats_factory):
+    def _create_stats(dir: Path, stats: dict | None = None):
+        if stats is None:
            stats = stats_factory()
-        fpath = dir / STATS_PATH
-        fpath.parent.mkdir(parents=True, exist_ok=True)
-        with open(fpath, "w") as f:
-            json.dump(stats, f, indent=4, ensure_ascii=False)
-        return fpath
+        write_stats(stats, dir)

-    return _create_stats_json_file
+    return _create_stats


@pytest.fixture(scope="session")
-def episodes_stats_path(episodes_stats_factory):
-    def _create_episodes_stats_jsonl_file(dir: Path, episodes_stats: list[dict] | None = None) -> Path:
-        if not episodes_stats:
-            episodes_stats = episodes_stats_factory()
-        fpath = dir / EPISODES_STATS_PATH
-        fpath.parent.mkdir(parents=True, exist_ok=True)
-        with jsonlines.open(fpath, "w") as writer:
-            writer.write_all(episodes_stats.values())
-        return fpath
-
-    return _create_episodes_stats_jsonl_file
-
-
-@pytest.fixture(scope="session")
-def tasks_path(tasks_factory):
-    def _create_tasks_jsonl_file(dir: Path, tasks: list | None = None) -> Path:
-        if not tasks:
+def create_tasks(tasks_factory):
+    def _create_tasks(dir: Path, tasks: pd.DataFrame | None = None):
+        if tasks is None:
            tasks = tasks_factory()
-        fpath = dir / TASKS_PATH
-        fpath.parent.mkdir(parents=True, exist_ok=True)
-        with jsonlines.open(fpath, "w") as writer:
-            writer.write_all(tasks.values())
-        return fpath
+        write_tasks(tasks, dir)

-    return _create_tasks_jsonl_file
+    return _create_tasks


@pytest.fixture(scope="session")
-def episode_path(episodes_factory):
-    def _create_episodes_jsonl_file(dir: Path, episodes: list | None = None) -> Path:
-        if not episodes:
+def create_episodes(episodes_factory):
+    def _create_episodes(dir: Path, episodes: datasets.Dataset | None = None):
+        if episodes is None:
+            # TODO(rcadene): add features, fps as arguments
            episodes = episodes_factory()
-        fpath = dir / EPISODES_PATH
-        fpath.parent.mkdir(parents=True, exist_ok=True)
-        with jsonlines.open(fpath, "w") as writer:
-            writer.write_all(episodes.values())
-        return fpath
+        write_episodes(episodes, dir)

-    return _create_episodes_jsonl_file
+    return _create_episodes
+
+
+@pytest.fixture(scope="session")
+def create_hf_dataset(hf_dataset_factory):
+    def _create_hf_dataset(dir: Path, hf_dataset: datasets.Dataset | None = None):
+        if hf_dataset is None:
+            hf_dataset = hf_dataset_factory()
+        write_hf_dataset(hf_dataset, dir)
+
+    return _create_hf_dataset


@pytest.fixture(scope="session")
@@ -104,7 +84,8 @@ def single_episode_parquet_path(hf_dataset_factory, info_factory):
    def _create_single_episode_parquet(
        dir: Path, ep_idx: int = 0, hf_dataset: datasets.Dataset | None = None, info: dict | None = None
    ) -> Path:
-        if not info:
+        raise NotImplementedError()
+        if info is None:
            info = info_factory()
        if hf_dataset is None:
            hf_dataset = hf_dataset_factory()
@@ -127,7 +108,8 @@ def multi_episode_parquet_path(hf_dataset_factory, info_factory):
    def _create_multi_episode_parquet(
        dir: Path, hf_dataset: datasets.Dataset | None = None, info: dict | None = None
    ) -> Path:
-        if not info:
+        raise NotImplementedError()
+        if info is None:
            info = info_factory()
        if hf_dataset is None:
            hf_dataset = hf_dataset_factory()
--- a/tests/fixtures/hub.py
+++ b/tests/fixtures/hub.py
@@ -14,15 +14,17 @@
 from pathlib import Path

 import datasets
+import pandas as pd
 import pytest
 from huggingface_hub.utils import filter_repo_objects

 from lerobot.common.datasets.utils import (
-    EPISODES_PATH,
-    EPISODES_STATS_PATH,
+    DEFAULT_DATA_PATH,
+    DEFAULT_EPISODES_PATH,
+    DEFAULT_TASKS_PATH,
+    DEFAULT_VIDEO_PATH,
    INFO_PATH,
    STATS_PATH,
-    TASKS_PATH,
 )
 from tests.fixtures.constants import LEROBOT_TEST_DIR

@@ -30,17 +32,16 @@ from tests.fixtures.constants import LEROBOT_TEST_DIR
@pytest.fixture(scope="session")
 def mock_snapshot_download_factory(
    info_factory,
-    info_path,
+    create_info,
    stats_factory,
-    stats_path,
-    episodes_stats_factory,
-    episodes_stats_path,
+    create_stats,
    tasks_factory,
-    tasks_path,
+    create_tasks,
    episodes_factory,
-    episode_path,
-    single_episode_parquet_path,
+    create_episodes,
    hf_dataset_factory,
+    create_hf_dataset,
+    create_videos,
 ):
    """
    This factory allows to patch snapshot_download such that when called, it will create expected files rather
@@ -50,82 +51,91 @@ def mock_snapshot_download_factory(
    def _mock_snapshot_download_func(
        info: dict | None = None,
        stats: dict | None = None,
-        episodes_stats: list[dict] | None = None,
-        tasks: list[dict] | None = None,
-        episodes: list[dict] | None = None,
+        tasks: pd.DataFrame | None = None,
+        episodes: datasets.Dataset | None = None,
        hf_dataset: datasets.Dataset | None = None,
    ):
-        if not info:
+        if info is None:
            info = info_factory()
-        if not stats:
+        if stats is None:
            stats = stats_factory(features=info["features"])
-        if not episodes_stats:
-            episodes_stats = episodes_stats_factory(
-                features=info["features"], total_episodes=info["total_episodes"]
-            )
-        if not tasks:
+        if tasks is None:
            tasks = tasks_factory(total_tasks=info["total_tasks"])
-        if not episodes:
+        if episodes is None:
            episodes = episodes_factory(
-                total_episodes=info["total_episodes"], total_frames=info["total_frames"], tasks=tasks
+                features=info["features"],
+                fps=info["fps"],
+                total_episodes=info["total_episodes"],
+                total_frames=info["total_frames"],
+                tasks=tasks,
            )
-        if not hf_dataset:
+        if hf_dataset is None:
            hf_dataset = hf_dataset_factory(tasks=tasks, episodes=episodes, fps=info["fps"])

-        def _extract_episode_index_from_path(fpath: str) -> int:
-            path = Path(fpath)
-            if path.suffix == ".parquet" and path.stem.startswith("episode_"):
-                episode_index = int(path.stem[len("episode_") :])  # 'episode_000000' -> 0
-                return episode_index
-            else:
-                return None
-
        def _mock_snapshot_download(
-            repo_id: str,
+            repo_id: str,  # TODO(rcadene): repo_id should be used no?
            local_dir: str | Path | None = None,
            allow_patterns: str | list[str] | None = None,
            ignore_patterns: str | list[str] | None = None,
            *args,
            **kwargs,
        ) -> str:
-            if not local_dir:
+            if local_dir is None:
                local_dir = LEROBOT_TEST_DIR

            # List all possible files
-            all_files = []
-            meta_files = [INFO_PATH, STATS_PATH, EPISODES_STATS_PATH, TASKS_PATH, EPISODES_PATH]
-            all_files.extend(meta_files)
+            all_files = [
+                INFO_PATH,
+                STATS_PATH,
+                # TODO(rcadene): remove naive chunk 0 file 0 ?
+                DEFAULT_TASKS_PATH.format(chunk_index=0, file_index=0),
+                DEFAULT_EPISODES_PATH.format(chunk_index=0, file_index=0),
+                DEFAULT_DATA_PATH.format(chunk_index=0, file_index=0),
+            ]

-            data_files = []
-            for episode_dict in episodes.values():
-                ep_idx = episode_dict["episode_index"]
-                ep_chunk = ep_idx // info["chunks_size"]
-                data_path = info["data_path"].format(episode_chunk=ep_chunk, episode_index=ep_idx)
-                data_files.append(data_path)
-            all_files.extend(data_files)
+            video_keys = [key for key, feats in info["features"].items() if feats["dtype"] == "video"]
+            for key in video_keys:
+                all_files.append(DEFAULT_VIDEO_PATH.format(video_key=key, chunk_index=0, file_index=0))

            allowed_files = filter_repo_objects(
                all_files, allow_patterns=allow_patterns, ignore_patterns=ignore_patterns
            )

-            # Create allowed files
+            request_info = False
+            request_tasks = False
+            request_episodes = False
+            request_stats = False
+            request_data = False
+            request_videos = False
            for rel_path in allowed_files:
-                if rel_path.startswith("data/"):
-                    episode_index = _extract_episode_index_from_path(rel_path)
-                    if episode_index is not None:
-                        _ = single_episode_parquet_path(local_dir, episode_index, hf_dataset, info)
-                if rel_path == INFO_PATH:
-                    _ = info_path(local_dir, info)
-                elif rel_path == STATS_PATH:
-                    _ = stats_path(local_dir, stats)
-                elif rel_path == EPISODES_STATS_PATH:
-                    _ = episodes_stats_path(local_dir, episodes_stats)
-                elif rel_path == TASKS_PATH:
-                    _ = tasks_path(local_dir, tasks)
-                elif rel_path == EPISODES_PATH:
-                    _ = episode_path(local_dir, episodes)
+                if rel_path.startswith("meta/info.json"):
+                    request_info = True
+                elif rel_path.startswith("meta/stats"):
+                    request_stats = True
+                elif rel_path.startswith("meta/tasks"):
+                    request_tasks = True
+                elif rel_path.startswith("meta/episodes"):
+                    request_episodes = True
+                elif rel_path.startswith("data/"):
+                    request_data = True
+                elif rel_path.startswith("videos/"):
+                    request_videos = True
                else:
-                    pass
+                    raise ValueError(f"{rel_path} not supported.")
+
+            if request_info:
+                create_info(local_dir, info)
+            if request_stats:
+                create_stats(local_dir, stats)
+            if request_tasks:
+                create_tasks(local_dir, tasks)
+            if request_episodes:
+                create_episodes(local_dir, episodes)
+            if request_data:
+                create_hf_dataset(local_dir, hf_dataset)
+            if request_videos:
+                create_videos(root=local_dir, info=info)
+
            return str(local_dir)

        return _mock_snapshot_download
--- a/tests/policies/test_policies.py
+++ b/tests/policies/test_policies.py
@@ -68,7 +68,11 @@ def dummy_dataset_metadata(lerobot_dataset_metadata_factory, info_factory, tmp_p
        },
    }
    info = info_factory(
-        total_episodes=1, total_frames=1, camera_features=camera_features, motor_features=motor_features
+        total_episodes=1,
+        total_frames=1,
+        total_tasks=1,
+        camera_features=camera_features,
+        motor_features=motor_features,
    )
    ds_meta = lerobot_dataset_metadata_factory(root=tmp_path / "init", info=info)
    return ds_meta
@@ -137,6 +141,7 @@ def test_policy(ds_repo_id, env_name, env_kwargs, policy_name, policy_kwargs):
    Note: We test various combinations of policy and dataset. The combinations are by no means exhaustive,
          and for now we add tests as we see fit.
    """
+    policy_kwargs["device"] = DEVICE

    train_cfg = TrainPipelineConfig(
        # TODO(rcadene, aliberts): remove dataset download
Author	SHA1	Message	Date
$fracapuano$ fracapuano	b54be4c23a	fix: optimize delta-indices construction. No point in fetching after iteration failure in same direction	2025-05-31 20:08:16 +02:00
$fracapuano$ fracapuano	3bf63e5518	fix: making sure to stream within-episode frames	2025-05-31 20:07:16 +02:00
$fracapuano$ fracapuano	617bebb617	fix: minor	2025-05-29 23:17:02 +02:00
$fracapuano$ fracapuano	3af6768f37	add: next() & prev() iterable for streaming dataset with two-sided memory buffers	2025-05-29 23:14:18 +02:00
$fracapuano$ fracapuano	90969e57ff	add: support for delta timesteps on streaming mode	2025-05-29 23:11:28 +02:00
$fracapuano$ fracapuano	e405f37b95	fix: small nit, typing of method	2025-05-27 14:32:14 +02:00
$fracapuano$ fracapuano	de5457f538	add: performance (time & randomness profiler)	2025-05-27 11:56:30 +02:00
$fracapuano$ fracapuano	fccbce9ef9	add: randomized streaming dataset	2025-05-27 11:43:55 +02:00
Remi Cadene	f07887e8d1	Merge remote-tracking branch 'origin/user/rcadene/2025_04_11_dataset_v3' into user/rcadene/2025_04_11_dataset_v3	2025-05-16 17:50:14 +00:00
Remi Cadene	8d360927af	WIP aggregate	2025-05-16 17:41:47 +00:00
Remi Cadene	e07cb52baa	In tests: Add use_videos=False by default, Create mp4 file if True, then fix test_datasets and test_aggregate (all passing)	2025-05-12 15:37:02 +02:00
Remi Cadene	e88af0e588	Fix visualize_dataset with rerun	2025-05-08 17:24:58 +02:00
Remi Cadene	1ecaeabad0	Uploaded droid 1.0.1	2025-05-08 15:14:15 +00:00
Remi Cadene	0309a9fcbc	Speedup data loading	2025-05-06 15:13:50 +00:00
Remi Cadene	588bf96559	Fix aggregate (num_frames, dataset_from_index, index)	2025-05-06 15:13:35 +00:00
Remi Cadene	e11d2e4197	Aggregate: Add concatenation	2025-05-02 13:33:57 +02:00
Remi Cadene	253c649507	Fix convert v30 with image datasets	2025-04-24 18:51:53 +02:00
Remi Cadene	71715c3914	fix hf_dataset.set_transform(hf_transform_to_torch)	2025-04-23 11:42:21 +02:00
Remi Cadene	7c005c2aa1	Merge remote-tracking branch 'origin/user/rcadene/2025_04_11_dataset_v3' into user/rcadene/2025_04_11_dataset_v3	2025-04-23 09:16:37 +00:00
Remi Cadene	d518b036d0	Faster self.meta.episodes[...] switch back to set_transform instead of set_format Add video_files_size_in_mb pre-commit run --all-files	2025-04-23 09:14:02 +00:00
Remi Cadene	367d9bda7d	Fix unit tests	2025-04-22 10:35:20 +02:00
Remi Cadene	601b5fdbfe	Merge remote-tracking branch 'origin/user/rcadene/2025_04_11_dataset_v3' into user/rcadene/2025_04_11_dataset_v3	2025-04-22 08:19:30 +00:00
Remi Cadene	20b74ae1eb	fix	2025-04-21 13:38:29 +00:00
Remi Cadene	b9b880bd8b	fix get_parquet_file_size_in_mb + DEFAULT_FILE_SIZE_IN_MB=100	2025-04-21 12:59:35 +00:00
Remi Cadene	5bd9cb1e72	Merge remote-tracking branch 'origin/main' into user/rcadene/2025_04_11_dataset_v3	2025-04-21 11:03:12 +02:00
Remi Cadene	2866d0770f	small fix ffmpeg encoding	2025-04-21 10:59:06 +02:00
Remi Cadene	4375a05a9f	Add push to hub for convert_dataset_v21_to_v30	2025-04-21 10:08:25 +02:00
Remi Cadene	4acf99f622	pre-commit run --all-files	2025-04-21 09:34:19 +02:00
Remi Cadene	5a6ea09248	Rename tests/test_aggregate_datasets.py -> tests/datasets/test_aggregate.py	2025-04-19 19:30:28 +05:30
Remi Cadene	9c0836c8d0	Remove legacy from datasets/utils.py	2025-04-19 19:27:14 +05:30
Remi Cadene	b0cca75e5e	Progress on aggregate_datasets	2025-04-19 19:11:53 +05:30
Remi Cadene	54b5c805bf	Revert mistake convert_dataset_v20_to_v21.py	2025-04-17 04:47:00 +02:00
Remi Cadene	eab5543750	Merge (No verify)	2025-04-17 04:46:09 +02:00
Remi Cadene	6b6a990f4c	most unit tests passing (TODO: convert datasets)	2025-04-16 21:30:58 +02:00
Remi Cadene	c2a05a1fde	Fix (Now loading all frames is possible)	2025-04-14 14:47:18 +00:00
Remi Cadene	6c4d122198	fix joints	2025-04-11 15:01:03 +02:00
Remi Cadene	34c5d4ce07	Most unit tests are passing	2025-04-11 14:04:22 +02:00
Remi Cadene	c1b28f0b58	Commit before episodes episodes_stats merging	2025-04-09 15:20:15 +02:00
Remi Cadene	53ecec5fb2	WIP v21 to v30	2025-03-31 07:38:01 +00:00
Remi Cadene	65738f0a80	Improve slurm droid	2025-03-20 14:12:46 +00:00
Remi Cadene	5d184a7811	NIT	2025-03-18 16:55:08 +00:00
Remi Cadene	1a5c1ef9c7	Rename openx to droid + Improve all (not tested)	2025-03-18 16:28:09 +00:00
Remi Cadene	7866c1f7d1	Merge remote-tracking branch 'origin/main' into user/rcadene/2025_02_19_port_openx	2025-03-01 19:17:18 +00:00
Remi Cadene	3666ac9346	WIP UploadDataset	2025-03-01 19:07:22 +00:00
Remi Cadene	3daab2acbb	Add upload_large_folder	2025-02-23 18:19:12 +00:00
Remi Cadene	c36d2253d0	Aggregate works	2025-02-23 18:18:46 +00:00
Remi Cadene	e2e6f6e666	Add auto_downsample_height_width	2025-02-23 18:15:39 +00:00
Remi Cadene	ff0029f84b	aggregate works	2025-02-22 15:33:47 +00:00
Remi Cadene	39ad2d16d4	let's go	2025-02-22 11:12:39 +00:00
Remi Cadene	689c5efc72	optimize shard	2025-02-22 10:13:09 +00:00
Remi Cadene	eda0b996cd	new dir	2025-02-21 23:56:44 +00:00
Remi Cadene	15e7a9d541	before new launch from scratch	2025-02-21 23:14:22 +00:00
Remi Cadene	52fb4143b5	workers	2025-02-21 13:08:21 +00:00
Remi Cadene	93c80b2cb1	rm brake	2025-02-20 23:24:03 +00:00
Remi Cadene	5fbbaa1bc0	fix No such file or directory error	2025-02-20 23:04:58 +00:00
Remi Cadene	71d1f5e2c9	WIP	2025-02-20 23:04:31 +00:00
Remi Cadene	b520941cd9	Merge remote-tracking branch 'origin/user/aliberts/2025_02_10_dataset_v2.1' into user/rcadene/2025_02_19_port_openx	2025-02-20 17:34:13 +00:00
Simon Alibert	64ed5258e6	Fix batch convert	2025-02-20 09:00:14 +01:00
Simon Alibert	392a8c32a7	Improve doc	2025-02-20 08:24:41 +01:00
Simon Alibert	969ef745a2	Remove dataset `consolidate` (#752 )	2025-02-19 16:02:54 +01:00
Simon Alibert	6fe42a72db	Add tag	2025-02-19 15:01:44 +01:00
Simon Alibert	2487228ea7	Use `HF_HOME` env variable (#753 )	2025-02-19 14:49:46 +01:00
Remi Cadene	76436ca1de	Merge remote-tracking branch 'tavish9_lerobot_openx/main' into user/rcadene/2025_02_19_port_openx	2025-02-19 12:58:18 +00:00
Simon Alibert	fbf2f2222a	Remove `local_files_only` and use `codebase_version` instead of branches (#734 )	2025-02-19 08:36:32 +01:00
Tavish	02bc4e03e0	support openx/rlds to lerobot	2025-02-18 22:25:58 +08:00
Simon Alibert	624eaf1175	Merge remote-tracking branch 'origin/main' into user/aliberts/2025_02_10_dataset_v2.1	2025-02-17 12:06:05 +01:00
Simon Alibert	aed3eb4a94	Merge remote-tracking branch 'origin/main' into user/aliberts/2025_02_10_dataset_v2.1	2025-02-15 15:56:24 +01:00
Simon Alibert	8426c64f42	Per-episode stats (#521 ) Co-authored-by: Remi Cadene <re.cadene@gmail.com> Co-authored-by: Remi <remi.cadene@huggingface.co>	2025-02-15 15:47:16 +01:00
Remi	7c2bbee613	Validate features during `add_frame` + Add 2D-to-5D + Add string (#720 )	2025-02-14 19:59:48 +01:00
Remi	9d6886dd08	Add frame level task (#693 ) Co-authored-by: Simon Alibert <75076266+aliberts@users.noreply.github.com>	2025-02-14 14:22:22 +01:00
Simon Alibert	d67ca342e9	Merge remote-tracking branch 'origin/main' into user/aliberts/2025_02_10_dataset_v2.1	2025-02-11 17:17:39 +01:00
Simon Alibert	57c9c21c39	Merge remote-tracking branch 'origin/main' into user/aliberts/2025_02_10_dataset_v2.1	2025-02-10 17:22:57 +01:00
Simon Alibert	38c14571cc	Bump CODEBASE_VERSION	2025-02-10 16:39:34 +01:00