From 55e752f0c2e7fab0d989c5ff999fbe3b6d8872ab Mon Sep 17 00:00:00 2001 From: Jade Choghari Date: Tue, 16 Sep 2025 17:45:38 +0200 Subject: [PATCH] docs(dataset): add dataset v3 documentation (#1956) * add v3 doc * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix * update changes * iterate on review * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add changes * create dataset section * Update docs/source/lerobot-dataset-v3.mdx Signed-off-by: Francesco Capuano <74058581+fracapuano@users.noreply.github.com> * Update docs/source/lerobot-dataset-v3.mdx Signed-off-by: Francesco Capuano <74058581+fracapuano@users.noreply.github.com> * Update docs/source/lerobot-dataset-v3.mdx Signed-off-by: Francesco Capuano <74058581+fracapuano@users.noreply.github.com> --------- Signed-off-by: Francesco Capuano <74058581+fracapuano@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Michel Aractingi Co-authored-by: Francesco Capuano <74058581+fracapuano@users.noreply.github.com> --- docs/source/_toctree.yml | 6 +- docs/source/lerobot-dataset-v3.mdx | 169 +++++++++++++++++++++++++++++ 2 files changed, 174 insertions(+), 1 deletion(-) create mode 100644 docs/source/lerobot-dataset-v3.mdx diff --git a/docs/source/_toctree.yml b/docs/source/_toctree.yml index 5f5a509c7..9f5de8230 100644 --- a/docs/source/_toctree.yml +++ b/docs/source/_toctree.yml @@ -19,9 +19,13 @@ title: Train RL in Simulation - local: async title: Use Async Inference + title: "Tutorials" +- sections: + - local: lerobot-dataset-v3 + title: Using LeRobotDataset - local: porting_datasets_v3 title: Porting Large Datasets - title: "Tutorials" + title: "Datasets" - sections: - local: smolvla title: Finetune SmolVLA diff --git a/docs/source/lerobot-dataset-v3.mdx b/docs/source/lerobot-dataset-v3.mdx new file mode 100644 index 000000000..4f33d9a25 --- /dev/null +++ b/docs/source/lerobot-dataset-v3.mdx @@ -0,0 +1,169 @@ +# LeRobotDataset v3.0 + +`LeRobotDataset v3.0` is a standardized format for robot learning data. It provides unified access to multi-modal time-series data, sensorimotor signals and multi‑camera video, as well as rich metadata for indexing, search, and visualization on the Hugging Face Hub. + +This docs will guide you to: + +- Understand the v3.0 design and directory layout +- Record a dataset and push it to the Hub +- Load datasets for training with `LeRobotDataset` +- Stream datasets without downloading using `StreamingLeRobotDataset` +- Migrate existing `v2.1` datasets to `v3.0` + +## What’s new in `v3` + +- **File-based storage**: Many episodes per Parquet/MP4 file (v2 used one file per episode). +- **Relational metadata**: Episode boundaries and lookups are resolved through metadata, not filenames. +- **Hub-native streaming**: Consume datasets directly from the Hub with `StreamingLeRobotDataset`. +- **Lower file-system pressure**: Fewer, larger files ⇒ faster initialization and fewer issues at scale. +- **Unified organization**: Clean directory layout with consistent path templates across data and videos. + +## Installation + +`LeRobotDataset v3.0` will be included in `lerobot >= 0.4.0`. + +Until that stable release, you can use the main branch by following the [build from source instructions](./installation#from-source). + +## Record a dataset + +Run the command below to record a dataset with the SO-101 and push to the Hub: + +```bash +lerobot-record \ + --robot.type=so101_follower \ + --robot.port=/dev/tty.usbmodem585A0076841 \ + --robot.id=my_awesome_follower_arm \ + --robot.cameras="{ front: {type: opencv, index_or_path: 0, width: 1920, height: 1080, fps: 30}}" \ + --teleop.type=so101_leader \ + --teleop.port=/dev/tty.usbmodem58760431551 \ + --teleop.id=my_awesome_leader_arm \ + --display_data=true \ + --dataset.repo_id=${HF_USER}/record-test \ + --dataset.num_episodes=5 \ + --dataset.single_task="Grab the black cube" +``` + +See the [recording guide](./il_robots#record-a-dataset) for more details. + +## Format design + +A core v3 principle is **decoupling storage from the user API**: data is stored efficiently (few large files), while the public API exposes intuitive episode-level access. + +`v3` has three pillars: + +1. **Tabular data**: Low‑dimensional, high‑frequency signals (states, actions, timestamps) stored in **Apache Parquet**. Access is memory‑mapped or streamed via the `datasets` stack. +2. **Visual data**: Camera frames concatenated and encoded into **MP4**. Frames from the same episode are grouped; videos are sharded per camera for practical sizes. +3. **Metadata**: JSON/Parquet records describing schema (feature names, dtypes, shapes), frame rates, normalization stats, and **episode segmentation** (start/end offsets into shared Parquet/MP4 files). + +> To scale to millions of episodes, tabular rows and video frames from multiple episodes are **concatenated** into larger files. Episode‑specific views are reconstructed **via metadata**, not file boundaries. + +
+
+ LeRobotDataset v3 diagram +
+ From episode‑based to file‑based datasets +
+
+
+ +### Directory layout (simplified) + +- **`meta/info.json`**: canonical schema (features, shapes/dtypes), FPS, codebase version, and **path templates** to locate data/video shards. +- **`meta/stats.json`**: global feature statistics (mean/std/min/max) used for normalization; exposed as `dataset.meta.stats`. +- **`meta/tasks.jsonl`**: natural‑language task descriptions mapped to integer IDs for task‑conditioned policies. +- **`meta/episodes/`**: per‑episode records (lengths, tasks, offsets) stored as **chunked Parquet** for scalability. +- **`data/`**: frame‑by‑frame **Parquet** shards; each file typically contains **many episodes**. +- **`videos/`**: **MP4** shards per camera; each file typically contains **many episodes**. + +## Load a dataset for training + +`LeRobotDataset` returns Python dictionaries of PyTorch tensors and integrates with `torch.utils.data.DataLoader`. Here is a code example showing its use: + +```python +import torch +from lerobot.datasets.lerobot_dataset import LeRobotDataset + +repo_id = "yaak-ai/L2D-v3" + +# 1) Load from the Hub (cached locally) +dataset = LeRobotDataset(repo_id) + +# 2) Random access by index +sample = dataset[100] +print(sample) +# { +# 'observation.state': tensor([...]), +# 'action': tensor([...]), +# 'observation.images.front_left': tensor([C, H, W]), +# 'timestamp': tensor(1.234), +# ... +# } + +# 3) Temporal windows via delta_timestamps (seconds relative to t) +delta_timestamps = { + "observation.images.front_left": [-0.2, -0.1, 0.0] # 0.2s and 0.1s before current frame +} + +dataset = LeRobotDataset(repo_id, delta_timestamps=delta_timestamps) + +# Accessing an index now returns a stack for the specified key(s) +sample = dataset[100] +print(sample["observation.images.front_left"].shape) # [T, C, H, W], where T=3 + +# 4) Wrap with a DataLoader for training +batch_size = 16 +data_loader = torch.utils.data.DataLoader(dataset, batch_size=batch_size) + +device = "cuda" if torch.cuda.is_available() else "cpu" +for batch in data_loader: + observations = batch["observation.state"].to(device) + actions = batch["action"].to(device) + images = batch["observation.images.front_left"].to(device) + # model.forward(batch) +``` + +## Stream a dataset (no downloads) + +Use `StreamingLeRobotDataset` to iterate directly from the Hub without local copies. This allows to stream large datasets without the need to downloading them onto disk or loading them onto memory, and is a key feature of the new dataset format. + +```python +from lerobot.datasets.streaming_dataset import StreamingLeRobotDataset + +repo_id = "yaak-ai/L2D-v3" +dataset = StreamingLeRobotDataset(repo_id) # streams directly from the Hub +``` + +
+
+ StreamingLeRobotDataset +
+ Stream directly from the Hub for on‑the‑fly training. +
+
+
+ +## Migrate `v2.1` → `v3.0` + +A converter aggregates per‑episode files into larger shards and writes episode offsets/metadata. Convert your dataset using the instructions below. + +```bash +# Pre-release build with v3 support: +pip install "https://github.com/huggingface/lerobot/archive/33cad37054c2b594ceba57463e8f11ee374fa93c.zip" + +# Convert an existing v2.1 dataset hosted on the Hub: +python -m lerobot.datasets.v30.convert_dataset_v21_to_v30 --repo-id= +``` + +**What it does** + +- Aggregates parquet files: `episode-0000.parquet`, `episode-0001.parquet`, … → **`file-0000.parquet`**, … +- Aggregates mp4 files: `episode-0000.mp4`, `episode-0001.mp4`, … → **`file-0000.mp4`**, … +- Updates `meta/episodes/*` (chunked Parquet) with per‑episode lengths, tasks, and byte/frame offsets.