lerobot/docs/source/lerobot-dataset-v3.mdx

# LeRobotDataset v3.0

`LeRobotDataset v3.0` is a standardized format for robot learning data. It provides unified access to multi-modal time-series data, sensorimotor signals and multi‑camera video, as well as rich metadata for indexing, search, and visualization on the Hugging Face Hub.

This docs will guide you to:

- Understand the v3.0 design and directory layout
- Record a dataset and push it to the Hub
- Load datasets for training with `LeRobotDataset`
- Stream datasets without downloading using `StreamingLeRobotDataset`
- Migrate existing `v2.1` datasets to `v3.0`

## What’s new in `v3`

- **File-based storage**: Many episodes per Parquet/MP4 file (v2 used one file per episode).
- **Relational metadata**: Episode boundaries and lookups are resolved through metadata, not filenames.
- **Hub-native streaming**: Consume datasets directly from the Hub with `StreamingLeRobotDataset`.
- **Lower file-system pressure**: Fewer, larger files ⇒ faster initialization and fewer issues at scale.
- **Unified organization**: Clean directory layout with consistent path templates across data and videos.

## Installation

`LeRobotDataset v3.0` will be included in `lerobot >= 0.4.0`.

Until that stable release, you can use the main branch by following the [build from source instructions](./installation#from-source).

## Record a dataset

Run the command below to record a dataset with the SO-101 and push to the Hub:

```bash
lerobot-record \
  --robot.type=so101_follower \
  --robot.port=/dev/tty.usbmodem585A0076841 \
  --robot.id=my_awesome_follower_arm \
  --robot.cameras="{ front: {type: opencv, index_or_path: 0, width: 1920, height: 1080, fps: 30}}" \
  --teleop.type=so101_leader \
  --teleop.port=/dev/tty.usbmodem58760431551 \
  --teleop.id=my_awesome_leader_arm \
  --display_data=true \
  --dataset.repo_id=${HF_USER}/record-test \
  --dataset.num_episodes=5 \
  --dataset.single_task="Grab the black cube"
```

See the [recording guide](./il_robots#record-a-dataset) for more details.

## Format design

A core v3 principle is **decoupling storage from the user API**: data is stored efficiently (few large files), while the public API exposes intuitive episode-level access.

`v3` has three pillars:

1. **Tabular data**: Low‑dimensional, high‑frequency signals (states, actions, timestamps) stored in **Apache Parquet**. Access is memory‑mapped or streamed via the `datasets` stack.
2. **Visual data**: Camera frames concatenated and encoded into **MP4**. Frames from the same episode are grouped; videos are sharded per camera for practical sizes.
3. **Metadata**: JSON/Parquet records describing schema (feature names, dtypes, shapes), frame rates, normalization stats, and **episode segmentation** (start/end offsets into shared Parquet/MP4 files).

> To scale to millions of episodes, tabular rows and video frames from multiple episodes are **concatenated** into larger files. Episode‑specific views are reconstructed **via metadata**, not file boundaries.

<div style="display:flex; justify-content:center; gap:12px; flex-wrap:wrap;">
  <figure style="margin:0; text-align:center;">
    <img
      src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/lerobotdataset-v3/asset1datasetv3.png"
      alt="LeRobotDataset v3 diagram"
      width="220"
    />
    <figcaption style="font-size:0.9em; color:#666;">
      From episode‑based to file‑based datasets
    </figcaption>
  </figure>
</div>

### Directory layout (simplified)

- **`meta/info.json`**: canonical schema (features, shapes/dtypes), FPS, codebase version, and **path templates** to locate data/video shards.
- **`meta/stats.json`**: global feature statistics (mean/std/min/max) used for normalization; exposed as `dataset.meta.stats`.
- **`meta/tasks.jsonl`**: natural‑language task descriptions mapped to integer IDs for task‑conditioned policies.
- **`meta/episodes/`**: per‑episode records (lengths, tasks, offsets) stored as **chunked Parquet** for scalability.
- **`data/`**: frame‑by‑frame **Parquet** shards; each file typically contains **many episodes**.
- **`videos/`**: **MP4** shards per camera; each file typically contains **many episodes**.

## Load a dataset for training

`LeRobotDataset` returns Python dictionaries of PyTorch tensors and integrates with `torch.utils.data.DataLoader`. Here is a code example showing its use:

```python
import torch
from lerobot.datasets.lerobot_dataset import LeRobotDataset

repo_id = "yaak-ai/L2D-v3"

# 1) Load from the Hub (cached locally)
dataset = LeRobotDataset(repo_id)

# 2) Random access by index
sample = dataset[100]
print(sample)
# {
#   'observation.state': tensor([...]),
#   'action': tensor([...]),
#   'observation.images.front_left': tensor([C, H, W]),
#   'timestamp': tensor(1.234),
#   ...
# }

# 3) Temporal windows via delta_timestamps (seconds relative to t)
delta_timestamps = {
    "observation.images.front_left": [-0.2, -0.1, 0.0]  # 0.2s and 0.1s before current frame
}

dataset = LeRobotDataset(repo_id, delta_timestamps=delta_timestamps)

# Accessing an index now returns a stack for the specified key(s)
sample = dataset[100]
print(sample["observation.images.front_left"].shape)  # [T, C, H, W], where T=3

# 4) Wrap with a DataLoader for training
batch_size = 16
data_loader = torch.utils.data.DataLoader(dataset, batch_size=batch_size)

device = "cuda" if torch.cuda.is_available() else "cpu"
for batch in data_loader:
    observations = batch["observation.state"].to(device)
    actions = batch["action"].to(device)
    images = batch["observation.images.front_left"].to(device)
    # model.forward(batch)
```

## Stream a dataset (no downloads)

Use `StreamingLeRobotDataset` to iterate directly from the Hub without local copies. This allows to stream large datasets without the need to downloading them onto disk or loading them onto memory, and is a key feature of the new dataset format.

```python
from lerobot.datasets.streaming_dataset import StreamingLeRobotDataset

repo_id = "yaak-ai/L2D-v3"
dataset = StreamingLeRobotDataset(repo_id)  # streams directly from the Hub
```

<div style="display:flex; justify-content:center; gap:12px; flex-wrap:wrap;">
  <figure style="margin:0; text-align:center;">
    <img
      src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/lerobotdataset-v3/streaming-lerobot.png"
      alt="StreamingLeRobotDataset"
      width="520"
    />
    <figcaption style="font-size:0.9em; color:#666;">
      Stream directly from the Hub for on‑the‑fly training.
    </figcaption>
  </figure>
</div>

## Migrate `v2.1` → `v3.0`

A converter aggregates per‑episode files into larger shards and writes episode offsets/metadata. Convert your dataset using the instructions below.

```bash
# Pre-release build with v3 support:
pip install "https://github.com/huggingface/lerobot/archive/33cad37054c2b594ceba57463e8f11ee374fa93c.zip"

# Convert an existing v2.1 dataset hosted on the Hub:
python -m lerobot.datasets.v30.convert_dataset_v21_to_v30 --repo-id=<HF_USER/DATASET_ID>
```

**What it does**

- Aggregates parquet files: `episode-0000.parquet`, `episode-0001.parquet`, … → **`file-0000.parquet`**, …
- Aggregates mp4 files: `episode-0000.mp4`, `episode-0001.mp4`, … → **`file-0000.mp4`**, …
- Updates `meta/episodes/*` (chunked Parquet) with per‑episode lengths, tasks, and byte/frame offsets.