# LeRobotDataset v3.0 `LeRobotDataset v3.0` is a standardized format for robot learning data. It provides unified access to multi-modal time-series data, sensorimotor signals and multi‑camera video, as well as rich metadata for indexing, search, and visualization on the Hugging Face Hub. This docs will guide you to: - Understand the v3.0 design and directory layout - Record a dataset and push it to the Hub - Load datasets for training with `LeRobotDataset` - Stream datasets without downloading using `StreamingLeRobotDataset` - Migrate existing `v2.1` datasets to `v3.0` ## What’s new in `v3` - **File-based storage**: Many episodes per Parquet/MP4 file (v2 used one file per episode). - **Relational metadata**: Episode boundaries and lookups are resolved through metadata, not filenames. - **Hub-native streaming**: Consume datasets directly from the Hub with `StreamingLeRobotDataset`. - **Lower file-system pressure**: Fewer, larger files ⇒ faster initialization and fewer issues at scale. - **Unified organization**: Clean directory layout with consistent path templates across data and videos. ## Installation `LeRobotDataset v3.0` will be included in `lerobot >= 0.4.0`. Until that stable release, you can use the main branch by following the [build from source instructions](./installation#from-source). ## Record a dataset Run the command below to record a dataset with the SO-101 and push to the Hub: ```bash lerobot-record \ --robot.type=so101_follower \ --robot.port=/dev/tty.usbmodem585A0076841 \ --robot.id=my_awesome_follower_arm \ --robot.cameras="{ front: {type: opencv, index_or_path: 0, width: 1920, height: 1080, fps: 30}}" \ --teleop.type=so101_leader \ --teleop.port=/dev/tty.usbmodem58760431551 \ --teleop.id=my_awesome_leader_arm \ --display_data=true \ --dataset.repo_id=${HF_USER}/record-test \ --dataset.num_episodes=5 \ --dataset.single_task="Grab the black cube" ``` See the [recording guide](./il_robots#record-a-dataset) for more details. ## Format design A core v3 principle is **decoupling storage from the user API**: data is stored efficiently (few large files), while the public API exposes intuitive episode-level access. `v3` has three pillars: 1. **Tabular data**: Low‑dimensional, high‑frequency signals (states, actions, timestamps) stored in **Apache Parquet**. Access is memory‑mapped or streamed via the `datasets` stack. 2. **Visual data**: Camera frames concatenated and encoded into **MP4**. Frames from the same episode are grouped; videos are sharded per camera for practical sizes. 3. **Metadata**: JSON/Parquet records describing schema (feature names, dtypes, shapes), frame rates, normalization stats, and **episode segmentation** (start/end offsets into shared Parquet/MP4 files). > To scale to millions of episodes, tabular rows and video frames from multiple episodes are **concatenated** into larger files. Episode‑specific views are reconstructed **via metadata**, not file boundaries.
LeRobotDataset v3 diagram
From episode‑based to file‑based datasets
### Directory layout (simplified) - **`meta/info.json`**: canonical schema (features, shapes/dtypes), FPS, codebase version, and **path templates** to locate data/video shards. - **`meta/stats.json`**: global feature statistics (mean/std/min/max) used for normalization; exposed as `dataset.meta.stats`. - **`meta/tasks.jsonl`**: natural‑language task descriptions mapped to integer IDs for task‑conditioned policies. - **`meta/episodes/`**: per‑episode records (lengths, tasks, offsets) stored as **chunked Parquet** for scalability. - **`data/`**: frame‑by‑frame **Parquet** shards; each file typically contains **many episodes**. - **`videos/`**: **MP4** shards per camera; each file typically contains **many episodes**. ## Load a dataset for training `LeRobotDataset` returns Python dictionaries of PyTorch tensors and integrates with `torch.utils.data.DataLoader`. Here is a code example showing its use: ```python import torch from lerobot.datasets.lerobot_dataset import LeRobotDataset repo_id = "yaak-ai/L2D-v3" # 1) Load from the Hub (cached locally) dataset = LeRobotDataset(repo_id) # 2) Random access by index sample = dataset[100] print(sample) # { # 'observation.state': tensor([...]), # 'action': tensor([...]), # 'observation.images.front_left': tensor([C, H, W]), # 'timestamp': tensor(1.234), # ... # } # 3) Temporal windows via delta_timestamps (seconds relative to t) delta_timestamps = { "observation.images.front_left": [-0.2, -0.1, 0.0] # 0.2s and 0.1s before current frame } dataset = LeRobotDataset(repo_id, delta_timestamps=delta_timestamps) # Accessing an index now returns a stack for the specified key(s) sample = dataset[100] print(sample["observation.images.front_left"].shape) # [T, C, H, W], where T=3 # 4) Wrap with a DataLoader for training batch_size = 16 data_loader = torch.utils.data.DataLoader(dataset, batch_size=batch_size) device = "cuda" if torch.cuda.is_available() else "cpu" for batch in data_loader: observations = batch["observation.state"].to(device) actions = batch["action"].to(device) images = batch["observation.images.front_left"].to(device) # model.forward(batch) ``` ## Stream a dataset (no downloads) Use `StreamingLeRobotDataset` to iterate directly from the Hub without local copies. This allows to stream large datasets without the need to downloading them onto disk or loading them onto memory, and is a key feature of the new dataset format. ```python from lerobot.datasets.streaming_dataset import StreamingLeRobotDataset repo_id = "yaak-ai/L2D-v3" dataset = StreamingLeRobotDataset(repo_id) # streams directly from the Hub ```
StreamingLeRobotDataset
Stream directly from the Hub for on‑the‑fly training.
## Migrate `v2.1` → `v3.0` A converter aggregates per‑episode files into larger shards and writes episode offsets/metadata. Convert your dataset using the instructions below. ```bash # Pre-release build with v3 support: pip install "https://github.com/huggingface/lerobot/archive/33cad37054c2b594ceba57463e8f11ee374fa93c.zip" # Convert an existing v2.1 dataset hosted on the Hub: python -m lerobot.datasets.v30.convert_dataset_v21_to_v30 --repo-id= ``` **What it does** - Aggregates parquet files: `episode-0000.parquet`, `episode-0001.parquet`, … → **`file-0000.parquet`**, … - Aggregates mp4 files: `episode-0000.mp4`, `episode-0001.mp4`, … → **`file-0000.mp4`**, … - Updates `meta/episodes/*` (chunked Parquet) with per‑episode lengths, tasks, and byte/frame offsets.