* feat(dataset-tools): add dataset utilities and example script - Introduced dataset tools for LeRobotDataset, including functions for deleting episodes, splitting datasets, adding/removing features, and merging datasets. - Added an example script demonstrating the usage of these utilities. - Implemented comprehensive tests for all new functionalities to ensure reliability and correctness. * style fixes * move example to dataset dir * missing lisence * fixes mostly path * clean comments * move tests to functions instead of class based * - fix video editting, decode, delete frames and rencode video - copy unchanged video and parquet files to avoid recreating the entire dataset * Fortify tooling tests * Fix type issue resulting from saving numpy arrays with shape 3,1,1 * added lerobot_edit_dataset * - revert changes in examples - remove hardcoded split names * update comment * fix comment add lerobot-edit-dataset shortcut * Apply suggestion from @Copilot Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Michel Aractingi <michel.aractingi@huggingface.co> * style nit after copilot review * fix: bug in dataset root when editing the dataset in place (without setting new_repo_id * Fix bug in aggregate.py when accumelating video timestamps; add tests to fortify aggregate videos * Added missing output repo id * migrate delete episode to using pyav instead of decoding, writing frames to disk and encoding again. Co-authored-by: Caroline Pascal <caroline8.pascal@gmail.com> * added modified suffix in case repo_id is not set in delete_episode * adding docs for dataset tools * bump av version and add back time_base assignment * linter * modified push_to_hub logic in lerobot_edit_dataset * fix(progress bar): fixing the progress bar issue in dataset tools * chore(concatenate): removing no longer needed concatenate_datasets usage * fix(file sizes forwarding): forwarding files and chunk sizes in metadata info when splitting and aggregating datasets * style fix * refactor(aggregate): Fix video indexing and timestamp bugs in dataset merging There were three critical bugs in aggregate.py that prevented correct dataset merging: 1. Video file indices: Changed from += to = assignment to correctly reference merged video files 2. Video timestamps: Implemented per-source-file offset tracking to maintain continuous timestamps when merging split datasets (was causing non-monotonic timestamp warnings) 3. File rotation offsets: Store timestamp offsets after rotation decision to prevent out-of-bounds frame access (was causing "Invalid frame index" errors with small file size limits) Changes: - Updated update_meta_data() to apply per-source-file timestamp offsets - Updated aggregate_videos() to track offsets correctly during file rotation - Added get_video_duration_in_s import for duration calculation * Improved docs for split dataset and added a check for the possible case that the split size results in zero episodes * chore(docs): update merge documentation details Signed-off-by: Steven Palma <imstevenpmwork@ieee.org> --------- Co-authored-by: CarolinePascal <caroline8.pascal@gmail.com> Co-authored-by: Jack Vial <vialjack@gmail.com> Co-authored-by: Steven Palma <imstevenpmwork@ieee.org>
103 lines
3.3 KiB
Plaintext
103 lines
3.3 KiB
Plaintext
# Using Dataset Tools
|
|
|
|
This guide covers the dataset tools utilities available in LeRobot for modifying and editing existing datasets.
|
|
|
|
## Overview
|
|
|
|
LeRobot provides several utilities for manipulating datasets:
|
|
|
|
1. **Delete Episodes** - Remove specific episodes from a dataset
|
|
2. **Split Dataset** - Divide a dataset into multiple smaller datasets
|
|
3. **Merge Datasets** - Combine multiple datasets into one. The datasets must have identical features, and episodes are concatenated in the order specified in `repo_ids`
|
|
4. **Add Features** - Add new features to a dataset
|
|
5. **Remove Features** - Remove features from a dataset
|
|
|
|
The core implementation is in `lerobot.datasets.dataset_tools`.
|
|
An example script detailing how to use the tools API is available in `examples/dataset/use_dataset_tools.py`.
|
|
|
|
## Command-Line Tool: lerobot-edit-dataset
|
|
|
|
`lerobot-edit-dataset` is a command-line script for editing datasets. It can be used to delete episodes, split datasets, merge datasets, add features, and remove features.
|
|
|
|
Run `lerobot-edit-dataset --help` for more information on the configuration of each operation.
|
|
|
|
### Usage Examples
|
|
|
|
#### Delete Episodes
|
|
|
|
Remove specific episodes from a dataset. This is useful for filtering out undesired data.
|
|
|
|
```bash
|
|
# Delete episodes 0, 2, and 5 (modifies original dataset)
|
|
lerobot-edit-dataset \
|
|
--repo_id lerobot/pusht \
|
|
--operation.type delete_episodes \
|
|
--operation.episode_indices "[0, 2, 5]"
|
|
|
|
# Delete episodes and save to a new dataset (preserves original dataset)
|
|
lerobot-edit-dataset \
|
|
--repo_id lerobot/pusht \
|
|
--new_repo_id lerobot/pusht_after_deletion \
|
|
--operation.type delete_episodes \
|
|
--operation.episode_indices "[0, 2, 5]"
|
|
```
|
|
|
|
#### Split Dataset
|
|
|
|
Divide a dataset into multiple subsets.
|
|
|
|
```bash
|
|
# Split by fractions (e.g. 80% train, 20% test, 20% val)
|
|
lerobot-edit-dataset \
|
|
--repo_id lerobot/pusht \
|
|
--operation.type split \
|
|
--operation.splits '{"train": 0.8, "test": 0.2, "val": 0.2}'
|
|
|
|
# Split by specific episode indices
|
|
lerobot-edit-dataset \
|
|
--repo_id lerobot/pusht \
|
|
--operation.type split \
|
|
--operation.splits '{"task1": [0, 1, 2, 3], "task2": [4, 5]}'
|
|
```
|
|
|
|
There are no constraints on the split names, they can be determined by the user. Resulting datasets are saved under the repo id with the split name appended, e.g. `lerobot/pusht_train`, `lerobot/pusht_task1`, `lerobot/pusht_task2`.
|
|
|
|
#### Merge Datasets
|
|
|
|
Combine multiple datasets into a single dataset.
|
|
|
|
```bash
|
|
# Merge train and validation splits back into one dataset
|
|
lerobot-edit-dataset \
|
|
--repo_id lerobot/pusht_merged \
|
|
--operation.type merge \
|
|
--operation.repo_ids "['lerobot/pusht_train', 'lerobot/pusht_val']"
|
|
```
|
|
|
|
#### Remove Features
|
|
|
|
Remove features from a dataset.
|
|
|
|
```bash
|
|
# Remove a camera feature
|
|
lerobot-edit-dataset \
|
|
--repo_id lerobot/pusht \
|
|
--operation.type remove_feature \
|
|
--operation.feature_names "['observation.images.top']"
|
|
```
|
|
|
|
### Push to Hub
|
|
|
|
Add the `--push_to_hub` flag to any command to automatically upload the resulting dataset to the Hugging Face Hub:
|
|
|
|
```bash
|
|
lerobot-edit-dataset \
|
|
--repo_id lerobot/pusht \
|
|
--new_repo_id lerobot/pusht_after_deletion \
|
|
--operation.type delete_episodes \
|
|
--operation.episode_indices "[0, 2, 5]" \
|
|
--push_to_hub
|
|
```
|
|
|
|
There is also a tool for adding features to a dataset that is not yet covered in `lerobot-edit-dataset`.
|