Files
lerobot/lerobot/common/datasets/push_dataset_to_hub/CODEBASE_VERSION.md
2024-07-22 20:08:59 +02:00

3.1 KiB

Using / Updating CODEBASE_VERSION (for maintainers)

Since our dataset pushed to the hub are decoupled with the evolution of this repo, we ensure compatibility of the datasets with our code, we use a CODEBASE_VERSION (defined in lerobot/common/datasets/lerobot_dataset.py) variable.

For instance, lerobot/pusht has many versions to maintain backward compatibility between LeRobot codebase versions:

Starting with v1.6, every dataset pushed to the hub or saved locally also have this version number in their info.json metadata.

Uploading a new dataset

If you are pushing a new dataset, you don't need to worry about any of the instructions below, nor to be compatible with previous codebase versions. The push_dataset_to_hub.py script will automatically tag your dataset with the current CODEBASE_VERSION.

Updating an existing dataset

If you want to update an existing dataset, you need to change the CODEBASE_VERSION from lerobot_dataset.py before running push_dataset_to_hub.py. This is especially useful if you introduce a breaking change intentionally or not (i.e. something not backward compatible such as modifying the reward functions used, deleting some frames at the end of an episode, etc.). That way, people running a previous version of the codebase won't be affected by your change and backward compatibility is maintained.

However, you will need to update the version of ALL the other datasets so that they have the new CODEBASE_VERSION as a branch in their hugging face dataset repository. Don't worry, there is an easy way that doesn't require to run push_dataset_to_hub.py. You can just "branch-out" from the main branch on HF dataset repo by running this script which corresponds to a git checkout -b (so no copy or upload needed):

from huggingface_hub import HfApi

from lerobot import available_datasets
from lerobot.common.datasets.lerobot_dataset import CODEBASE_VERSION

api = HfApi()

for repo_id in available_datasets:
    dataset_info = api.list_repo_refs(repo_id, repo_type="dataset")
    branches = [b.name for b in dataset_info.branches]
    if CODEBASE_VERSION in branches:
        print(f"{repo_id} already @{CODEBASE_VERSION}, skipping.")
        continue
    else:
        # Now create a branch named after the new version by branching out from "main"
        # which is expected to be the preceding version
        api.create_branch(repo_id, repo_type="dataset", branch=CODEBASE_VERSION, revision="main")
        print(f"{repo_id} successfully updated @{CODEBASE_VERSION}")