multi-node openpi commit

2026-03-17 23:05:23 +08:00
parent 28833f0c0f
commit 7411e0e004
156 changed files with 33951 additions and 1 deletions
--- a/policy/openpi-InternData-A1/examples/droid/README.md
+++ b/policy/openpi-InternData-A1/examples/droid/README.md
@@ -0,0 +1,84 @@
+# DROID Policies in openpi
+
+We offer instructions for:
+- [Running inference for our best $pi_{0.5}$-DROID policy](./README.md#running-droid-inference)
+- [Running inference for other pre-trained DROID policies ($\pi_0$, $\pi_0$-FAST, ...)](./README.md#running-roboarena-baseline-policies)
+- [Pre-training *generalist* policies on the *full* DROID dataset](./README_train.md#training-on-droid)
+- [Fine-tuning expert $\pi_{0.5}$ on your custom DROID dataset](./README_train.md#fine-tuning-on-custom-droid-datasets)
+
+## Running DROID Inference
+
+This example shows how to run the fine-tuned $\pi_{0.5}$-DROID model on the [DROID robot platform](https://github.com/droid-dataset/droid). Based on the [public RoboArena benchmark](https://robo-arena.github.io/leaderboard), this is currently our strongest generalist DROID policy. 
+
+
+### Step 1: Start a policy server
+
+Since the DROID control laptop does not have a powerful GPU, we will start a remote policy server on a different machine with a more powerful GPU and then query it from the DROID control laptop during inference.
+
+1. On a machine with a powerful GPU (~NVIDIA 4090), clone and install the `openpi` repository following the instructions in the [README](https://github.com/Physical-Intelligence/openpi).
+2. Start the OpenPI server via the following command:
+
+```bash
+uv run scripts/serve_policy.py policy:checkpoint --policy.config=pi05_droid --policy.dir=gs://openpi-assets/checkpoints/pi05_droid
+```
+
+You can also run the equivalent command below:
+
+```bash
+uv run scripts/serve_policy.py --env=DROID
+```
+
+### Step 2: Run the DROID robot
+
+1. Make sure you have the most recent version of the DROID package installed on both the DROID control laptop and the NUC.
+2. On the control laptop, activate your DROID conda environment.
+3. Clone the openpi repo and install the openpi client, which we will use to connect to the policy server (this has very few dependencies and should be very fast to install): with the DROID conda environment activated, run `cd $OPENPI_ROOT/packages/openpi-client && pip install -e .`.
+4. Install `tyro`, which we will use for command line parsing: `pip install tyro`.
+5. Copy the `main.py` file from this directory to the `$DROID_ROOT/scripts` directory.
+6. Replace the camera IDs in the `main.py` file with the IDs of your cameras (you can find the camera IDs by running `ZED_Explorer` in the command line, which will open a tool that shows you all connected cameras and their IDs -- you can also use it to make sure that the cameras are well-positioned to see the scene you want the robot to interact with).
+7. Run the `main.py` file. Make sure to point the IP and host address to the policy server. (To make sure the server machine is reachable from the DROID laptop, you can run `ping <server_ip>` from the DROID laptop.) Also make sure to specify the external camera to use for the policy (we only input one external camera), choose from ["left", "right"].
+
+```bash
+python3 scripts/main.py --remote_host=<server_ip> --remote_port=<server_port> --external_camera="left"
+```
+
+The script will ask you to enter a free-form language instruction for the robot to follow. Make sure to point the cameras at the scene you want the robot to interact with. You _do not_ need to carefully control camera angle, object positions, etc. The policy is fairly robust in our experience. Happy prompting!
+
+## Troubleshooting
+
+| Issue | Solution |
+|-------|----------|
+| Cannot reach policy server | Make sure the server is running and the IP and port are correct. You can check that the server machine is reachable by running `ping <server_ip>` from the DROID laptop. |
+| Cannot find cameras | Make sure the camera IDs are correct and that the cameras are connected to the DROID laptop. Sometimes replugging the cameras can help. You can check all connected cameras by running `ZED_Explore` in the command line. |
+| Policy inference is slow / inconsistent | Try using a wired internet connection for the DROID laptop to reduce latency (0.5 - 1 sec latency per chunk is normal). |
+| Policy does not perform the task well | In our experiments, the policy could perform simple table top manipulation tasks (pick-and-place) across a wide range of environments, camera positions, and lighting conditions. If the policy does not perform the task well, you can try modifying the scene or object placement to make the task easier. Also make sure that the camera view you are passing to the policy can see all relevant objects in the scene (the policy is only conditioned on a single external camera + wrist camera, make sure you are feeding the desired camera to the policy). Use `ZED_Explore` to check that the camera view you are passing to the policy can see all relevant objects in the scene. Finally, the policy is far from perfect and will fail on more complex manipulation tasks, but it usually makes a decent effort. :) |
+
+
+## Running Other Policies
+
+We provide configs for running the baseline DROID policies from the [RoboArena](https://robo-arena.github.io/) paper. Simply run the commands below to start inference servers for the respective policies. Then follow the instructions above to run evaluation on the DROID robot.
+
+```
+# Train from pi0-FAST, using FAST tokenizer
+uv run scripts/serve_policy.py policy:checkpoint --policy.config=pi0_fast_droid --policy.dir=gs://openpi-assets/checkpoints/pi0_fast_droid
+
+# Train from pi0, using flow matching
+uv run scripts/serve_policy.py policy:checkpoint --policy.config=pi0_droid --policy.dir=gs://openpi-assets/checkpoints/pi0_droid
+
+# Trained from PaliGemma, using RT-2 / OpenVLA style binning tokenizer.
+uv run scripts/serve_policy.py policy:checkpoint --policy.config=paligemma_binning_droid --policy.dir=gs://openpi-assets/checkpoints/roboarena/paligemma_binning_droid
+
+# Trained from PaliGemma, using FAST tokenizer (using universal FAST+ tokenizer).
+uv run scripts/serve_policy.py policy:checkpoint --policy.config=paligemma_fast_droid --policy.dir=gs://openpi-assets/checkpoints/roboarena/paligemma_fast_droid
+
+# Trained from PaliGemma, using FAST tokenizer (tokenizer trained on DROID dataset).
+uv run scripts/serve_policy.py policy:checkpoint --policy.config=paligemma_fast_specialist_droid --policy.dir=gs://openpi-assets/checkpoints/roboarena/paligemma_fast_specialist_droid
+
+# Trained from PaliGemma, using FSQ tokenizer.
+uv run scripts/serve_policy.py policy:checkpoint --policy.config=paligemma_vq_droid --policy.dir=gs://openpi-assets/checkpoints/roboarena/paligemma_vq_droid
+
+# pi0-style diffusion / flow VLA, trained on DROID from PaliGemma.
+uv run scripts/serve_policy.py policy:checkpoint --policy.config=paligemma_diffusion_droid --policy.dir=gs://openpi-assets/checkpoints/roboarena/paligemma_diffusion_droid
+```
+
+You can find the inference configs in [roboarena_config.py](../../src/openpi/training/misc/roboarena_config.py).
--- a/policy/openpi-InternData-A1/examples/droid/README_train.md
+++ b/policy/openpi-InternData-A1/examples/droid/README_train.md
@@ -0,0 +1,106 @@
+# Training on DROID
+
+Here we describe how to fine-tune the pi0.5 model on the *full* DROID dataset. This is an approximate open-source reproduction of the pi05-DROID training pipeline.
+(small differences in data loading and the used action space) -- For a tutorial on how to fine-tune your model with a smaller, custom dataset collected on the DROID platform, see below.
+
+In contrast to the rest of openpi, which uses LeRobot for data loading, we need to use RLDS as the data format for full DROID training (since at the moment LeRobot isn't scalable enough 
+for larger datasets like DROID -- they are working on improving it though). Below, we provide instructions for updating your openpi environment for RLDS data loading and where to download the DROID dataset.
+
+## Install
+
+We need a few additional dependencies for RLDS data loading. Run:
+```bash
+uv sync --group rlds
+```
+
+## Download DROID dataset
+
+You can download the DROID dataset with the following command (after installing the `gsutil` google cloud CLI):
+```
+gsutil -m cp -r gs://gresearch/robotics/droid/1.0.1 <your_download_path>/droid/1.0.1
+```
+
+Note that downloading version 1.0.1 is important (not v1.0.0): it contains the complete set of language annotations (~75k episodes) while v1.0.0 only has annotations for 30k episodes. If for some reason you would like to use another version, modify the line `version="1.0.1"` in the `DroidRldsDataset` object [here](src/openpi/training/droid_rlds_dataset.py).
+
+You will need 1.8TB of disk storage to download the DROID RLDS dataset.
+
+## Run
+
+First, change the `rlds_data_dir` path in your `TrainConfig` to the directory that you downloaded the `droid` dataset into (see [src/openpi/training/config.py](src/openpi/training/config.py)).
+
+Then, compute normalization statistics (this will take ~10 minutes):
+```bash
+uv run --group rlds scripts/compute_norm_stats.py --config-name pi05_full_droid_finetune --max-frames 10_000_000
+```
+
+Run training:
+```bash
+XLA_PYTHON_CLIENT_MEM_FRACTION=0.9 uv run --group rlds scripts/train.py pi05_full_droid_finetune --exp-name=my_experiment --overwrite
+```
+
+**Note**: The original pi0.5-DROID model was trained with joint velocity actions.
+Joint velocity actions are not compatible with simulated evaluation environments (much harder to simulate). 
+Thus, we do not recommend training with joint velocity actions and instead use joint position actions here.
+
+
+## Compute Requirements
+
+Our DROID training config requires approximately 2 days on 8x H100 GPUs for convergence (100k iterations, bs256, approx. 1 epoch).
+If you start from PaliGemma instead of pi0 initialization, plan with ~5 days on 8x H100s (240k iterations, i.e. 3 epochs).
+
+We have experimented with LoRA for cheaper finetuning, but haven't found the policies to perform well so far.
+
+
+## Data Filtering
+
+Like any diverse real-robot dataset, the DROID dataset isn't perfectly "clean" and we have found data filtering to significantly improve policy performance. Concretely, the DROID dataset contains many *idle* timesteps in which the robot does not move (in part due to the VR teleoperation interface that was used during data collection, we will not go into too much detail here). Appropriate filtering of these idle transitions can improve policy performance.
+
+By default, our openpi training recipe implements the same idle filter used to train all pi-DROID models. We implement it by pre-computing which dataset indices to sample during training. You can check [compute_droid_nonidle_ranges.py](examples/droid/compute_droid_nonidle_ranges.py) for how we compute these indices. Roughly speaking, we filter any time steps for which the next chunk of actions would be largely idle. During training, our code automatically pulls our pre-computed list of indices from cloud storage and applies them. If you want to modify the idle filter / create your custom sampling logic, you can modify our script to generate a new index list and provide it via the `filter_dict_path="<path_to_filter_dict>"` argument in [src/openpi/training/config.py](src/openpi/training/config.py).
+
+**Note**: our list of filtering indices is only valid for the `droid/1.0.1` dataset mentioned in the download section above, and will not provide valid filtering for any other version of the DROID dataset, so make sure you download the dataset above! If you have a custom DROID version, you can rerun the [compute_droid_nonidle_ranges.py](examples/droid/compute_droid_nonidle_ranges.py) script to generate a new list of sampling indices.
+
+## RoboArena
+
+Consider submitting your DROID policies to the [RoboArena benchmark](https://robo-arena.github.io/), which allows you to evaluate your policies on diverse tasks & scenes, **in the real world**! :)
+
+If you have questions about RoboArena, please email [karl.pertsch@gmail.com](mailto:karl.pertsch@gmail.com).
+
+
+# Fine-Tuning on Custom DROID Datasets
+
+Here we describe how to fine-tune a model on a custom (smaller) dataset collected on the DROID platform. Like for other datasets, we will first convert the custom DROID dataset to LeRobot and then fine-tune a model (pi05-droid) on it.
+
+Note: We use LeRobot here, since we assume the custom DROID fine-tuning dataset to be relatively small (<10s of hours). For larger datasets (like the full DROID dataset) we recommend using RLDS for it's better efficiency (see the example above).
+
+
+## Step 1: Converting your custom DROID dataset to LeRobot
+
+We will use a small subset of the real DROID dataset for this example. This is a subset of just 30 demonstrations -- we assume that you will use your own dataset instead, but here is the command to download our subset (1.6GB):
+```
+gsutil -m cp -r gs://gresearch/robotics/droid_raw/1.0.1/IRIS/success/2023-12-04 <your_target_path>
+```
+
+We will also download the language annotations for the DROID dataset so we can pair our demonstrations with language instructions. Again, for your own data you can manually enter your language instructions and don't need to download our annotations. To download the DROID language annotations (12MB), run:
+```
+gsutil -m cp -r gs://gresearch/robotics/droid_raw/1.0.1/aggregated-annotations-030724.json <your_target_dir>
+```
+
+For your own dataset, make sure that each episode's directory contains a folder called `recordings/MP4` -- if not, you need to first run the MP4 video extraction (from SVO files) using the script [here](https://github.com/droid-dataset/droid/blob/main/scripts/convert/svo_to_mp4.py).
+
+Now, we will use the `convert_droid_to_lerobot.py` script to create a LeRobot version of this dataset (takes <5min for the 30 demonstrations):
+```
+uv run examples/droid/convert_droid_data_to_lerobot.py --data_dir <your_target_path>
+```
+
+## Step 2: Run fine-tuning with your custom dataset
+
+Now we can run fine-tuning with our converted custom dataset. We provide an example config for fine-tuning `pi05_droid` on the custom dataset we created. 
+You can modify the config easily to work with other base models, or use your custom DROID dataset in `config.py` (seach for `pi05_droid_finetune`).
+
+To launch training:
+```
+uv run scripts/train.py pi05_droid_finetune --exp-name=my_experiment --overwrite
+```
+
+Once trained, you can follow the instructions in [`examples/droid/README.md`](examples/droid/README.md) to serve the policy and run it on the robot.
+
--- a/policy/openpi-InternData-A1/examples/droid/compute_droid_nonidle_ranges.py
+++ b/policy/openpi-InternData-A1/examples/droid/compute_droid_nonidle_ranges.py
@@ -0,0 +1,103 @@
+"""
+Iterates through the DROID dataset and creates a json mapping from episode unique IDs to ranges of time steps
+that should be sampled during training (all others are filtered out).
+
+Filtering logic:
+We look for ranges of consecutive steps that contain at most min_idle_len consecutive idle frames
+(default to 7 -- as most DROID action-chunking policies run the first 8 actions generated in each chunk, filtering
+this way means the policy will not get stuck outputting stationary actions). Additionally, we also only keep non-idle
+ranges of length at least min_non_idle_len (default to 16 frames = ~1 second), while also removing the last
+filter_last_n_in_ranges frames from the end of each range (as those all correspond to action chunks with many idle actions).
+
+This leaves us with trajectory segments consisting of contiguous, significant movement. Training on this filtered set
+yields policies that output fewer stationary actions (i.e., get "stuck" in states less).
+"""
+
+import json
+import os
+from pathlib import Path
+
+import numpy as np
+import tensorflow as tf
+import tensorflow_datasets as tfds
+from tqdm import tqdm
+
+os.environ["CUDA_VISIBLE_DEVICES"] = ""  # Set to the GPU you want to use, or leave empty for CPU
+
+builder = tfds.builder_from_directory(
+    # path to the `droid` directory (not its parent)
+    builder_dir="<path_to_droid_dataset_tfds_files>",
+)
+ds = builder.as_dataset(split="train", shuffle_files=False)
+tf.data.experimental.ignore_errors(ds)
+
+keep_ranges_path = "<path_to_where_to_save_the_json>"
+
+min_idle_len = 7  # If more than this number of consecutive idle frames, filter all of them out
+min_non_idle_len = 16  # If fewer than this number of consecutive non-idle frames, filter all of them out
+filter_last_n_in_ranges = 10  # When using a filter dict, remove this many frames from the end of each range
+
+keep_ranges_map = {}
+if Path(keep_ranges_path).exists():
+    with Path(keep_ranges_path).open("r") as f:
+        keep_ranges_map = json.load(f)
+    print(f"Resuming from {len(keep_ranges_map)} episodes already processed")
+
+for ep_idx, ep in enumerate(tqdm(ds)):
+    recording_folderpath = ep["episode_metadata"]["recording_folderpath"].numpy().decode()
+    file_path = ep["episode_metadata"]["file_path"].numpy().decode()
+
+    key = f"{recording_folderpath}--{file_path}"
+    if key in keep_ranges_map:
+        continue
+
+    joint_velocities = [step["action_dict"]["joint_velocity"].numpy() for step in ep["steps"]]
+    joint_velocities = np.array(joint_velocities)
+
+    is_idle_array = np.hstack(
+        [np.array([False]), np.all(np.abs(joint_velocities[1:] - joint_velocities[:-1]) < 1e-3, axis=1)]
+    )
+
+    # Find what steps go from idle to non-idle and vice-versa
+    is_idle_padded = np.concatenate(
+        [[False], is_idle_array, [False]]
+    )  # Start and end with False, so idle at first step is a start of motion
+
+    is_idle_diff = np.diff(is_idle_padded.astype(int))
+    is_idle_true_starts = np.where(is_idle_diff == 1)[0]  # +1 transitions --> going from idle to non-idle
+    is_idle_true_ends = np.where(is_idle_diff == -1)[0]  # -1 transitions --> going from non-idle to idle
+
+    # Find which steps correspond to idle segments of length at least min_idle_len
+    true_segment_masks = (is_idle_true_ends - is_idle_true_starts) >= min_idle_len
+    is_idle_true_starts = is_idle_true_starts[true_segment_masks]
+    is_idle_true_ends = is_idle_true_ends[true_segment_masks]
+
+    keep_mask = np.ones(len(joint_velocities), dtype=bool)
+    for start, end in zip(is_idle_true_starts, is_idle_true_ends, strict=True):
+        keep_mask[start:end] = False
+
+    # Get all non-idle ranges of at least 16
+    # Same logic as above, but for keep_mask, allowing us to filter out contiguous ranges of length < min_non_idle_len
+    keep_padded = np.concatenate([[False], keep_mask, [False]])
+
+    keep_diff = np.diff(keep_padded.astype(int))
+    keep_true_starts = np.where(keep_diff == 1)[0]  # +1 transitions --> going from filter out to keep
+    keep_true_ends = np.where(keep_diff == -1)[0]  # -1 transitions --> going from keep to filter out
+
+    # Find which steps correspond to non-idle segments of length at least min_non_idle_len
+    true_segment_masks = (keep_true_ends - keep_true_starts) >= min_non_idle_len
+    keep_true_starts = keep_true_starts[true_segment_masks]
+    keep_true_ends = keep_true_ends[true_segment_masks]
+
+    # Add mapping from episode unique ID key to list of non-idle ranges to keep
+    keep_ranges_map[key] = []
+    for start, end in zip(keep_true_starts, keep_true_ends, strict=True):
+        keep_ranges_map[key].append((int(start), int(end) - filter_last_n_in_ranges))
+
+    if ep_idx % 1000 == 0:
+        with Path(keep_ranges_path).open("w") as f:
+            json.dump(keep_ranges_map, f)
+
+print("Done!")
+with Path(keep_ranges_path).open("w") as f:
+    json.dump(keep_ranges_map, f)
--- a/policy/openpi-InternData-A1/examples/droid/convert_droid_data_to_lerobot.py
+++ b/policy/openpi-InternData-A1/examples/droid/convert_droid_data_to_lerobot.py
@@ -0,0 +1,477 @@
+"""
+Minimal example script for converting a dataset collected on the DROID platform to LeRobot format.
+
+Usage:
+uv run examples/droid/convert_droid_data_to_lerobot.py --data_dir /path/to/your/data
+
+If you want to push your dataset to the Hugging Face Hub, you can use the following command:
+uv run examples/droid/convert_droid_data_to_lerobot.py --data_dir /path/to/your/data --push_to_hub
+
+The resulting dataset will get saved to the $LEROBOT_HOME directory.
+"""
+
+from collections import defaultdict
+import copy
+import glob
+import json
+from pathlib import Path
+import shutil
+
+import cv2
+import h5py
+from lerobot.common.datasets.lerobot_dataset import HF_LEROBOT_HOME
+from lerobot.common.datasets.lerobot_dataset import LeRobotDataset
+import numpy as np
+from PIL import Image
+from tqdm import tqdm
+import tyro
+
+REPO_NAME = "your_hf_username/my_droid_dataset"  # Name of the output dataset, also used for the Hugging Face Hub
+
+
+def resize_image(image, size):
+    image = Image.fromarray(image)
+    return np.array(image.resize(size, resample=Image.BICUBIC))
+
+
+def main(data_dir: str, *, push_to_hub: bool = False):
+    # Clean up any existing dataset in the output directory
+    output_path = HF_LEROBOT_HOME / REPO_NAME
+    if output_path.exists():
+        shutil.rmtree(output_path)
+    data_dir = Path(data_dir)
+
+    # Create LeRobot dataset, define features to store
+    # We will follow the DROID data naming conventions here.
+    # LeRobot assumes that dtype of image data is `image`
+    dataset = LeRobotDataset.create(
+        repo_id=REPO_NAME,
+        robot_type="panda",
+        fps=15,  # DROID data is typically recorded at 15fps
+        features={
+            # We call this "left" since we will only use the left stereo camera (following DROID RLDS convention)
+            "exterior_image_1_left": {
+                "dtype": "image",
+                "shape": (180, 320, 3),  # This is the resolution used in the DROID RLDS dataset
+                "names": ["height", "width", "channel"],
+            },
+            "exterior_image_2_left": {
+                "dtype": "image",
+                "shape": (180, 320, 3),
+                "names": ["height", "width", "channel"],
+            },
+            "wrist_image_left": {
+                "dtype": "image",
+                "shape": (180, 320, 3),
+                "names": ["height", "width", "channel"],
+            },
+            "joint_position": {
+                "dtype": "float32",
+                "shape": (7,),
+                "names": ["joint_position"],
+            },
+            "gripper_position": {
+                "dtype": "float32",
+                "shape": (1,),
+                "names": ["gripper_position"],
+            },
+            "actions": {
+                "dtype": "float32",
+                "shape": (8,),  # We will use joint *velocity* actions here (7D) + gripper position (1D)
+                "names": ["actions"],
+            },
+        },
+        image_writer_threads=10,
+        image_writer_processes=5,
+    )
+
+    # Load language annotations
+    # Note: we load the DROID language annotations for this example, but you can manually define them for your own data
+    with (data_dir / "aggregated-annotations-030724.json").open() as f:
+        language_annotations = json.load(f)
+
+    # Loop over raw DROID fine-tuning datasets and write episodes to the LeRobot dataset
+    # We assume the following directory structure:
+    # RAW_DROID_PATH/
+    #   - <...>/
+    #     - recordings/
+    #        - MP4/
+    #          - <camera_id>.mp4  # single-view video of left stereo pair camera
+    #     - trajectory.hdf5
+    #   - <...>/
+    episode_paths = list(data_dir.glob("**/trajectory.h5"))
+    print(f"Found {len(episode_paths)} episodes for conversion")
+
+    # We will loop over each dataset_name and write episodes to the LeRobot dataset
+    for episode_path in tqdm(episode_paths, desc="Converting episodes"):
+        # Load raw data
+        recording_folderpath = episode_path.parent / "recordings" / "MP4"
+        trajectory = load_trajectory(str(episode_path), recording_folderpath=str(recording_folderpath))
+
+        # To load the language instruction, we need to parse out the episode_id from the metadata file
+        # Again, you can modify this step for your own data, to load your own language instructions
+        metadata_filepath = next(iter(episode_path.parent.glob("metadata_*.json")))
+        episode_id = metadata_filepath.name.split(".")[0].split("_")[-1]
+        language_instruction = language_annotations.get(episode_id, {"language_instruction1": "Do something"})[
+            "language_instruction1"
+        ]
+        print(f"Converting episode with language instruction: {language_instruction}")
+
+        # Write to LeRobot dataset
+        for step in trajectory:
+            camera_type_dict = step["observation"]["camera_type"]
+            wrist_ids = [k for k, v in camera_type_dict.items() if v == 0]
+            exterior_ids = [k for k, v in camera_type_dict.items() if v != 0]
+            dataset.add_frame(
+                {
+                    # Note: need to flip BGR --> RGB for loaded images
+                    "exterior_image_1_left": resize_image(
+                        step["observation"]["image"][exterior_ids[0]][..., ::-1], (320, 180)
+                    ),
+                    "exterior_image_2_left": resize_image(
+                        step["observation"]["image"][exterior_ids[1]][..., ::-1], (320, 180)
+                    ),
+                    "wrist_image_left": resize_image(step["observation"]["image"][wrist_ids[0]][..., ::-1], (320, 180)),
+                    "joint_position": np.asarray(
+                        step["observation"]["robot_state"]["joint_positions"], dtype=np.float32
+                    ),
+                    "gripper_position": np.asarray(
+                        step["observation"]["robot_state"]["gripper_position"][None], dtype=np.float32
+                    ),
+                    # Important: we use joint velocity actions here since pi05-droid was pre-trained on joint velocity actions
+                    "actions": np.concatenate(
+                        [step["action"]["joint_velocity"], step["action"]["gripper_position"][None]], dtype=np.float32
+                    ),
+                    "task": language_instruction,
+                }
+            )
+        dataset.save_episode()
+
+    # Optionally push to the Hugging Face Hub
+    if push_to_hub:
+        dataset.push_to_hub(
+            tags=["libero", "panda", "rlds"],
+            private=False,
+            push_videos=True,
+            license="apache-2.0",
+        )
+
+
+##########################################################################################################
+################ The rest of this file are functions to parse the raw DROID data #########################
+################ You don't need to worry about understanding this part           #########################
+################ It was copied from here: https://github.com/JonathanYang0127/r2d2_rlds_dataset_builder/blob/parallel_convert/r2_d2/r2_d2.py
+##########################################################################################################
+
+
+camera_type_dict = {
+    "hand_camera_id": 0,
+    "varied_camera_1_id": 1,
+    "varied_camera_2_id": 1,
+}
+
+camera_type_to_string_dict = {
+    0: "hand_camera",
+    1: "varied_camera",
+    2: "fixed_camera",
+}
+
+
+def get_camera_type(cam_id):
+    if cam_id not in camera_type_dict:
+        return None
+    type_int = camera_type_dict[cam_id]
+    return camera_type_to_string_dict[type_int]
+
+
+class MP4Reader:
+    def __init__(self, filepath, serial_number):
+        # Save Parameters #
+        self.serial_number = serial_number
+        self._index = 0
+
+        # Open Video Reader #
+        self._mp4_reader = cv2.VideoCapture(filepath)
+        if not self._mp4_reader.isOpened():
+            raise RuntimeError("Corrupted MP4 File")
+
+    def set_reading_parameters(
+        self,
+        image=True,  # noqa: FBT002
+        concatenate_images=False,  # noqa: FBT002
+        resolution=(0, 0),
+        resize_func=None,
+    ):
+        # Save Parameters #
+        self.image = image
+        self.concatenate_images = concatenate_images
+        self.resolution = resolution
+        self.resize_func = cv2.resize
+        self.skip_reading = not image
+        if self.skip_reading:
+            return
+
+    def get_frame_resolution(self):
+        width = self._mp4_reader.get(cv2.cv.CV_CAP_PROP_FRAME_WIDTH)
+        height = self._mp4_reader.get(cv2.cv.CV_CAP_PROP_FRAME_HEIGHT)
+        return (width, height)
+
+    def get_frame_count(self):
+        if self.skip_reading:
+            return 0
+        return int(self._mp4_reader.get(cv2.cv.CV_CAP_PROP_FRAME_COUNT))
+
+    def set_frame_index(self, index):
+        if self.skip_reading:
+            return
+
+        if index < self._index:
+            self._mp4_reader.set(cv2.CAP_PROP_POS_FRAMES, index - 1)
+            self._index = index
+
+        while self._index < index:
+            self.read_camera(ignore_data=True)
+
+    def _process_frame(self, frame):
+        frame = copy.deepcopy(frame)
+        if self.resolution == (0, 0):
+            return frame
+        return self.resize_func(frame, self.resolution)
+
+    def read_camera(self, ignore_data=False, correct_timestamp=None):  # noqa: FBT002
+        # Skip if Read Unnecesary #
+        if self.skip_reading:
+            return {}
+
+        # Read Camera #
+        success, frame = self._mp4_reader.read()
+
+        self._index += 1
+        if not success:
+            return None
+        if ignore_data:
+            return None
+
+        # Return Data #
+        data_dict = {}
+
+        if self.concatenate_images or "stereo" not in self.serial_number:
+            data_dict["image"] = {self.serial_number: self._process_frame(frame)}
+        else:
+            single_width = frame.shape[1] // 2
+            data_dict["image"] = {
+                self.serial_number + "_left": self._process_frame(frame[:, :single_width, :]),
+                self.serial_number + "_right": self._process_frame(frame[:, single_width:, :]),
+            }
+
+        return data_dict
+
+    def disable_camera(self):
+        if hasattr(self, "_mp4_reader"):
+            self._mp4_reader.release()
+
+
+class RecordedMultiCameraWrapper:
+    def __init__(self, recording_folderpath, camera_kwargs={}):  # noqa: B006
+        # Save Camera Info #
+        self.camera_kwargs = camera_kwargs
+
+        # Open Camera Readers #
+        mp4_filepaths = glob.glob(recording_folderpath + "/*.mp4")
+        all_filepaths = mp4_filepaths
+
+        self.camera_dict = {}
+        for f in all_filepaths:
+            serial_number = f.split("/")[-1][:-4]
+            cam_type = get_camera_type(serial_number)
+            camera_kwargs.get(cam_type, {})
+
+            if f.endswith(".mp4"):
+                Reader = MP4Reader  # noqa: N806
+            else:
+                raise ValueError
+
+            self.camera_dict[serial_number] = Reader(f, serial_number)
+
+    def read_cameras(self, index=None, camera_type_dict={}, timestamp_dict={}):  # noqa: B006
+        full_obs_dict = defaultdict(dict)
+
+        # Read Cameras In Randomized Order #
+        all_cam_ids = list(self.camera_dict.keys())
+        # random.shuffle(all_cam_ids)
+
+        for cam_id in all_cam_ids:
+            if "stereo" in cam_id:
+                continue
+            try:
+                cam_type = camera_type_dict[cam_id]
+            except KeyError:
+                print(f"{self.camera_dict} -- {camera_type_dict}")
+                raise ValueError(f"Camera type {cam_id} not found in camera_type_dict")  # noqa: B904
+            curr_cam_kwargs = self.camera_kwargs.get(cam_type, {})
+            self.camera_dict[cam_id].set_reading_parameters(**curr_cam_kwargs)
+
+            timestamp = timestamp_dict.get(cam_id + "_frame_received", None)
+            if index is not None:
+                self.camera_dict[cam_id].set_frame_index(index)
+
+            data_dict = self.camera_dict[cam_id].read_camera(correct_timestamp=timestamp)
+
+            # Process Returned Data #
+            if data_dict is None:
+                return None
+            for key in data_dict:
+                full_obs_dict[key].update(data_dict[key])
+
+        return full_obs_dict
+
+
+def get_hdf5_length(hdf5_file, keys_to_ignore=[]):  # noqa: B006
+    length = None
+
+    for key in hdf5_file:
+        if key in keys_to_ignore:
+            continue
+
+        curr_data = hdf5_file[key]
+        if isinstance(curr_data, h5py.Group):
+            curr_length = get_hdf5_length(curr_data, keys_to_ignore=keys_to_ignore)
+        elif isinstance(curr_data, h5py.Dataset):
+            curr_length = len(curr_data)
+        else:
+            raise ValueError
+
+        if length is None:
+            length = curr_length
+        assert curr_length == length
+
+    return length
+
+
+def load_hdf5_to_dict(hdf5_file, index, keys_to_ignore=[]):  # noqa: B006
+    data_dict = {}
+
+    for key in hdf5_file:
+        if key in keys_to_ignore:
+            continue
+
+        curr_data = hdf5_file[key]
+        if isinstance(curr_data, h5py.Group):
+            data_dict[key] = load_hdf5_to_dict(curr_data, index, keys_to_ignore=keys_to_ignore)
+        elif isinstance(curr_data, h5py.Dataset):
+            data_dict[key] = curr_data[index]
+        else:
+            raise ValueError
+
+    return data_dict
+
+
+class TrajectoryReader:
+    def __init__(self, filepath, read_images=True):  # noqa: FBT002
+        self._hdf5_file = h5py.File(filepath, "r")
+        is_video_folder = "observations/videos" in self._hdf5_file
+        self._read_images = read_images and is_video_folder
+        self._length = get_hdf5_length(self._hdf5_file)
+        self._video_readers = {}
+        self._index = 0
+
+    def length(self):
+        return self._length
+
+    def read_timestep(self, index=None, keys_to_ignore=[]):  # noqa: B006
+        # Make Sure We Read Within Range #
+        if index is None:
+            index = self._index
+        else:
+            assert not self._read_images
+            self._index = index
+        assert index < self._length
+
+        # Load Low Dimensional Data #
+        keys_to_ignore = [*keys_to_ignore.copy(), "videos"]
+        timestep = load_hdf5_to_dict(self._hdf5_file, self._index, keys_to_ignore=keys_to_ignore)
+
+        # Increment Read Index #
+        self._index += 1
+
+        # Return Timestep #
+        return timestep
+
+    def close(self):
+        self._hdf5_file.close()
+
+
+def load_trajectory(
+    filepath=None,
+    read_cameras=True,  # noqa: FBT002
+    recording_folderpath=None,
+    camera_kwargs={},  # noqa: B006
+    remove_skipped_steps=False,  # noqa: FBT002
+    num_samples_per_traj=None,
+    num_samples_per_traj_coeff=1.5,
+):
+    read_recording_folderpath = read_cameras and (recording_folderpath is not None)
+
+    traj_reader = TrajectoryReader(filepath)
+    if read_recording_folderpath:
+        camera_reader = RecordedMultiCameraWrapper(recording_folderpath, camera_kwargs)
+
+    horizon = traj_reader.length()
+    timestep_list = []
+
+    # Choose Timesteps To Save #
+    if num_samples_per_traj:
+        num_to_save = num_samples_per_traj
+        if remove_skipped_steps:
+            num_to_save = int(num_to_save * num_samples_per_traj_coeff)
+        max_size = min(num_to_save, horizon)
+        indices_to_save = np.sort(np.random.choice(horizon, size=max_size, replace=False))
+    else:
+        indices_to_save = np.arange(horizon)
+
+    # Iterate Over Trajectory #
+    for i in indices_to_save:
+        # Get HDF5 Data #
+        timestep = traj_reader.read_timestep(index=i)
+
+        # If Applicable, Get Recorded Data #
+        if read_recording_folderpath:
+            timestamp_dict = timestep["observation"]["timestamp"]["cameras"]
+            camera_type_dict = {
+                k: camera_type_to_string_dict[v] for k, v in timestep["observation"]["camera_type"].items()
+            }
+            camera_obs = camera_reader.read_cameras(
+                index=i, camera_type_dict=camera_type_dict, timestamp_dict=timestamp_dict
+            )
+            camera_failed = camera_obs is None
+
+            # Add Data To Timestep If Successful #
+            if camera_failed:
+                break
+            timestep["observation"].update(camera_obs)
+
+        # Filter Steps #
+        step_skipped = not timestep["observation"]["controller_info"].get("movement_enabled", True)
+        delete_skipped_step = step_skipped and remove_skipped_steps
+
+        # Save Filtered Timesteps #
+        if delete_skipped_step:
+            del timestep
+        else:
+            timestep_list.append(timestep)
+
+    # Remove Extra Transitions #
+    timestep_list = np.array(timestep_list)
+    if (num_samples_per_traj is not None) and (len(timestep_list) > num_samples_per_traj):
+        ind_to_keep = np.random.choice(len(timestep_list), size=num_samples_per_traj, replace=False)
+        timestep_list = timestep_list[ind_to_keep]
+
+    # Close Readers #
+    traj_reader.close()
+
+    # Return Data #
+    return timestep_list
+
+
+if __name__ == "__main__":
+    tyro.cli(main)
--- a/policy/openpi-InternData-A1/examples/droid/main.py
+++ b/policy/openpi-InternData-A1/examples/droid/main.py
@@ -0,0 +1,246 @@
+# ruff: noqa
+
+import contextlib
+import dataclasses
+import datetime
+import faulthandler
+import os
+import signal
+import time
+from moviepy.editor import ImageSequenceClip
+import numpy as np
+from openpi_client import image_tools
+from openpi_client import websocket_client_policy
+import pandas as pd
+from PIL import Image
+from droid.robot_env import RobotEnv
+import tqdm
+import tyro
+
+faulthandler.enable()
+
+# DROID data collection frequency -- we slow down execution to match this frequency
+DROID_CONTROL_FREQUENCY = 15
+
+
+@dataclasses.dataclass
+class Args:
+    # Hardware parameters
+    left_camera_id: str = "<your_camera_id>"  # e.g., "24259877"
+    right_camera_id: str = "<your_camera_id>"  # e.g., "24514023"
+    wrist_camera_id: str = "<your_camera_id>"  # e.g., "13062452"
+
+    # Policy parameters
+    external_camera: str | None = (
+        None  # which external camera should be fed to the policy, choose from ["left", "right"]
+    )
+
+    # Rollout parameters
+    max_timesteps: int = 600
+    # How many actions to execute from a predicted action chunk before querying policy server again
+    # 8 is usually a good default (equals 0.5 seconds of action execution).
+    open_loop_horizon: int = 8
+
+    # Remote server parameters
+    remote_host: str = "0.0.0.0"  # point this to the IP address of the policy server, e.g., "192.168.1.100"
+    remote_port: int = (
+        8000  # point this to the port of the policy server, default server port for openpi servers is 8000
+    )
+
+
+# We are using Ctrl+C to optionally terminate rollouts early -- however, if we press Ctrl+C while the policy server is
+# waiting for a new action chunk, it will raise an exception and the server connection dies.
+# This context manager temporarily prevents Ctrl+C and delays it after the server call is complete.
+@contextlib.contextmanager
+def prevent_keyboard_interrupt():
+    """Temporarily prevent keyboard interrupts by delaying them until after the protected code."""
+    interrupted = False
+    original_handler = signal.getsignal(signal.SIGINT)
+
+    def handler(signum, frame):
+        nonlocal interrupted
+        interrupted = True
+
+    signal.signal(signal.SIGINT, handler)
+    try:
+        yield
+    finally:
+        signal.signal(signal.SIGINT, original_handler)
+        if interrupted:
+            raise KeyboardInterrupt
+
+
+def main(args: Args):
+    # Make sure external camera is specified by user -- we only use one external camera for the policy
+    assert (
+        args.external_camera is not None and args.external_camera in ["left", "right"]
+    ), f"Please specify an external camera to use for the policy, choose from ['left', 'right'], but got {args.external_camera}"
+
+    # Initialize the Panda environment. Using joint velocity action space and gripper position action space is very important.
+    env = RobotEnv(action_space="joint_velocity", gripper_action_space="position")
+    print("Created the droid env!")
+
+    # Connect to the policy server
+    policy_client = websocket_client_policy.WebsocketClientPolicy(args.remote_host, args.remote_port)
+
+    df = pd.DataFrame(columns=["success", "duration", "video_filename"])
+
+    while True:
+        instruction = input("Enter instruction: ")
+
+        # Rollout parameters
+        actions_from_chunk_completed = 0
+        pred_action_chunk = None
+
+        # Prepare to save video of rollout
+        timestamp = datetime.datetime.now().strftime("%Y_%m_%d_%H:%M:%S")
+        video = []
+        bar = tqdm.tqdm(range(args.max_timesteps))
+        print("Running rollout... press Ctrl+C to stop early.")
+        for t_step in bar:
+            start_time = time.time()
+            try:
+                # Get the current observation
+                curr_obs = _extract_observation(
+                    args,
+                    env.get_observation(),
+                    # Save the first observation to disk
+                    save_to_disk=t_step == 0,
+                )
+
+                video.append(curr_obs[f"{args.external_camera}_image"])
+
+                # Send websocket request to policy server if it's time to predict a new chunk
+                if actions_from_chunk_completed == 0 or actions_from_chunk_completed >= args.open_loop_horizon:
+                    actions_from_chunk_completed = 0
+
+                    # We resize images on the robot laptop to minimize the amount of data sent to the policy server
+                    # and improve latency.
+                    request_data = {
+                        "observation/exterior_image_1_left": image_tools.resize_with_pad(
+                            curr_obs[f"{args.external_camera}_image"], 224, 224
+                        ),
+                        "observation/wrist_image_left": image_tools.resize_with_pad(curr_obs["wrist_image"], 224, 224),
+                        "observation/joint_position": curr_obs["joint_position"],
+                        "observation/gripper_position": curr_obs["gripper_position"],
+                        "prompt": instruction,
+                    }
+
+                    # Wrap the server call in a context manager to prevent Ctrl+C from interrupting it
+                    # Ctrl+C will be handled after the server call is complete
+                    with prevent_keyboard_interrupt():
+                        # this returns action chunk [10, 8] of 10 joint velocity actions (7) + gripper position (1)
+                        pred_action_chunk = policy_client.infer(request_data)["actions"]
+                    assert pred_action_chunk.shape == (10, 8)
+
+                # Select current action to execute from chunk
+                action = pred_action_chunk[actions_from_chunk_completed]
+                actions_from_chunk_completed += 1
+
+                # Binarize gripper action
+                if action[-1].item() > 0.5:
+                    # action[-1] = 1.0
+                    action = np.concatenate([action[:-1], np.ones((1,))])
+                else:
+                    # action[-1] = 0.0
+                    action = np.concatenate([action[:-1], np.zeros((1,))])
+
+                # clip all dimensions of action to [-1, 1]
+                action = np.clip(action, -1, 1)
+
+                env.step(action)
+
+                # Sleep to match DROID data collection frequency
+                elapsed_time = time.time() - start_time
+                if elapsed_time < 1 / DROID_CONTROL_FREQUENCY:
+                    time.sleep(1 / DROID_CONTROL_FREQUENCY - elapsed_time)
+            except KeyboardInterrupt:
+                break
+
+        video = np.stack(video)
+        save_filename = "video_" + timestamp
+        ImageSequenceClip(list(video), fps=10).write_videofile(save_filename + ".mp4", codec="libx264")
+
+        success: str | float | None = None
+        while not isinstance(success, float):
+            success = input(
+                "Did the rollout succeed? (enter y for 100%, n for 0%), or a numeric value 0-100 based on the evaluation spec"
+            )
+            if success == "y":
+                success = 1.0
+            elif success == "n":
+                success = 0.0
+
+            success = float(success) / 100
+            if not (0 <= success <= 1):
+                print(f"Success must be a number in [0, 100] but got: {success * 100}")
+
+        df = df.append(
+            {
+                "success": success,
+                "duration": t_step,
+                "video_filename": save_filename,
+            },
+            ignore_index=True,
+        )
+
+        if input("Do one more eval? (enter y or n) ").lower() != "y":
+            break
+        env.reset()
+
+    os.makedirs("results", exist_ok=True)
+    timestamp = datetime.datetime.now().strftime("%I:%M%p_%B_%d_%Y")
+    csv_filename = os.path.join("results", f"eval_{timestamp}.csv")
+    df.to_csv(csv_filename)
+    print(f"Results saved to {csv_filename}")
+
+
+def _extract_observation(args: Args, obs_dict, *, save_to_disk=False):
+    image_observations = obs_dict["image"]
+    left_image, right_image, wrist_image = None, None, None
+    for key in image_observations:
+        # Note the "left" below refers to the left camera in the stereo pair.
+        # The model is only trained on left stereo cams, so we only feed those.
+        if args.left_camera_id in key and "left" in key:
+            left_image = image_observations[key]
+        elif args.right_camera_id in key and "left" in key:
+            right_image = image_observations[key]
+        elif args.wrist_camera_id in key and "left" in key:
+            wrist_image = image_observations[key]
+
+    # Drop the alpha dimension
+    left_image = left_image[..., :3]
+    right_image = right_image[..., :3]
+    wrist_image = wrist_image[..., :3]
+
+    # Convert to RGB
+    left_image = left_image[..., ::-1]
+    right_image = right_image[..., ::-1]
+    wrist_image = wrist_image[..., ::-1]
+
+    # In addition to image observations, also capture the proprioceptive state
+    robot_state = obs_dict["robot_state"]
+    cartesian_position = np.array(robot_state["cartesian_position"])
+    joint_position = np.array(robot_state["joint_positions"])
+    gripper_position = np.array([robot_state["gripper_position"]])
+
+    # Save the images to disk so that they can be viewed live while the robot is running
+    # Create one combined image to make live viewing easy
+    if save_to_disk:
+        combined_image = np.concatenate([left_image, wrist_image, right_image], axis=1)
+        combined_image = Image.fromarray(combined_image)
+        combined_image.save("robot_camera_views.png")
+
+    return {
+        "left_image": left_image,
+        "right_image": right_image,
+        "wrist_image": wrist_image,
+        "cartesian_position": cartesian_position,
+        "joint_position": joint_position,
+        "gripper_position": gripper_position,
+    }
+
+
+if __name__ == "__main__":
+    args: Args = tyro.cli(Args)
+    main(args)