updated params

First commit of tdmpc2 taken from NHansen code
Support for converting OpenX datasets from RLDS format to LeRobotDataset (#354 )
2024-09-02 06:34:24 +00:00 · 2024-08-29 12:48:01 +00:00 · 2024-08-27 09:07:00 +02:00 · 2024-08-26 17:38:48 +02:00 · 2024-08-26 14:30:18 +01:00 · 2024-08-26 12:28:16 +01:00
29 changed files with 3805 additions and 47 deletions
--- a/README.md
+++ b/README.md
@@ -31,12 +31,13 @@
    <p>We just dropped an in-depth tutorial on how to build your own robot!</p>
    <p>Teach it new skills by showing it a few moves with just a laptop.</p>
    <p>Then watch your homemade robot act autonomously 🤯</p>
-    <p>For more info, see [our thread on X](https://x.com/RemiCadene/status/1825455895561859185) or [our tutorial page](https://github.com/huggingface/lerobot/blob/main/examples/7_get_started_with_real_robot.md).</p>
+    <p>For more info, see <a href="https://x.com/RemiCadene/status/1825455895561859185">our thread on X</a> or <a href="https://github.com/huggingface/lerobot/blob/main/examples/7_get_started_with_real_robot.md">our tutorial page</a>.</p>
 </div>

+<br/>

 <h3 align="center">
-    <p>State-of-the-art AI for real-world robotics</p>
+    <p>LeRobot: State-of-the-art AI for real-world robotics</p>
 </h3>

 ---
@@ -266,13 +267,20 @@ checkpoints
 │   └── training_state.pth  # optimizer/scheduler/rng state and training step
 ```

+To resume training from a checkpoint, you can add these to the `train.py` python command:
+```bash
+    hydra.run.dir=your/original/experiment/dir resume=true
+```
+
+It will load the pretrained model, optimizer and scheduler states for training. For more information please see our tutorial on training resumption [here](https://github.com/huggingface/lerobot/blob/main/examples/5_resume_training.md).
+
 To use wandb for logging training and evaluation curves, make sure you've run `wandb login` as a one-time setup step. Then, when running the training command above, enable WandB in the configuration by adding:

 ```bash
    wandb.enable=true
 ```

-A link to the wandb logs for the run will also show up in yellow in your terminal. Here is an example of what they look like in your browser:
+A link to the wandb logs for the run will also show up in yellow in your terminal. Here is an example of what they look like in your browser. Please also check [here](https://github.com/huggingface/lerobot/blob/main/examples/4_train_policy_with_script.md#typical-logs-and-metrics) for the explaination of some commonly used metrics in logs.

 ![](media/wandb.png)

--- a/examples/4_train_policy_with_script.md
+++ b/examples/4_train_policy_with_script.md
@@ -170,6 +170,36 @@ python lerobot/scripts/train.py --config-dir outputs/train/my_experiment/checkpo

 Note that you may still use the regular syntax for config parameter overrides (eg: by adding `training.offline_steps=200000`).

+## Typical logs and metrics
+
+When you start the training process, you will first see your full configuration being printed in the terminal. You can check it to make sure that you config it correctly and your config is not overrided by other files. The final configuration will also be saved with the checkpoint.
+
+After that, you will see training log like this one:
+
+```
+INFO 2024-08-14 13:35:12 ts/train.py:192 step:0 smpl:64 ep:1 epch:0.00 loss:1.112 grdn:15.387 lr:2.0e-07 updt_s:1.738 data_s:4.774
+```
+
+or evaluation log like:
+
+```
+INFO 2024-08-14 13:38:45 ts/train.py:226 step:100 smpl:6K ep:52 epch:0.25 ∑rwrd:20.693 success:0.0% eval_s:120.266
+```
+
+These logs will also be saved in wandb if `wandb.enable` is set to `true`. Here are the meaning of some abbreviations:
+
+- `smpl`: number of samples seen during training.
+- `ep`: number of episodes seen during training. An episode contains multiple samples in a complete manipulation task.
+- `epch`: number of time all unique samples are seen (epoch).
+- `grdn`: gradient norm.
+- `∑rwrd`: compute the sum of rewards in every evaluation episode and then take an average of them.
+- `success`: average success rate of eval episodes. Reward and success are usually different except for the sparsing reward setting, where reward=1 only when the task is completed successfully.
+- `eval_s`: time to evaluate the policy in the environment, in second.
+- `updt_s`: time to update the network parameters, in second.
+- `data_s`: time to load a batch of data, in second. 
+
+Some metrics are useful for initial performance profiling. For example, if you find the current GPU utilization is low via the `nvidia-smi` command and `data_s` sometimes is too high, you may need to modify batch size or number of dataloading workers to accelerate dataloading. We also recommend [pytorch profiler](https://github.com/huggingface/lerobot?tab=readme-ov-file#improve-your-code-with-profiling) for detailed performance probing.
+
 ---

 So far we've seen how to train Diffusion Policy for PushT and ACT for ALOHA. Now, what if we want to train ACT for PushT? Well, there are aspects of the ACT configuration that are specific to the ALOHA environments, and these happen to be incompatible with PushT. Therefore, trying to run the following will almost certainly raise an exception of sorts (eg: feature dimension mismatch):
--- a/examples/7_get_started_with_real_robot.md
+++ b/examples/7_get_started_with_real_robot.md
@@ -752,7 +752,7 @@ Before trying `record`, if you want to push your dataset to the hub, make sure y
 ```bash
 huggingface-cli login --token ${HUGGINGFACE_TOKEN} --add-to-git-credential
 ```
-Also, store your Hugging Face repositery name in a variable (e.g. `cadene` or `lerobot`). For instance, run this to use your Hugging Face user name as repositery:
+Also, store your Hugging Face repository name in a variable (e.g. `cadene` or `lerobot`). For instance, run this to use your Hugging Face user name as repository:
 ```bash
 HF_USER=$(huggingface-cli whoami | head -n 1)
 echo $HF_USER
--- a/lerobot/init.py
+++ b/lerobot/init.py
@@ -129,6 +129,53 @@ available_real_world_datasets = [
    "lerobot/unitreeh1_rearrange_objects",
    "lerobot/unitreeh1_two_robot_greeting",
    "lerobot/unitreeh1_warehouse",
+    "lerobot/nyu_rot_dataset",
+    "lerobot/utokyo_saytap",
+    "lerobot/imperialcollege_sawyer_wrist_cam",
+    "lerobot/utokyo_xarm_bimanual",
+    "lerobot/tokyo_u_lsmo",
+    "lerobot/utokyo_pr2_opening_fridge",
+    "lerobot/cmu_franka_exploration_dataset",
+    "lerobot/cmu_stretch",
+    "lerobot/asu_table_top",
+    "lerobot/utokyo_pr2_tabletop_manipulation",
+    "lerobot/utokyo_xarm_pick_and_place",
+    "lerobot/ucsd_kitchen_dataset",
+    "lerobot/austin_buds_dataset",
+    "lerobot/dlr_sara_grid_clamp",
+    "lerobot/conq_hose_manipulation",
+    "lerobot/columbia_cairlab_pusht_real",
+    "lerobot/dlr_sara_pour",
+    "lerobot/dlr_edan_shared_control",
+    "lerobot/ucsd_pick_and_place_dataset",
+    "lerobot/berkeley_cable_routing",
+    "lerobot/nyu_franka_play_dataset",
+    "lerobot/austin_sirius_dataset",
+    "lerobot/cmu_play_fusion",
+    "lerobot/berkeley_gnm_sac_son",
+    "lerobot/nyu_door_opening_surprising_effectiveness",
+    "lerobot/berkeley_fanuc_manipulation",
+    "lerobot/jaco_play",
+    "lerobot/viola",
+    "lerobot/kaist_nonprehensile",
+    "lerobot/berkeley_mvp",
+    "lerobot/uiuc_d3field",
+    "lerobot/berkeley_gnm_recon",
+    "lerobot/austin_sailor_dataset",
+    "lerobot/utaustin_mutex",
+    "lerobot/roboturk",
+    "lerobot/stanford_hydra_dataset",
+    "lerobot/berkeley_autolab_ur5",
+    "lerobot/stanford_robocook",
+    "lerobot/toto",
+    "lerobot/fmb",
+    "lerobot/droid_100",
+    "lerobot/berkeley_rpt",
+    "lerobot/stanford_kuka_multimodal_dataset",
+    "lerobot/iamlab_cmu_pickup_insert",
+    "lerobot/taco_play",
+    "lerobot/berkeley_gnm_cory_hall",
+    "lerobot/usc_cloth_sim",
 ]

 available_datasets = list(
--- a/lerobot/common/datasets/compute_stats.py
+++ b/lerobot/common/datasets/compute_stats.py
@@ -40,6 +40,10 @@ def get_stats_einops_patterns(dataset, num_workers=0):

    stats_patterns = {}
    for key, feats_type in dataset.features.items():
+        # NOTE: skip language_instruction embedding in stats computation
+        if key == "language_instruction":
+            continue
+
        # sanity check that tensors are not float64
        assert batch[key].dtype != torch.float64

--- a/lerobot/common/datasets/push_dataset_to_hub/_download_raw.py
+++ b/lerobot/common/datasets/push_dataset_to_hub/_download_raw.py
@@ -60,8 +60,8 @@ AVAILABLE_RAW_REPO_IDS = {
    "lerobot-raw/aloha_static_vinh_cup_left_raw": "aloha_hdf5",
    "lerobot-raw/aloha_static_vinh_cup_raw": "aloha_hdf5",
    "lerobot-raw/aloha_static_ziploc_slide_raw": "aloha_hdf5",
-    "lerobot-raw/pusht_raw": "pusht_zarr",
    "lerobot-raw/umi_cup_in_the_wild_raw": "umi_zarr",
+    "lerobot-raw/pusht_raw": "pusht_zarr",
    "lerobot-raw/unitreeh1_fold_clothes_raw": "aloha_hdf5",
    "lerobot-raw/unitreeh1_rearrange_objects_raw": "aloha_hdf5",
    "lerobot-raw/unitreeh1_two_robot_greeting_raw": "aloha_hdf5",
@@ -70,6 +70,74 @@ AVAILABLE_RAW_REPO_IDS = {
    "lerobot-raw/xarm_lift_medium_replay_raw": "xarm_pkl",
    "lerobot-raw/xarm_push_medium_raw": "xarm_pkl",
    "lerobot-raw/xarm_push_medium_replay_raw": "xarm_pkl",
+    "lerobot-raw/fractal20220817_data_raw": "openx_rlds.fractal20220817_data",
+    "lerobot-raw/kuka_raw": "openx_rlds.kuka",
+    "lerobot-raw/bridge_openx_raw": "openx_rlds.bridge_openx",
+    "lerobot-raw/taco_play_raw": "openx_rlds.taco_play",
+    "lerobot-raw/jaco_play_raw": "openx_rlds.jaco_play",
+    "lerobot-raw/berkeley_cable_routing_raw": "openx_rlds.berkeley_cable_routing",
+    "lerobot-raw/roboturk_raw": "openx_rlds.roboturk",
+    "lerobot-raw/nyu_door_opening_surprising_effectiveness_raw": "openx_rlds.nyu_door_opening_surprising_effectiveness",
+    "lerobot-raw/viola_raw": "openx_rlds.viola",
+    "lerobot-raw/berkeley_autolab_ur5_raw": "openx_rlds.berkeley_autolab_ur5",
+    "lerobot-raw/toto_raw": "openx_rlds.toto",
+    "lerobot-raw/language_table_raw": "openx_rlds.language_table",
+    "lerobot-raw/columbia_cairlab_pusht_real_raw": "openx_rlds.columbia_cairlab_pusht_real",
+    "lerobot-raw/stanford_kuka_multimodal_dataset_raw": "openx_rlds.stanford_kuka_multimodal_dataset",
+    "lerobot-raw/nyu_rot_dataset_raw": "openx_rlds.nyu_rot_dataset",
+    "lerobot-raw/io_ai_tech_raw": "openx_rlds.io_ai_tech",
+    "lerobot-raw/stanford_hydra_dataset_raw": "openx_rlds.stanford_hydra_dataset",
+    "lerobot-raw/austin_buds_dataset_raw": "openx_rlds.austin_buds_dataset",
+    "lerobot-raw/nyu_franka_play_dataset_raw": "openx_rlds.nyu_franka_play_dataset",
+    "lerobot-raw/maniskill_dataset_raw": "openx_rlds.maniskill_dataset",
+    "lerobot-raw/furniture_bench_dataset_raw": "openx_rlds.furniture_bench_dataset",
+    "lerobot-raw/cmu_franka_exploration_dataset_raw": "openx_rlds.cmu_franka_exploration_dataset",
+    "lerobot-raw/ucsd_kitchen_dataset_raw": "openx_rlds.ucsd_kitchen_dataset",
+    "lerobot-raw/ucsd_pick_and_place_dataset_raw": "openx_rlds.ucsd_pick_and_place_dataset",
+    "lerobot-raw/spoc_raw": "openx_rlds.spoc",
+    "lerobot-raw/austin_sailor_dataset_raw": "openx_rlds.austin_sailor_dataset",
+    "lerobot-raw/austin_sirius_dataset_raw": "openx_rlds.austin_sirius_dataset",
+    "lerobot-raw/bc_z_raw": "openx_rlds.bc_z",
+    "lerobot-raw/utokyo_pr2_opening_fridge_raw": "openx_rlds.utokyo_pr2_opening_fridge",
+    "lerobot-raw/utokyo_pr2_tabletop_manipulation_raw": "openx_rlds.utokyo_pr2_tabletop_manipulation",
+    "lerobot-raw/utokyo_xarm_pick_and_place_raw": "openx_rlds.utokyo_xarm_pick_and_place",
+    "lerobot-raw/utokyo_xarm_bimanual_raw": "openx_rlds.utokyo_xarm_bimanual",
+    "lerobot-raw/utokyo_saytap_raw": "openx_rlds.utokyo_saytap",
+    "lerobot-raw/robo_net_raw": "openx_rlds.robo_net",
+    "lerobot-raw/robo_set_raw": "openx_rlds.robo_set",
+    "lerobot-raw/berkeley_mvp_raw": "openx_rlds.berkeley_mvp",
+    "lerobot-raw/berkeley_rpt_raw": "openx_rlds.berkeley_rpt",
+    "lerobot-raw/kaist_nonprehensile_raw": "openx_rlds.kaist_nonprehensile",
+    "lerobot-raw/stanford_mask_vit_raw": "openx_rlds.stanford_mask_vit",
+    "lerobot-raw/tokyo_u_lsmo_raw": "openx_rlds.tokyo_u_lsmo",
+    "lerobot-raw/dlr_sara_pour_raw": "openx_rlds.dlr_sara_pour",
+    "lerobot-raw/dlr_sara_grid_clamp_raw": "openx_rlds.dlr_sara_grid_clamp",
+    "lerobot-raw/dlr_edan_shared_control_raw": "openx_rlds.dlr_edan_shared_control",
+    "lerobot-raw/asu_table_top_raw": "openx_rlds.asu_table_top",
+    "lerobot-raw/stanford_robocook_raw": "openx_rlds.stanford_robocook",
+    "lerobot-raw/imperialcollege_sawyer_wrist_cam_raw": "openx_rlds.imperialcollege_sawyer_wrist_cam",
+    "lerobot-raw/iamlab_cmu_pickup_insert_raw": "openx_rlds.iamlab_cmu_pickup_insert",
+    "lerobot-raw/uiuc_d3field_raw": "openx_rlds.uiuc_d3field",
+    "lerobot-raw/utaustin_mutex_raw": "openx_rlds.utaustin_mutex",
+    "lerobot-raw/berkeley_fanuc_manipulation_raw": "openx_rlds.berkeley_fanuc_manipulation",
+    "lerobot-raw/cmu_playing_with_food_raw": "openx_rlds.cmu_playing_with_food",
+    "lerobot-raw/cmu_play_fusion_raw": "openx_rlds.cmu_play_fusion",
+    "lerobot-raw/cmu_stretch_raw": "openx_rlds.cmu_stretch",
+    "lerobot-raw/berkeley_gnm_recon_raw": "openx_rlds.berkeley_gnm_recon",
+    "lerobot-raw/berkeley_gnm_cory_hall_raw": "openx_rlds.berkeley_gnm_cory_hall",
+    "lerobot-raw/berkeley_gnm_sac_son_raw": "openx_rlds.berkeley_gnm_sac_son",
+    "lerobot-raw/droid_raw": "openx_rlds.droid",
+    "lerobot-raw/droid_100_raw": "openx_rlds.droid100",
+    "lerobot-raw/fmb_raw": "openx_rlds.fmb",
+    "lerobot-raw/dobbe_raw": "openx_rlds.dobbe",
+    "lerobot-raw/usc_cloth_sim_raw": "openx_rlds.usc_cloth_sim",
+    "lerobot-raw/plex_robosuite_raw": "openx_rlds.plex_robosuite",
+    "lerobot-raw/conq_hose_manipulation_raw": "openx_rlds.conq_hose_manipulation",
+    "lerobot-raw/vima_raw": "openx_rlds.vima",
+    "lerobot-raw/robot_vqa_raw": "openx_rlds.robot_vqa",
+    "lerobot-raw/mimic_play_raw": "openx_rlds.mimic_play",
+    "lerobot-raw/tidybot_raw": "openx_rlds.tidybot",
+    "lerobot-raw/eth_agent_affordances_raw": "openx_rlds.eth_agent_affordances",
 }


@@ -110,7 +178,7 @@ def download_all_raw_datasets(data_dir: Path | None = None):
 def main():
    parser = argparse.ArgumentParser(
        description=f"""A script to download raw datasets from Hugging Face hub to a local directory. Here is a
-            non exhaustive list of available repositories to use in `--repo-id`: {AVAILABLE_RAW_REPO_IDS}""",
+            non exhaustive list of available repositories to use in `--repo-id`: {list(AVAILABLE_RAW_REPO_IDS.keys())}""",
    )

    parser.add_argument(
--- a/lerobot/common/datasets/push_dataset_to_hub/openx/configs.yaml
+++ b/lerobot/common/datasets/push_dataset_to_hub/openx/configs.yaml
@@ -0,0 +1,640 @@
+OPENX_DATASET_CONFIGS:
+  fractal20220817_data:
+    image_obs_keys:
+      - image
+    depth_obs_keys:
+      - null
+    state_obs_keys:
+      - base_pose_tool_reached
+      - gripper_closed
+    fps: 3
+  
+  kuka:
+    image_obs_keys:
+      - image
+    depth_obs_keys:
+      - null
+    state_obs_keys:
+      - clip_function_input/base_pose_tool_reached
+      - gripper_closed
+    fps: 10
+  
+  bridge_openx:
+    image_obs_keys:
+      - image
+    depth_obs_keys:
+      - null
+    state_obs_keys:
+      - EEF_state
+      - gripper_state
+    fps: 5
+  
+  taco_play:
+    image_obs_keys:
+      - rgb_static
+      - rgb_gripper
+    depth_obs_keys:
+      - depth_static
+      - depth_gripper
+    state_obs_keys:
+      - state_eef
+      - state_gripper
+    fps: 15
+  
+  jaco_play:
+    image_obs_keys:
+      - image
+      - image_wrist
+    depth_obs_keys:
+      - null
+    state_obs_keys:
+      - state_eef
+      - state_gripper
+    fps: 10
+  
+  berkeley_cable_routing:
+    image_obs_keys:
+      - image
+      - top_image
+      - wrist45_image
+      - wrist225_image
+    depth_obs_keys:
+      - null
+    state_obs_keys:
+      - robot_state
+    fps: 10
+
+  roboturk:
+    image_obs_keys:
+      - front_rgb
+    depth_obs_keys:
+      - null
+    state_obs_keys:
+      - null
+    fps: 10
+  
+  nyu_door_opening_surprising_effectiveness:
+    image_obs_keys:
+      - image
+    depth_obs_keys:
+      - null
+    state_obs_keys:
+      - null
+    fps: 3
+
+  viola:
+    image_obs_keys:
+      - agentview_rgb
+      - eye_in_hand_rgb
+    depth_obs_keys:
+      - null
+    state_obs_keys:
+      - joint_states
+      - gripper_states
+    fps: 20
+
+  berkeley_autolab_ur5:
+    image_obs_keys:
+      - image
+      - hand_image
+    depth_obs_keys:
+      - image_with_depth
+    state_obs_keys:
+      - state
+    fps: 5
+
+  toto:
+    image_obs_keys:
+      - image
+    depth_obs_keys:
+      - null
+    state_obs_keys:
+      - state
+    fps: 30
+
+  language_table:
+    image_obs_keys:
+      - rgb
+    depth_obs_keys:
+      - null
+    state_obs_keys:
+      - effector_translation
+    fps: 10
+
+  columbia_cairlab_pusht_real:
+    image_obs_keys:
+      - image
+      - wrist_image
+    depth_obs_keys:
+      - null
+    state_obs_keys:
+      - robot_state
+    fps: 10
+
+  stanford_kuka_multimodal_dataset_converted_externally_to_rlds:
+    image_obs_keys:
+      - image
+    depth_obs_keys:
+      - depth_image
+    state_obs_keys:
+      - ee_position
+      - ee_orientation
+    fps: 20
+
+  nyu_rot_dataset_converted_externally_to_rlds:
+    image_obs_keys:
+      - image
+    depth_obs_keys:
+      - null
+    state_obs_keys:
+      - eef_state
+      - gripper_state
+    fps: 3
+
+  io_ai_tech:
+    image_obs_keys:
+      - image
+      - image_fisheye
+      - image_left_side
+      - image_right_side
+    depth_obs_keys:
+      - null
+    state_obs_keys:
+      - state
+    fps: 3
+
+  stanford_hydra_dataset_converted_externally_to_rlds:
+    image_obs_keys:
+      - image
+      - wrist_image
+    depth_obs_keys:
+      - null
+    state_obs_keys:
+      - eef_state
+      - gripper_state
+    fps: 10
+
+  austin_buds_dataset_converted_externally_to_rlds:
+    image_obs_keys:
+      - image
+      - wrist_image
+    depth_obs_keys:
+      - null
+    state_obs_keys:
+      - state
+    fps: 20
+
+  nyu_franka_play_dataset_converted_externally_to_rlds:
+    image_obs_keys:
+      - image
+      - image_additional_view
+    depth_obs_keys:
+      - depth
+      - depth_additional_view
+    state_obs_keys:
+      - eef_state
+    fps: 3
+
+  maniskill_dataset_converted_externally_to_rlds:
+    image_obs_keys:
+      - image
+      - wrist_image
+    depth_obs_keys:
+      - depth
+      - wrist_depth
+    state_obs_keys:
+      - tcp_pose
+      - gripper_state
+    fps: 20
+
+  furniture_bench_dataset_converted_externally_to_rlds:
+    image_obs_keys:
+      - image
+      - wrist_image
+    depth_obs_keys:
+      - null
+    state_obs_keys:
+      - state
+    fps: 10
+
+  cmu_franka_exploration_dataset_converted_externally_to_rlds:
+    image_obs_keys:
+      - highres_image
+    depth_obs_keys:
+      - null  
+    state_obs_keys:
+      - null
+    fps: 10
+
+  ucsd_kitchen_dataset_converted_externally_to_rlds:
+    image_obs_keys:
+      - image
+    depth_obs_keys:
+      - null
+    state_obs_keys:
+      - joint_state
+    fps: 2
+  
+  ucsd_pick_and_place_dataset_converted_externally_to_rlds:
+    image_obs_keys:
+      - image
+    depth_obs_keys:
+      - null
+    state_obs_keys:
+      - eef_state
+      - gripper_state
+    fps: 3
+  
+  spoc:
+    image_obs_keys:
+      - image
+      - image_manipulation
+    depth_obs_keys:
+      - null
+    state_obs_keys:
+      - null
+    fps: 3
+  
+  austin_sailor_dataset_converted_externally_to_rlds:
+    image_obs_keys:
+      - image
+      - wrist_image
+    depth_obs_keys:
+      - null
+    state_obs_keys:
+      - state
+    fps: 20
+  
+  austin_sirius_dataset_converted_externally_to_rlds:
+    image_obs_keys:
+      - image
+      - wrist_image
+    depth_obs_keys:
+      - null
+    state_obs_keys:
+      - state
+    fps: 20
+  
+  bc_z:
+    image_obs_keys:
+      - image
+    depth_obs_keys:
+      - null
+    state_obs_keys:
+      - present/xyz
+      - present/axis_angle
+      - present/sensed_close
+    fps: 10
+  
+  utokyo_pr2_opening_fridge_converted_externally_to_rlds:
+    image_obs_keys:
+      - image
+    depth_obs_keys:
+      - null
+    state_obs_keys:
+      - eef_state
+      - gripper_state
+    fps: 10
+  
+  utokyo_pr2_tabletop_manipulation_converted_externally_to_rlds:
+    image_obs_keys:
+      - image
+    depth_obs_keys:
+      - null
+    state_obs_keys:
+      - eef_state
+      - gripper_state
+    fps: 10
+  
+  utokyo_xarm_pick_and_place_converted_externally_to_rlds:
+    image_obs_keys:
+      - image
+      - image2
+      - hand_image
+    depth_obs_keys:
+      - null
+    state_obs_keys:
+      - end_effector_pose
+    fps: 10
+  
+  utokyo_xarm_bimanual_converted_externally_to_rlds:
+    image_obs_keys:
+      - image
+    depth_obs_keys:
+      - null
+    state_obs_keys:
+      - pose_r
+    fps: 10
+  
+  robo_net:
+    image_obs_keys:
+      - image
+      - image1
+    depth_obs_keys:
+      - null
+    state_obs_keys:
+      - eef_state
+      - gripper_state
+    fps: 1
+  
+  robo_set:
+    image_obs_keys:
+      - image_left
+      - image_right
+      - image_wrist
+    depth_obs_keys:
+      - null
+    state_obs_keys:
+      - state
+      - state_velocity
+    fps: 5
+  
+  berkeley_mvp_converted_externally_to_rlds:
+    image_obs_keys:
+      - hand_image
+    depth_obs_keys:
+      - null
+    state_obs_keys:
+      - gripper
+      - pose
+      - joint_pos
+    fps: 5
+  
+  berkeley_rpt_converted_externally_to_rlds:
+    image_obs_keys:
+      - hand_image
+    depth_obs_keys:
+      - null
+    state_obs_keys:
+      - joint_pos
+      - gripper
+    fps: 30
+  
+  kaist_nonprehensile_converted_externally_to_rlds:
+    image_obs_keys:
+      - image
+    depth_obs_keys:
+      - null
+    state_obs_keys:
+      - state
+    fps: 10
+  
+  stanford_mask_vit_converted_externally_to_rlds:
+    image_obs_keys:
+      - image
+    depth_obs_keys:
+      - null
+    state_obs_keys:
+      - eef_state
+      - gripper_state
+  
+  tokyo_u_lsmo_converted_externally_to_rlds:
+    image_obs_keys:
+      - image
+    depth_obs_keys:
+      - null
+    state_obs_keys:
+      - eef_state
+      - gripper_state
+    fps: 10
+  
+  dlr_sara_pour_converted_externally_to_rlds:
+    image_obs_keys:
+      - image
+    depth_obs_keys:
+      - null
+    state_obs_keys:
+      - state
+    fps: 10
+  
+  dlr_sara_grid_clamp_converted_externally_to_rlds:
+    image_obs_keys:
+      - image
+    depth_obs_keys:
+      - null
+    state_obs_keys:
+      - state  
+    fps: 10
+  
+  dlr_edan_shared_control_converted_externally_to_rlds:
+    image_obs_keys:
+      - image
+    depth_obs_keys:
+      - null
+    state_obs_keys:
+      - state
+    fps: 5
+  
+  asu_table_top_converted_externally_to_rlds:
+    image_obs_keys:
+      - image
+    depth_obs_keys:
+      - null
+    state_obs_keys:
+      - eef_state
+      - gripper_state
+    fps: 12.5
+
+  stanford_robocook_converted_externally_to_rlds:
+    image_obs_keys:
+      - image_1
+      - image_2
+    depth_obs_keys:
+      - depth_1
+      - depth_2
+    state_obs_keys:
+      - eef_state
+      - gripper_state
+    fps: 5
+
+  imperialcollege_sawyer_wrist_cam:
+    image_obs_keys:
+      - image
+      - wrist_image
+    depth_obs_keys:
+      - null
+    state_obs_keys:
+      - state
+    fps: 10
+
+  iamlab_cmu_pickup_insert_converted_externally_to_rlds:
+    image_obs_keys:
+      - image
+      - wrist_image
+    depth_obs_keys:
+      - null
+    state_obs_keys:
+      - joint_state
+      - gripper_state
+    fps: 20
+
+  uiuc_d3field:
+    image_obs_keys:
+      - image_1
+      - image_2
+    depth_obs_keys:
+      - depth_1
+      - depth_2
+    state_obs_keys:
+      - null
+    fps: 1
+  
+  utaustin_mutex:
+    image_obs_keys:
+      - image
+      - wrist_image
+    depth_obs_keys:
+      - null
+    state_obs_keys:
+      - state
+    fps: 20
+  
+  berkeley_fanuc_manipulation:
+    image_obs_keys:
+      - image
+      - wrist_image
+    depth_obs_keys:
+      - null
+    state_obs_keys:
+      - joint_state
+      - gripper_state
+    fps: 10
+  
+  cmu_playing_with_food:
+    image_obs_keys:
+      - image
+      - finger_vision_1
+    depth_obs_keys:
+      - null
+    state_obs_keys:
+      - state
+    fps: 10
+  
+  cmu_play_fusion:
+    image_obs_keys:
+      - image
+    depth_obs_keys:
+      - null
+    state_obs_keys:
+      - state
+    fps: 5
+  
+  cmu_stretch:
+    image_obs_keys:
+      - image
+    depth_obs_keys:
+      - null
+    state_obs_keys:
+      - eef_state
+      - gripper_state
+    fps: 10
+  
+  berkeley_gnm_recon:
+    image_obs_keys:
+      - image
+    depth_obs_keys:
+      - null
+    state_obs_keys:
+      - state
+      - position
+      - yaw
+    fps: 3
+ 
+  berkeley_gnm_cory_hall:
+    image_obs_keys:
+      - image
+    depth_obs_keys:
+      - null
+    state_obs_keys:
+      - state
+      - position
+      - yaw
+    fps: 5
+ 
+  berkeley_gnm_sac_son:
+    image_obs_keys:
+      - image
+    depth_obs_keys:
+      - null
+    state_obs_keys:
+      - state
+      - position
+      - yaw
+    fps: 10
+  
+  droid:
+    image_obs_keys:
+      - exterior_image_1_left
+      - exterior_image_2_left
+      - wrist_image_left
+    depth_obs_keys:
+      - null
+    state_obs_keys:
+      - proprio
+    fps: 15
+  
+  droid_100:
+    image_obs_keys:
+      - exterior_image_1_left
+      - exterior_image_2_left
+      - wrist_image_left
+    depth_obs_keys:
+      - null
+    state_obs_keys:
+      - proprio
+    fps: 15
+  
+  fmb:
+    image_obs_keys:
+      - image_side_1
+      - image_side_2
+      - image_wrist_1
+      - image_wrist_2
+    depth_obs_keys:
+      - image_side_1_depth
+      - image_side_2_depth
+      - image_wrist_1_depth
+      - image_wrist_2_depth
+    state_obs_keys:
+      - proprio
+    fps: 10
+  
+  dobbe:
+    image_obs_keys:
+      - wrist_image
+    depth_obs_keys:
+      - null
+    state_obs_keys:
+      - proprio
+    fps: 3.75
+  
+  usc_cloth_sim_converted_externally_to_rlds:
+    image_obs_keys:
+      - image
+    depth_obs_keys:
+      - null
+    state_obs_keys:
+      - null
+    fps: 10
+  
+  plex_robosuite:
+    image_obs_keys:
+      - image
+      - wrist_image
+    depth_obs_keys:
+      - null
+    state_obs_keys:
+      - state
+    fps: 20
+  
+  conq_hose_manipulation:
+    image_obs_keys:
+      - frontleft_fisheye_image
+      - frontright_fisheye_image
+      - hand_color_image
+    depth_obs_keys:
+      - null
+    state_obs_keys:
+      - state
+    fps: 30
+  
--- a/lerobot/common/datasets/push_dataset_to_hub/openx/data_utils.py
+++ b/lerobot/common/datasets/push_dataset_to_hub/openx/data_utils.py
@@ -0,0 +1,106 @@
+#!/usr/bin/env python
+
+# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the Licens    e.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+NOTE(YL): Adapted from:
+    Octo: https://github.com/octo-models/octo/blob/main/octo/data/utils/data_utils.py
+
+data_utils.py
+
+Additional utils for data processing.
+"""
+
+from typing import Any, Dict, List
+
+import tensorflow as tf
+
+
+def binarize_gripper_actions(actions: tf.Tensor) -> tf.Tensor:
+    """
+    Converts gripper actions from continuous to binary values (0 and 1).
+
+    We exploit that fact that most of the time, the gripper is fully open (near 1.0) or fully closed (near 0.0). As it
+    transitions between the two, it sometimes passes through a few intermediate values. We relabel those intermediate
+    values based on the state that is reached _after_ those intermediate values.
+
+    In the edge case that the trajectory ends with an intermediate value, we give up on binarizing and relabel that
+    chunk of intermediate values as the last action in the trajectory.
+
+    The `scan_fn` implements the following logic:
+        new_actions = np.empty_like(actions)
+        carry = actions[-1]
+        for i in reversed(range(actions.shape[0])):
+            if in_between_mask[i]:
+                carry = carry
+            else:
+                carry = float(open_mask[i])
+            new_actions[i] = carry
+    """
+    open_mask, closed_mask = actions > 0.95, actions < 0.05
+    in_between_mask = tf.logical_not(tf.logical_or(open_mask, closed_mask))
+    is_open_float = tf.cast(open_mask, tf.float32)
+
+    def scan_fn(carry, i):
+        return tf.cond(in_between_mask[i], lambda: tf.cast(carry, tf.float32), lambda: is_open_float[i])
+
+    return tf.scan(scan_fn, tf.range(tf.shape(actions)[0]), actions[-1], reverse=True)
+
+
+def invert_gripper_actions(actions: tf.Tensor) -> tf.Tensor:
+    return 1 - actions
+
+
+def rel2abs_gripper_actions(actions: tf.Tensor) -> tf.Tensor:
+    """
+    Converts relative gripper actions (+1 for closing, -1 for opening) to absolute actions (0 = closed; 1 = open).
+
+    Assumes that the first relative gripper is not redundant (i.e. close when already closed)!
+    """
+    # Note =>> -1 for closing, 1 for opening, 0 for no change
+    opening_mask, closing_mask = actions < -0.1, actions > 0.1
+    thresholded_actions = tf.where(opening_mask, 1, tf.where(closing_mask, -1, 0))
+
+    def scan_fn(carry, i):
+        return tf.cond(thresholded_actions[i] == 0, lambda: carry, lambda: thresholded_actions[i])
+
+    # If no relative grasp, assumes open for whole trajectory
+    start = -1 * thresholded_actions[tf.argmax(thresholded_actions != 0, axis=0)]
+    start = tf.cond(start == 0, lambda: 1, lambda: start)
+
+    # Note =>> -1 for closed, 1 for open
+    new_actions = tf.scan(scan_fn, tf.range(tf.shape(actions)[0]), start)
+    new_actions = tf.cast(new_actions, tf.float32) / 2 + 0.5
+
+    return new_actions
+
+
+# === Bridge-V2 =>> Dataset-Specific Transform ===
+def relabel_bridge_actions(traj: Dict[str, Any]) -> Dict[str, Any]:
+    """Relabels actions to use reached proprioceptive state; discards last timestep (no-action)."""
+    movement_actions = traj["observation"]["state"][1:, :6] - traj["observation"]["state"][:-1, :6]
+    traj_truncated = tf.nest.map_structure(lambda x: x[:-1], traj)
+    traj_truncated["action"] = tf.concat([movement_actions, traj["action"][:-1, -1:]], axis=1)
+
+    return traj_truncated
+
+
+# === RLDS Dataset Initialization Utilities ===
+def pprint_data_mixture(dataset_kwargs_list: List[Dict[str, Any]], dataset_weights: List[int]) -> None:
+    print("\n######################################################################################")
+    print(f"# Loading the following {len(dataset_kwargs_list)} datasets (incl. sampling weight):{'': >24} #")
+    for dataset_kwargs, weight in zip(dataset_kwargs_list, dataset_weights, strict=False):
+        pad = 80 - len(dataset_kwargs["name"])
+        print(f"# {dataset_kwargs['name']}: {weight:=>{pad}f} #")
+    print("######################################################################################\n")
--- a/lerobot/common/datasets/push_dataset_to_hub/openx/droid_utils.py
+++ b/lerobot/common/datasets/push_dataset_to_hub/openx/droid_utils.py
@@ -0,0 +1,200 @@
+#!/usr/bin/env python
+
+# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+NOTE(YL): Adapted from:
+    OpenVLA: https://github.com/openvla/openvla
+
+Episode transforms for DROID dataset.
+"""
+
+from typing import Any, Dict
+
+import tensorflow as tf
+import tensorflow_graphics.geometry.transformation as tfg
+
+
+def rmat_to_euler(rot_mat):
+    return tfg.euler.from_rotation_matrix(rot_mat)
+
+
+def euler_to_rmat(euler):
+    return tfg.rotation_matrix_3d.from_euler(euler)
+
+
+def invert_rmat(rot_mat):
+    return tfg.rotation_matrix_3d.inverse(rot_mat)
+
+
+def rotmat_to_rot6d(mat):
+    """
+    Converts rotation matrix to R6 rotation representation (first two rows in rotation matrix).
+    Args:
+        mat: rotation matrix
+
+    Returns: 6d vector (first two rows of rotation matrix)
+
+    """
+    r6 = mat[..., :2, :]
+    r6_0, r6_1 = r6[..., 0, :], r6[..., 1, :]
+    r6_flat = tf.concat([r6_0, r6_1], axis=-1)
+    return r6_flat
+
+
+def velocity_act_to_wrist_frame(velocity, wrist_in_robot_frame):
+    """
+    Translates velocity actions (translation + rotation) from base frame of the robot to wrist frame.
+    Args:
+        velocity: 6d velocity action (3 x translation, 3 x rotation)
+        wrist_in_robot_frame: 6d pose of the end-effector in robot base frame
+
+    Returns: 9d velocity action in robot wrist frame (3 x translation, 6 x rotation as R6)
+
+    """
+    r_frame = euler_to_rmat(wrist_in_robot_frame[:, 3:6])
+    r_frame_inv = invert_rmat(r_frame)
+
+    # world to wrist: dT_pi = R^-1 dT_rbt
+    vel_t = (r_frame_inv @ velocity[:, :3][..., None])[..., 0]
+
+    # world to wrist: dR_pi = R^-1 dR_rbt R
+    dr_ = euler_to_rmat(velocity[:, 3:6])
+    dr_ = r_frame_inv @ (dr_ @ r_frame)
+    dr_r6 = rotmat_to_rot6d(dr_)
+    return tf.concat([vel_t, dr_r6], axis=-1)
+
+
+def rand_swap_exterior_images(img1, img2):
+    """
+    Randomly swaps the two exterior images (for training with single exterior input).
+    """
+    return tf.cond(tf.random.uniform(shape=[]) > 0.5, lambda: (img1, img2), lambda: (img2, img1))
+
+
+def droid_baseact_transform(trajectory: Dict[str, Any]) -> Dict[str, Any]:
+    """
+    DROID dataset transformation for actions expressed in *base* frame of the robot.
+    """
+    dt = trajectory["action_dict"]["cartesian_velocity"][:, :3]
+    dr_ = trajectory["action_dict"]["cartesian_velocity"][:, 3:6]
+
+    trajectory["action"] = tf.concat(
+        (
+            dt,
+            dr_,
+            1 - trajectory["action_dict"]["gripper_position"],
+        ),
+        axis=-1,
+    )
+    trajectory["observation"]["exterior_image_1_left"], trajectory["observation"]["exterior_image_2_left"] = (
+        rand_swap_exterior_images(
+            trajectory["observation"]["exterior_image_1_left"],
+            trajectory["observation"]["exterior_image_2_left"],
+        )
+    )
+    trajectory["observation"]["proprio"] = tf.concat(
+        (
+            trajectory["observation"]["cartesian_position"],
+            trajectory["observation"]["gripper_position"],
+        ),
+        axis=-1,
+    )
+    return trajectory
+
+
+def droid_wristact_transform(trajectory: Dict[str, Any]) -> Dict[str, Any]:
+    """
+    DROID dataset transformation for actions expressed in *wrist* frame of the robot.
+    """
+    wrist_act = velocity_act_to_wrist_frame(
+        trajectory["action_dict"]["cartesian_velocity"], trajectory["observation"]["cartesian_position"]
+    )
+    trajectory["action"] = tf.concat(
+        (
+            wrist_act,
+            trajectory["action_dict"]["gripper_position"],
+        ),
+        axis=-1,
+    )
+    trajectory["observation"]["exterior_image_1_left"], trajectory["observation"]["exterior_image_2_left"] = (
+        rand_swap_exterior_images(
+            trajectory["observation"]["exterior_image_1_left"],
+            trajectory["observation"]["exterior_image_2_left"],
+        )
+    )
+    trajectory["observation"]["proprio"] = tf.concat(
+        (
+            trajectory["observation"]["cartesian_position"],
+            trajectory["observation"]["gripper_position"],
+        ),
+        axis=-1,
+    )
+    return trajectory
+
+
+def droid_finetuning_transform(trajectory: Dict[str, Any]) -> Dict[str, Any]:
+    """
+    DROID dataset transformation for actions expressed in *base* frame of the robot.
+    """
+    dt = trajectory["action_dict"]["cartesian_velocity"][:, :3]
+    dr_ = trajectory["action_dict"]["cartesian_velocity"][:, 3:6]
+    trajectory["action"] = tf.concat(
+        (
+            dt,
+            dr_,
+            1 - trajectory["action_dict"]["gripper_position"],
+        ),
+        axis=-1,
+    )
+    trajectory["observation"]["proprio"] = tf.concat(
+        (
+            trajectory["observation"]["cartesian_position"],
+            trajectory["observation"]["gripper_position"],
+        ),
+        axis=-1,
+    )
+    return trajectory
+
+
+def zero_action_filter(traj: Dict) -> bool:
+    """
+    Filters transitions whose actions are all-0 (only relative actions, no gripper action).
+    Note: this filter is applied *after* action normalization, so need to compare to "normalized 0".
+    """
+    droid_q01 = tf.convert_to_tensor(
+        [
+            -0.7776297926902771,
+            -0.5803514122962952,
+            -0.5795090794563293,
+            -0.6464047729969025,
+            -0.7041108310222626,
+            -0.8895104378461838,
+        ]
+    )
+    droid_q99 = tf.convert_to_tensor(
+        [
+            0.7597932070493698,
+            0.5726242214441299,
+            0.7351000607013702,
+            0.6705610305070877,
+            0.6464948207139969,
+            0.8897542208433151,
+        ]
+    )
+    droid_norm_0_act = (
+        2 * (tf.zeros_like(traj["action"][:, :6]) - droid_q01) / (droid_q99 - droid_q01 + 1e-8) - 1
+    )
+
+    return tf.reduce_any(tf.math.abs(traj["action"][:, :6] - droid_norm_0_act) > 1e-5)
--- a/lerobot/common/datasets/push_dataset_to_hub/openx/transforms.py
+++ b/lerobot/common/datasets/push_dataset_to_hub/openx/transforms.py
@@ -0,0 +1,859 @@
+#!/usr/bin/env python
+
+# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+NOTE(YL): Adapted from:
+    OpenVLA: https://github.com/openvla/openvla
+    Octo: https://github.com/octo-models/octo
+
+transforms.py
+
+Defines a registry of per-dataset standardization transforms for each dataset in Open-X Embodiment.
+
+Transforms adopt the following structure:
+    Input: Dictionary of *batched* features (i.e., has leading time dimension)
+    Output: Dictionary `step` =>> {
+        "observation": {
+            <image_keys, depth_image_keys>
+            State (in chosen state representation)
+        },
+        "action": Action (in chosen action representation),
+        "language_instruction": str
+    }
+"""
+
+from typing import Any, Dict
+
+import tensorflow as tf
+
+from lerobot.common.datasets.push_dataset_to_hub.openx.data_utils import (
+    binarize_gripper_actions,
+    invert_gripper_actions,
+    rel2abs_gripper_actions,
+    relabel_bridge_actions,
+)
+
+
+def droid_baseact_transform_fn():
+    from lerobot.common.datasets.push_dataset_to_hub.openx.droid_utils import droid_baseact_transform
+
+    return droid_baseact_transform
+
+
+def bridge_openx_dataset_transform(trajectory: Dict[str, Any]) -> Dict[str, Any]:
+    """
+    Applies to version of Bridge V2 in Open X-Embodiment mixture.
+
+    Note =>> In original Bridge V2 dataset, the first timestep has an all-zero action, so we remove it!
+    """
+    for key in trajectory:
+        if key == "traj_metadata":
+            continue
+        elif key in ["observation", "action"]:
+            for key2 in trajectory[key]:
+                trajectory[key][key2] = trajectory[key][key2][1:]
+        else:
+            trajectory[key] = trajectory[key][1:]
+
+    trajectory["action"] = tf.concat(
+        (
+            trajectory["action"]["world_vector"],
+            trajectory["action"]["rotation_delta"],
+            tf.cast(trajectory["action"]["open_gripper"][:, None], tf.float32),
+        ),
+        axis=-1,
+    )
+    trajectory["language_instruction"] = trajectory["observation"]["natural_language_instruction"]
+    trajectory = relabel_bridge_actions(trajectory)
+    trajectory["observation"]["EEF_state"] = trajectory["observation"]["state"][:, :6]
+    trajectory["observation"]["gripper_state"] = trajectory["observation"]["state"][:, -1:]
+    return trajectory
+
+
+def bridge_orig_dataset_transform(trajectory: Dict[str, Any]) -> Dict[str, Any]:
+    """
+    Applies to original version of Bridge V2 from the official project website.
+
+    Note =>> In original Bridge V2 dataset, the first timestep has an all-zero action, so we remove it!
+    """
+    for key in trajectory:
+        if key == "traj_metadata":
+            continue
+        elif key == "observation":
+            for key2 in trajectory[key]:
+                trajectory[key][key2] = trajectory[key][key2][1:]
+        else:
+            trajectory[key] = trajectory[key][1:]
+
+    trajectory["action"] = tf.concat(
+        [
+            trajectory["action"][:, :6],
+            binarize_gripper_actions(trajectory["action"][:, -1])[:, None],
+        ],
+        axis=1,
+    )
+    trajectory = relabel_bridge_actions(trajectory)
+    trajectory["observation"]["EEF_state"] = trajectory["observation"]["state"][:, :6]
+    trajectory["observation"]["gripper_state"] = trajectory["observation"]["state"][:, -1:]
+    return trajectory
+
+
+def ppgm_dataset_transform(trajectory: Dict[str, Any]) -> Dict[str, Any]:
+    trajectory["action"] = tf.concat(
+        [
+            trajectory["action"][:, :6],
+            binarize_gripper_actions(trajectory["action"][:, -1])[:, None],
+        ],
+        axis=1,
+    )
+    trajectory["observation"]["EEF_state"] = trajectory["observation"]["cartesian_position"][:, :6]
+    trajectory["observation"]["gripper_state"] = trajectory["observation"]["gripper_position"][:, -1:]
+    return trajectory
+
+
+def rt1_dataset_transform(trajectory: Dict[str, Any]) -> Dict[str, Any]:
+    # make gripper action absolute action, +1 = open, 0 = close
+    gripper_action = trajectory["action"]["gripper_closedness_action"][:, 0]
+    gripper_action = rel2abs_gripper_actions(gripper_action)
+
+    trajectory["action"] = tf.concat(
+        (
+            trajectory["action"]["world_vector"],
+            trajectory["action"]["rotation_delta"],
+            gripper_action[:, None],
+        ),
+        axis=-1,
+    )
+    trajectory["language_instruction"] = trajectory["observation"]["natural_language_instruction"]
+    return trajectory
+
+
+def kuka_dataset_transform(trajectory: Dict[str, Any]) -> Dict[str, Any]:
+    # make gripper action absolute action, +1 = open, 0 = close
+    gripper_action = trajectory["action"]["gripper_closedness_action"][:, 0]
+    gripper_action = rel2abs_gripper_actions(gripper_action)
+
+    trajectory["action"] = tf.concat(
+        (
+            trajectory["action"]["world_vector"],
+            trajectory["action"]["rotation_delta"],
+            gripper_action[:, None],
+        ),
+        axis=-1,
+    )
+    # decode compressed state
+    eef_value = tf.io.decode_compressed(
+        trajectory["observation"]["clip_function_input/base_pose_tool_reached"],
+        compression_type="ZLIB",
+    )
+    eef_value = tf.io.decode_raw(eef_value, tf.float32)
+    trajectory["observation"]["clip_function_input/base_pose_tool_reached"] = tf.reshape(eef_value, (-1, 7))
+    gripper_value = tf.io.decode_compressed(
+        trajectory["observation"]["gripper_closed"], compression_type="ZLIB"
+    )
+    gripper_value = tf.io.decode_raw(gripper_value, tf.float32)
+    trajectory["observation"]["gripper_closed"] = tf.reshape(gripper_value, (-1, 1))
+    trajectory["language_instruction"] = trajectory["observation"]["natural_language_instruction"]
+    return trajectory
+
+
+def taco_play_dataset_transform(trajectory: Dict[str, Any]) -> Dict[str, Any]:
+    trajectory["observation"]["state_eef"] = trajectory["observation"]["robot_obs"][:, :6]
+    trajectory["observation"]["state_gripper"] = trajectory["observation"]["robot_obs"][:, 7:8]
+    trajectory["action"] = trajectory["action"]["rel_actions_world"]
+
+    # invert gripper action + clip, +1 = open, 0 = close
+    trajectory["action"] = tf.concat(
+        (
+            trajectory["action"][:, :6],
+            tf.clip_by_value(trajectory["action"][:, -1:], 0, 1),
+        ),
+        axis=-1,
+    )
+
+    trajectory["language_instruction"] = trajectory["observation"]["natural_language_instruction"]
+    return trajectory
+
+
+def jaco_play_dataset_transform(trajectory: Dict[str, Any]) -> Dict[str, Any]:
+    trajectory["observation"]["state_eef"] = trajectory["observation"]["end_effector_cartesian_pos"][:, :6]
+    trajectory["observation"]["state_gripper"] = trajectory["observation"]["end_effector_cartesian_pos"][
+        :, -1:
+    ]
+
+    # make gripper action absolute action, +1 = open, 0 = close
+    gripper_action = trajectory["action"]["gripper_closedness_action"][:, 0]
+    gripper_action = rel2abs_gripper_actions(gripper_action)
+
+    trajectory["action"] = tf.concat(
+        (
+            trajectory["action"]["world_vector"],
+            tf.zeros_like(trajectory["action"]["world_vector"]),
+            gripper_action[:, None],
+        ),
+        axis=-1,
+    )
+    trajectory["language_instruction"] = trajectory["observation"]["natural_language_instruction"]
+    return trajectory
+
+
+def berkeley_cable_routing_dataset_transform(trajectory: Dict[str, Any]) -> Dict[str, Any]:
+    trajectory["action"] = tf.concat(
+        (
+            trajectory["action"]["world_vector"],
+            trajectory["action"]["rotation_delta"],
+            tf.zeros_like(trajectory["action"]["world_vector"][:, :1]),
+        ),
+        axis=-1,
+    )
+    trajectory["language_instruction"] = trajectory["observation"]["natural_language_instruction"]
+    return trajectory
+
+
+def roboturk_dataset_transform(trajectory: Dict[str, Any]) -> Dict[str, Any]:
+    # invert absolute gripper action, +1 = open, 0 = close
+    gripper_action = invert_gripper_actions(
+        tf.clip_by_value(trajectory["action"]["gripper_closedness_action"], 0, 1)
+    )
+
+    trajectory["action"] = tf.concat(
+        (
+            trajectory["action"]["world_vector"],
+            trajectory["action"]["rotation_delta"],
+            gripper_action,
+        ),
+        axis=-1,
+    )
+    trajectory["language_instruction"] = trajectory["observation"]["natural_language_instruction"]
+    trajectory["language_embedding"] = trajectory["observation"]["natural_language_embedding"]
+    return trajectory
+
+
+def nyu_door_opening_dataset_transform(trajectory: Dict[str, Any]) -> Dict[str, Any]:
+    # make gripper action absolute action, +1 = open, 0 = close
+    gripper_action = trajectory["action"]["gripper_closedness_action"][:, 0]
+    gripper_action = rel2abs_gripper_actions(gripper_action)
+
+    trajectory["action"] = tf.concat(
+        (
+            trajectory["action"]["world_vector"],
+            trajectory["action"]["rotation_delta"],
+            gripper_action[:, None],
+        ),
+        axis=-1,
+    )
+    trajectory["language_instruction"] = trajectory["observation"]["natural_language_instruction"]
+    return trajectory
+
+
+def viola_dataset_transform(trajectory: Dict[str, Any]) -> Dict[str, Any]:
+    # make gripper action, +1 = open, 0 = close
+    gripper_action = trajectory["action"]["gripper_closedness_action"][:, None]
+    gripper_action = tf.clip_by_value(gripper_action, 0, 1)
+    gripper_action = invert_gripper_actions(gripper_action)
+
+    trajectory["action"] = tf.concat(
+        (
+            trajectory["action"]["world_vector"],
+            trajectory["action"]["rotation_delta"],
+            gripper_action,
+        ),
+        axis=-1,
+    )
+    trajectory["language_instruction"] = trajectory["observation"]["natural_language_instruction"]
+    return trajectory
+
+
+def berkeley_autolab_ur5_dataset_transform(trajectory: Dict[str, Any]) -> Dict[str, Any]:
+    trajectory["observation"]["state"] = trajectory["observation"]["robot_state"][:, 6:14]
+
+    # make gripper action absolute action, +1 = open, 0 = close
+    gripper_action = trajectory["action"]["gripper_closedness_action"]
+    gripper_action = rel2abs_gripper_actions(gripper_action)
+
+    trajectory["action"] = tf.concat(
+        (
+            trajectory["action"]["world_vector"],
+            trajectory["action"]["rotation_delta"],
+            gripper_action[:, None],
+        ),
+        axis=-1,
+    )
+    trajectory["language_instruction"] = trajectory["observation"]["natural_language_instruction"]
+    return trajectory
+
+
+def toto_dataset_transform(trajectory: Dict[str, Any]) -> Dict[str, Any]:
+    trajectory["action"] = tf.concat(
+        (
+            trajectory["action"]["world_vector"],
+            trajectory["action"]["rotation_delta"],
+            tf.cast(trajectory["action"]["open_gripper"][:, None], tf.float32),
+        ),
+        axis=-1,
+    )
+    trajectory["language_instruction"] = trajectory["observation"]["natural_language_instruction"]
+    return trajectory
+
+
+def language_table_dataset_transform(trajectory: Dict[str, Any]) -> Dict[str, Any]:
+    # default to "open" gripper
+    trajectory["action"] = tf.concat(
+        (
+            trajectory["action"],
+            tf.zeros_like(trajectory["action"]),
+            tf.zeros_like(trajectory["action"]),
+            tf.ones_like(trajectory["action"][:, :1]),
+        ),
+        axis=-1,
+    )
+
+    # decode language instruction
+    instruction_bytes = trajectory["observation"]["instruction"]
+    instruction_encoded = tf.strings.unicode_encode(instruction_bytes, output_encoding="UTF-8")
+    # Remove trailing padding --> convert RaggedTensor to regular Tensor.
+    trajectory["language_instruction"] = tf.strings.split(instruction_encoded, "\x00")[:, :1].to_tensor()[
+        :, 0
+    ]
+    return trajectory
+
+
+def pusht_dataset_transform(trajectory: Dict[str, Any]) -> Dict[str, Any]:
+    trajectory["action"] = tf.concat(
+        (
+            trajectory["action"]["world_vector"],
+            trajectory["action"]["rotation_delta"],
+            trajectory["action"]["gripper_closedness_action"][:, None],
+        ),
+        axis=-1,
+    )
+    trajectory["language_instruction"] = trajectory["observation"]["natural_language_instruction"]
+    return trajectory
+
+
+def stanford_kuka_multimodal_dataset_transform(trajectory: Dict[str, Any]) -> Dict[str, Any]:
+    trajectory["observation"]["depth_image"] = trajectory["observation"]["depth_image"][..., 0]
+    trajectory["action"] = tf.concat(
+        (
+            trajectory["action"][:, :3],
+            tf.zeros_like(trajectory["action"][:, :3]),
+            trajectory["action"][:, -1:],
+        ),
+        axis=-1,
+    )
+    return trajectory
+
+
+def nyu_rot_dataset_transform(trajectory: Dict[str, Any]) -> Dict[str, Any]:
+    trajectory["observation"]["eef_state"] = trajectory["observation"]["state"][..., :6]
+    trajectory["observation"]["gripper_state"] = trajectory["observation"]["state"][..., -1:]
+    trajectory["action"] = trajectory["action"][..., :7]
+    return trajectory
+
+
+def stanford_hydra_dataset_transform(trajectory: Dict[str, Any]) -> Dict[str, Any]:
+    # invert gripper action, +1 = open, 0 = close
+    trajectory["action"] = tf.concat(
+        (
+            trajectory["action"][:, :6],
+            invert_gripper_actions(trajectory["action"][:, -1:]),
+        ),
+        axis=-1,
+    )
+
+    trajectory["observation"]["eef_state"] = tf.concat(
+        (
+            trajectory["observation"]["state"][:, :3],
+            trajectory["observation"]["state"][:, 7:10],
+        ),
+        axis=-1,
+    )
+    trajectory["observation"]["gripper_state"] = trajectory["observation"]["state"][:, -3:-2]
+    return trajectory
+
+
+def austin_buds_dataset_transform(trajectory: Dict[str, Any]) -> Dict[str, Any]:
+    # invert gripper action + clip, +1 = open, 0 = close
+    trajectory["action"] = tf.concat(
+        (
+            trajectory["action"][:, :6],
+            invert_gripper_actions(tf.clip_by_value(trajectory["action"][:, -1:], 0, 1)),
+        ),
+        axis=-1,
+    )
+
+    trajectory["observation"]["state"] = trajectory["observation"]["state"][:, :8]
+    return trajectory
+
+
+def nyu_franka_play_dataset_transform(trajectory: Dict[str, Any]) -> Dict[str, Any]:
+    trajectory["observation"]["depth"] = tf.cast(trajectory["observation"]["depth"][..., 0], tf.float32)
+    trajectory["observation"]["depth_additional_view"] = tf.cast(
+        trajectory["observation"]["depth_additional_view"][..., 0], tf.float32
+    )
+    trajectory["observation"]["eef_state"] = trajectory["observation"]["state"][:, -6:]
+
+    # clip gripper action, +1 = open, 0 = close
+    trajectory["action"] = tf.concat(
+        (
+            trajectory["action"][:, -8:-2],
+            tf.clip_by_value(trajectory["action"][:, -2:-1], 0, 1),
+        ),
+        axis=-1,
+    )
+    return trajectory
+
+
+def maniskill_dataset_transform(trajectory: Dict[str, Any]) -> Dict[str, Any]:
+    trajectory["observation"]["gripper_state"] = trajectory["observation"]["state"][..., 7:8]
+    return trajectory
+
+
+def furniture_bench_dataset_transform(trajectory: Dict[str, Any]) -> Dict[str, Any]:
+    import tensorflow_graphics.geometry.transformation as tft
+
+    trajectory["observation"]["state"] = tf.concat(
+        (
+            trajectory["observation"]["state"][:, :7],
+            trajectory["observation"]["state"][:, -1:],
+        ),
+        axis=-1,
+    )
+
+    # invert gripper action + clip, +1 = open, 0 = close
+    trajectory["action"] = tf.concat(
+        (
+            trajectory["action"][:, :3],
+            tft.euler.from_quaternion(trajectory["action"][:, 3:7]),
+            invert_gripper_actions(tf.clip_by_value(trajectory["action"][:, -1:], 0, 1)),
+        ),
+        axis=-1,
+    )
+    return trajectory
+
+
+def cmu_franka_exploration_dataset_transform(trajectory: Dict[str, Any]) -> Dict[str, Any]:
+    trajectory["action"] = trajectory["action"][..., :-1]
+    return trajectory
+
+
+def ucsd_kitchen_dataset_transform(trajectory: Dict[str, Any]) -> Dict[str, Any]:
+    trajectory["observation"]["joint_state"] = trajectory["observation"]["state"][:, :7]
+    trajectory["action"] = trajectory["action"][..., :-1]
+    return trajectory
+
+
+def ucsd_pick_place_dataset_transform(trajectory: Dict[str, Any]) -> Dict[str, Any]:
+    trajectory["observation"]["eef_state"] = trajectory["observation"]["state"][:, :6]
+    trajectory["observation"]["gripper_state"] = trajectory["observation"]["state"][:, -1:]
+    trajectory["action"] = tf.concat(
+        (
+            trajectory["action"][:, :3],
+            tf.zeros_like(trajectory["action"][:, :3]),
+            trajectory["action"][:, -1:],
+        ),
+        axis=-1,
+    )
+    return trajectory
+
+
+def austin_sailor_dataset_transform(trajectory: Dict[str, Any]) -> Dict[str, Any]:
+    # invert gripper action + clip, +1 = open, 0 = close
+    trajectory["action"] = tf.concat(
+        (
+            trajectory["action"][:, :6],
+            invert_gripper_actions(tf.clip_by_value(trajectory["action"][:, -1:], 0, 1)),
+        ),
+        axis=-1,
+    )
+    return trajectory
+
+
+def austin_sirius_dataset_transform(trajectory: Dict[str, Any]) -> Dict[str, Any]:
+    # invert gripper action + clip, +1 = open, 0 = close
+    trajectory["action"] = tf.concat(
+        (
+            trajectory["action"][:, :6],
+            invert_gripper_actions(tf.clip_by_value(trajectory["action"][:, -1:], 0, 1)),
+        ),
+        axis=-1,
+    )
+    return trajectory
+
+
+def bc_z_dataset_transform(trajectory: Dict[str, Any]) -> Dict[str, Any]:
+    trajectory["action"] = tf.concat(
+        (
+            trajectory["action"]["future/xyz_residual"][:, :3],
+            trajectory["action"]["future/axis_angle_residual"][:, :3],
+            invert_gripper_actions(tf.cast(trajectory["action"]["future/target_close"][:, :1], tf.float32)),
+        ),
+        axis=-1,
+    )
+    trajectory["language_instruction"] = trajectory["observation"]["natural_language_instruction"]
+    return trajectory
+
+
+def tokyo_pr2_opening_fridge_dataset_transform(trajectory: Dict[str, Any]) -> Dict[str, Any]:
+    trajectory["observation"]["eef_state"] = trajectory["observation"]["state"][:, :6]
+    trajectory["observation"]["gripper_state"] = trajectory["observation"]["state"][:, -1:]
+    trajectory["action"] = trajectory["action"][..., :-1]
+    return trajectory
+
+
+def tokyo_pr2_tabletop_manipulation_dataset_transform(trajectory: Dict[str, Any]) -> Dict[str, Any]:
+    trajectory["observation"]["eef_state"] = trajectory["observation"]["state"][:, :6]
+    trajectory["observation"]["gripper_state"] = trajectory["observation"]["state"][:, -1:]
+    trajectory["action"] = trajectory["action"][..., :-1]
+    return trajectory
+
+
+def utokyo_xarm_bimanual_dataset_transform(trajectory: Dict[str, Any]) -> Dict[str, Any]:
+    trajectory["action"] = trajectory["action"][..., -7:]
+    return trajectory
+
+
+def robo_net_dataset_transform(trajectory: Dict[str, Any]) -> Dict[str, Any]:
+    trajectory["observation"]["eef_state"] = tf.concat(
+        (
+            trajectory["observation"]["state"][:, :4],
+            tf.zeros_like(trajectory["observation"]["state"][:, :2]),
+        ),
+        axis=-1,
+    )
+    trajectory["observation"]["gripper_state"] = trajectory["observation"]["state"][:, -1:]
+    trajectory["action"] = tf.concat(
+        (
+            trajectory["action"][:, :4],
+            tf.zeros_like(trajectory["action"][:, :2]),
+            trajectory["action"][:, -1:],
+        ),
+        axis=-1,
+    )
+    return trajectory
+
+
+def berkeley_mvp_dataset_transform(trajectory: Dict[str, Any]) -> Dict[str, Any]:
+    """
+    trajectory["observation"]["state"] = tf.concat((
+        tf.cast(trajectory["observation"]["gripper"][:, None], tf.float32),
+                        trajectory["observation"]["pose"],
+                        trajectory["observation"]["joint_pos"],),
+                        axis=-1,)
+    """
+    trajectory["observation"]["gripper"] = tf.cast(trajectory["observation"]["gripper"][:, None], tf.float32)
+    return trajectory
+
+
+def berkeley_rpt_dataset_transform(trajectory: Dict[str, Any]) -> Dict[str, Any]:
+    trajectory["observation"]["gripper"] = tf.cast(trajectory["observation"]["gripper"][:, None], tf.float32)
+    return trajectory
+
+
+def kaist_nonprehensible_dataset_transform(trajectory: Dict[str, Any]) -> Dict[str, Any]:
+    trajectory["observation"]["state"] = trajectory["observation"]["state"][:, -7:]
+    trajectory["action"] = tf.concat(
+        (
+            trajectory["action"][:, :6],
+            tf.zeros_like(trajectory["action"][:, :1]),
+        ),
+        axis=-1,
+    )
+    return trajectory
+
+
+def stanford_mask_vit_dataset_transform(trajectory: Dict[str, Any]) -> Dict[str, Any]:
+    trajectory["observation"]["eef_state"] = tf.concat(
+        (
+            trajectory["observation"]["end_effector_pose"][:, :4],
+            tf.zeros_like(trajectory["observation"]["end_effector_pose"][:, :2]),
+        ),
+        axis=-1,
+    )
+    trajectory["observation"]["gripper_state"] = trajectory["observation"]["end_effector_pose"][:, -1:]
+    trajectory["action"] = tf.concat(
+        (
+            trajectory["action"][:, :4],
+            tf.zeros_like(trajectory["action"][:, :2]),
+            trajectory["action"][:, -1:],
+        ),
+        axis=-1,
+    )
+    return trajectory
+
+
+def tokyo_lsmo_dataset_transform(trajectory: Dict[str, Any]) -> Dict[str, Any]:
+    trajectory["observation"]["eef_state"] = trajectory["observation"]["state"][:, :6]
+    trajectory["observation"]["gripper_state"] = trajectory["observation"]["state"][:, -1:]
+    return trajectory
+
+
+def dlr_sara_grid_clamp_dataset_transform(trajectory: Dict[str, Any]) -> Dict[str, Any]:
+    trajectory["observation"]["state"] = trajectory["observation"]["state"][:, :6]
+    return trajectory
+
+
+def dlr_edan_shared_control_dataset_transform(trajectory: Dict[str, Any]) -> Dict[str, Any]:
+    # invert gripper action, +1 = open, 0 = close
+    trajectory["action"] = tf.concat(
+        (
+            trajectory["action"][:, :6],
+            invert_gripper_actions(trajectory["action"][:, -1:]),
+        ),
+        axis=-1,
+    )
+    return trajectory
+
+
+def asu_table_top_dataset_transform(trajectory: Dict[str, Any]) -> Dict[str, Any]:
+    trajectory["observation"]["eef_state"] = trajectory["ground_truth_states"]["EE"]
+    trajectory["observation"]["gripper_state"] = trajectory["observation"]["state"][:, -1:]
+    return trajectory
+
+
+def robocook_dataset_transform(trajectory: Dict[str, Any]) -> Dict[str, Any]:
+    trajectory["observation"]["eef_state"] = trajectory["observation"]["state"][:, :6]
+    trajectory["observation"]["gripper_state"] = trajectory["observation"]["state"][:, -1:]
+    return trajectory
+
+
+def imperial_wristcam_dataset_transform(trajectory: Dict[str, Any]) -> Dict[str, Any]:
+    trajectory["action"] = trajectory["action"][..., :-1]
+    return trajectory
+
+
+def iamlab_pick_insert_dataset_transform(trajectory: Dict[str, Any]) -> Dict[str, Any]:
+    import tensorflow_graphics.geometry.transformation as tft
+
+    trajectory["observation"]["joint_state"] = trajectory["observation"]["state"][:, :7]
+    trajectory["observation"]["gripper_state"] = trajectory["observation"]["state"][:, 7:8]
+    trajectory["action"] = tf.concat(
+        (
+            trajectory["action"][:, :3],
+            tft.euler.from_quaternion(trajectory["action"][:, 3:7]),
+            trajectory["action"][:, 7:8],
+        ),
+        axis=-1,
+    )
+    return trajectory
+
+
+def uiuc_d3field_dataset_transform(trajectory: Dict[str, Any]) -> Dict[str, Any]:
+    trajectory["action"] = tf.concat(
+        (
+            trajectory["action"],
+            tf.zeros_like(trajectory["action"]),
+            tf.zeros_like(trajectory["action"][:, :1]),
+        ),
+        axis=-1,
+    )
+    return trajectory
+
+
+def utaustin_mutex_dataset_transform(trajectory: Dict[str, Any]) -> Dict[str, Any]:
+    trajectory["observation"]["state"] = trajectory["observation"]["state"][:, :8]
+
+    # invert gripper action + clip, +1 = open, 0 = close
+    trajectory["action"] = tf.concat(
+        (
+            trajectory["action"][:, :6],
+            invert_gripper_actions(tf.clip_by_value(trajectory["action"][:, -1:], 0, 1)),
+        ),
+        axis=-1,
+    )
+    return trajectory
+
+
+def berkeley_fanuc_dataset_transform(trajectory: Dict[str, Any]) -> Dict[str, Any]:
+    trajectory["observation"]["joint_state"] = trajectory["observation"]["state"][:, :6]
+    trajectory["observation"]["gripper_state"] = trajectory["observation"]["state"][:, 6:7]
+
+    # dataset does not store gripper actions, so use gripper state info, invert so +1 = open, 0 = close
+    trajectory["action"] = tf.concat(
+        (
+            trajectory["action"],
+            invert_gripper_actions(trajectory["observation"]["gripper_state"]),
+        ),
+        axis=-1,
+    )
+    return trajectory
+
+
+def cmu_playing_with_food_dataset_transform(trajectory: Dict[str, Any]) -> Dict[str, Any]:
+    import tensorflow_graphics.geometry.transformation as tft
+
+    trajectory["action"] = tf.concat(
+        (
+            trajectory["action"][:, :3],
+            tft.euler.from_quaternion(trajectory["action"][:, 3:7]),
+            trajectory["action"][:, -1:],
+        ),
+        axis=-1,
+    )
+    return trajectory
+
+
+def playfusion_dataset_transform(trajectory: Dict[str, Any]) -> Dict[str, Any]:
+    trajectory["action"] = tf.concat(
+        (
+            trajectory["action"][:, :3],
+            trajectory["action"][:, -4:],
+        ),
+        axis=-1,
+    )
+    return trajectory
+
+
+def cmu_stretch_dataset_transform(trajectory: Dict[str, Any]) -> Dict[str, Any]:
+    trajectory["observation"]["eef_state"] = tf.concat(
+        (
+            trajectory["observation"]["state"][:, :3],
+            tf.zeros_like(trajectory["observation"]["state"][:, :3]),
+        ),
+        axis=-1,
+    )
+    trajectory["observation"]["gripper_state"] = trajectory["observation"]["state"][:, -1:]
+    trajectory["action"] = trajectory["action"][..., :-1]
+    return trajectory
+
+
+def gnm_dataset_transform(trajectory: Dict[str, Any]) -> Dict[str, Any]:
+    trajectory["observation"]["state"] = tf.concat(
+        (
+            trajectory["observation"]["position"],
+            tf.zeros_like(trajectory["observation"]["state"][:, :3]),
+            trajectory["observation"]["yaw"],
+        ),
+        axis=-1,
+    )
+    trajectory["action"] = tf.concat(
+        (
+            trajectory["action"],
+            tf.zeros_like(trajectory["action"]),
+            tf.zeros_like(trajectory["action"]),
+            tf.zeros_like(trajectory["action"][:, :1]),
+        ),
+        axis=-1,
+    )
+    return trajectory
+
+
+def fmb_transform(trajectory: Dict[str, Any]) -> Dict[str, Any]:
+    # every input feature is batched, ie has leading batch dimension
+    trajectory["observation"]["proprio"] = tf.concat(
+        (
+            trajectory["observation"]["eef_pose"],
+            trajectory["observation"]["state_gripper_pose"][..., None],
+        ),
+        axis=-1,
+    )
+    return trajectory
+
+
+def dobbe_dataset_transform(trajectory: Dict[str, Any]) -> Dict[str, Any]:
+    # every input feature is batched, ie has leading batch dimension
+    trajectory["observation"]["proprio"] = trajectory["observation"]["state"]
+    return trajectory
+
+
+def robo_set_dataset_transform(trajectory: Dict[str, Any]) -> Dict[str, Any]:
+    # gripper action is in -1...1 --> clip to 0...1, flip
+    gripper_action = trajectory["action"][:, -1:]
+    gripper_action = invert_gripper_actions(tf.clip_by_value(gripper_action, 0, 1))
+
+    trajectory["action"] = tf.concat(
+        (
+            trajectory["action"][:, :7],
+            gripper_action,
+        ),
+        axis=-1,
+    )
+    return trajectory
+
+
+def identity_transform(trajectory: Dict[str, Any]) -> Dict[str, Any]:
+    return trajectory
+
+
+# === Registry ===
+OPENX_STANDARDIZATION_TRANSFORMS = {
+    "bridge_openx": bridge_openx_dataset_transform,
+    "bridge_orig": bridge_orig_dataset_transform,
+    "bridge_dataset": bridge_orig_dataset_transform,
+    "ppgm": ppgm_dataset_transform,
+    "ppgm_static": ppgm_dataset_transform,
+    "ppgm_wrist": ppgm_dataset_transform,
+    "fractal20220817_data": rt1_dataset_transform,
+    "kuka": kuka_dataset_transform,
+    "taco_play": taco_play_dataset_transform,
+    "jaco_play": jaco_play_dataset_transform,
+    "berkeley_cable_routing": berkeley_cable_routing_dataset_transform,
+    "roboturk": roboturk_dataset_transform,
+    "nyu_door_opening_surprising_effectiveness": nyu_door_opening_dataset_transform,
+    "viola": viola_dataset_transform,
+    "berkeley_autolab_ur5": berkeley_autolab_ur5_dataset_transform,
+    "toto": toto_dataset_transform,
+    "language_table": language_table_dataset_transform,
+    "columbia_cairlab_pusht_real": pusht_dataset_transform,
+    "stanford_kuka_multimodal_dataset_converted_externally_to_rlds": stanford_kuka_multimodal_dataset_transform,
+    "nyu_rot_dataset_converted_externally_to_rlds": nyu_rot_dataset_transform,
+    "stanford_hydra_dataset_converted_externally_to_rlds": stanford_hydra_dataset_transform,
+    "austin_buds_dataset_converted_externally_to_rlds": austin_buds_dataset_transform,
+    "nyu_franka_play_dataset_converted_externally_to_rlds": nyu_franka_play_dataset_transform,
+    "maniskill_dataset_converted_externally_to_rlds": maniskill_dataset_transform,
+    "furniture_bench_dataset_converted_externally_to_rlds": furniture_bench_dataset_transform,
+    "cmu_franka_exploration_dataset_converted_externally_to_rlds": cmu_franka_exploration_dataset_transform,
+    "ucsd_kitchen_dataset_converted_externally_to_rlds": ucsd_kitchen_dataset_transform,
+    "ucsd_pick_and_place_dataset_converted_externally_to_rlds": ucsd_pick_place_dataset_transform,
+    "austin_sailor_dataset_converted_externally_to_rlds": austin_sailor_dataset_transform,
+    "austin_sirius_dataset_converted_externally_to_rlds": austin_sirius_dataset_transform,
+    "bc_z": bc_z_dataset_transform,
+    "utokyo_pr2_opening_fridge_converted_externally_to_rlds": tokyo_pr2_opening_fridge_dataset_transform,
+    "utokyo_pr2_tabletop_manipulation_converted_externally_to_rlds": tokyo_pr2_tabletop_manipulation_dataset_transform,
+    "utokyo_xarm_pick_and_place_converted_externally_to_rlds": identity_transform,
+    "utokyo_xarm_bimanual_converted_externally_to_rlds": utokyo_xarm_bimanual_dataset_transform,
+    "robo_net": robo_net_dataset_transform,
+    "berkeley_mvp_converted_externally_to_rlds": berkeley_mvp_dataset_transform,
+    "berkeley_rpt_converted_externally_to_rlds": berkeley_rpt_dataset_transform,
+    "kaist_nonprehensile_converted_externally_to_rlds": kaist_nonprehensible_dataset_transform,
+    "stanford_mask_vit_converted_externally_to_rlds": stanford_mask_vit_dataset_transform,
+    "tokyo_u_lsmo_converted_externally_to_rlds": tokyo_lsmo_dataset_transform,
+    "dlr_sara_pour_converted_externally_to_rlds": identity_transform,
+    "dlr_sara_grid_clamp_converted_externally_to_rlds": dlr_sara_grid_clamp_dataset_transform,
+    "dlr_edan_shared_control_converted_externally_to_rlds": dlr_edan_shared_control_dataset_transform,
+    "asu_table_top_converted_externally_to_rlds": asu_table_top_dataset_transform,
+    "stanford_robocook_converted_externally_to_rlds": robocook_dataset_transform,
+    "imperialcollege_sawyer_wrist_cam": imperial_wristcam_dataset_transform,
+    "iamlab_cmu_pickup_insert_converted_externally_to_rlds": iamlab_pick_insert_dataset_transform,
+    "uiuc_d3field": uiuc_d3field_dataset_transform,
+    "utaustin_mutex": utaustin_mutex_dataset_transform,
+    "berkeley_fanuc_manipulation": berkeley_fanuc_dataset_transform,
+    "cmu_playing_with_food": cmu_playing_with_food_dataset_transform,
+    "cmu_play_fusion": playfusion_dataset_transform,
+    "cmu_stretch": cmu_stretch_dataset_transform,
+    "berkeley_gnm_recon": gnm_dataset_transform,
+    "berkeley_gnm_cory_hall": gnm_dataset_transform,
+    "berkeley_gnm_sac_son": gnm_dataset_transform,
+    "droid": droid_baseact_transform_fn(),
+    "droid_100": droid_baseact_transform_fn(),  # first 100 episodes of droid
+    "fmb": fmb_transform,
+    "dobbe": dobbe_dataset_transform,
+    "robo_set": robo_set_dataset_transform,
+    "usc_cloth_sim_converted_externally_to_rlds": identity_transform,
+    "plex_robosuite": identity_transform,
+    "conq_hose_manipulation": identity_transform,
+    "io_ai_tech": identity_transform,
+    "spoc": identity_transform,
+}
--- a/lerobot/common/datasets/push_dataset_to_hub/openx_rlds_format.py
+++ b/lerobot/common/datasets/push_dataset_to_hub/openx_rlds_format.py
@@ -0,0 +1,359 @@
+#!/usr/bin/env python
+
+# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+For https://github.com/google-deepmind/open_x_embodiment (OPENX) datasets.
+
+Example:
+    python lerobot/scripts/push_dataset_to_hub.py \
+        --raw-dir /hdd/tensorflow_datasets/bridge_dataset/1.0.0/ \
+        --repo-id youliangtan/sampled_bridge_data_v2 \
+        --raw-format openx_rlds.bridge_orig \
+        --episodes 3 4 5 8 9
+
+Exact dataset fps defined in openx/config.py, obtained from:
+    https://docs.google.com/spreadsheets/d/1rPBD77tk60AEIGZrGSODwyyzs5FgCU9Uz3h-3_t2A9g/edit?gid=0#gid=0&range=R:R
+"""
+
+import shutil
+from pathlib import Path
+
+import numpy as np
+import tensorflow as tf
+import tensorflow_datasets as tfds
+import torch
+import tqdm
+import yaml
+from datasets import Dataset, Features, Image, Sequence, Value
+from PIL import Image as PILImage
+
+from lerobot.common.datasets.lerobot_dataset import CODEBASE_VERSION
+from lerobot.common.datasets.push_dataset_to_hub.openx.transforms import OPENX_STANDARDIZATION_TRANSFORMS
+from lerobot.common.datasets.push_dataset_to_hub.utils import (
+    concatenate_episodes,
+    get_default_encoding,
+    save_images_concurrently,
+)
+from lerobot.common.datasets.utils import (
+    calculate_episode_data_index,
+    hf_transform_to_torch,
+)
+from lerobot.common.datasets.video_utils import VideoFrame, encode_video_frames
+
+with open("lerobot/common/datasets/push_dataset_to_hub/openx/configs.yaml", "r") as f:
+    _openx_list = yaml.safe_load(f)
+
+OPENX_DATASET_CONFIGS = _openx_list["OPENX_DATASET_CONFIGS"]
+
+np.set_printoptions(precision=2)
+
+
+def tf_to_torch(data):
+    return torch.from_numpy(data.numpy())
+
+
+def tf_img_convert(img):
+    if img.dtype == tf.string:
+        img = tf.io.decode_image(img, expand_animations=False, dtype=tf.uint8)
+    elif img.dtype != tf.uint8:
+        raise ValueError(f"Unsupported image dtype: found with dtype {img.dtype}")
+    return img.numpy()
+
+
+def _broadcast_metadata_rlds(i: tf.Tensor, traj: dict) -> dict:
+    """
+    In the RLDS format, each trajectory has some top-level metadata that is explicitly separated out, and a "steps"
+    entry. This function moves the "steps" entry to the top level, broadcasting any metadata to the length of the
+    trajectory. This function also adds the extra metadata fields `_len`, `_traj_index`, and `_frame_index`.
+
+    NOTE: adapted from DLimp library https://github.com/kvablack/dlimp/
+    """
+    steps = traj.pop("steps")
+
+    traj_len = tf.shape(tf.nest.flatten(steps)[0])[0]
+
+    # broadcast metadata to the length of the trajectory
+    metadata = tf.nest.map_structure(lambda x: tf.repeat(x, traj_len), traj)
+
+    # put steps back in
+    assert "traj_metadata" not in steps
+    traj = {**steps, "traj_metadata": metadata}
+
+    assert "_len" not in traj
+    assert "_traj_index" not in traj
+    assert "_frame_index" not in traj
+    traj["_len"] = tf.repeat(traj_len, traj_len)
+    traj["_traj_index"] = tf.repeat(i, traj_len)
+    traj["_frame_index"] = tf.range(traj_len)
+
+    return traj
+
+
+def load_from_raw(
+    raw_dir: Path,
+    videos_dir: Path,
+    fps: int,
+    video: bool,
+    episodes: list[int] | None = None,
+    encoding: dict | None = None,
+    openx_dataset_name: str | None = None,
+):
+    """
+    Args:
+        raw_dir (Path): _description_
+        videos_dir (Path): _description_
+        fps (int): _description_
+        video (bool): _description_
+        episodes (list[int] | None, optional): _description_. Defaults to None.
+    """
+    ds_builder = tfds.builder_from_directory(str(raw_dir))
+    dataset = ds_builder.as_dataset(
+        split="all",
+        decoders={"steps": tfds.decode.SkipDecoding()},
+    )
+
+    dataset_info = ds_builder.info
+    print("dataset_info: ", dataset_info)
+
+    ds_length = len(dataset)
+    dataset = dataset.take(ds_length)
+    # "flatten" the dataset as such we can apply trajectory level map() easily
+    # each [obs][key] has a shape of (frame_size, ...)
+    dataset = dataset.enumerate().map(_broadcast_metadata_rlds)
+
+    # we will apply the standardization transform if the dataset_name is provided
+    # if the dataset name is not provided and the goal is to convert any rlds formatted dataset
+    # search for 'image' keys in the observations
+    if openx_dataset_name is not None:
+        print(" - applying standardization transform for dataset: ", openx_dataset_name)
+        assert openx_dataset_name in OPENX_STANDARDIZATION_TRANSFORMS
+        transform_fn = OPENX_STANDARDIZATION_TRANSFORMS[openx_dataset_name]
+        dataset = dataset.map(transform_fn)
+
+        image_keys = OPENX_DATASET_CONFIGS[openx_dataset_name]["image_obs_keys"]
+    else:
+        obs_keys = dataset_info.features["steps"]["observation"].keys()
+        image_keys = [key for key in obs_keys if "image" in key]
+
+    lang_key = "language_instruction" if "language_instruction" in dataset.element_spec else None
+
+    print(" - image_keys: ", image_keys)
+    print(" - lang_key: ", lang_key)
+
+    it = iter(dataset)
+
+    ep_dicts = []
+    # Init temp path to save ep_dicts in case of crash
+    tmp_ep_dicts_dir = videos_dir.parent.joinpath("ep_dicts")
+    tmp_ep_dicts_dir.mkdir(parents=True, exist_ok=True)
+
+    # check if ep_dicts have already been saved in /tmp
+    starting_ep_idx = 0
+    saved_ep_dicts = [ep.__str__() for ep in tmp_ep_dicts_dir.iterdir()]
+    if len(saved_ep_dicts) > 0:
+        saved_ep_dicts.sort()
+        # get last ep_idx number
+        starting_ep_idx = int(saved_ep_dicts[-1][-13:-3]) + 1
+        for i in range(starting_ep_idx):
+            episode = next(it)
+            ep_dicts.append(torch.load(saved_ep_dicts[i]))
+
+    # if we user specified episodes, skip the ones not in the list
+    if episodes is not None:
+        if ds_length == 0:
+            raise ValueError("No episodes found.")
+        # convert episodes index to sorted list
+        episodes = sorted(episodes)
+
+    for ep_idx in tqdm.tqdm(range(starting_ep_idx, ds_length)):
+        episode = next(it)
+
+        # if user specified episodes, skip the ones not in the list
+        if episodes is not None:
+            if len(episodes) == 0:
+                break
+            if ep_idx == episodes[0]:
+                # process this episode
+                print(" selecting episode idx: ", ep_idx)
+                episodes.pop(0)
+            else:
+                continue  # skip
+
+        num_frames = episode["action"].shape[0]
+
+        ###########################################################
+        # Handle the episodic data
+
+        # last step of demonstration is considered done
+        done = torch.zeros(num_frames, dtype=torch.bool)
+        done[-1] = True
+        ep_dict = {}
+        langs = []  # TODO: might be located in "observation"
+
+        image_array_dict = {key: [] for key in image_keys}
+
+        # We will create the state observation tensor by stacking the state
+        # obs keys defined in the openx/configs.py
+        if openx_dataset_name is not None:
+            state_obs_keys = OPENX_DATASET_CONFIGS[openx_dataset_name]["state_obs_keys"]
+            # stack the state observations, if is None, pad with zeros
+            states = []
+            for key in state_obs_keys:
+                if key in episode["observation"]:
+                    states.append(tf_to_torch(episode["observation"][key]))
+                else:
+                    states.append(torch.zeros(num_frames, 1))  # pad with zeros
+            states = torch.cat(states, dim=1)
+            # assert states.shape == (num_frames, 8), f"states shape: {states.shape}"
+        else:
+            states = tf_to_torch(episode["observation"]["state"])
+
+        actions = tf_to_torch(episode["action"])
+        rewards = tf_to_torch(episode["reward"]).float()
+
+        # If lang_key is present, convert the entire tensor at once
+        if lang_key is not None:
+            langs = [str(x) for x in episode[lang_key]]
+
+        for im_key in image_keys:
+            imgs = episode["observation"][im_key]
+            image_array_dict[im_key] = [tf_img_convert(img) for img in imgs]
+
+        # simple assertions
+        for item in [states, actions, rewards, done]:
+            assert len(item) == num_frames
+
+        ###########################################################
+
+        # loop through all cameras
+        for im_key in image_keys:
+            img_key = f"observation.images.{im_key}"
+            imgs_array = image_array_dict[im_key]
+            imgs_array = np.array(imgs_array)
+            if video:
+                # save png images in temporary directory
+                tmp_imgs_dir = videos_dir / "tmp_images"
+                save_images_concurrently(imgs_array, tmp_imgs_dir)
+
+                # encode images to a mp4 video
+                fname = f"{img_key}_episode_{ep_idx:06d}.mp4"
+                video_path = videos_dir / fname
+                encode_video_frames(tmp_imgs_dir, video_path, fps, **(encoding or {}))
+
+                # clean temporary images directory
+                shutil.rmtree(tmp_imgs_dir)
+
+                # store the reference to the video frame
+                ep_dict[img_key] = [
+                    {"path": f"videos/{fname}", "timestamp": i / fps} for i in range(num_frames)
+                ]
+            else:
+                ep_dict[img_key] = [PILImage.fromarray(x) for x in imgs_array]
+
+        if lang_key is not None:
+            ep_dict["language_instruction"] = langs
+
+        ep_dict["observation.state"] = states
+        ep_dict["action"] = actions
+        ep_dict["timestamp"] = torch.arange(0, num_frames, 1) / fps
+        ep_dict["episode_index"] = torch.tensor([ep_idx] * num_frames)
+        ep_dict["frame_index"] = torch.arange(0, num_frames, 1)
+        ep_dict["next.reward"] = rewards
+        ep_dict["next.done"] = done
+
+        path_ep_dict = tmp_ep_dicts_dir.joinpath(
+            "ep_dict_" + "0" * (10 - len(str(ep_idx))) + str(ep_idx) + ".pt"
+        )
+        torch.save(ep_dict, path_ep_dict)
+
+        ep_dicts.append(ep_dict)
+
+    data_dict = concatenate_episodes(ep_dicts)
+
+    total_frames = data_dict["frame_index"].shape[0]
+    data_dict["index"] = torch.arange(0, total_frames, 1)
+    return data_dict
+
+
+def to_hf_dataset(data_dict, video) -> Dataset:
+    features = {}
+
+    keys = [key for key in data_dict if "observation.images." in key]
+    for key in keys:
+        if video:
+            features[key] = VideoFrame()
+        else:
+            features[key] = Image()
+
+    features["observation.state"] = Sequence(
+        length=data_dict["observation.state"].shape[1], feature=Value(dtype="float32", id=None)
+    )
+    if "observation.velocity" in data_dict:
+        features["observation.velocity"] = Sequence(
+            length=data_dict["observation.velocity"].shape[1], feature=Value(dtype="float32", id=None)
+        )
+    if "observation.effort" in data_dict:
+        features["observation.effort"] = Sequence(
+            length=data_dict["observation.effort"].shape[1], feature=Value(dtype="float32", id=None)
+        )
+    if "language_instruction" in data_dict:
+        features["language_instruction"] = Value(dtype="string", id=None)
+
+    features["action"] = Sequence(
+        length=data_dict["action"].shape[1], feature=Value(dtype="float32", id=None)
+    )
+    features["episode_index"] = Value(dtype="int64", id=None)
+    features["frame_index"] = Value(dtype="int64", id=None)
+    features["timestamp"] = Value(dtype="float32", id=None)
+    features["next.reward"] = Value(dtype="float32", id=None)
+    features["next.done"] = Value(dtype="bool", id=None)
+    features["index"] = Value(dtype="int64", id=None)
+
+    hf_dataset = Dataset.from_dict(data_dict, features=Features(features))
+    hf_dataset.set_transform(hf_transform_to_torch)
+    return hf_dataset
+
+
+def from_raw_to_lerobot_format(
+    raw_dir: Path,
+    videos_dir: Path,
+    fps: int | None = None,
+    video: bool = True,
+    episodes: list[int] | None = None,
+    encoding: dict | None = None,
+    openx_dataset_name: str | None = None,
+):
+    """This is a test impl for rlds conversion"""
+    if openx_dataset_name is None:
+        # set a default rlds frame rate if the dataset is not from openx
+        fps = 30
+    elif "fps" not in OPENX_DATASET_CONFIGS[openx_dataset_name]:
+        raise ValueError(
+            "fps for this dataset is not specified in openx/configs.py yet," "means it is not yet tested"
+        )
+    fps = OPENX_DATASET_CONFIGS[openx_dataset_name]["fps"]
+
+    data_dict = load_from_raw(raw_dir, videos_dir, fps, video, episodes, encoding, openx_dataset_name)
+    hf_dataset = to_hf_dataset(data_dict, video)
+    episode_data_index = calculate_episode_data_index(hf_dataset)
+    info = {
+        "codebase_version": CODEBASE_VERSION,
+        "fps": fps,
+        "video": video,
+    }
+    if video:
+        info["encoding"] = get_default_encoding()
+
+    return hf_dataset, episode_data_index, info
--- a/lerobot/common/datasets/utils.py
+++ b/lerobot/common/datasets/utils.py
@@ -80,6 +80,11 @@ def hf_transform_to_torch(items_dict: dict[torch.Tensor | None]):
        if isinstance(first_item, PILImage.Image):
            to_tensor = transforms.ToTensor()
            items_dict[key] = [to_tensor(img) for img in items_dict[key]]
+        elif isinstance(first_item, str):
+            # TODO (michel-aractingi): add str2embedding via language tokenizer
+            # For now we leave this part up to the user to choose how to address
+            # language conditioned tasks
+            pass
        elif isinstance(first_item, dict) and "path" in first_item and "timestamp" in first_item:
            # video frame will be processed downstream
            pass
--- a/lerobot/common/policies/factory.py
+++ b/lerobot/common/policies/factory.py
@@ -51,6 +51,11 @@ def get_policy_and_config_classes(name: str) -> tuple[Policy, object]:
        from lerobot.common.policies.tdmpc.modeling_tdmpc import TDMPCPolicy

        return TDMPCPolicy, TDMPCConfig
+    elif name == "tdmpc2":
+        from lerobot.common.policies.tdmpc2.configuration_tdmpc2 import TDMPC2Config
+        from lerobot.common.policies.tdmpc2.modeling_tdmpc2 import TDMPC2Policy
+
+        return TDMPC2Policy, TDMPC2Config
    elif name == "diffusion":
        from lerobot.common.policies.diffusion.configuration_diffusion import DiffusionConfig
        from lerobot.common.policies.diffusion.modeling_diffusion import DiffusionPolicy
--- a/lerobot/common/policies/tdmpc2/configuration_tdmpc2.py
+++ b/lerobot/common/policies/tdmpc2/configuration_tdmpc2.py
@@ -0,0 +1,217 @@
+#!/usr/bin/env python
+
+# Copyright 2024 Nicklas Hansen, Xiaolong Wang, Hao Su,
+# and The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from dataclasses import dataclass, field
+
+
+@dataclass
+class TDMPC2Config:
+    """Configuration class for TDMPC2Policy.
+
+    Defaults are configured for training with xarm_lift_medium_replay providing proprioceptive and single
+    camera observations.
+
+    The parameters you will most likely need to change are the ones which depend on the environment / sensors.
+    Those are: `input_shapes`, `output_shapes`, and perhaps `max_random_shift_ratio`.
+
+    Args:
+        n_action_repeats: The number of times to repeat the action returned by the planning. (hint: Google
+            action repeats in Q-learning or ask your favorite chatbot)
+        horizon: Horizon for model predictive control.
+        n_action_steps: Number of action steps to take from the plan given by model predictive control. This
+            is an alternative to using action repeats. If this is set to more than 1, then we require
+            `n_action_repeats == 1`, `use_mpc == True` and `n_action_steps <= horizon`. Note that this
+            approach of using multiple steps from the plan is not in the original implementation.
+        input_shapes: A dictionary defining the shapes of the input data for the policy. The key represents
+            the input data name, and the value is a list indicating the dimensions of the corresponding data.
+            For example, "observation.image" refers to an input from a camera with dimensions [3, 96, 96],
+            indicating it has three color channels and 96x96 resolution. Importantly, `input_shapes` doesn't
+            include batch dimension or temporal dimension.
+        output_shapes: A dictionary defining the shapes of the output data for the policy. The key represents
+            the output data name, and the value is a list indicating the dimensions of the corresponding data.
+            For example, "action" refers to an output shape of [14], indicating 14-dimensional actions.
+            Importantly, `output_shapes` doesn't include batch dimension or temporal dimension.
+        input_normalization_modes: A dictionary with key representing the modality (e.g. "observation.state"),
+            and the value specifies the normalization mode to apply. The two available modes are "mean_std"
+            which subtracts the mean and divides by the standard deviation and "min_max" which rescale in a
+            [-1, 1] range. Note that here this defaults to None meaning inputs are not normalized. This is to
+            match the original implementation.
+        output_normalization_modes: Similar dictionary as `normalize_input_modes`, but to unnormalize to the
+            original scale. Note that this is also used for normalizing the training targets. NOTE: Clipping
+            to [-1, +1] is used during MPPI/CEM. Therefore, it is recommended that you stick with "min_max"
+            normalization mode here.
+        image_encoder_hidden_dim: Number of channels for the convolutional layers used for image encoding.
+        state_encoder_hidden_dim: Hidden dimension for MLP used for state vector encoding.
+        latent_dim: Observation's latent embedding dimension.
+        q_ensemble_size: Number of Q function estimators to use in an ensemble for uncertainty estimation.
+        mlp_dim: Hidden dimension of MLPs used for modelling the dynamics encoder, reward function, policy
+            (π), Q ensemble, and V.
+        discount: Discount factor (γ) to use for the reinforcement learning formalism.
+        use_mpc: Whether to use model predictive control. The alternative is to just sample the policy model
+            (π) for each step.
+        cem_iterations: Number of iterations for the MPPI/CEM loop in MPC.
+        max_std: Maximum standard deviation for actions sampled from the gaussian PDF in CEM.
+        min_std: Minimum standard deviation for noise applied to actions sampled from the policy model (π).
+            Doubles up as the minimum standard deviation for actions sampled from the gaussian PDF in CEM.
+        n_gaussian_samples: Number of samples to draw from the gaussian distribution every CEM iteration. Must
+            be non-zero.
+        n_pi_samples: Number of samples to draw from the policy / world model rollout every CEM iteration. Can
+            be zero.
+        uncertainty_regularizer_coeff: Coefficient for the uncertainty regularization used when estimating
+            trajectory values (this is the λ coeffiecient in eqn 4 of FOWM).
+        n_elites: The number of elite samples to use for updating the gaussian parameters every CEM iteration.
+        elite_weighting_temperature: The temperature to use for softmax weighting (by trajectory value) of the
+            elites, when updating the gaussian parameters for CEM.
+        gaussian_mean_momentum: Momentum (α) used for EMA updates of the mean parameter μ of the gaussian
+            parameters optimized in CEM. Updates are calculated as μ⁻ ← αμ⁻ + (1-α)μ.
+        max_random_shift_ratio: Maximum random shift (as a proportion of the image size) to apply to the
+            image(s) (in units of pixels) for training-time augmentation. If set to 0, no such augmentation
+            is applied. Note that the input images are assumed to be square for this augmentation.
+        reward_coeff: Loss weighting coefficient for the reward regression loss.
+        expectile_weight: Weighting (τ) used in expectile regression for the state value function (V).
+            v_pred < v_target is weighted by τ and v_pred >= v_target is weighted by (1-τ). τ is expected to
+            be in [0, 1]. Setting τ closer to 1 results in a more "optimistic" V. This is sensible to do
+            because v_target is obtained by evaluating the learned state-action value functions (Q) with
+            in-sample actions that may not be always optimal.
+        value_coeff: Loss weighting coefficient for both the state-action value (Q) TD loss, and the state
+            value (V) expectile regression loss.
+        consistency_coeff: Loss weighting coefficient for the consistency loss.
+        advantage_scaling: A factor by which the advantages are scaled prior to exponentiation for advantage
+            weighted regression of the policy (π) estimator parameters. Note that the exponentiated advantages
+            are clamped at 100.0.
+        pi_coeff: Loss weighting coefficient for the action regression loss.
+        temporal_decay_coeff: Exponential decay coefficient for decaying the loss coefficient for future time-
+            steps. Hint: each loss computation involves `horizon` steps worth of actions starting from the
+            current time step.
+        target_model_momentum: Momentum (α) used for EMA updates of the target models. Updates are calculated
+            as ϕ ← αϕ + (1-α)θ where ϕ are the parameters of the target model and θ are the parameters of the
+            model being trained.
+    """
+
+    num_bins = 101
+    vmin = -10
+    vmax = +10
+    rho: float = 0.5
+    tau: float = 0.01
+    simnorm_dim: int = 8
+
+    # Input / output structure.
+    n_action_repeats: int = 2
+    horizon: int = 5
+    n_action_steps: int = 1
+
+    input_shapes: dict[str, list[int]] = field(
+        default_factory=lambda: {
+            "observation.image": [3, 64, 64],
+            "observation.state": [4],
+        }
+    )
+    output_shapes: dict[str, list[int]] = field(
+        default_factory=lambda: {
+            "action": [4],
+        }
+    )
+
+    # Normalization / Unnormalization
+    input_normalization_modes: dict[str, str] | None = None
+    output_normalization_modes: dict[str, str] = field(
+        default_factory=lambda: {"action": "min_max"},
+    )
+
+    # Architecture / modeling.
+    # Neural networks.
+    image_encoder_hidden_dim: int = 32
+    state_encoder_hidden_dim: int = 256
+    latent_dim: int = 8 #50
+    q_ensemble_size: int = 5
+    mlp_dim: int = 512
+    # Reinforcement learning.
+    discount: float = 0.9
+    lr: float = 3e-4
+    enc_lr_scale: float = 0.3
+
+    num_q: int = 5
+    dropout: float = 0.01
+
+    num_channels = 32
+    num_enc_layers = 2
+    enc_dim = 256
+
+    # Inference.
+    use_mpc: bool = True
+    cem_iterations: int = 6
+    max_std: float = 2.0
+    min_std: float = 0.05
+    n_gaussian_samples: int = 512
+    n_pi_samples: int = 51
+    uncertainty_regularizer_coeff: float = 1.0
+    n_elites: int = 50
+    elite_weighting_temperature: float = 0.5
+    gaussian_mean_momentum: float = 0.1
+
+    # Training and loss computation.
+    grad_clip_norm: float = 20
+
+    max_random_shift_ratio: float = 0.0476
+    # Loss coefficients.
+    consistency_coef: float = 20
+    entropy_coef: float = 1e-4
+
+    reward_coef: float = 0.1
+    expectile_weight: float = 0.9
+    value_coef: float = 0.1
+    consistency_coeff: float = 20.0
+    advantage_scaling: float = 3.0
+    pi_coeff: float = 0.5
+    temporal_decay_coeff: float = 0.5
+    # Target model.
+    target_model_momentum: float = 0.995
+
+    def __post_init__(self):
+        """Input validation (not exhaustive)."""
+        # There should only be one image key.
+        image_keys = {k for k in self.input_shapes if k.startswith("observation.image")}
+        if len(image_keys) > 1:
+            raise ValueError(
+                f"{self.__class__.__name__} handles at most one image for now. Got image keys {image_keys}."
+            )
+        if len(image_keys) > 0:
+            image_key = next(iter(image_keys))
+            if self.input_shapes[image_key][-2] != self.input_shapes[image_key][-1]:
+                # TODO(alexander-soare): This limitation is solely because of code in the random shift
+                # augmentation. It should be able to be removed.
+                raise ValueError(
+                    f"Only square images are handled now. Got image shape {self.input_shapes[image_key]}."
+                )
+        if self.n_gaussian_samples <= 0:
+            raise ValueError(
+                f"The number of guassian samples for CEM should be non-zero. Got `{self.n_gaussian_samples=}`"
+            )
+        if self.output_normalization_modes != {"action": "min_max"}:
+            raise ValueError(
+                "TD-MPC assumes the action space dimensions to all be in [-1, 1]. Therefore it is strongly "
+                f"advised that you stick with the default. See {self.__class__.__name__} docstring for more "
+                "information."
+            )
+        if self.n_action_steps > 1:
+            if self.n_action_repeats != 1:
+                raise ValueError(
+                    "If `n_action_steps > 1`, `n_action_repeats` must be left to its default value of 1."
+                )
+            if not self.use_mpc:
+                raise ValueError("If `n_action_steps > 1`, `use_mpc` must be set to `True`.")
+            if self.n_action_steps > self.horizon:
+                raise ValueError("`n_action_steps` must be less than or equal to `horizon`.")
--- a/lerobot/common/policies/tdmpc2/modeling_tdmpc2.py
+++ b/lerobot/common/policies/tdmpc2/modeling_tdmpc2.py
@@ -0,0 +1,727 @@
+#!/usr/bin/env python
+
+# Copyright 2024 Nicklas Hansen, Xiaolong Wang, Hao Su,
+# and The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Implementation of Finetuning Offline World Models in the Real World.
+
+The comments in this code may sometimes refer to these references:
+    TD-MPC paper: Temporal Difference Learning for Model Predictive Control (https://arxiv.org/abs/2203.04955)
+    FOWM paper: Finetuning Offline World Models in the Real World (https://arxiv.org/abs/2310.16029)
+"""
+
+# ruff: noqa: N806
+
+import logging
+from collections import deque
+from copy import deepcopy
+from functools import partial
+from typing import Callable
+
+import einops
+import numpy as np
+import torch
+import torch.nn as nn
+import torch.nn.functional as F  # noqa: N812
+from huggingface_hub import PyTorchModelHubMixin
+from torch import Tensor
+
+import lerobot.common.policies.tdmpc2.tdmpc2_utils as utils
+from lerobot.common.policies.normalize import Normalize, Unnormalize
+from lerobot.common.policies.tdmpc2.configuration_tdmpc2 import TDMPC2Config
+from lerobot.common.policies.utils import get_device_from_parameters, populate_queues
+
+
+class TDMPC2Policy(nn.Module, PyTorchModelHubMixin):
+    """Implementation of TD-MPC2 learning + inference.
+
+    Please note several warnings for this policy.
+        - We have NOT checked that training on LeRobot reproduces SOTA results. This is a TODO.
+    """
+
+    name = "tdmpc2"
+
+    def __init__(
+        self, config: TDMPC2Config | None = None, dataset_stats: dict[str, dict[str, Tensor]] | None = None
+    ):
+        """
+        Args:
+            config: Policy configuration class instance or None, in which case the default instantiation of
+                the configuration class is used.
+            dataset_stats: Dataset statistics to be used for normalization. If not passed here, it is expected
+                that they will be passed with a call to `load_state_dict` before the policy is used.
+        """
+        super().__init__()
+        logging.warning(
+            """
+            Please note several warnings for this policy.
+            - We have NOT checked that training on LeRobot reproduces SOTA results. This is a TODO.
+            """
+        )
+
+        if config is None:
+            config = TDMPC2Config()
+        self.config = config
+        self.model = TDMPC2TOLD(config)
+
+        if config.input_normalization_modes is not None:
+            self.normalize_inputs = Normalize(
+                config.input_shapes, config.input_normalization_modes, dataset_stats
+            )
+        else:
+            self.normalize_inputs = nn.Identity()
+        self.normalize_targets = Normalize(
+            config.output_shapes, config.output_normalization_modes, dataset_stats
+        )
+        self.unnormalize_outputs = Unnormalize(
+            config.output_shapes, config.output_normalization_modes, dataset_stats
+        )
+
+        image_keys = [k for k in config.input_shapes if k.startswith("observation.image")]
+        # Note: This check is covered in the post-init of the config but have a sanity check just in case.
+        self._use_image = False
+        self._use_env_state = False
+        if len(image_keys) > 0:
+            assert len(image_keys) == 1
+            self._use_image = True
+            self.input_image_key = image_keys[0]
+        if "observation.environment_state" in config.input_shapes:
+            self._use_env_state = True
+
+        self.scale = utils.RunningScale(self.config)
+
+        self.queue_keys = None
+
+        self.reset()
+
+    def reset(self):
+        """
+        Clear observation and action queues. Clear previous means for warm starting of MPPI/CEM. Should be
+        called on `env.reset()`
+        """
+        self._queues = {
+            "observation.state": deque(maxlen=1),
+            "action": deque(maxlen=max(self.config.n_action_steps, self.config.n_action_repeats)),
+        }
+        if self._use_image:
+            self._queues["observation.image"] = deque(maxlen=1)
+        if self._use_env_state:
+            self._queues["observation.environment_state"] = deque(maxlen=1)
+        # Previous mean obtained from the cross-entropy method (CEM) used during MPC. It is used to warm start
+        # CEM for the next step.
+        self._prev_mean: torch.Tensor | None = None
+
+    @torch.no_grad()
+    def select_action(self, batch: dict[str, Tensor]) -> Tensor:
+        """Select a single action given environment observations."""
+        batch = self.normalize_inputs(batch)
+        if self._use_image:
+            batch = dict(batch)  # shallow copy so that adding a key doesn't modify the original
+            batch["observation.image"] = batch[self.input_image_key]
+
+        self._queues = populate_queues(self._queues, batch)
+
+        if self.queue_keys is None:
+            self.queue_keys = [k for k in batch if k in self._queues]
+
+        # When the action queue is depleted, populate it again by querying the policy.
+        if len(self._queues["action"]) == 0:
+            batch = {key: torch.stack(list(self._queues[key]), dim=1) for key in self.queue_keys}
+
+            # Remove the time dimensions as it is not handled yet.
+            for key in batch:
+                assert batch[key].shape[1] == 1
+                batch[key] = batch[key][:, 0]
+
+            # NOTE: Order of observations matters here.
+            encode_keys = []
+            if self._use_image:
+                encode_keys.append("observation.image")
+            if self._use_env_state:
+                encode_keys.append("observation.environment_state")
+            encode_keys.append("observation.state")
+
+            z = self.model.encode({k: batch[k] for k in encode_keys})
+
+            if self.config.use_mpc:  # noqa: SIM108
+                actions = self.plan(z)  # (horizon, batch, action_dim)
+            else:
+                # Plan with the policy (π) alone. This always returns one action so unsqueeze to get a
+                # sequence dimension like in the MPC branch.
+                actions = self.model.pi(z).unsqueeze(0)
+
+            actions = torch.clamp(actions, -1, +1)
+
+            actions = self.unnormalize_outputs({"action": actions})["action"]
+
+            if self.config.n_action_repeats > 1:
+                for _ in range(self.config.n_action_repeats):
+                    self._queues["action"].append(actions[0])
+            else:
+                # Action queue is (n_action_steps, batch_size, action_dim), so we transpose the action.
+                self._queues["action"].extend(
+                    actions[: self.config.n_action_steps]
+                )  # TDMPC2 does it use n_action_steps?
+
+        action = self._queues["action"].popleft()
+        return action
+
+    @torch.no_grad()
+    def plan(self, z: Tensor) -> Tensor:
+        """Plan sequence of actions using TD-MPC inference.
+
+        Args:
+            z: (batch, latent_dim,) tensor for the initial state.
+        Returns:
+            (horizon, batch, action_dim,) tensor for the planned trajectory of actions.
+        """
+        device = get_device_from_parameters(self)
+
+        batch_size = z.shape[0]
+
+        # Sample Nπ trajectories from the policy.
+        pi_actions = torch.empty(
+            self.config.horizon,
+            self.config.n_pi_samples,
+            batch_size,
+            self.config.output_shapes["action"][0],
+            device=device,
+        )
+        if self.config.n_pi_samples > 0:
+            _z = einops.repeat(z, "b d -> n b d", n=self.config.n_pi_samples)
+            for t in range(self.config.horizon):
+                # Note: Adding a small amount of noise here doesn't hurt during inference and may even be
+                # helpful for CEM.
+                pi_actions[t] = self.model.pi_action(_z)
+                _z = self.model.latent_dynamics(_z, pi_actions[t])
+
+        # In the CEM loop we will need this for a call to estimate_value with the gaussian sampled
+        # trajectories.
+        z = einops.repeat(z, "b d -> n b d", n=self.config.n_gaussian_samples + self.config.n_pi_samples)
+
+        # Model Predictive Path Integral (MPPI) with the cross-entropy method (CEM) as the optimization
+        # algorithm.
+        # The initial mean and standard deviation for the cross-entropy method (CEM).
+        mean = torch.zeros(
+            self.config.horizon, batch_size, self.config.output_shapes["action"][0], device=device
+        )
+        # Maybe warm start CEM with the mean from the previous step.
+        if self._prev_mean is not None:
+            mean[:-1] = self._prev_mean[1:]
+        std = self.config.max_std * torch.ones_like(mean)
+
+        for _ in range(self.config.cem_iterations):
+            # Randomly sample action trajectories for the gaussian distribution.
+            std_normal_noise = torch.randn(
+                self.config.horizon,
+                self.config.n_gaussian_samples,
+                batch_size,
+                self.config.output_shapes["action"][0],
+                device=std.device,
+            )
+            gaussian_actions = torch.clamp(mean.unsqueeze(1) + std.unsqueeze(1) * std_normal_noise, -1, 1)
+
+            # Compute elite actions.
+            actions = torch.cat([gaussian_actions, pi_actions], dim=1)
+            value = self.estimate_value(z, actions).nan_to_num_(0).squeeze(-1)
+            elite_idxs = torch.topk(value, self.config.n_elites, dim=0).indices  # (n_elites, batch)
+            # from IPython import embed; embed()
+            elite_value = value.take_along_dim(elite_idxs, dim=0)  # (n_elites, batch)
+            # (horizon, n_elites, batch, action_dim)
+            elite_actions = actions.take_along_dim(einops.rearrange(elite_idxs, "n b -> 1 n b 1"), dim=1)
+
+            # Update gaussian PDF parameters to be the (weighted) mean and standard deviation of the elites.
+            max_value = elite_value.max(0, keepdim=True)[0]  # (1, batch)
+            # The weighting is a softmax over trajectory values. Note that this is not the same as the usage
+            # of Ω in eqn 4 of the TD-MPC paper. Instead it is the normalized version of it: s = Ω/ΣΩ. This
+            # makes the equations: μ = Σ(s⋅Γ), σ = Σ(s⋅(Γ-μ)²).
+            score = torch.exp(self.config.elite_weighting_temperature * (elite_value - max_value))
+            score /= score.sum(axis=0, keepdim=True)
+            # (horizon, batch, action_dim)
+            _mean = torch.sum(einops.rearrange(score, "n b -> n b 1") * elite_actions, dim=1)
+            _std = torch.sqrt(
+                torch.sum(
+                    einops.rearrange(score, "n b -> n b 1")
+                    * (elite_actions - einops.rearrange(_mean, "h b d -> h 1 b d")) ** 2,
+                    dim=1,
+                )
+            )
+            # Update mean with an exponential moving average, and std with a direct replacement.
+            mean = (
+                self.config.gaussian_mean_momentum * mean + (1 - self.config.gaussian_mean_momentum) * _mean
+            )
+            std = _std.clamp_(self.config.min_std, self.config.max_std)
+
+        # Keep track of the mean for warm-starting subsequent steps.
+        self._prev_mean = mean
+
+        # Randomly select one of the elite actions from the last iteration of MPPI/CEM using the softmax
+        # scores from the last iteration.
+        actions = elite_actions[:, torch.multinomial(score.T, 1).squeeze(), torch.arange(batch_size)]
+
+        return actions
+
+    @torch.no_grad()
+    def estimate_value(self, z, actions):
+        """Estimate value of a trajectory starting at latent state z and executing given actions."""
+        G, discount = 0, 1
+        for t in range(self.config.horizon):
+            reward = utils.two_hot_inv(self.model._reward(torch.cat([z, actions[t]], dim=-1)), self.config)
+            z = self.model.next(z, actions[t])
+            G += discount * reward
+            discount *= self.config.discount
+        return G + discount * self.model.Qs(z, self.model.pi(z)[1], return_type="avg")
+
+    def forward(self, batch: dict[str, Tensor]) -> dict[str, Tensor | float]:
+        """Run the batch through the model and compute the loss.
+
+        Returns a dictionary with loss as a tensor, and other information as native floats.
+        """
+        device = get_device_from_parameters(self)
+
+        batch = self.normalize_inputs(batch)
+        if self._use_image:
+            batch = dict(batch)  # shallow copy so that adding a key doesn't modify the original
+            batch["observation.image"] = batch[self.input_image_key]
+        batch = self.normalize_targets(batch)
+
+        # (b, t) -> (t, b)
+        for key in batch:
+            if batch[key].ndim > 1:
+                batch[key] = batch[key].transpose(1, 0)
+
+        action = batch["action"]  # (t, b, action_dim)
+        reward = batch["next.reward"]  # (t, b)
+        reward = reward.unsqueeze(-1)  # (t, b, 1)
+        observations = {k: v for k, v in batch.items() if k.startswith("observation.")}
+
+        # Apply random image augmentations.
+        if self._use_image and self.config.max_random_shift_ratio > 0:
+            observations["observation.image"] = flatten_forward_unflatten(
+                partial(random_shifts_aug, max_random_shift_ratio=self.config.max_random_shift_ratio),
+                observations["observation.image"],
+            )
+
+        # Get the current observation for predicting trajectories, and all future observations for use in
+        # the latent consistency loss and TD loss.
+        current_observation, next_observations = {}, {}
+        for k in observations:
+            current_observation[k] = observations[k][0]
+            next_observations[k] = observations[k][1:]
+        horizon, batch_size = next_observations[
+            "observation.image" if self._use_image else "observation.environment_state"
+        ].shape[:2]
+
+        # Compute targets
+        with torch.no_grad():
+            next_z = self.model.encode(next_observations)
+            curr_z = self.model.encode(current_observation).unsqueeze(
+                0
+            )  # TODO: not necessary to do the whole thing
+            # get the next targets # _td_target in the original code
+            pi = self.model.pi(next_z)[1]
+            discount = self.config.discount
+
+            td_targets = reward + discount * self.model.Qs(next_z, pi, return_type="min", target=True)
+
+        #self.model.train()
+
+        # Latent rollout
+        zs = torch.empty(self.config.horizon + 1, batch_size, self.config.latent_dim, device=device)
+        zs[0] = z = curr_z[0]
+        consistency_loss = 0
+        for t in range(self.config.horizon):
+            x = torch.cat([z, action[t]], dim=-1)
+            z = self.model._dynamics(x)
+            consistency_loss += F.mse_loss(z, next_z[t]) * self.config.rho**t
+            zs[t + 1] = z
+
+        # Predictions
+        _zs = zs[:-1]
+        qs = self.model.Qs(_zs, action, return_type="all")
+        reward_preds = self.model._reward(torch.cat([_zs, action], dim=-1))
+
+        # Compute losses
+        reward_loss, value_loss = 0, 0
+        for t in range(self.config.horizon):
+            reward_loss += utils.soft_ce(reward_preds[t], reward[t], self.config).mean() * self.config.rho**t
+            for q in range(self.config.num_q):
+                value_loss += utils.soft_ce(qs[q][t], td_targets[t], self.config).mean() * self.config.rho**t
+        consistency_loss *= 1 / self.config.horizon
+        reward_loss *= 1 / self.config.horizon
+        value_loss *= 1 / (self.config.horizon * self.config.num_q)
+
+        ############################ deviation from NHansen
+        # total_loss = (
+        #    self.config.consistency_coef * consistency_loss
+        #    + self.config.reward_coef * reward_loss
+        #    + self.config.value_coef * value_loss
+        # )
+
+        # Update model########################
+        # total_loss.backward()
+        # grad_norm = torch.nn.utils.clip_grad_norm_(self.model.parameters(), self.config.grad_clip_norm)
+        # self.optim.step()
+        ##########################################
+
+        # Deviation from Hansen, since the optimizer step is called in train.py
+        # Update the policy using a sequence of latent states.
+        zs_for_pi = zs.detach()
+        # self.pi_optim.zero_grad(set_to_none=True)
+        self.model.track_q_grad(False)
+        _, pis, log_pis, _ = self.model.pi(zs_for_pi)
+        qs = self.model.Qs(zs_for_pi, pis, return_type="avg")
+        self.scale.update(qs[0])
+        qs = self.scale(qs)
+
+        # Loss is a weighted sum of Q-values
+        rho = torch.pow(self.config.rho, torch.arange(len(qs), device=device))
+        pi_loss = ((self.config.entropy_coef * log_pis - qs).mean(dim=(1, 2)) * rho).mean()
+        # pi_loss.backward()
+        # torch.nn.utils.clip_grad_norm_(self.model._pi.parameters(), self.config.grad_clip_norm)
+        # self.pi_optim.step()
+
+        # self.model.track_q_grad(True)
+
+        # pi_loss = pi_loss.item()
+
+        loss = (
+            self.config.consistency_coef * consistency_loss
+            + self.config.reward_coef * reward_loss
+            + self.config.value_coef * value_loss
+            + self.config.pi_coeff * pi_loss
+        )
+
+        # Update target Q-functions
+        # """
+        # Soft-update target Q-networks using Polyak averaging.
+        # """
+        # with torch.no_grad():
+        #    for p, p_target in zip(self.model._Qs.parameters(), self.model._target_Qs.parameters()):
+        #        p_target.data.lerp_(p.data, self.config.tau)
+
+        # Return training statistics
+        self.model.eval()
+        info = {
+            "loss": loss,
+            "consistency_loss": consistency_loss.mean().item(),
+            "reward_loss": reward_loss.mean().item(),
+            "value_loss": value_loss.mean().item(),
+            "pi_loss": pi_loss.item(),
+            "pi_scale": self.scale.value,
+        }
+
+        # Undo (b, t) -> (t, b).
+        for key in batch:
+            if batch[key].ndim > 1:
+                batch[key] = batch[key].transpose(1, 0)
+
+        return info
+
+    def update(self):
+        """Soft-update target Q-networks using Polyak averaging."""
+        with torch.no_grad():
+            for p, p_target in zip(
+                self.model._Qs.parameters(), self.model._target_Qs.parameters(), strict=False
+            ):
+                p_target.data.lerp_(p.data, self.config.tau)
+
+
+class TDMPC2TOLD(nn.Module):
+    """Task-Oriented Latent Dynamics (TOLD) model used in TD-MPC2."""
+
+    def __init__(self, config: TDMPC2Config):
+        super().__init__()
+        self.config = config
+
+        self.config.bin_size = (config.vmax - config.vmin) / (
+            config.num_bins - 1
+        )  # Bin size for discrete regression
+
+        action_dim = config.output_shapes["action"][0]
+
+        self._encoder = TDMPC2ObservationEncoder(config)
+        self._dynamics = utils.mlp(
+            config.latent_dim + action_dim,
+            2 * [config.mlp_dim],
+            config.latent_dim,
+            act=utils.SimNorm(config),
+        )
+        self._reward = utils.mlp(
+            config.latent_dim + action_dim, 2 * [config.mlp_dim], max(config.num_bins, 1)
+        )
+        self._pi = utils.mlp(config.latent_dim, 2 * [config.mlp_dim], 2 * action_dim)
+        self._Qs = utils.Ensemble(
+            [
+                utils.mlp(
+                    config.latent_dim + action_dim,
+                    2 * [config.mlp_dim],
+                    max(config.num_bins, 1),
+                    dropout=config.dropout,
+                )
+                for _ in range(config.num_q)
+            ]
+        )
+
+        self.apply(self.weight_init)
+        for p in [self._reward[-1].weight, self._Qs.params[-2]]:
+            p.data.fill_(0)
+
+        self._target_Qs = deepcopy(self._Qs).requires_grad_(False)
+        log_std_min, log_std_max = -10, 2  # TODO: add to config
+        self.log_std_min = torch.tensor(log_std_min)
+        self.log_std_dif = torch.tensor(log_std_max) - self.log_std_min
+
+    def track_q_grad(self, mode=True):
+        """
+        Enables/disables gradient tracking of Q-networks.
+        Avoids unnecessary computation during policy optimization.
+        This method also enables/disables gradients for task embeddings.
+        """
+        for p in self._Qs.parameters():
+            p.requires_grad_(mode)
+
+    def weight_init(self, m):  # lifted from Nicklas' code
+        """Custom weight initialization for TD-MPC2."""
+        if isinstance(m, nn.Linear):
+            nn.init.trunc_normal_(m.weight, std=0.02)
+            if m.bias is not None:
+                nn.init.constant_(m.bias, 0)
+        elif isinstance(m, nn.Embedding):
+            nn.init.uniform_(m.weight, -0.02, 0.02)
+        elif isinstance(m, nn.ParameterList):
+            for i, p in enumerate(m):
+                if p.dim() == 3:  # Linear
+                    nn.init.trunc_normal_(p, std=0.02)  # Weight
+                    nn.init.constant_(m[i + 1], 0)  # Bias
+
+    def encode(self, obs: dict[str, Tensor]) -> Tensor:
+        """Encodes an observation into its latent representation."""
+        # from IPython import embed; embed()
+        # print(obs["observation.state"].shape, obs["observation.image"].shape)
+        return self._encoder(obs)
+
+    def latent_dynamics_and_reward(self, z: Tensor, a: Tensor) -> tuple[Tensor, Tensor]:
+        """Predict the next state's latent representation and the reward given a current latent and action.
+
+        Args:
+            z: (*, latent_dim) tensor for the current state's latent representation.
+            a: (*, action_dim) tensor for the action to be applied.
+        Returns:
+            A tuple containing:
+                - (*, latent_dim) tensor for the next state's latent representation.
+                - (*,) tensor for the estimated reward.
+        """
+        x = torch.cat([z, a], dim=-1)
+        r = self._reward(x)
+        r = utils.two_hot_inv(r, self.config).squeeze(-1)
+        # from IPython import embed; embed()
+        return self._dynamics(x), r
+
+    def latent_dynamics(self, z: Tensor, a: Tensor) -> Tensor:
+        """Predict the next state's latent representation given a current latent and action.
+
+        Args:
+            z: (*, latent_dim) tensor for the current state's latent representation.
+            a: (*, action_dim) tensor for the action to be applied.
+        Returns:
+            (*, latent_dim) tensor for the next state's latent representation.
+        """
+        x = torch.cat([z, a], dim=-1)
+        return self._dynamics(x)
+
+    def next(self, z: Tensor, a: Tensor) -> Tensor:
+        return self.latent_dynamics(z, a)  # just a wrapper
+
+    def pi(self, z):  # lifted from Nicklas' code
+        """
+        Samples an action from the policy prior.
+        The policy prior is a Gaussian distribution with
+        mean and (log) std predicted by a neural network.
+        """
+        # Gaussian policy prior
+        mu, log_std = self._pi(z).chunk(2, dim=-1)
+        log_std = utils.log_std_fn(log_std, self.log_std_min, self.log_std_dif)
+        eps = torch.randn_like(mu)
+
+        # No masking
+        action_dims = None
+
+        log_pi = utils.gaussian_logprob(eps, log_std, size=action_dims)
+        pi = mu + eps * log_std.exp()
+        mu, pi, log_pi = utils.squash(mu, pi, log_pi)
+
+        return mu, pi, log_pi, log_std
+
+    def pi_action(self, z):
+        return self.pi(z)[1]  # just return the action
+
+    def Qs(self, z: Tensor, a: Tensor, return_type: str = "min", target: bool = False) -> Tensor:  # noqa: N802
+        """Predict state-action value for all of the learned Q functions.
+
+        Args:
+            z: (*, latent_dim) tensor for the current state's latent representation.
+            a: (*, action_dim) tensor for the action to be applied.
+            return_type can be one of [`min`, `avg`, `all`]:
+                - `min`: return the minimum of two randomly subsampled Q-values.
+                - `avg`: return the average of two randomly subsampled Q-values.
+                - `all`: return all Q-values.
+            target: Set to true to use the target Q functions.
+        Returns:
+            (q_ensemble, *) tensor for the value predictions of each learned Q function in the ensemble OR
+            (*,) tensor if return_min=True.
+        """
+        assert return_type in {"min", "avg", "all"}
+
+        z = torch.cat([z, a], dim=-1)
+        out = (self._target_Qs if target else self._Qs)(z)
+
+        if return_type == "all":
+            return out
+
+        Q1, Q2 = out[np.random.choice(self.config.num_q, 2, replace=False)]
+        Q1, Q2 = utils.two_hot_inv(Q1, self.config), utils.two_hot_inv(Q2, self.config)
+        return torch.min(Q1, Q2) if return_type == "min" else (Q1 + Q2) / 2
+
+
+class TDMPC2ObservationEncoder(nn.Module):
+    """Encode image and/or state vector observations."""
+
+    def __init__(self, config: TDMPC2Config):
+        """
+        Creates encoders for pixel and/or state modalities.
+        TODO(alexander-soare): The original work allows for multiple images by concatenating them along the
+            channel dimension. Re-implement this capability.
+        """
+        super().__init__()
+        self.config = config
+
+        for k in config.input_shapes:
+            if "observation.environment_state" in k:
+                obs_dim = config.input_shapes["observation.environment_state"][0]
+                self.env_state_enc_layers = utils.mlp(
+                    obs_dim,
+                    max(config.num_enc_layers - 1, 1) * [config.enc_dim],
+                    config.latent_dim,
+                    act=utils.SimNorm(config),
+                )
+            elif "observation.state" in k:
+                obs_dim = config.input_shapes["observation.state"][0]
+                self.state_enc_layers = utils.mlp(
+                    obs_dim,
+                    max(config.num_enc_layers - 1, 1) * [config.enc_dim],
+                    config.latent_dim,
+                    act=utils.SimNorm(config),
+                )
+            elif "observation.image" in k:
+                obs_shape = config.input_shapes["observation.image"]
+                self.image_enc_layers = utils.conv(obs_shape, config.num_channels, act=utils.SimNorm(config))
+                dummy_batch = torch.zeros(1, *config.input_shapes["observation.image"])
+                with torch.no_grad():
+                    out_shape = self.image_enc_layers(dummy_batch).shape[1]
+                self.image_enc_layers.extend(
+                    utils.mlp(
+                            out_shape,
+                            max(config.num_enc_layers - 1, 1) * [config.enc_dim],
+                            config.latent_dim,
+                            act=utils.SimNorm(config),
+                            ))
+            
+
+
+    def forward(self, obs_dict: dict[str, Tensor]) -> Tensor:
+        """Encode the image and/or state vector.
+
+        Each modality is encoded into a feature vector of size (latent_dim,) and then a uniform mean is taken
+        over all features.
+        """
+        feat = []
+        # NOTE: Order of observations matters here.
+        if "observation.image" in self.config.input_shapes:
+            feat.append(flatten_forward_unflatten(self.image_enc_layers, obs_dict["observation.image"]))
+        if "observation.environment_state" in self.config.input_shapes:
+            feat.append(self.env_state_enc_layers(obs_dict["observation.environment_state"]))
+        if "observation.state" in self.config.input_shapes:
+            feat.append(self.state_enc_layers(obs_dict["observation.state"]))
+
+        return torch.stack(feat, dim=0).mean(0)
+
+
+def random_shifts_aug(x: Tensor, max_random_shift_ratio: float) -> Tensor:
+    """Randomly shifts images horizontally and vertically.
+
+    Adapted from https://github.com/facebookresearch/drqv2
+    """
+    b, _, h, w = x.size()
+    assert h == w, "non-square images not handled yet"
+    pad = int(round(max_random_shift_ratio * h))
+    x = F.pad(x, tuple([pad] * 4), "replicate")
+    eps = 1.0 / (h + 2 * pad)
+    arange = torch.linspace(
+        -1.0 + eps,
+        1.0 - eps,
+        h + 2 * pad,
+        device=x.device,
+        dtype=torch.float32,
+    )[:h]
+    arange = einops.repeat(arange, "w -> h w 1", h=h)
+    base_grid = torch.cat([arange, arange.transpose(1, 0)], dim=2)
+    base_grid = einops.repeat(base_grid, "h w c -> b h w c", b=b)
+    # A random shift in units of pixels and within the boundaries of the padding.
+    shift = torch.randint(
+        0,
+        2 * pad + 1,
+        size=(b, 1, 1, 2),
+        device=x.device,
+        dtype=torch.float32,
+    )
+    shift *= 2.0 / (h + 2 * pad)
+    grid = base_grid + shift
+    return F.grid_sample(x, grid, padding_mode="zeros", align_corners=False)
+
+
+def update_ema_parameters(ema_net: nn.Module, net: nn.Module, alpha: float):
+    """Update EMA parameters in place with ema_param <- alpha * ema_param + (1 - alpha) * param."""
+    for ema_module, module in zip(ema_net.modules(), net.modules(), strict=True):
+        for (n_p_ema, p_ema), (n_p, p) in zip(
+            ema_module.named_parameters(recurse=False), module.named_parameters(recurse=False), strict=True
+        ):
+            assert n_p_ema == n_p, "Parameter names don't match for EMA model update"
+            if isinstance(p, dict):
+                raise RuntimeError("Dict parameter not supported")
+            if isinstance(module, nn.modules.batchnorm._BatchNorm) or not p.requires_grad:
+                # Copy BatchNorm parameters, and non-trainable parameters directly.
+                p_ema.copy_(p.to(dtype=p_ema.dtype).data)
+            with torch.no_grad():
+                p_ema.mul_(alpha)
+                p_ema.add_(p.to(dtype=p_ema.dtype).data, alpha=1 - alpha)
+
+
+def flatten_forward_unflatten(fn: Callable[[Tensor], Tensor], image_tensor: Tensor) -> Tensor:
+    """Helper to temporarily flatten extra dims at the start of the image tensor.
+
+    Args:
+        fn: Callable that the image tensor will be passed to. It should accept (B, C, H, W) and return
+            (B, *), where * is any number of dimensions.
+        image_tensor: An image tensor of shape (**, C, H, W), where ** is any number of dimensions, generally
+            different from *.
+    Returns:
+        A return value from the callable reshaped to (**, *).
+    """
+    if image_tensor.ndim == 4:
+        return fn(image_tensor)
+    start_dims = image_tensor.shape[:-3]
+    inp = torch.flatten(image_tensor, end_dim=-4)
+    flat_out = fn(inp)
+    return torch.reshape(flat_out, (*start_dims, *flat_out.shape[1:]))
--- a/lerobot/common/policies/tdmpc2/tdmpc2_utils.py
+++ b/lerobot/common/policies/tdmpc2/tdmpc2_utils.py
@@ -0,0 +1,305 @@
+import torch
+import torch.nn as nn
+import torch.nn.functional as F  # noqa: N812
+from functorch import combine_state_for_ensemble
+
+# Lifted directly from https://github.com/nicklashansen/tdmpc2
+DREG_BINS = None
+
+
+def soft_ce(pred, target, cfg):
+    """Computes the cross entropy loss between predictions and soft targets."""
+    pred = F.log_softmax(pred, dim=-1)
+    target = two_hot(target, cfg)
+    return -(target * pred).sum(-1, keepdim=True)
+
+
+@torch.jit.script
+def log_std(x, low, dif):
+    return low + 0.5 * dif * (torch.tanh(x) + 1)
+
+
+@torch.jit.script
+def _gaussian_residual(eps, log_std):
+    return -0.5 * eps.pow(2) - log_std
+
+
+@torch.jit.script
+def _gaussian_logprob(residual):
+    return residual - 0.5 * torch.log(2 * torch.pi)
+
+
+def gaussian_logprob(eps, log_std, size=None):
+    """Compute Gaussian log probability."""
+    residual = _gaussian_residual(eps, log_std).sum(-1, keepdim=True)
+    if size is None:
+        size = eps.size(-1)
+    return _gaussian_logprob(residual) * size
+
+
+@torch.jit.script
+def _squash(pi):
+    return torch.log(F.relu(1 - pi.pow(2)) + 1e-6)
+
+
+def squash(mu, pi, log_pi):
+    """Apply squashing function."""
+    mu = torch.tanh(mu)
+    pi = torch.tanh(pi)
+    log_pi -= _squash(pi).sum(-1, keepdim=True)
+    return mu, pi, log_pi
+
+
+@torch.jit.script
+def symexp(x):
+    """
+    Symmetric exponential function.
+    Adapted from https://github.com/danijar/dreamerv3.
+    """
+    return torch.sign(x) * (torch.exp(torch.abs(x)) - 1)
+
+
+@torch.jit.script
+def symlog(x):
+    """
+    Symmetric logarithmic function.
+    Adapted from https://github.com/danijar/dreamerv3.
+    """
+    return torch.sign(x) * torch.log(1 + torch.abs(x))
+
+
+@torch.jit.script
+def log_std_fn(x, low, dif):
+    return low + 0.5 * dif * (torch.tanh(x) + 1)
+
+
+def two_hot(x, cfg):
+    """Converts a batch of scalars to soft two-hot encoded targets for discrete regression."""
+    if cfg.num_bins == 0:
+        return x
+    elif cfg.num_bins == 1:
+        return symlog(x)
+    x = torch.clamp(symlog(x), cfg.vmin, cfg.vmax).squeeze(1)
+    bin_idx = torch.floor((x - cfg.vmin) / cfg.bin_size).long()
+    bin_offset = ((x - cfg.vmin) / cfg.bin_size - bin_idx.float()).unsqueeze(-1)
+    soft_two_hot = torch.zeros(x.size(0), cfg.num_bins, device=x.device)
+
+    # print("x shape:", x.shape)
+    # print("bin_idx shape:", bin_idx.shape)
+    # print("bin_offset shape:", bin_offset.shape)
+    # print("soft_two_hot shape:", soft_two_hot.shape)
+
+    # from IPython import embed; embed()
+
+    soft_two_hot.scatter_(1, bin_idx.unsqueeze(1), 1 - bin_offset)
+    soft_two_hot.scatter_(1, (bin_idx.unsqueeze(1) + 1) % cfg.num_bins, bin_offset)
+    return soft_two_hot
+
+
+def two_hot_inv(x, cfg):
+    """Converts a batch of soft two-hot encoded vectors to scalars."""
+    global DREG_BINS
+    if cfg.num_bins == 0:
+        return x
+    elif cfg.num_bins == 1:
+        return symexp(x)
+    if DREG_BINS is None:
+        DREG_BINS = torch.linspace(cfg.vmin, cfg.vmax, cfg.num_bins, device=x.device)
+    x = F.softmax(x, dim=-1)
+
+    # cloning bins to avoid the inference tensor errodr
+    x = torch.sum(x * DREG_BINS.clone(), dim=-1, keepdim=True)
+
+    return symexp(x)
+
+
+class Ensemble(nn.Module):
+    """
+    Vectorized ensemble of modules.
+    """
+
+    def __init__(self, modules, **kwargs):
+        super().__init__()
+        modules = nn.ModuleList(modules)
+        fn, params, _ = combine_state_for_ensemble(modules)
+        self.vmap = torch.vmap(fn, in_dims=(0, 0, None), randomness="different", **kwargs)
+        self.params = nn.ParameterList([nn.Parameter(p) for p in params])
+        self._repr = str(modules)
+
+    def forward(self, *args, **kwargs):
+        return self.vmap(list(self.params), (), *args, **kwargs)
+
+    def __repr__(self):
+        return "Vectorized " + self._repr
+
+
+class ShiftAug(nn.Module):
+    """
+    Random shift image augmentation.
+    Adapted from https://github.com/facebookresearch/drqv2
+    """
+
+    def __init__(self, pad=3):
+        super().__init__()
+        self.pad = pad
+
+    def forward(self, x):
+        x = x.float()
+        n, _, h, w = x.size()
+        assert h == w
+        padding = tuple([self.pad] * 4)
+        x = F.pad(x, padding, "replicate")
+        eps = 1.0 / (h + 2 * self.pad)
+        arange = torch.linspace(-1.0 + eps, 1.0 - eps, h + 2 * self.pad, device=x.device, dtype=x.dtype)[:h]
+        arange = arange.unsqueeze(0).repeat(h, 1).unsqueeze(2)
+        base_grid = torch.cat([arange, arange.transpose(1, 0)], dim=2)
+        base_grid = base_grid.unsqueeze(0).repeat(n, 1, 1, 1)
+        shift = torch.randint(0, 2 * self.pad + 1, size=(n, 1, 1, 2), device=x.device, dtype=x.dtype)
+        shift *= 2.0 / (h + 2 * self.pad)
+        grid = base_grid + shift
+        return F.grid_sample(x, grid, padding_mode="zeros", align_corners=False)
+
+
+class PixelPreprocess(nn.Module):
+    """
+    Normalizes pixel observations to [-0.5, 0.5].
+    """
+
+    def __init__(self):
+        super().__init__()
+
+    def forward(self, x):
+        return x.div_(255.0).sub_(0.5)
+
+
+class SimNorm(nn.Module):
+    """
+    Simplicial normalization.
+    Adapted from https://arxiv.org/abs/2204.00616.
+    """
+
+    def __init__(self, cfg):
+        super().__init__()
+        self.dim = cfg.simnorm_dim
+
+    def forward(self, x):
+        shp = x.shape
+        x = x.view(*shp[:-1], -1, self.dim)
+
+        x = F.softmax(x, dim=-1)
+        return x.view(*shp)
+
+    def __repr__(self):
+        return f"SimNorm(dim={self.dim})"
+
+
+class NormedLinear(nn.Linear):
+    """
+    Linear layer with LayerNorm, activation, and optionally dropout.
+    """
+
+    def __init__(self, *args, dropout=0.0, act=None, **kwargs):
+        super().__init__(*args, **kwargs)
+        self.ln = nn.LayerNorm(self.out_features)
+        self.act = nn.Mish(inplace=True) if act is None else act
+        self.dropout = nn.Dropout(dropout, inplace=True) if dropout else None
+
+    def forward(self, x):
+        x = super().forward(x)
+        if self.dropout:
+            x = self.dropout(x)
+        return self.act(self.ln(x))
+
+    def __repr__(self):
+        repr_dropout = f", dropout={self.dropout.p}" if self.dropout else ""
+        return (
+            f"NormedLinear(in_features={self.in_features}, "
+            f"out_features={self.out_features}, "
+            f"bias={self.bias is not None}{repr_dropout}, "
+            f"act={self.act.__class__.__name__})"
+        )
+
+
+def mlp(in_dim, mlp_dims, out_dim, act=None, dropout=0.0):
+    """
+    Basic building block of TD-MPC2.
+    MLP with LayerNorm, Mish activations, and optionally dropout.
+    """
+    if isinstance(mlp_dims, int):
+        mlp_dims = [mlp_dims]
+    dims = [in_dim] + mlp_dims + [out_dim]
+    mlp = nn.ModuleList()
+    for i in range(len(dims) - 2):
+        mlp.append(NormedLinear(dims[i], dims[i + 1], dropout=dropout * (i == 0)))
+    mlp.append(NormedLinear(dims[-2], dims[-1], act=act) if act else nn.Linear(dims[-2], dims[-1]))
+    return nn.Sequential(*mlp)
+
+
+def conv(in_shape, num_channels, act=None):
+    """
+    Basic convolutional encoder for TD-MPC2 with raw image observations.
+    4 layers of convolution with ReLU activations, followed by a linear layer.
+    """
+    #assert in_shape[-1] == 64  # assumes rgb observations to be 64x64
+    layers = [
+        ShiftAug(),
+        PixelPreprocess(),
+        nn.Conv2d(in_shape[0], num_channels, 7, stride=2),
+        nn.ReLU(inplace=True),
+        nn.Conv2d(num_channels, num_channels, 5, stride=2),
+        nn.ReLU(inplace=True),
+        nn.Conv2d(num_channels, num_channels, 3, stride=2),
+        nn.ReLU(inplace=True),
+        nn.Conv2d(num_channels, num_channels, 3, stride=1),
+        nn.Flatten(),
+    ]
+    if act:
+        layers.append(act)
+    return nn.Sequential(*layers)
+
+
+class RunningScale:
+    """Running trimmed scale estimator."""
+
+    def __init__(self, cfg):
+        self.cfg = cfg
+        self._value = torch.ones(1, dtype=torch.float32, device=torch.device("cuda"))
+        self._percentiles = torch.tensor([5, 95], dtype=torch.float32, device=torch.device("cuda"))
+
+    def state_dict(self):
+        return {"value": self._value, "percentiles": self._percentiles}
+
+    def load_state_dict(self, state_dict):
+        self._value.data.copy_(state_dict["value"])
+        self._percentiles.data.copy_(state_dict["percentiles"])
+
+    @property
+    def value(self):
+        return self._value.cpu().item()
+
+    def _percentile(self, x):
+        x_dtype, x_shape = x.dtype, x.shape
+        x = x.view(x.shape[0], -1)
+        in_sorted, _ = torch.sort(x, dim=0)
+        positions = self._percentiles * (x.shape[0] - 1) / 100
+        floored = torch.floor(positions)
+        ceiled = floored + 1
+        ceiled[ceiled > x.shape[0] - 1] = x.shape[0] - 1
+        weight_ceiled = positions - floored
+        weight_floored = 1.0 - weight_ceiled
+        d0 = in_sorted[floored.long(), :] * weight_floored[:, None]
+        d1 = in_sorted[ceiled.long(), :] * weight_ceiled[:, None]
+        return (d0 + d1).view(-1, *x_shape[1:]).type(x_dtype)
+
+    def update(self, x):
+        percentiles = self._percentile(x.detach())
+        value = torch.clamp(percentiles[1] - percentiles[0], min=1.0)
+        self._value.data.lerp_(value, self.cfg.tau)
+
+    def __call__(self, x, update=False):
+        if update:
+            self.update(x)
+        return x * (1 / self.value)
+
+    def __repr__(self):
+        return f"RunningScale(S: {self.value})"
--- a/lerobot/common/robot_devices/robots/koch.py
+++ b/lerobot/common/robot_devices/robots/koch.py
@@ -1,7 +1,9 @@
+import logging
 import pickle
 import time
 from dataclasses import dataclass, field, replace
 from pathlib import Path
+from typing import Sequence

 import numpy as np
 import torch
@@ -26,7 +28,6 @@ URL_TEMPLATE = (
 # In nominal degree range ]-180, +180[
 ZERO_POSITION_DEGREE = 0
 ROTATED_POSITION_DEGREE = 90
-GRIPPER_OPEN_DEGREE = 35.156


 def assert_drive_mode(drive_mode):
@@ -165,6 +166,30 @@ class KochRobotConfig:
    follower_arms: dict[str, MotorsBus] = field(default_factory=lambda: {})
    cameras: dict[str, Camera] = field(default_factory=lambda: {})

+    # Optionally limit the magnitude of the relative positional target vector for safety purposes.
+    # Set this to a positive scalar to have the same value for all motors, or a list that is the same length
+    # as the number of motors in your follower arms (assumes all follower arms have the same number of
+    # motors).
+    max_relative_target: list[float] | float | None = None
+
+    # Optionally set the leader arm in torque mode with the gripper motor set to this angle. This makes it
+    # possible to squeeze the gripper and have it spring back to an open position on its own. If None, the
+    # gripper is not put in torque mode.
+    gripper_open_degree: float | None = None
+
+    def __setattr__(self, prop: str, val):
+        if prop == "max_relative_target" and val is not None and isinstance(val, Sequence):
+            for name in self.follower_arms:
+                if len(self.follower_arms[name].motors) != len(val):
+                    raise ValueError(
+                        f"len(max_relative_target)={len(val)} but the follower arm with name {name} has "
+                        f"{len(self.follower_arms[name].motors)} motors. Please make sure that the "
+                        f"`max_relative_target` list has as many parameters as there are motors per arm. "
+                        "Note: This feature does not yet work with robots where different follower arms have "
+                        "different numbers of motors."
+                    )
+        super().__setattr__(prop, val)
+

 class KochRobot:
    # TODO(rcadene): Implement force feedback
@@ -206,7 +231,10 @@ class KochRobot:
            },
        ),
    }
-    robot = KochRobot(leader_arms, follower_arms)
+    robot = KochRobot(
+        leader_arms=leader_arms,
+        follower_arms=follower_arms,
+    )

    # Connect motors buses and cameras if any (Required)
    robot.connect()
@@ -218,7 +246,10 @@ class KochRobot:
    Example of highest frequency data collection without camera:
    ```python
    # Assumes leader and follower arms have been instantiated already (see first example)
-    robot = KochRobot(leader_arms, follower_arms)
+    robot = KochRobot(
+        leader_arms=leader_arms,
+        follower_arms=follower_arms,
+    )
    robot.connect()
    while True:
        observation, action = robot.teleop_step(record_data=True)
@@ -236,7 +267,11 @@ class KochRobot:
    }

    # Assumes leader and follower arms have been instantiated already (see first example)
-    robot = KochRobot(leader_arms, follower_arms, cameras)
+    robot = KochRobot(
+        leader_arms=leader_arms,
+        follower_arms=follower_arms,
+        cameras=cameras,
+    )
    robot.connect()
    while True:
        observation, action = robot.teleop_step(record_data=True)
@@ -245,7 +280,11 @@ class KochRobot:
    Example of controlling the robot with a policy (without running multiple policies in parallel to ensure highest frequency):
    ```python
    # Assumes leader and follower arms + cameras have been instantiated already (see previous example)
-    robot = KochRobot(leader_arms, follower_arms, cameras)
+    robot = KochRobot(
+        leader_arms=leader_arms,
+        follower_arms=follower_arms,
+        cameras=cameras,
+    )
    robot.connect()
    while True:
        # Uses the follower arms and cameras to capture an observation
@@ -339,11 +378,12 @@ class KochRobot:
            print(f"Activating torque on {name} follower arm.")
            self.follower_arms[name].write("Torque_Enable", 1)

-        # Enable torque on the gripper of the leader arms, and move it to 45 degrees,
-        # so that we can use it as a trigger to close the gripper of the follower arms.
-        for name in self.leader_arms:
-            self.leader_arms[name].write("Torque_Enable", 1, "gripper")
-            self.leader_arms[name].write("Goal_Position", GRIPPER_OPEN_DEGREE, "gripper")
+        if self.config.gripper_open_degree is not None:
+            # Set the leader arm in torque mode with the gripper motor set to an angle. This makes it possible
+            # to squeeze the gripper and have it spring back to an open position on its own.
+            for name in self.leader_arms:
+                self.leader_arms[name].write("Torque_Enable", 1, "gripper")
+                self.leader_arms[name].write("Goal_Position", self.config.gripper_open_degree, "gripper")

        # Connect the cameras
        for name in self.cameras:
@@ -392,7 +432,7 @@ class KochRobot:
        # Send action
        for name in self.follower_arms:
            before_fwrite_t = time.perf_counter()
-            self.follower_arms[name].write("Goal_Position", follower_goal_pos[name])
+            self.send_action(torch.tensor(follower_goal_pos[name]), [name])
            self.logs[f"write_follower_{name}_goal_pos_dt_s"] = time.perf_counter() - before_fwrite_t

        # Early exit when recording data is not requested
@@ -474,21 +514,55 @@ class KochRobot:
            obs_dict[f"observation.images.{name}"] = torch.from_numpy(images[name])
        return obs_dict

-    def send_action(self, action: torch.Tensor):
-        """The provided action is expected to be a vector."""
+    def send_action(self, action: torch.Tensor, follower_names: list[str] | None = None):
+        """Command the follower arms to move to a target joint configuration.
+
+        The relative action magnitude may be clipped depending on the configuration parameter
+        `max_relative_target`.
+
+        Args:
+            action: tensor containing the concatenated joint positions for the follower arms.
+            follower_names: Pass follower arm names to only control a subset of all the follower arms.
+        """
        if not self.is_connected:
            raise RobotDeviceNotConnectedError(
                "KochRobot is not connected. You need to run `robot.connect()`."
            )

+        if follower_names is None:
+            follower_names = list(self.follower_arms)
+        elif not set(follower_names).issubset(self.follower_arms):
+            raise ValueError(
+                f"You provided {follower_names=} but only the following arms are registered: "
+                f"{list(self.follower_arms)}"
+            )
+
        from_idx = 0
        to_idx = 0
        follower_goal_pos = {}
-        for name in self.follower_arms:
-            if name in self.follower_arms:
-                to_idx += len(self.follower_arms[name].motor_names)
-                follower_goal_pos[name] = action[from_idx:to_idx].numpy()
-                from_idx = to_idx
+        for name in follower_names:
+            to_idx += len(self.follower_arms[name].motor_names)
+            this_action = action[from_idx:to_idx]
+
+            if self.config.max_relative_target is not None:
+                if not isinstance(self.config.max_relative_target, list):
+                    max_relative_target = [self.config.max_relative_target for _ in range(from_idx, to_idx)]
+                max_relative_target = torch.tensor(self.config.max_relative_target)
+                # Cap relative action target magnitude for safety.
+                current_pos = torch.tensor(self.follower_arms[name].read("Present_Position"))
+                diff = this_action - current_pos
+                safe_diff = torch.minimum(diff, max_relative_target)
+                safe_diff = torch.maximum(safe_diff, -max_relative_target)
+                safe_action = current_pos + safe_diff
+                if not torch.allclose(safe_action, action):
+                    logging.warning(
+                        "Relative action magnitude had to be clamped to be safe.\n"
+                        f"  requested relative action target: {diff}\n"
+                        f"    clamped relative action target: {safe_diff}"
+                    )
+
+            follower_goal_pos[name] = safe_action.numpy()
+            from_idx = to_idx

        for name in self.follower_arms:
            self.follower_arms[name].write("Goal_Position", follower_goal_pos[name].astype(np.int32))
--- a/lerobot/common/utils/utils.py
+++ b/lerobot/common/utils/utils.py
@@ -14,6 +14,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import logging
+import os
 import os.path as osp
 import random
 from contextlib import contextmanager
@@ -27,6 +28,12 @@ import torch
 from omegaconf import DictConfig


+def inside_slurm():
+    """Check whether the python process was launched through slurm"""
+    # TODO(rcadene): return False for interactive mode `--pty bash`
+    return "SLURM_JOB_ID" in os.environ
+
+
 def get_safe_torch_device(cfg_device: str, log: bool = False) -> torch.device:
    """Given a string, return a torch.device with checks on whether the device is available."""
    match cfg_device:
@@ -158,7 +165,6 @@ def init_hydra_config(config_path: str, overrides: list[str] | None = None) -> D
        version_base="1.2",
    )
    cfg = hydra.compose(Path(config_path).stem, overrides)
-
    return cfg


--- a/lerobot/configs/default.yaml
+++ b/lerobot/configs/default.yaml
@@ -120,7 +120,7 @@ eval:
  # `batch_size` specifies the number of environments to use in a gym.vector.VectorEnv.
  batch_size: 1
  # `use_async_envs` specifies whether to use asynchronous environments (multiprocessing).
-  use_async_envs: false
+  use_async_envs: true

 wandb:
  enable: false
--- a/lerobot/configs/env/aloha.yaml
+++ b/lerobot/configs/env/aloha.yaml
@@ -2,6 +2,11 @@

 fps: 50

+eval:
+  # `use_async_envs` specifies whether to use asynchronous environments (multiprocessing).
+  # set it to false to avoid some problems of the aloha env
+  use_async_envs: false
+
 env:
  name: aloha
  task: AlohaInsertion-v0
--- a/lerobot/configs/env/xarm.yaml
+++ b/lerobot/configs/env/xarm.yaml
@@ -2,6 +2,11 @@

 fps: 15

+eval:
+  # `use_async_envs` specifies whether to use asynchronous environments (multiprocessing).
+  # set it to false to avoid some problems of the aloha env
+  use_async_envs: false
+
 env:
  name: xarm
  task: XarmLift-v0
--- a/lerobot/configs/robot/koch.yaml
+++ b/lerobot/configs/robot/koch.yaml
@@ -37,3 +37,10 @@ cameras:
    fps: 30
    width: 640
    height: 480
+# `max_relative_target` limits the magnitude of the relative positional target vector for safety purposes.
+# Set this to a positive scalar to have the same value for all motors, or a list that is the same length as
+# the number of motors in your follower arms.
+max_relative_target: null
+# Sets the leader arm in torque mode with the gripper motor set to this angle. This makes it possible
+# to squeeze the gripper and have it spring back to an open position on its own.
+gripper_open_degree: 35.156
--- a/lerobot/scripts/eval.py
+++ b/lerobot/scripts/eval.py
@@ -70,7 +70,13 @@ from lerobot.common.policies.factory import make_policy
 from lerobot.common.policies.policy_protocol import Policy
 from lerobot.common.policies.utils import get_device_from_parameters
 from lerobot.common.utils.io_utils import write_video
-from lerobot.common.utils.utils import get_safe_torch_device, init_hydra_config, init_logging, set_global_seed
+from lerobot.common.utils.utils import (
+    get_safe_torch_device,
+    init_hydra_config,
+    init_logging,
+    inside_slurm,
+    set_global_seed,
+)


 def rollout(
@@ -79,7 +85,6 @@ def rollout(
    seeds: list[int] | None = None,
    return_observations: bool = False,
    render_callback: Callable[[gym.vector.VectorEnv], None] | None = None,
-    enable_progbar: bool = False,
 ) -> dict:
    """Run a batched policy rollout once through a batch of environments.

@@ -109,7 +114,6 @@ def rollout(
            are returned optionally because they typically take more memory to cache. Defaults to False.
        render_callback: Optional rendering callback to be used after the environments are reset, and after
            every step.
-        enable_progbar: Enable a progress bar over rollout steps.
    Returns:
        The dictionary described above.
    """
@@ -136,7 +140,7 @@ def rollout(
    progbar = trange(
        max_steps,
        desc=f"Running rollout with at most {max_steps} steps",
-        disable=not enable_progbar,
+        disable=inside_slurm(),  # we dont want progress bar when we use slurm, since it clutters the logs
        leave=False,
    )
    while not np.all(done):
@@ -210,8 +214,6 @@ def eval_policy(
    videos_dir: Path | None = None,
    return_episode_data: bool = False,
    start_seed: int | None = None,
-    enable_progbar: bool = False,
-    enable_inner_progbar: bool = False,
 ) -> dict:
    """
    Args:
@@ -224,8 +226,6 @@ def eval_policy(
            the "episodes" key of the returned dictionary.
        start_seed: The first seed to use for the first individual rollout. For all subsequent rollouts the
            seed is incremented by 1. If not provided, the environments are not manually seeded.
-        enable_progbar: Enable progress bar over batches.
-        enable_inner_progbar: Enable progress bar over steps in each batch.
    Returns:
        Dictionary with metrics and data regarding the rollouts.
    """
@@ -266,7 +266,8 @@ def eval_policy(
    if return_episode_data:
        episode_data: dict | None = None

-    progbar = trange(n_batches, desc="Stepping through eval batches", disable=not enable_progbar)
+    # we dont want progress bar when we use slurm, since it clutters the logs
+    progbar = trange(n_batches, desc="Stepping through eval batches", disable=inside_slurm())
    for batch_ix in progbar:
        # Cache frames for rendering videos. Each item will be (b, h, w, c), and the list indexes the rollout
        # step.
@@ -285,7 +286,6 @@ def eval_policy(
            seeds=list(seeds) if seeds else None,
            return_observations=return_episode_data,
            render_callback=render_frame if max_episodes_rendered > 0 else None,
-            enable_progbar=enable_inner_progbar,
        )

        # Figure out where in each rollout sequence the first done condition was encountered (results after
@@ -454,6 +454,16 @@ def main(
    else:
        hydra_cfg = init_hydra_config(hydra_cfg_path, config_overrides)

+    if hydra_cfg.eval.batch_size > hydra_cfg.eval.n_episodes:
+        raise ValueError(
+            "The eval batch size is greater than the number of eval episodes "
+            f"({hydra_cfg.eval.batch_size} > {hydra_cfg.eval.n_episodes}). As a result, {hydra_cfg.eval.batch_size} "
+            f"eval environments will be instantiated, but only {hydra_cfg.eval.n_episodes} will be used. "
+            "This might significantly slow down evaluation. To fix this, you should update your command "
+            f"to increase the number of episodes to match the batch size (e.g. `eval.n_episodes={hydra_cfg.eval.batch_size}`), "
+            f"or lower the batch size (e.g. `eval.batch_size={hydra_cfg.eval.n_episodes}`)."
+        )
+
    if out_dir is None:
        out_dir = f"outputs/eval/{dt.now().strftime('%Y-%m-%d/%H-%M-%S')}_{hydra_cfg.env.name}_{hydra_cfg.policy.name}"

@@ -487,8 +497,6 @@ def main(
            max_episodes_rendered=10,
            videos_dir=Path(out_dir) / "videos",
            start_seed=hydra_cfg.seed,
-            enable_progbar=True,
-            enable_inner_progbar=True,
        )
    print(info["aggregated"])

--- a/lerobot/scripts/push_dataset_to_hub.py
+++ b/lerobot/scripts/push_dataset_to_hub.py
@@ -66,6 +66,8 @@ def get_from_raw_to_lerobot_format_fn(raw_format: str):
        from lerobot.common.datasets.push_dataset_to_hub.umi_zarr_format import from_raw_to_lerobot_format
    elif raw_format == "aloha_hdf5":
        from lerobot.common.datasets.push_dataset_to_hub.aloha_hdf5_format import from_raw_to_lerobot_format
+    elif "openx_rlds" in raw_format:
+        from lerobot.common.datasets.push_dataset_to_hub.openx_rlds_format import from_raw_to_lerobot_format
    elif raw_format == "dora_parquet":
        from lerobot.common.datasets.push_dataset_to_hub.dora_parquet_format import from_raw_to_lerobot_format
    elif raw_format == "xarm_pkl":
@@ -197,9 +199,25 @@ def push_dataset_to_hub(

    # convert dataset from original raw format to LeRobot format
    from_raw_to_lerobot_format = get_from_raw_to_lerobot_format_fn(raw_format)
-    hf_dataset, episode_data_index, info = from_raw_to_lerobot_format(
-        raw_dir, videos_dir, fps, video, episodes, encoding
-    )
+
+    fmt_kwgs = {
+        "raw_dir": raw_dir,
+        "videos_dir": videos_dir,
+        "fps": fps,
+        "video": video,
+        "episodes": episodes,
+        "encoding": encoding,
+    }
+
+    if "openx_rlds." in raw_format:
+        # Support for official OXE dataset name inside `raw_format`.
+        # For instance, `raw_format="oxe_rlds"` uses the default formating (TODO what does that mean?),
+        # and `raw_format="oxe_rlds.bridge_orig"` uses the brdige_orig formating
+        _, openx_dataset_name = raw_format.split(".")
+        print(f"Converting dataset [{openx_dataset_name}] from 'openx_rlds' to LeRobot format.")
+        fmt_kwgs["openx_dataset_name"] = openx_dataset_name
+
+    hf_dataset, episode_data_index, info = from_raw_to_lerobot_format(**fmt_kwgs)

    lerobot_dataset = LeRobotDataset.from_preloaded(
        repo_id=repo_id,
@@ -268,7 +286,7 @@ def main():
        "--raw-format",
        type=str,
        required=True,
-        help="Dataset type (e.g. `pusht_zarr`, `umi_zarr`, `aloha_hdf5`, `xarm_pkl`, `dora_parquet`).",
+        help="Dataset type (e.g. `pusht_zarr`, `umi_zarr`, `aloha_hdf5`, `xarm_pkl`, `dora_parquet`, `openx_rlds`).",
    )
    parser.add_argument(
        "--repo-id",
@@ -328,6 +346,13 @@ def main():
        default=0,
        help="When set to 1, resumes a previous run.",
    )
+    parser.add_argument(
+        "--cache-dir",
+        type=Path,
+        required=False,
+        default="/tmp",
+        help="Directory to store the temporary videos and images generated while creating the dataset.",
+    )
    parser.add_argument(
        "--tests-data-dir",
        type=Path,
--- a/lerobot/scripts/train.py
+++ b/lerobot/scripts/train.py
@@ -93,6 +93,16 @@ def make_optimizer_and_scheduler(cfg, policy):
    elif policy.name == "tdmpc":
        optimizer = torch.optim.Adam(policy.parameters(), cfg.training.lr)
        lr_scheduler = None
+    elif policy.name == "tdmpc2":
+        params_group = [
+            {"params": policy.model._encoder.parameters(), "lr": cfg.training.lr * cfg.training.enc_lr_scale},
+            {"params": policy.model._dynamics.parameters()},
+            {"params": policy.model._reward.parameters()},
+            {"params": policy.model._Qs.parameters()},
+            {"params": policy.model._pi.parameters(), "eps": 1e-5},
+        ]
+        optimizer = torch.optim.Adam(params_group, lr=cfg.training.lr)
+        lr_scheduler = None
    elif cfg.policy.name == "vqbet":
        from lerobot.common.policies.vqbet.modeling_vqbet import VQBeTOptimizer, VQBeTScheduler

@@ -241,6 +251,7 @@ def train(cfg: DictConfig, out_dir: str | None = None, job_name: str | None = No
        raise NotImplementedError()

    init_logging()
+    logging.info(pformat(OmegaConf.to_container(cfg)))

    if cfg.training.online_steps > 0 and isinstance(cfg.dataset_repo_id, ListConfig):
        raise NotImplementedError("Online training with LeRobotMultiDataset is not implemented.")
@@ -287,6 +298,16 @@ def train(cfg: DictConfig, out_dir: str | None = None, job_name: str | None = No
            "you meant to resume training, please use `resume=true` in your command or yaml configuration."
        )

+    if cfg.eval.batch_size > cfg.eval.n_episodes:
+        raise ValueError(
+            "The eval batch size is greater than the number of eval episodes "
+            f"({cfg.eval.batch_size} > {cfg.eval.n_episodes}). As a result, {cfg.eval.batch_size} "
+            f"eval environments will be instantiated, but only {cfg.eval.n_episodes} will be used. "
+            "This might significantly slow down evaluation. To fix this, you should update your command "
+            f"to increase the number of episodes to match the batch size (e.g. `eval.n_episodes={cfg.eval.batch_size}`), "
+            f"or lower the batch size (e.g. `eval.batch_size={cfg.eval.n_episodes}`)."
+        )
+
    # log metrics to terminal and wandb
    logger = Logger(cfg, out_dir, wandb_job_name=job_name)

--- a/lerobot/templates/visualize_dataset_template.html
+++ b/lerobot/templates/visualize_dataset_template.html
@@ -75,7 +75,7 @@
            {% for video_info in videos_info %}
            <div class="max-w-96">
                <p class="text-sm text-gray-300 bg-gray-800 px-2 rounded-t-xl truncate">{{ video_info.filename }}</p>
-                <video autoplay muted loop type="video/mp4" class="min-w-64" @timeupdate="() => {
+                <video muted loop type="video/mp4" class="min-w-64" @canplay="videoCanPlay" @timeupdate="() => {
                    if (video.duration) {
                      const time = video.currentTime;
                      const pc = (100 / video.duration) * time;
@@ -107,12 +107,12 @@

        <!-- Controllers -->
        <div class="flex gap-1 text-3xl items-center">
-            <button x-ref="btnPlay" class="-rotate-90 hidden" class="-rotate-90" title="Play. Toggle with Space" @click="() => {
+            <button x-ref="btnPlay" class="-rotate-90" class="-rotate-90" title="Play. Toggle with Space" @click="() => {
                videos.forEach(video => video.play());
                $refs.btnPlay.classList.toggle('hidden');
                $refs.btnPause.classList.toggle('hidden');
            }">🔽</button>
-            <button x-ref="btnPause" title="Pause. Toggle with Space" @click="() => {
+            <button x-ref="btnPause" class="hidden" title="Pause. Toggle with Space" @click="() => {
                videos.forEach(video => video.pause());
                $refs.btnPlay.classList.toggle('hidden');
                $refs.btnPause.classList.toggle('hidden');
@@ -125,7 +125,6 @@
                @click="() => (videos.forEach(video => (video.currentTime = 0.0)))">↩️</button>
            <input x-ref="slider" max="100" min="0" step="1" type="range" value="0" class="w-80 mx-2" @input="() => {
                const sliderValue = $refs.slider.value;
-                $refs.btnPause.click();
                videos.forEach(video => {
                    const time = (video.duration * sliderValue) / 100;
                    video.currentTime = time;
@@ -208,6 +207,8 @@
                videos: null,
                video: null,
                colors: null,
+                nVideos: {{ videos_info | length }},
+                nVideoReadyToPlay: 0,

                // alpine initialization
                init() {
@@ -343,7 +344,6 @@
                    window.history.replaceState({}, '', url.toString());
                },

-
                formatTime(time) {
                    var hours = Math.floor(time / 3600);
                    var minutes = Math.floor((time % 3600) / 60);
@@ -351,6 +351,14 @@
                    return (hours > 0 ? hours + ':' : '') + (minutes < 10 ? '0' + minutes : minutes) + ':' + (seconds <
                        10 ?
                        '0' + seconds : seconds);
+                },
+
+                videoCanPlay() {
+                    this.nVideoReadyToPlay += 1;
+                    if(this.nVideoReadyToPlay == this.nVideos) {
+                        // start autoplay all videos in sync
+                        this.$refs.btnPlay.click();
+                    }
                }
            };
        }
--- a/tests/scripts/save_dataset_to_safetensors.py
+++ b/tests/scripts/save_dataset_to_safetensors.py
@@ -84,5 +84,7 @@ if __name__ == "__main__":
        "lerobot/pusht",
        "lerobot/aloha_sim_insertion_human",
        "lerobot/xarm_lift_medium",
+        "lerobot/nyu_franka_play_dataset",
+        "lerobot/cmu_stretch",
    ]:
        save_dataset_to_safetensors("tests/data/save_dataset_to_safetensors", repo_id=dataset)
--- a/tests/test_datasets.py
+++ b/tests/test_datasets.py
@@ -303,6 +303,9 @@ def test_flatten_unflatten_dict():
        "lerobot/pusht",
        "lerobot/aloha_sim_insertion_human",
        "lerobot/xarm_lift_medium",
+        # (michel-aractingi) commenting the two datasets from openx as test is failing
+        # "lerobot/nyu_franka_play_dataset",
+        # "lerobot/cmu_stretch",
    ],
 )
 def test_backward_compatibility(repo_id):
@@ -318,6 +321,11 @@ def test_backward_compatibility(repo_id):
        new_frame = dataset[i]  # noqa: B023
        old_frame = load_file(test_dir / f"frame_{i}.safetensors")  # noqa: B023

+        # ignore language instructions (if exists) in language conditioned datasets
+        # TODO (michel-aractingi): transform language obs to langauge embeddings via tokenizer
+        new_frame.pop("language_instruction", None)
+        old_frame.pop("language_instruction", None)
+
        new_keys = set(new_frame.keys())
        old_keys = set(old_frame.keys())
        assert new_keys == old_keys, f"{new_keys=} and {old_keys=} are not the same"
--- a/tests/test_utils.py
+++ b/tests/test_utils.py
@@ -1,5 +1,6 @@
 import random
 from typing import Callable
+from uuid import uuid4

 import numpy as np
 import pytest
@@ -13,6 +14,7 @@ from lerobot.common.datasets.utils import (
 )
 from lerobot.common.utils.utils import (
    get_global_random_state,
+    init_hydra_config,
    seeded_context,
    set_global_random_state,
    set_global_seed,
@@ -83,3 +85,10 @@ def test_reset_episode_index():
    correct_episode_index = [0, 0, 1, 2, 2, 2]
    dataset = reset_episode_index(dataset)
    assert dataset["episode_index"] == correct_episode_index
+
+
+def test_init_hydra_config_empty():
+    test_file = f"/tmp/test_init_hydra_config_empty_{uuid4().hex}.yaml"
+    with open(test_file, "w") as f:
+        f.write("\n")
+    init_hydra_config(test_file)
Author	SHA1	Message	Date
Michel Aractingi	d5fb8e9802	updated params	2024-09-02 06:34:24 +00:00
Michel Aractingi	53d67bb5b7	First commit of tdmpc2 taken from NHansen code	2024-08-29 12:48:01 +00:00
Michel Aractingi	eb4c505cff	Support for converting OpenX datasets from RLDS format to LeRobotDataset (#354 ) Signed-off-by: youliangtan <tan_you_liang@hotmail.com> Co-authored-by: Simon Alibert <alibert.sim@gmail.com> Co-authored-by: youliangtan <tan_you_liang@hotmail.com> Co-authored-by: Remi <re.cadene@gmail.com>	2024-08-27 09:07:00 +02:00
Mishig	aad59e6b6b	Fix videos in visualize_dataset are not in sync (#382 )	2024-08-26 17:38:48 +02:00
Alexander Soare	9ce98bb93c	Add safety limits on relative action target (#373 )	2024-08-26 14:30:18 +01:00
Alexander Soare	97086cdcdf	Make gripper_open_degree a config param (#379 )	2024-08-26 12:28:16 +01:00
Alexander Soare	9c7649f140	Make sure `init_hydra_config` does not require any keys (#376 )	2024-08-23 12:27:08 +01:00
Zhuoheng Li	a2592a5563	Provide more information to the user (#358 ) Co-authored-by: Alexander Soare <alexander.soare159@gmail.com> Co-authored-by: Remi <re.cadene@gmail.com>	2024-08-23 11:00:35 +01:00
ellacroix	b5ad79a7d3	Fix typo in tutorial (#371 )	2024-08-21 14:14:01 +02:00
Remi	996468bcce	Update README.md	2024-08-20 16:45:57 +02:00