nit

cleaned eval_on_robot.py; readded policy; fixed doc strings
Extend reward classifier for multiple camera views (#626 )
2025-01-22 10:26:52 +01:00 · 2025-01-22 10:26:52 +01:00 · 2025-01-22 10:26:52 +01:00 · 2025-01-22 10:26:52 +01:00 · 2025-01-22 10:26:50 +01:00 · 2025-01-22 10:25:52 +01:00
30 changed files with 3603 additions and 28 deletions
--- a/examples/12_train_hilserl_classifier.md
+++ b/examples/12_train_hilserl_classifier.md
@@ -0,0 +1,83 @@
+# Training a HIL-SERL Reward Classifier with LeRobot
+
+This tutorial provides step-by-step instructions for training a reward classifier using LeRobot.
+
+---
+
+## Training Script Overview
+
+LeRobot includes a ready-to-use training script located at [`lerobot/scripts/train_hilserl_classifier.py`](../../lerobot/scripts/train_hilserl_classifier.py). Here's an outline of its workflow:
+
+1. **Configuration Loading**
+   The script uses Hydra to load a configuration file for subsequent steps. (Details on Hydra follow below.)
+
+2. **Dataset Initialization**
+   It loads a `LeRobotDataset` containing images and rewards. To optimize performance, a weighted random sampler is used to balance class sampling.
+
+3. **Classifier Initialization**
+   A lightweight classification head is built on top of a frozen, pretrained image encoder from HuggingFace. The classifier outputs either:
+   - A single probability (binary classification), or
+   - Logits (multi-class classification).
+
+4. **Training Loop Execution**
+   The script performs:
+   - Forward and backward passes,
+   - Optimization steps,
+   - Periodic logging, evaluation, and checkpoint saving.
+
+---
+
+## Configuring with Hydra
+
+For detailed information about Hydra usage, refer to [`examples/4_train_policy_with_script.md`](../examples/4_train_policy_with_script.md). However, note that training the reward classifier differs slightly and requires a separate configuration file.
+
+### Config File Setup
+
+The default `default.yaml` cannot launch the reward classifier training directly. Instead, you need a configuration file like [`lerobot/configs/policy/hilserl_classifier.yaml`](../../lerobot/configs/policy/hilserl_classifier.yaml), with the following adjustment:
+
+Replace the `dataset_repo_id` field with the identifier for your dataset, which contains images and sparse rewards:
+
+```yaml
+# Example: lerobot/configs/policy/reward_classifier.yaml
+dataset_repo_id: "my_dataset_repo_id"
+## Typical logs and metrics
+```
+When you start the training process, you will first see your full configuration being printed in the terminal. You can check it to make sure that you config it correctly and your config is not overrided by other files. The final configuration will also be saved with the checkpoint.
+
+After that, you will see training log like this one:
+
+```
+[2024-11-29 18:26:36,999][root][INFO] -
+Epoch 5/5
+Training:  82%|██████████████████████████████████████████████████████████████████████████████▋                 | 91/111 [00:50<00:09,  2.04it/s, loss=0.2999, acc=69.99%]
+```
+
+or evaluation log like:
+
+```
+Validation: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 28/28 [00:20<00:00,  1.37it/s]
+```
+
+### Metrics Tracking with Weights & Biases (WandB)
+
+If `wandb.enable` is set to `true`, the training and evaluation logs will also be saved in WandB. This allows you to track key metrics in real-time, including:
+
+- **Training Metrics**:
+  - `train/accuracy`
+  - `train/loss`
+  - `train/dataloading_s`
+- **Evaluation Metrics**:
+  - `eval/accuracy`
+  - `eval/loss`
+  - `eval/eval_s`
+
+#### Additional Features
+
+You can also log sample predictions during evaluation. Each logged sample will include:
+
+- The **input image**.
+- The **predicted label**.
+- The **true label**.
+- The **classifier's "confidence" (logits/probability)**.
+
+These logs can be useful for diagnosing and debugging performance issues.
--- a/lerobot/common/datasets/lerobot_dataset.py
+++ b/lerobot/common/datasets/lerobot_dataset.py
@@ -291,7 +291,7 @@ class LeRobotDatasetMetadata:
        obj.root.mkdir(parents=True, exist_ok=False)

        if robot is not None:
-            features = get_features_from_robot(robot, use_videos)
+            features = {**(features or {}), **get_features_from_robot(robot)}
            robot_type = robot.robot_type
            if not all(cam.fps == fps for cam in robot.cameras.values()):
                logging.warning(
--- a/lerobot/common/envs/factory.py
+++ b/lerobot/common/envs/factory.py
@@ -14,9 +14,13 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import importlib
+from collections import deque

 import gymnasium as gym
+import numpy as np
+import torch
 from omegaconf import DictConfig
+from mani_skill.utils import common


 def make_env(cfg: DictConfig, n_envs: int | None = None) -> gym.vector.VectorEnv | None:
@@ -30,6 +34,10 @@ def make_env(cfg: DictConfig, n_envs: int | None = None) -> gym.vector.VectorEnv
    if cfg.env.name == "real_world":
        return

+    if "maniskill" in cfg.env.name:
+        env = make_maniskill_env(cfg, n_envs if n_envs is not None else cfg.eval.batch_size)
+        return env
+
    package_name = f"gym_{cfg.env.name}"

    try:
@@ -56,3 +64,86 @@ def make_env(cfg: DictConfig, n_envs: int | None = None) -> gym.vector.VectorEnv
    )

    return env
+
+
+def make_maniskill_env(cfg: DictConfig, n_envs: int | None = None) -> gym.vector.VectorEnv | None:
+    """Make ManiSkill3 gym environment"""
+    from mani_skill.vector.wrappers.gymnasium import ManiSkillVectorEnv
+
+    env = gym.make(
+        cfg.env.task,
+        obs_mode=cfg.env.obs,
+        control_mode=cfg.env.control_mode,
+        render_mode=cfg.env.render_mode,
+        sensor_configs=dict(width=cfg.env.image_size, height=cfg.env.image_size),
+        num_envs=n_envs,
+    )
+    # cfg.env_cfg.control_mode = cfg.eval_env_cfg.control_mode = env.control_mode
+    env = ManiSkillVectorEnv(env, ignore_terminations=True)
+    # state should have the size of 25
+    # env = ConvertToLeRobotEnv(env, n_envs)
+    # env = PixelWrapper(cfg, env, n_envs)
+    env._max_episode_steps = env.max_episode_steps = 50  # gym_utils.find_max_episode_steps_value(env)
+    env.unwrapped.metadata["render_fps"] = 20
+
+    return env
+
+
+class PixelWrapper(gym.Wrapper):
+    """
+    Wrapper for pixel observations. Works with Maniskill vectorized environments
+    """
+
+    def __init__(self, cfg, env, num_envs, num_frames=3):
+        super().__init__(env)
+        self.cfg = cfg
+        self.env = env
+        self.observation_space = gym.spaces.Box(
+            low=0,
+            high=255,
+            shape=(num_envs, num_frames * 3, cfg.env.render_size, cfg.env.render_size),
+            dtype=np.uint8,
+        )
+        self._frames = deque([], maxlen=num_frames)
+        self._render_size = cfg.env.render_size
+
+    def _get_obs(self, obs):
+        frame = obs["sensor_data"]["base_camera"]["rgb"].cpu().permute(0, 3, 1, 2)
+        self._frames.append(frame)
+        return {"pixels": torch.from_numpy(np.concatenate(self._frames, axis=1)).to(self.env.device)}
+
+    def reset(self, seed):
+        obs, info = self.env.reset()  # (seed=seed)
+        for _ in range(self._frames.maxlen):
+            obs_frames = self._get_obs(obs)
+        return obs_frames, info
+
+    def step(self, action):
+        obs, reward, terminated, truncated, info = self.env.step(action)
+        return self._get_obs(obs), reward, terminated, truncated, info
+
+class ConvertToLeRobotEnv(gym.Wrapper):
+    def __init__(self, env, num_envs):
+        super().__init__(env)
+    def reset(self, seed=None, options=None):
+        obs, info = self.env.reset(seed=seed, options={})
+        return self._get_obs(obs), info
+    def step(self, action):
+        obs, reward, terminated, truncated, info = self.env.step(action)
+        return self._get_obs(obs), reward, terminated, truncated, info
+    def _get_obs(self, observation):
+        sensor_data = observation.pop("sensor_data")
+        del observation["sensor_param"]
+        images = []
+        for cam_data in sensor_data.values():
+                images.append(cam_data["rgb"])
+
+        images = torch.concat(images, axis=-1)
+        # flatten the rest of the data which should just be state data
+        observation = common.flatten_state_dict(
+            observation, use_torch=True, device=self.base_env.device
+        )
+        ret = dict()
+        ret["state"] = observation
+        ret["pixels"] = images
+        return ret
--- a/lerobot/common/envs/utils.py
+++ b/lerobot/common/envs/utils.py
@@ -28,6 +28,9 @@ def preprocess_observation(observations: dict[str, np.ndarray]) -> dict[str, Ten
    """
    # map to expected inputs for the policy
    return_observations = {}
+    # TODO: You have to merge all tensors from agent key and extra key
+    # You don't keep sensor param key in the observation
+    # And you keep sensor data rgb
    if "pixels" in observations:
        if isinstance(observations["pixels"], dict):
            imgs = {f"observation.images.{key}": img for key, img in observations["pixels"].items()}
@@ -50,6 +53,8 @@ def preprocess_observation(observations: dict[str, np.ndarray]) -> dict[str, Ten
            img /= 255

            return_observations[imgkey] = img
+        # obs state agent qpos and qvel
+        # image

    if "environment_state" in observations:
        return_observations["observation.environment_state"] = torch.from_numpy(
@@ -60,3 +65,38 @@ def preprocess_observation(observations: dict[str, np.ndarray]) -> dict[str, Ten
    # requirement for "agent_pos"
    return_observations["observation.state"] = torch.from_numpy(observations["agent_pos"]).float()
    return return_observations
+
+
+def preprocess_maniskill_observation(observations: dict[str, np.ndarray]) -> dict[str, Tensor]:
+    """Convert environment observation to LeRobot format observation.
+    Args:
+        observation: Dictionary of observation batches from a Gym vector environment.
+    Returns:
+        Dictionary of observation batches with keys renamed to LeRobot format and values as tensors.
+    """
+    # map to expected inputs for the policy
+    return_observations = {}
+    # TODO: You have to merge all tensors from agent key and extra key
+    # You don't keep sensor param key in the observation
+    # And you keep sensor data rgb
+    q_pos = observations["agent"]["qpos"]
+    q_vel = observations["agent"]["qvel"]
+    tcp_pos = observations["extra"]["tcp_pose"]
+    img = observations["sensor_data"]["base_camera"]["rgb"]
+
+    _, h, w, c = img.shape
+    assert c < h and c < w, f"expect channel last images, but instead got {img.shape=}"
+
+    # sanity check that images are uint8
+    assert img.dtype == torch.uint8, f"expect torch.uint8, but instead {img.dtype=}"
+
+    # convert to channel first of type float32 in range [0,1]
+    img = einops.rearrange(img, "b h w c -> b c h w").contiguous()
+    img = img.type(torch.float32)
+    img /= 255
+
+    state = torch.cat([q_pos, q_vel, tcp_pos], dim=-1)
+
+    return_observations["observation.image"] = img
+    return_observations["observation.state"] = state
+    return return_observations
--- a/lerobot/common/logger.py
+++ b/lerobot/common/logger.py
@@ -25,6 +25,7 @@ from glob import glob
 from pathlib import Path

 import torch
+import wandb
 from huggingface_hub.constants import SAFETENSORS_SINGLE_FILE
 from omegaconf import DictConfig, OmegaConf
 from termcolor import colored
@@ -107,8 +108,6 @@ class Logger:
            self._wandb = None
        else:
            os.environ["WANDB_SILENT"] = "true"
-            import wandb
-
            wandb_run_id = None
            if cfg.resume:
                wandb_run_id = get_wandb_run_id_from_filesystem(self.checkpoints_dir)
@@ -232,7 +231,7 @@ class Logger:
        # TODO(alexander-soare): Add local text log.
        if self._wandb is not None:
            for k, v in d.items():
-                if not isinstance(v, (int, float, str)):
+                if not isinstance(v, (int, float, str, wandb.Table)):
                    logging.warning(
                        f'WandB logging of key "{k}" was ignored as its type is not handled by this wrapper.'
                    )
--- a/lerobot/common/policies/factory.py
+++ b/lerobot/common/policies/factory.py
@@ -66,6 +66,11 @@ def get_policy_and_config_classes(name: str) -> tuple[Policy, object]:
        from lerobot.common.policies.vqbet.modeling_vqbet import VQBeTPolicy

        return VQBeTPolicy, VQBeTConfig
+    elif name == "sac":
+        from lerobot.common.policies.sac.configuration_sac import SACConfig
+        from lerobot.common.policies.sac.modeling_sac import SACPolicy
+
+        return SACPolicy, SACConfig
    else:
        raise NotImplementedError(f"Policy with name {name} is not implemented.")

@@ -85,10 +90,10 @@ def make_policy(
            be provided when initializing a new policy, and must not be provided when loading a pretrained
            policy. Therefore, this argument is mutually exclusive with `pretrained_policy_name_or_path`.
    """
-    if not (pretrained_policy_name_or_path is None) ^ (dataset_stats is None):
-        raise ValueError(
-            "Exactly one of `pretrained_policy_name_or_path` and `dataset_stats` must be provided."
-        )
+    # if not (pretrained_policy_name_or_path is None) ^ (dataset_stats is None):
+    #     raise ValueError(
+    #         "Exactly one of `pretrained_policy_name_or_path` and `dataset_stats` must be provided."
+    #     )

    policy_cls, policy_cfg_class = get_policy_and_config_classes(hydra_cfg.policy.name)

--- a/lerobot/common/policies/hilserl/classifier/configuration_classifier.py
+++ b/lerobot/common/policies/hilserl/classifier/configuration_classifier.py
@@ -0,0 +1,35 @@
+import json
+import os
+from dataclasses import asdict, dataclass
+
+
+@dataclass
+class ClassifierConfig:
+    """Configuration for the Classifier model."""
+
+    num_classes: int = 2
+    hidden_dim: int = 256
+    dropout_rate: float = 0.1
+    model_name: str = "microsoft/resnet-50"
+    device: str = "cpu"
+    model_type: str = "cnn"  # "transformer" or "cnn"
+    num_cameras: int = 2
+
+    def save_pretrained(self, save_dir):
+        """Save config to json file."""
+        os.makedirs(save_dir, exist_ok=True)
+
+        # Convert to dict and save as JSON
+        config_dict = asdict(self)
+        with open(os.path.join(save_dir, "config.json"), "w") as f:
+            json.dump(config_dict, f, indent=2)
+
+    @classmethod
+    def from_pretrained(cls, pretrained_model_name_or_path):
+        """Load config from json file."""
+        config_file = os.path.join(pretrained_model_name_or_path, "config.json")
+
+        with open(config_file) as f:
+            config_dict = json.load(f)
+
+        return cls(**config_dict)
--- a/lerobot/common/policies/hilserl/classifier/modeling_classifier.py
+++ b/lerobot/common/policies/hilserl/classifier/modeling_classifier.py
@@ -0,0 +1,151 @@
+import logging
+from typing import Optional
+
+import torch
+from huggingface_hub import PyTorchModelHubMixin
+from torch import Tensor, nn
+
+from .configuration_classifier import ClassifierConfig
+
+logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(name)s - %(levelname)s - %(message)s")
+logger = logging.getLogger(__name__)
+
+
+class ClassifierOutput:
+    """Wrapper for classifier outputs with additional metadata."""
+
+    def __init__(
+        self, logits: Tensor, probabilities: Optional[Tensor] = None, hidden_states: Optional[Tensor] = None
+    ):
+        self.logits = logits
+        self.probabilities = probabilities
+        self.hidden_states = hidden_states
+
+    def __repr__(self):
+        return (
+            f"ClassifierOutput(logits={self.logits}, "
+            f"probabilities={self.probabilities}, "
+            f"hidden_states={self.hidden_states})"
+        )
+
+
+class Classifier(
+    nn.Module,
+    PyTorchModelHubMixin,
+    # Add Hub metadata
+    library_name="lerobot",
+    repo_url="https://github.com/huggingface/lerobot",
+    tags=["robotics", "vision-classifier"],
+):
+    """Image classifier built on top of a pre-trained encoder."""
+
+    # Add name attribute for factory
+    name = "classifier"
+
+    def __init__(self, config: ClassifierConfig):
+        from transformers import AutoImageProcessor, AutoModel
+
+        super().__init__()
+        self.config = config
+        self.processor = AutoImageProcessor.from_pretrained(self.config.model_name, trust_remote_code=True)
+        encoder = AutoModel.from_pretrained(self.config.model_name, trust_remote_code=True)
+        # Extract vision model if we're given a multimodal model
+        if hasattr(encoder, "vision_model"):
+            logging.info("Multimodal model detected - using vision encoder only")
+            self.encoder = encoder.vision_model
+            self.vision_config = encoder.config.vision_config
+        else:
+            self.encoder = encoder
+            self.vision_config = getattr(encoder, "config", None)
+
+        # Model type from config
+        self.is_cnn = self.config.model_type == "cnn"
+
+        # For CNNs, initialize backbone
+        if self.is_cnn:
+            self._setup_cnn_backbone()
+
+        self._freeze_encoder()
+        self._build_classifier_head()
+
+    def _setup_cnn_backbone(self):
+        """Set up CNN encoder"""
+        if hasattr(self.encoder, "fc"):
+            self.feature_dim = self.encoder.fc.in_features
+            self.encoder = nn.Sequential(*list(self.encoder.children())[:-1])
+        elif hasattr(self.encoder.config, "hidden_sizes"):
+            self.feature_dim = self.encoder.config.hidden_sizes[-1]  # Last channel dimension
+        else:
+            raise ValueError("Unsupported CNN architecture")
+
+        self.encoder = self.encoder.to(self.config.device)
+
+    def _freeze_encoder(self) -> None:
+        """Freeze the encoder parameters."""
+        for param in self.encoder.parameters():
+            param.requires_grad = False
+
+    def _build_classifier_head(self) -> None:
+        """Initialize the classifier head architecture."""
+        # Get input dimension based on model type
+        if self.is_cnn:
+            input_dim = self.feature_dim
+        else:  # Transformer models
+            if hasattr(self.encoder.config, "hidden_size"):
+                input_dim = self.encoder.config.hidden_size
+            else:
+                raise ValueError("Unsupported transformer architecture since hidden_size is not found")
+
+        self.classifier_head = nn.Sequential(
+            nn.Linear(input_dim * self.config.num_cameras, self.config.hidden_dim),
+            nn.Dropout(self.config.dropout_rate),
+            nn.LayerNorm(self.config.hidden_dim),
+            nn.ReLU(),
+            nn.Linear(self.config.hidden_dim, 1 if self.config.num_classes == 2 else self.config.num_classes),
+        )
+        self.classifier_head = self.classifier_head.to(self.config.device)
+
+    def _get_encoder_output(self, x: torch.Tensor) -> torch.Tensor:
+        """Extract the appropriate output from the encoder."""
+        # Process images with the processor (handles resizing and normalization)
+        processed = self.processor(
+            images=x,  # LeRobotDataset already provides proper tensor format
+            return_tensors="pt",
+        )
+        processed = processed["pixel_values"].to(x.device)
+
+        with torch.no_grad():
+            if self.is_cnn:
+                # The HF ResNet applies pooling internally
+                outputs = self.encoder(processed)
+                # Get pooled output directly
+                features = outputs.pooler_output
+
+                if features.dim() > 2:
+                    features = features.squeeze(-1).squeeze(-1)
+                return features
+            else:  # Transformer models
+                outputs = self.encoder(processed)
+                if hasattr(outputs, "pooler_output") and outputs.pooler_output is not None:
+                    return outputs.pooler_output
+                return outputs.last_hidden_state[:, 0, :]
+
+    def forward(self, xs: torch.Tensor) -> ClassifierOutput:
+        """Forward pass of the classifier."""
+        # For training, we expect input to be a tensor directly from LeRobotDataset
+        encoder_outputs = torch.hstack([self._get_encoder_output(x) for x in xs])
+        logits = self.classifier_head(encoder_outputs)
+
+        if self.config.num_classes == 2:
+            logits = logits.squeeze(-1)
+            probabilities = torch.sigmoid(logits)
+        else:
+            probabilities = torch.softmax(logits, dim=-1)
+
+        return ClassifierOutput(logits=logits, probabilities=probabilities, hidden_states=encoder_outputs)
+
+    def predict_reward(self, x):
+        if self.config.num_classes == 2:
+            return (self.forward(x).probabilities > 0.5).float()
+        else:
+            return torch.argmax(self.forward(x).probabilities, dim=1)
--- a/lerobot/common/policies/hilserl/configuration_hilserl.py
+++ b/lerobot/common/policies/hilserl/configuration_hilserl.py
@@ -0,0 +1,23 @@
+#!/usr/bin/env python
+
+# Copyright 2024 The HuggingFace Inc. team.
+# All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from dataclasses import dataclass
+
+
+@dataclass
+class HILSerlConfig:
+    pass
--- a/lerobot/common/policies/hilserl/modeling_hilserl.py
+++ b/lerobot/common/policies/hilserl/modeling_hilserl.py
@@ -0,0 +1,29 @@
+#!/usr/bin/env python
+
+# Copyright 2024 The HuggingFace Inc. team.
+# All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import torch.nn as nn
+from huggingface_hub import PyTorchModelHubMixin
+
+
+class HILSerlPolicy(
+    nn.Module,
+    PyTorchModelHubMixin,
+    library_name="lerobot",
+    repo_url="https://github.com/huggingface/lerobot",
+    tags=["robotics", "hilserl"],
+):
+    pass
--- a/lerobot/common/policies/sac/configuration_sac.py
+++ b/lerobot/common/policies/sac/configuration_sac.py
@@ -0,0 +1,83 @@
+#!/usr/bin/env python
+
+# Copyright 2024 The HuggingFace Inc. team.
+# All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from dataclasses import dataclass, field
+from typing import Any
+
+
+@dataclass
+class SACConfig:
+    input_shapes: dict[str, list[int]] = field(
+        default_factory=lambda: {
+            "observation.image": [3, 84, 84],
+            "observation.state": [4],
+        }
+    )
+    output_shapes: dict[str, list[int]] = field(
+        default_factory=lambda: {
+            "action": [2],
+        }
+    )
+    input_normalization_modes: dict[str, str] = field(
+        default_factory=lambda: {
+            "observation.image": "mean_std",
+            "observation.state": "min_max",
+            "observation.environment_state": "min_max",
+        }
+    )
+    output_normalization_modes: dict[str, str] = field(default_factory=lambda: {"action": "min_max"})
+    output_normalization_params: dict[str, dict[str, list[float]]] = field(
+        default_factory=lambda: {
+            "action": {"min": [-1, -1], "max": [1, 1]},
+        }
+    )
+    camera_number: int = 1
+    # Add type annotations for these fields:
+    image_encoder_hidden_dim: int = 32
+    shared_encoder: bool = False
+    discount: float = 0.99
+    temperature_init: float = 1.0
+    num_critics: int = 2
+    num_subsample_critics: int | None = None
+    critic_lr: float = 3e-4
+    actor_lr: float = 3e-4
+    temperature_lr: float = 3e-4
+    critic_target_update_weight: float = 0.005
+    utd_ratio: int = 1  # If you want enable utd_ratio, you need to set it to >1
+    state_encoder_hidden_dim: int = 256
+    latent_dim: int = 256
+    target_entropy: float | None = None
+    use_backup_entropy: bool = True
+    critic_network_kwargs: dict[str, Any] = field(
+        default_factory=lambda: {
+            "hidden_dims": [256, 256],
+            "activate_final": True,
+        }
+    )
+    actor_network_kwargs: dict[str, Any] = field(
+        default_factory=lambda: {
+            "hidden_dims": [256, 256],
+            "activate_final": True,
+        }
+    )
+    policy_kwargs: dict[str, Any] = field(
+        default_factory=lambda: {
+            "use_tanh_squash": True,
+            "log_std_min": -5,
+            "log_std_max": 2,
+        }
+    )
--- a/lerobot/common/policies/sac/modeling_sac.py
+++ b/lerobot/common/policies/sac/modeling_sac.py
@@ -0,0 +1,571 @@
+#!/usr/bin/env python
+
+# Copyright 2024 The HuggingFace Inc. team.
+# All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# TODO: (1) better device management
+
+from collections import deque
+from typing import Callable, Optional, Sequence, Tuple, Union
+
+import einops
+import numpy as np
+import torch
+import torch.nn as nn
+import torch.nn.functional as F  # noqa: N812
+from huggingface_hub import PyTorchModelHubMixin
+from torch import Tensor
+
+from lerobot.common.policies.normalize import Normalize, Unnormalize
+from lerobot.common.policies.sac.configuration_sac import SACConfig
+
+
+class SACPolicy(
+    nn.Module,
+    PyTorchModelHubMixin,
+    library_name="lerobot",
+    repo_url="https://github.com/huggingface/lerobot",
+    tags=["robotics", "RL", "SAC"],
+):
+    name = "sac"
+
+    def __init__(
+        self,
+        config: SACConfig | None = None,
+        dataset_stats: dict[str, dict[str, Tensor]] | None = None,
+        device: str = "cpu",
+    ):
+        super().__init__()
+
+        if config is None:
+            config = SACConfig()
+        self.config = config
+        if config.input_normalization_modes is not None:
+            self.normalize_inputs = Normalize(
+                config.input_shapes, config.input_normalization_modes, dataset_stats
+            )
+        else:
+            self.normalize_inputs = nn.Identity()
+
+        output_normalization_params = {}
+        for outer_key, inner_dict in config.output_normalization_params.items():
+            output_normalization_params[outer_key] = {}
+            for key, value in inner_dict.items():
+                output_normalization_params[outer_key][key] = torch.tensor(value)
+
+        # HACK: This is hacky and should be removed
+        dataset_stats = dataset_stats or output_normalization_params
+        self.normalize_targets = Normalize(
+            config.output_shapes, config.output_normalization_modes, dataset_stats
+        )
+        self.unnormalize_outputs = Unnormalize(
+            config.output_shapes, config.output_normalization_modes, dataset_stats
+        )
+
+        if config.shared_encoder:
+            encoder_critic = SACObservationEncoder(config)
+            encoder_actor: SACObservationEncoder = encoder_critic
+        else:
+            encoder_critic = SACObservationEncoder(config)
+            encoder_actor = SACObservationEncoder(config)
+        # Define networks
+        critic_nets = []
+        for _ in range(config.num_critics):
+            critic_net = Critic(
+                encoder=encoder_critic,
+                network=MLP(
+                    input_dim=encoder_critic.output_dim + config.output_shapes["action"][0],
+                    **config.critic_network_kwargs,
+                ),
+                device=device,
+            )
+            critic_nets.append(critic_net)
+
+        target_critic_nets = []
+        for _ in range(config.num_critics):
+            target_critic_net = Critic(
+                encoder=encoder_critic,
+                network=MLP(
+                    input_dim=encoder_critic.output_dim + config.output_shapes["action"][0],
+                    **config.critic_network_kwargs,
+                ),
+                device=device,
+            )
+            target_critic_nets.append(target_critic_net)
+
+        self.critic_ensemble = create_critic_ensemble(
+            critics=critic_nets, num_critics=config.num_critics, device=device
+        )
+        self.critic_target = create_critic_ensemble(
+            critics=target_critic_nets, num_critics=config.num_critics, device=device
+        )
+        self.critic_target.load_state_dict(self.critic_ensemble.state_dict())
+
+        self.actor = Policy(
+            encoder=encoder_actor,
+            network=MLP(input_dim=encoder_actor.output_dim, **config.actor_network_kwargs),
+            action_dim=config.output_shapes["action"][0],
+            device=device,
+            encoder_is_shared=config.shared_encoder,
+            **config.policy_kwargs,
+        )
+        if config.target_entropy is None:
+            config.target_entropy = -np.prod(config.output_shapes["action"][0]) / 2  # (-dim(A)/2)
+        # TODO: Handle the case where the temparameter is a fixed
+        self.log_alpha = torch.zeros(1, requires_grad=True, device=device)
+        self.temperature = self.log_alpha.exp().item()
+
+    def reset(self):
+        """Reset the policy"""
+        pass
+
+    @torch.no_grad()
+    def select_action(self, batch: dict[str, Tensor]) -> Tensor:
+        """Select action for inference/evaluation"""
+        actions, _, _ = self.actor(batch)
+        actions = self.unnormalize_outputs({"action": actions})["action"]
+        return actions
+    
+    def critic_forward(self, observations: dict[str, Tensor], actions: Tensor, use_target: bool = False) -> Tensor:
+        """Forward pass through a critic network ensemble
+        
+        Args:
+            observations: Dictionary of observations
+            actions: Action tensor
+            use_target: If True, use target critics, otherwise use ensemble critics
+        
+        Returns:
+            Tensor of Q-values from all critics
+        """
+        critics = self.critic_target if use_target else self.critic_ensemble
+        q_values = torch.stack([critic(observations, actions) for critic in critics])
+        return q_values
+
+
+    def critic_forward(
+        self, observations: dict[str, Tensor], actions: Tensor, use_target: bool = False
+    ) -> Tensor:
+        """Forward pass through a critic network ensemble
+
+        Args:
+            observations: Dictionary of observations
+            actions: Action tensor
+            use_target: If True, use target critics, otherwise use ensemble critics
+
+        Returns:
+            Tensor of Q-values from all critics
+        """
+        critics = self.critic_target if use_target else self.critic_ensemble
+        q_values = torch.stack([critic(observations, actions) for critic in critics])
+        return q_values
+
+    def forward(self, batch: dict[str, Tensor]) -> dict[str, Tensor | float]: ...
+    def update_target_networks(self):
+        """Update target networks with exponential moving average"""
+        for target_critic, critic in zip(self.critic_target, self.critic_ensemble, strict=False):
+            for target_param, param in zip(target_critic.parameters(), critic.parameters(), strict=False):
+                target_param.data.copy_(
+                    param.data * self.config.critic_target_update_weight
+                    + target_param.data * (1.0 - self.config.critic_target_update_weight)
+                )
+
+    def compute_loss_critic(self, observations, actions, rewards, next_observations, done) -> Tensor:
+        temperature = self.log_alpha.exp().item()
+        with torch.no_grad():
+            next_action_preds, next_log_probs, _ = self.actor(next_observations)
+
+            # 2- compute q targets
+            q_targets = self.critic_forward(
+                observations=next_observations, actions=next_action_preds, use_target=True
+            )
+
+            # subsample critics to prevent overfitting if use high UTD (update to date)
+            if self.config.num_subsample_critics is not None:
+                indices = torch.randperm(self.config.num_critics)
+                indices = indices[: self.config.num_subsample_critics]
+                q_targets = q_targets[indices]
+
+            # critics subsample size
+            min_q, _ = q_targets.min(dim=0)  # Get values from min operation
+            if self.config.use_backup_entropy:
+                min_q = min_q - (temperature * next_log_probs)
+
+            td_target = rewards + (1 - done) * self.config.discount * min_q
+
+        # 3- compute predicted qs
+        q_preds = self.critic_forward(observations, actions, use_target=False)
+
+        # 4- Calculate loss
+        # Compute state-action value loss (TD loss) for all of the Q functions in the ensemble.
+        td_target_duplicate = einops.repeat(td_target, "b -> e b", e=q_preds.shape[0])
+        # You compute the mean loss of the batch for each critic and then to compute the final loss you sum them up
+        critics_loss = (
+            F.mse_loss(
+                input=q_preds,
+                target=td_target_duplicate,
+                reduction="none",
+            ).mean(1)
+        ).sum()
+        return critics_loss
+
+    def compute_loss_temperature(self, observations) -> Tensor:
+        """Compute the temperature loss"""
+        # calculate temperature loss
+        with torch.no_grad():
+            _, log_probs, _ = self.actor(observations)
+        temperature_loss = (-self.log_alpha.exp() * (log_probs + self.config.target_entropy)).mean()
+        return temperature_loss
+
+    def compute_loss_actor(self, observations) -> Tensor:
+        temperature = self.log_alpha.exp().item()
+
+        actions_pi, log_probs, _ = self.actor(observations)
+
+        q_preds = self.critic_forward(observations, actions_pi, use_target=False)
+        min_q_preds = q_preds.min(dim=0)[0]
+
+        actor_loss = ((temperature * log_probs) - min_q_preds).mean()
+        return actor_loss
+
+
+class MLP(nn.Module):
+    def __init__(
+        self,
+        input_dim: int,
+        hidden_dims: list[int],
+        activations: Callable[[torch.Tensor], torch.Tensor] | str = nn.SiLU(),
+        activate_final: bool = False,
+        dropout_rate: Optional[float] = None,
+    ):
+        super().__init__()
+        self.activate_final = activate_final
+        layers = []
+
+        # First layer uses input_dim
+        layers.append(nn.Linear(input_dim, hidden_dims[0]))
+
+        # Add activation after first layer
+        if dropout_rate is not None and dropout_rate > 0:
+            layers.append(nn.Dropout(p=dropout_rate))
+        layers.append(nn.LayerNorm(hidden_dims[0]))
+        layers.append(activations if isinstance(activations, nn.Module) else getattr(nn, activations)())
+
+        # Rest of the layers
+        for i in range(1, len(hidden_dims)):
+            layers.append(nn.Linear(hidden_dims[i - 1], hidden_dims[i]))
+
+            if i + 1 < len(hidden_dims) or activate_final:
+                if dropout_rate is not None and dropout_rate > 0:
+                    layers.append(nn.Dropout(p=dropout_rate))
+                layers.append(nn.LayerNorm(hidden_dims[i]))
+                layers.append(
+                    activations if isinstance(activations, nn.Module) else getattr(nn, activations)()
+                )
+
+        self.net = nn.Sequential(*layers)
+
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        return self.net(x)
+    
+    
+class Critic(nn.Module):
+    def __init__(
+        self,
+        encoder: Optional[nn.Module],
+        network: nn.Module,
+        init_final: Optional[float] = None,
+        device: str = "cpu",
+    ):
+        super().__init__()
+        self.device = torch.device(device)
+        self.encoder = encoder
+        self.network = network
+        self.init_final = init_final
+        
+        # Find the last Linear layer's output dimension
+        for layer in reversed(network.net):
+            if isinstance(layer, nn.Linear):
+                out_features = layer.out_features
+                break
+        
+        # Output layer
+        if init_final is not None:
+            self.output_layer = nn.Linear(out_features, 1)
+            nn.init.uniform_(self.output_layer.weight, -init_final, init_final)
+            nn.init.uniform_(self.output_layer.bias, -init_final, init_final)
+        else:
+            self.output_layer = nn.Linear(out_features, 1)
+            orthogonal_init()(self.output_layer.weight)
+        
+        self.to(self.device)
+
+    def forward(
+        self,
+        observations: dict[str, torch.Tensor],
+        actions: torch.Tensor,
+    ) -> torch.Tensor:
+        # Move each tensor in observations to device
+        observations = {k: v.to(self.device) for k, v in observations.items()}
+        actions = actions.to(self.device)
+        
+        obs_enc = observations if self.encoder is None else self.encoder(observations)
+            
+        inputs = torch.cat([obs_enc, actions], dim=-1)
+        x = self.network(inputs)
+        value = self.output_layer(x)
+        return value.squeeze(-1)
+
+class Policy(nn.Module):
+    def __init__(
+        self,
+        encoder: Optional[nn.Module],
+        network: nn.Module,
+        action_dim: int,
+        log_std_min: float = -5,
+        log_std_max: float = 2,
+        fixed_std: Optional[torch.Tensor] = None,
+        init_final: Optional[float] = None,
+        use_tanh_squash: bool = False,
+        device: str = "cpu",
+        encoder_is_shared: bool = False,
+    ):
+        super().__init__()
+        self.device = torch.device(device)
+        self.encoder = encoder
+        self.network = network
+        self.action_dim = action_dim
+        self.log_std_min = log_std_min
+        self.log_std_max = log_std_max
+        self.fixed_std = fixed_std.to(self.device) if fixed_std is not None else None
+        self.use_tanh_squash = use_tanh_squash
+        self.parameters_to_optimize = []
+
+        self.parameters_to_optimize += list(self.network.parameters())
+
+        if self.encoder is not None and not encoder_is_shared:
+            self.parameters_to_optimize += list(self.encoder.parameters())
+        # Find the last Linear layer's output dimension
+        for layer in reversed(network.net):
+            if isinstance(layer, nn.Linear):
+                out_features = layer.out_features
+                break
+        # Mean layer
+        self.mean_layer = nn.Linear(out_features, action_dim)
+        if init_final is not None:
+            nn.init.uniform_(self.mean_layer.weight, -init_final, init_final)
+            nn.init.uniform_(self.mean_layer.bias, -init_final, init_final)
+        else:
+            orthogonal_init()(self.mean_layer.weight)
+
+        self.parameters_to_optimize += list(self.mean_layer.parameters())
+        # Standard deviation layer or parameter
+        if fixed_std is None:
+            self.std_layer = nn.Linear(out_features, action_dim)
+            if init_final is not None:
+                nn.init.uniform_(self.std_layer.weight, -init_final, init_final)
+                nn.init.uniform_(self.std_layer.bias, -init_final, init_final)
+            else:
+                orthogonal_init()(self.std_layer.weight)
+            self.parameters_to_optimize += list(self.std_layer.parameters())
+
+        self.to(self.device)
+
+    def forward(
+        self, 
+        observations: torch.Tensor,
+    ) -> Tuple[torch.Tensor, torch.Tensor]:
+                
+        # Encode observations if encoder exists
+        obs_enc = observations if self.encoder is None else self.encoder(observations)
+
+        # Get network outputs
+        outputs = self.network(obs_enc)
+        means = self.mean_layer(outputs)
+        
+        # Compute standard deviations
+        if self.fixed_std is None:
+            log_std = self.std_layer(outputs)
+            assert not torch.isnan(log_std).any(), "[ERROR] log_std became NaN after std_layer!"
+
+            if self.use_tanh_squash:
+                log_std = torch.tanh(log_std)
+                log_std = self.log_std_min + 0.5 * (self.log_std_max - self.log_std_min) * (log_std + 1.0)
+            else:
+                log_std = torch.clamp(log_std, self.log_std_min, self.log_std_max)
+        else:
+            log_std = self.fixed_std.expand_as(means)
+
+        # uses tanh activation function to squash the action to be in the range of [-1, 1]
+        normal = torch.distributions.Normal(means, torch.exp(log_std))
+        x_t = normal.rsample()  # Reparameterization trick (mean + std * N(0,1))
+        log_probs = normal.log_prob(x_t)  # Base log probability before Tanh
+
+        if self.use_tanh_squash:
+            actions = torch.tanh(x_t)
+            log_probs -= torch.log((1 - actions.pow(2)) + 1e-6)  # Adjust log-probs for Tanh
+        else:
+            actions = x_t  # No Tanh; raw Gaussian sample
+
+        log_probs = log_probs.sum(-1)  # Sum over action dimensions
+        means = torch.tanh(means) if self.use_tanh_squash else means
+        return actions, log_probs, means
+
+    def get_features(self, observations: torch.Tensor) -> torch.Tensor:
+        """Get encoded features from observations"""
+        observations = observations.to(self.device)
+        if self.encoder is not None:
+            with torch.inference_mode():
+                return self.encoder(observations)
+        return observations
+
+
+class SACObservationEncoder(nn.Module):
+    """Encode image and/or state vector observations.
+    TODO(ke-wang): The original work allows for (1) stacking multiple history frames and (2) using pretrained resnet encoders.
+    """
+
+    def __init__(self, config: SACConfig):
+        """
+        Creates encoders for pixel and/or state modalities.
+        """
+        super().__init__()
+        self.config = config
+        if "observation.image" in config.input_shapes:
+            self.image_enc_layers = nn.Sequential(
+                nn.Conv2d(
+                    in_channels=config.input_shapes["observation.image"][0],
+                    out_channels=config.image_encoder_hidden_dim,
+                    kernel_size=7,
+                    stride=2,
+                ),
+                nn.ReLU(),
+                nn.Conv2d(
+                    in_channels=config.image_encoder_hidden_dim,
+                    out_channels=config.image_encoder_hidden_dim,
+                    kernel_size=5,
+                    stride=2,
+                ),
+                nn.ReLU(),
+                nn.Conv2d(
+                    in_channels=config.image_encoder_hidden_dim,
+                    out_channels=config.image_encoder_hidden_dim,
+                    kernel_size=3,
+                    stride=2,
+                ),
+                nn.ReLU(),
+                nn.Conv2d(
+                    in_channels=config.image_encoder_hidden_dim,
+                    out_channels=config.image_encoder_hidden_dim,
+                    kernel_size=3,
+                    stride=2,
+                ),
+                nn.ReLU(),
+            )
+            self.camera_number = config.camera_number
+            self.aggregation_size: int = 0
+
+            dummy_batch = torch.zeros(1, *config.input_shapes["observation.image"])
+            with torch.inference_mode():
+                out_shape = self.image_enc_layers(dummy_batch).shape[1:]
+            self.image_enc_layers.extend(
+                sequential=nn.Sequential(
+                    nn.Flatten(),
+                    nn.Linear(
+                        in_features=np.prod(out_shape) * self.camera_number, out_features=config.latent_dim
+                    ),
+                    nn.LayerNorm(normalized_shape=config.latent_dim),
+                    nn.Tanh(),
+                )
+            )
+
+            self.aggregation_size += config.latent_dim * self.camera_number
+        if "observation.state" in config.input_shapes:
+            self.state_enc_layers = nn.Sequential(
+                nn.Linear(
+                    in_features=config.input_shapes["observation.state"][0], out_features=config.latent_dim
+                ),
+                nn.LayerNorm(normalized_shape=config.latent_dim),
+                nn.Tanh(),
+            )
+            self.aggregation_size += config.latent_dim
+
+        if "observation.environment_state" in config.input_shapes:
+            self.env_state_enc_layers = nn.Sequential(
+                nn.Linear(
+                    in_features=config.input_shapes["observation.environment_state"][0],
+                    out_features=config.latent_dim,
+                ),
+                nn.LayerNorm(normalized_shape=config.latent_dim),
+                nn.Tanh(),
+            )
+
+            self.aggregation_size += config.latent_dim
+        self.aggregation_layer = nn.Linear(in_features=self.aggregation_size, out_features=config.latent_dim)
+
+    def forward(self, obs_dict: dict[str, Tensor]) -> Tensor:
+        """Encode the image and/or state vector.
+
+        Each modality is encoded into a feature vector of size (latent_dim,) and then a uniform mean is taken
+        over all features.
+        """
+        feat = []
+        # Concatenate all images along the channel dimension.
+        image_keys = [k for k in self.config.input_shapes if k.startswith("observation.image")]
+        for image_key in image_keys:
+            feat.append(flatten_forward_unflatten(self.image_enc_layers, obs_dict[image_key]))
+        if "observation.environment_state" in self.config.input_shapes:
+            feat.append(self.env_state_enc_layers(obs_dict["observation.environment_state"]))
+        if "observation.state" in self.config.input_shapes:
+            feat.append(self.state_enc_layers(obs_dict["observation.state"]))
+        # TODO(ke-wang): currently average over all features, concatenate all features maybe a better way
+        # return torch.stack(feat, dim=0).mean(0)
+        features = torch.cat(tensors=feat, dim=-1)
+        features = self.aggregation_layer(features)
+
+        return features
+
+    @property
+    def output_dim(self) -> int:
+        """Returns the dimension of the encoder output"""
+        return self.config.latent_dim
+
+
+def orthogonal_init():
+    return lambda x: torch.nn.init.orthogonal_(x, gain=1.0)
+
+
+def create_critic_ensemble(critics: list[nn.Module], num_critics: int, device: str = "cpu") -> nn.ModuleList:
+    """Creates an ensemble of critic networks"""
+    assert len(critics) == num_critics, f"Expected {num_critics} critics, got {len(critics)}"
+    return nn.ModuleList(critics).to(device)
+
+# borrowed from tdmpc
+def flatten_forward_unflatten(fn: Callable[[Tensor], Tensor], image_tensor: Tensor) -> Tensor:
+    """Helper to temporarily flatten extra dims at the start of the image tensor.
+
+    Args:
+        fn: Callable that the image tensor will be passed to. It should accept (B, C, H, W) and return
+            (B, *), where * is any number of dimensions.
+        image_tensor: An image tensor of shape (**, C, H, W), where ** is any number of dimensions and 
+        can be more than 1 dimensions, generally different from *.
+    Returns:
+        A return value from the callable reshaped to (**, *).
+    """
+    if image_tensor.ndim == 4:
+        return fn(image_tensor)
+    start_dims = image_tensor.shape[:-3]
+    inp = torch.flatten(image_tensor, end_dim=-4)
+    flat_out = fn(inp)
+    return torch.reshape(flat_out, (*start_dims, *flat_out.shape[1:]))
--- a/lerobot/common/robot_devices/control_utils.py
+++ b/lerobot/common/robot_devices/control_utils.py
@@ -11,6 +11,7 @@ from copy import copy
 from functools import cache

 import cv2
+import numpy as np
 import torch
 import tqdm
 from deepdiff import DeepDiff
@@ -120,14 +121,22 @@ def predict_action(observation, policy, device, use_amp):
    return action


-def init_keyboard_listener():
-    # Allow to exit early while recording an episode or resetting the environment,
-    # by tapping the right arrow key '->'. This might require a sudo permission
-    # to allow your terminal to monitor keyboard events.
+def init_keyboard_listener(assign_rewards=False):
+    """
+    Initializes a keyboard listener to enable early termination of an episode
+    or environment reset by pressing the right arrow key ('->'). This may require
+    sudo permissions to allow the terminal to monitor keyboard events.
+
+    Args:
+        assign_rewards (bool): If True, allows annotating the collected trajectory
+        with a binary reward at the end of the episode to indicate success.
+    """
    events = {}
    events["exit_early"] = False
    events["rerecord_episode"] = False
    events["stop_recording"] = False
+    if assign_rewards:
+        events["next.reward"] = 0

    if is_headless():
        logging.warning(
@@ -152,6 +161,13 @@ def init_keyboard_listener():
                print("Escape key pressed. Stopping data recording...")
                events["stop_recording"] = True
                events["exit_early"] = True
+            elif assign_rewards and key == keyboard.Key.space:
+                events["next.reward"] = 1 if events["next.reward"] == 0 else 0
+                print(
+                    "Space key pressed. Assigning new reward to the subsequent frames. New reward:",
+                    events["next.reward"],
+                )
+
        except Exception as e:
            print(f"Error handling key press: {e}")

@@ -272,6 +288,8 @@ def control_loop(

        if dataset is not None:
            frame = {**observation, **action}
+            if "next.reward" in events:
+                frame["next.reward"] = events["next.reward"]
            dataset.add_frame(frame)

        if display_cameras and not is_headless():
@@ -301,6 +319,8 @@ def reset_environment(robot, events, reset_time_s):

    timestamp = 0
    start_vencod_t = time.perf_counter()
+    if "next.reward" in events:
+        events["next.reward"] = 0

    # Wait if necessary
    with tqdm.tqdm(total=reset_time_s, desc="Waiting") as pbar:
@@ -313,6 +333,14 @@ def reset_environment(robot, events, reset_time_s):
                break


+def reset_follower_position(robot: Robot, target_position):
+    current_position = robot.follower_arms["main"].read("Present_Position")
+    trajectory = torch.from_numpy(np.linspace(current_position, target_position, 30)) # NOTE: 30 is just an aribtrary number 
+    for pose in trajectory:
+        robot.send_action(pose)
+        busy_wait(0.015)
+
+
 def stop_recording(robot, listener, display_cameras):
    robot.disconnect()

@@ -343,12 +371,16 @@ def sanity_check_dataset_name(repo_id, policy):


 def sanity_check_dataset_robot_compatibility(
-    dataset: LeRobotDataset, robot: Robot, fps: int, use_videos: bool
+    dataset: LeRobotDataset, robot: Robot, fps: int, use_videos: bool, extra_features: dict = None
 ) -> None:
+    features_from_robot = get_features_from_robot(robot, use_videos)
+    if extra_features is not None:
+        features_from_robot.update(extra_features)
+
    fields = [
        ("robot_type", dataset.meta.robot_type, robot.robot_type),
        ("fps", dataset.fps, fps),
-        ("features", dataset.features, get_features_from_robot(robot, use_videos)),
+        ("features", dataset.features, features_from_robot),
    ]

    mismatches = []
--- a/lerobot/configs/policy/hilserl_classifier.yaml
+++ b/lerobot/configs/policy/hilserl_classifier.yaml
@@ -0,0 +1,48 @@
+# @package _global_
+
+defaults:
+  - _self_
+
+seed: 13
+dataset_repo_id: aractingi/pick_place_lego_cube_1
+train_split_proportion: 0.8
+
+# Required by logger
+env:
+  name: "classifier"
+  task: "binary_classification"
+
+
+training:
+  num_epochs: 5
+  batch_size: 16
+  learning_rate: 1e-4
+  num_workers: 4
+  grad_clip_norm: 10
+  use_amp: true
+  log_freq: 1
+  eval_freq: 1  # How often to run validation (in epochs)
+  save_freq: 1  # How often to save checkpoints (in epochs)
+  save_checkpoint: true
+  image_keys: ["observation.images.top", "observation.images.wrist"]
+  label_key: "next.reward"
+
+eval:
+  batch_size: 16
+  num_samples_to_log: 30  # Number of validation samples to log in the table
+
+policy:
+  name: "hilserl/classifier/pick_place_lego_cube_1"
+  model_name: "facebook/convnext-base-224"
+  model_type: "cnn"
+  num_cameras: 2 # Has to be len(training.image_keys)
+
+wandb:
+  enable: false
+  project: "classifier-training"
+  job_name: "classifier_training_0"
+  disable_artifact: false
+
+device: "mps"
+resume: false
+output_dir: "outputs/classifier"
--- a/lerobot/configs/policy/sac_manyskill.yaml
+++ b/lerobot/configs/policy/sac_manyskill.yaml
@@ -0,0 +1,97 @@
+# @package _global_
+
+# Train with:
+#
+# python lerobot/scripts/train.py \
+#   +dataset=lerobot/pusht_keypoints
+#   env=pusht \
+#   env.gym.obs_type=environment_state_agent_pos \
+
+seed: 1
+dataset_repo_id: null 
+
+
+training:
+  # Offline training dataloader
+  num_workers: 4
+
+  # batch_size: 256
+  batch_size: 512
+  grad_clip_norm: 10.0
+  lr: 3e-4
+
+  eval_freq: 2500
+  log_freq: 500
+  save_freq: 50000
+
+  online_steps: 1000000
+  online_rollout_n_episodes: 10
+  online_rollout_batch_size: 10
+  online_steps_between_rollouts: 1000
+  online_sampling_ratio: 1.0
+  online_env_seed: 10000
+  online_buffer_capacity: 1000000
+  online_buffer_seed_size: 0
+  online_step_before_learning: 5000
+  do_online_rollout_async: false
+  policy_update_freq: 1
+
+  # delta_timestamps:
+  #   observation.environment_state: "[i / ${fps} for i in range(${policy.horizon} + 1)]"
+  #   observation.state: "[i / ${fps} for i in range(${policy.horizon} + 1)]"
+  #   action: "[i / ${fps} for i in range(${policy.horizon})]"
+  #   next.reward: "[i / ${fps} for i in range(${policy.horizon})]"
+
+policy:
+  name: sac
+
+  pretrained_model_path:
+
+  # Input / output structure.
+  n_action_repeats: 1
+  horizon: 1
+  n_action_steps: 1
+
+  shared_encoder: true
+  input_shapes:
+    # # TODO(rcadene, alexander-soare): add variables for height and width from the dataset/env?
+    observation.state: ["${env.state_dim}"]
+    observation.image: [3, 64, 64]
+  output_shapes:
+    action: ["${env.action_dim}"]
+
+  # Normalization / Unnormalization
+  input_normalization_modes: null
+  output_normalization_modes:
+    action: min_max
+  output_normalization_params:
+    action:
+      min: [-1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0]
+      max: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
+
+  # Architecture / modeling.
+  # Neural networks.
+  image_encoder_hidden_dim: 32
+  # discount: 0.99
+  discount: 0.80
+  temperature_init: 1.0
+  num_critics: 2
+  num_subsample_critics: null
+  critic_lr: 3e-4
+  actor_lr: 3e-4
+  temperature_lr: 3e-4
+  # critic_target_update_weight: 0.005
+  critic_target_update_weight: 0.01
+  utd_ratio: 1
+
+
+  # # Loss coefficients.
+  # reward_coeff: 0.5
+  # expectile_weight: 0.9
+  # value_coeff: 0.1
+  # consistency_coeff: 20.0
+  # advantage_scaling: 3.0
+  # pi_coeff: 0.5
+  # temporal_decay_coeff: 0.5
+  # # Target model.
+  # target_model_momentum: 0.995
--- a/lerobot/configs/policy/sac_pusht_keypoints.yaml
+++ b/lerobot/configs/policy/sac_pusht_keypoints.yaml
@@ -0,0 +1,89 @@
+# @package _global_
+
+# Train with:
+#
+# python lerobot/scripts/train.py \
+#   env=pusht \
+#   +dataset=lerobot/pusht_keypoints
+
+seed: 1
+dataset_repo_id: lerobot/pusht_keypoints
+
+training:
+  offline_steps: 0
+
+  # Offline training dataloader
+  num_workers: 4
+
+  batch_size: 128
+  grad_clip_norm: 10.0
+  lr: 3e-4
+
+  eval_freq: 50000
+  log_freq: 500
+  save_freq: 50000
+
+  online_steps: 1000000
+  online_rollout_n_episodes: 10
+  online_rollout_batch_size: 10
+  online_steps_between_rollouts: 1000
+  online_sampling_ratio: 1.0
+  online_env_seed: 10000
+  online_buffer_capacity: 40000
+  online_buffer_seed_size: 0
+  do_online_rollout_async: false
+
+  delta_timestamps:
+    observation.environment_state: "[i / ${fps} for i in range(${policy.horizon} + 1)]"
+    observation.state: "[i / ${fps} for i in range(${policy.horizon} + 1)]"
+    action: "[i / ${fps} for i in range(${policy.horizon})]"
+    next.reward: "[i / ${fps} for i in range(${policy.horizon})]"
+
+policy:
+  name: sac
+
+  pretrained_model_path:
+
+  # Input / output structure.
+  n_action_repeats: 1
+  horizon: 5
+  n_action_steps: 5
+
+  input_shapes:
+    # TODO(rcadene, alexander-soare): add variables for height and width from the dataset/env?
+    observation.environment_state: [16]
+    observation.state: ["${env.state_dim}"]
+  output_shapes:
+    action: ["${env.action_dim}"]
+
+  # Normalization / Unnormalization
+  input_normalization_modes:
+    observation.environment_state: min_max
+    observation.state: min_max
+  output_normalization_modes:
+    action: min_max
+
+  # Architecture / modeling.
+  # Neural networks.
+  # image_encoder_hidden_dim: 32
+  discount: 0.99
+  temperature_init: 1.0
+  num_critics: 2
+  num_subsample_critics: None
+  critic_lr: 3e-4
+  actor_lr: 3e-4
+  temperature_lr: 3e-4
+  critic_target_update_weight: 0.005
+  utd_ratio: 2
+
+
+  # # Loss coefficients.
+  # reward_coeff: 0.5
+  # expectile_weight: 0.9
+  # value_coeff: 0.1
+  # consistency_coeff: 20.0
+  # advantage_scaling: 3.0
+  # pi_coeff: 0.5
+  # temporal_decay_coeff: 0.5
+  # # Target model.
+  # target_model_momentum: 0.995
--- a/lerobot/configs/robot/koch.yaml
+++ b/lerobot/configs/robot/koch.yaml
@@ -10,7 +10,7 @@ max_relative_target: null
 leader_arms:
  main:
    _target_: lerobot.common.robot_devices.motors.dynamixel.DynamixelMotorsBus
-    port: /dev/tty.usbmodem575E0031751
+    port: /dev/tty.usbmodem58760430441
    motors:
      # name: (index, model)
      shoulder_pan: [1, "xl330-m077"]
@@ -23,7 +23,7 @@ leader_arms:
 follower_arms:
  main:
    _target_: lerobot.common.robot_devices.motors.dynamixel.DynamixelMotorsBus
-    port: /dev/tty.usbmodem575E0032081
+    port: /dev/tty.usbmodem585A0083391
    motors:
      # name: (index, model)
      shoulder_pan: [1, "xl430-w250"]
--- a/lerobot/configs/robot/so100.yaml
+++ b/lerobot/configs/robot/so100.yaml
@@ -18,7 +18,7 @@ max_relative_target: null
 leader_arms:
  main:
    _target_: lerobot.common.robot_devices.motors.feetech.FeetechMotorsBus
-    port: /dev/tty.usbmodem585A0077581
+    port: /dev/tty.usbmodem58760433331
    motors:
      # name: (index, model)
      shoulder_pan: [1, "sts3215"]
--- a/lerobot/scripts/control_robot.py
+++ b/lerobot/scripts/control_robot.py
@@ -109,6 +109,7 @@ from lerobot.common.robot_devices.control_utils import (
    log_control_info,
    record_episode,
    reset_environment,
+    reset_follower_position,
    sanity_check_dataset_name,
    sanity_check_dataset_robot_compatibility,
    stop_recording,
@@ -191,6 +192,7 @@ def record(
    single_task: str,
    pretrained_policy_name_or_path: str | None = None,
    policy_overrides: List[str] | None = None,
+    assign_rewards: bool = False,
    fps: int | None = None,
    warmup_time_s: int | float = 2,
    episode_time_s: int | float = 10,
@@ -204,6 +206,7 @@ def record(
    num_image_writer_threads_per_camera: int = 4,
    display_cameras: bool = True,
    play_sounds: bool = True,
+    reset_follower: bool = False, 
    resume: bool = False,
    # TODO(rcadene, aliberts): remove local_files_only when refactor with dataset as argument
    local_files_only: bool = False,
@@ -214,6 +217,9 @@ def record(
    policy = None
    device = None
    use_amp = None
+    extra_features = (
+        {"next.reward": {"dtype": "int64", "shape": (1,), "names": None}} if assign_rewards else None
+    )

    if single_task:
        task = single_task
@@ -242,7 +248,7 @@ def record(
            num_processes=num_image_writer_processes,
            num_threads=num_image_writer_threads_per_camera * len(robot.cameras),
        )
-        sanity_check_dataset_robot_compatibility(dataset, robot, fps, video)
+        sanity_check_dataset_robot_compatibility(dataset, robot, fps, video, extra_features)
    else:
        # Create empty dataset or load existing saved episodes
        sanity_check_dataset_name(repo_id, policy)
@@ -254,13 +260,16 @@ def record(
            use_videos=video,
            image_writer_processes=num_image_writer_processes,
            image_writer_threads=num_image_writer_threads_per_camera * len(robot.cameras),
+            features=extra_features,
        )

    if not robot.is_connected:
        robot.connect()
+    listener, events = init_keyboard_listener(assign_rewards=assign_rewards)

-    listener, events = init_keyboard_listener()
-
+    if reset_follower:
+        initial_position = robot.follower_arms["main"].read("Present_Position")
+        
    # Execute a few seconds without recording to:
    # 1. teleoperate the robot to move it in starting position if no policy provided,
    # 2. give times to the robot devices to connect and start synchronizing,
@@ -303,6 +312,8 @@ def record(
            (recorded_episodes < num_episodes - 1) or events["rerecord_episode"]
        ):
            log_say("Reset the environment", play_sounds)
+            if reset_follower:
+                reset_follower_position(robot, initial_position)
            reset_environment(robot, events, reset_time_s)

        if events["rerecord_episode"]:
@@ -469,12 +480,12 @@ if __name__ == "__main__":
        default=1,
        help="Upload dataset to Hugging Face hub.",
    )
-    parser_record.add_argument(
-        "--tags",
-        type=str,
-        nargs="*",
-        help="Add tags to your dataset on the hub.",
-    )
+    # parser_record.add_argument(
+    #     "--tags",
+    #     type=str,
+    #     nargs="*",
+    #     help="Add tags to your dataset on the hub.",
+    # )
    parser_record.add_argument(
        "--num-image-writer-processes",
        type=int,
@@ -517,6 +528,18 @@ if __name__ == "__main__":
        nargs="*",
        help="Any key=value arguments to override config values (use dots for.nested=overrides)",
    )
+    parser_record.add_argument(
+        "--assign-rewards",
+        type=int,
+        default=0,
+        help="Enables the assignation of rewards to frames (by default no assignation). When enabled, assign a 0 reward to frames until the space bar is pressed which assign a 1 reward. Press the space bar a second time to assign a 0 reward. The reward assigned is reset to 0 when the episode ends.",
+    )
+    parser_record.add_argument(
+        "--reset-follower",
+        type=int,
+        default=0,
+        help="Resets the follower to the initial position during while reseting the evironment, this is to avoid having the follower start at an awkward position in the next episode",
+    )

    parser_replay = subparsers.add_parser("replay", parents=[base_parser])
    parser_replay.add_argument(
--- a/lerobot/scripts/control_sim_robot.py
+++ b/lerobot/scripts/control_sim_robot.py
@@ -183,8 +183,14 @@ def record(
    resume: bool = False,
    local_files_only: bool = False,
    run_compute_stats: bool = True,
+    assign_rewards: bool = False,
 ) -> LeRobotDataset:
    # Load pretrained policy
+
+    extra_features = (
+        {"next.reward": {"dtype": "int64", "shape": (1,), "names": None}} if assign_rewards else None
+    )
+
    policy = None
    if pretrained_policy_name_or_path is not None:
        policy, policy_fps, device, use_amp = init_policy(pretrained_policy_name_or_path, policy_overrides)
@@ -197,7 +203,7 @@ def record(
        raise ValueError("Either policy or process_action_fn has to be set to enable control in sim.")

    # initialize listener before sim env
-    listener, events = init_keyboard_listener()
+    listener, events = init_keyboard_listener(assign_rewards=assign_rewards)

    # create sim env
    env = env()
@@ -237,6 +243,7 @@ def record(
            }

        features["action"] = {"dtype": "float32", "shape": env.action_space.shape, "names": None}
+        features = {**features, **extra_features}

        # Create empty dataset or load existing saved episodes
        sanity_check_dataset_name(repo_id, policy)
@@ -288,6 +295,13 @@ def record(
                "timestamp": env_timestamp,
            }

+            # Overwrite environment reward with manually assigned reward
+            if assign_rewards:
+                frame["next.reward"] = events["next.reward"]
+
+                # Should success always be false to match what we do in control_utils?
+                frame["next.success"] = False
+
            for key in image_keys:
                if not key.startswith("observation.image"):
                    frame["observation.image." + key] = observation[key]
@@ -472,6 +486,13 @@ if __name__ == "__main__":
        default=0,
        help="Resume recording on an existing dataset.",
    )
+    parser_record.add_argument(
+        "--assign-rewards",
+        type=int,
+        default=0,
+        help="Enables the assignation of rewards to frames (by default no assignation). When enabled, assign a 0 reward to frames until the space bar is pressed which assign a 1 reward. Press the space bar a second time to assign a 0 reward. The reward assigned is reset to 0 when the episode ends.",
+    )
+
    parser_replay = subparsers.add_parser("replay", parents=[base_parser])
    parser_replay.add_argument(
        "--fps", type=none_or_int, default=None, help="Frames per second (set to None to disable)"
--- a/lerobot/scripts/eval_on_robot.py
+++ b/lerobot/scripts/eval_on_robot.py
@@ -0,0 +1,433 @@
+#!/usr/bin/env python
+
+# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Evaluate a policy by running rollouts on the real robot and computing metrics.
+
+This script supports performing human interventions during rollouts. 
+Human interventions allow the user to take control of the robot from the policy 
+and correct its behavior. It is specifically designed for reinforcement learning 
+experiments and HIL-SERL (human-in-the-loop reinforcement learning) methods.
+
+### How to Use
+
+To rollout a policy on the robot:
+```
+python lerobot/scripts/eval_on_robot.py \
+    --robot-path lerobot/configs/robot/so100.yaml \
+    --pretrained-policy-path-or-name path/to/pretrained_model \
+    --policy-config path/to/policy/config.yaml \
+    --display-cameras 1
+```
+
+If you trained a reward classifier on your task, you can also evaluate it using this script.
+You can annotate the collection with a pre-trained reward classifier by running:
+
+```
+python lerobot/scripts/eval_on_robot.py \
+    --robot-path lerobot/configs/robot/so100.yaml \
+    --pretrained-policy-path-or-name path/to/pretrained_model \
+    --policy-config path/to/policy/config.yaml \
+    --reward-classifier-pretrained-path outputs/classifier/checkpoints/best/pretrained_model \
+    --reward-classifier-config-file lerobot/configs/policy/hilserl_classifier.yaml \
+    --display-cameras 1
+```
+"""
+
+import argparse
+import logging
+import time
+
+import cv2
+import numpy as np
+import torch
+from tqdm import trange
+
+from lerobot.common.policies.policy_protocol import Policy
+from lerobot.common.policies.utils import get_device_from_parameters
+from lerobot.common.robot_devices.control_utils import busy_wait, is_headless, reset_follower_position, predict_action
+from lerobot.common.robot_devices.robots.factory import Robot, make_robot
+from lerobot.common.utils.utils import (
+    init_hydra_config,
+    init_logging,
+    log_say,
+)
+
+
+def get_classifier(pretrained_path, config_path):
+    if pretrained_path is None or config_path is None:
+        return
+
+    from lerobot.common.policies.factory import _policy_cfg_from_hydra_cfg
+    from lerobot.common.policies.hilserl.classifier.configuration_classifier import ClassifierConfig
+    from lerobot.common.policies.hilserl.classifier.modeling_classifier import Classifier
+
+    cfg = init_hydra_config(config_path)
+
+    classifier_config = _policy_cfg_from_hydra_cfg(ClassifierConfig, cfg)
+    classifier_config.num_cameras = len(cfg.training.image_keys)  # TODO automate these paths
+    model = Classifier(classifier_config)
+    model.load_state_dict(Classifier.from_pretrained(pretrained_path).state_dict())
+    return model
+
+
+def rollout(
+    robot: Robot,
+    policy: Policy,
+    reward_classifier,
+    fps: int,
+    control_time_s: float = 20,
+    use_amp: bool = True,
+    display_cameras: bool = False,
+    device: str = "cpu"
+) -> dict:
+    """Run a batched policy rollout on the real robot.
+
+    This function executes a rollout using the provided policy and robot interface, 
+    simulating batched interactions for a fixed control duration. 
+
+    The returned dictionary contains rollout statistics, which can be used for analysis and debugging.
+
+    Args:
+        "robot": The robot interface for interacting with the real robot hardware.
+        "policy": The policy to execute. Must be a PyTorch `nn.Module` object.
+        "reward_classifier": A module to classify rewards during the rollout.
+        "fps": The control frequency at which the policy is executed.
+        "control_time_s": The total control duration of the rollout in seconds.
+        "use_amp": Whether to use automatic mixed precision (AMP) for policy evaluation.
+        "display_cameras": If True, displays camera streams during the rollout.
+        "device": The device to use for computations (e.g., "cpu", "cuda" or "mps").
+
+    Returns:
+        Dictionary of the statisitcs collected during rollouts.
+    """
+    # define keyboard listener
+    listener, events = init_keyboard_listener()
+
+    # Reset the policy. TODO (michel-aractingi) add real policy evaluation once the code is ready.
+    if policy is not None:
+        policy.reset()  
+
+    # NOTE: sorting to make sure the key sequence is the same during training and testing.
+    observation = robot.capture_observation()
+    image_keys = [key for key in observation if "image" in key]
+    image_keys.sort() # CG{T}
+
+    
+    all_actions = []
+    all_rewards = []
+    
+    indices_from_policy = []
+
+    start_episode_t = time.perf_counter()
+    init_pos = robot.follower_arms["main"].read("Present_Position")
+    timestamp = 0.0
+    while timestamp < control_time_s:
+        start_loop_t = time.perf_counter()
+
+        # Apply the next action.
+        while events["pause_policy"] and not events["human_intervention_step"]:
+            busy_wait(0.5)
+
+        if events["human_intervention_step"]:
+            # take over the robot's actions
+            observation, action = robot.teleop_step(record_data=True)
+            action = action["action"]  # teleop step returns torch tensors but in a dict
+        else:
+            # explore with policy
+            with torch.inference_mode():
+                # TODO (michel-aractingi) in placy temporarly for testing purposes
+                if policy is None:
+                    action = robot.follower_arms["main"].read("Present_Position")
+                    action = torch.from_numpy(action)
+                    indices_from_policy.append(False)
+                else:
+                    action = predict_action(observation, policy, device, use_amp)
+                    indices_from_policy.append(True)
+                
+                robot.send_action(action)
+                observation = robot.capture_observation()
+                
+
+        images = []
+        for key in image_keys:
+            if display_cameras:
+                cv2.imshow(key, cv2.cvtColor(observation[key].numpy(), cv2.COLOR_RGB2BGR))
+                cv2.waitKey(1)
+            images.append(observation[key].to(device))
+
+        reward = reward_classifier.predict_reward(images) if reward_classifier is not None else 0.0
+
+        # TODO send data through the server as soon as you have it       
+
+        all_rewards.append(reward)
+        all_actions.append(action)
+
+        dt_s = time.perf_counter() - start_loop_t
+        busy_wait(1 / fps - dt_s)
+        timestamp = time.perf_counter() - start_episode_t
+        if events["exit_early"]:
+            events["exit_early"] = False
+            events["human_intervention_step"] = False
+            events["pause_policy"] = False
+            break
+
+    reset_follower_position(robot, target_position=init_pos)
+
+    dones = torch.tensor([False] * len(all_actions))
+    dones[-1] = True
+    # Stack the sequence along the first dimension so that we have (batch, sequence, *) tensors.
+    ret = {
+        "action": torch.stack(all_actions, dim=1),
+        "next.reward": torch.stack(all_rewards, dim=1),
+        "done": dones,
+    }
+
+    listener.stop()
+
+    return ret
+
+
+def eval_policy(
+    robot: Robot,
+    policy: torch.nn.Module,
+    fps: float,
+    n_episodes: int,
+    control_time_s: int = 20,
+    use_amp: bool = True,
+    display_cameras: bool = False,
+    reward_classifier_pretrained_path: str | None = None,
+    reward_classifier_config_file: str | None = None,
+    device: str | None = None,
+) -> dict:
+    """
+    Evaluate a policy on a real robot by running multiple episodes and collecting metrics.
+
+    This function executes rollouts of the specified policy on the robot, computes metrics 
+    for the rollouts, and optionally evaluates a reward classifier if provided.
+
+    Args:
+        "robot": The robot interface used to interact with the real robot hardware.
+        "policy": The policy to be evaluated. Must be a PyTorch neural network module.
+        "fps": Frames per second (control frequency) for running the policy.
+        "n_episodes": The number of episodes to evaluate the policy.
+        "control_time_s": The max duration for each episode in seconds.
+        "use_amp": Whether to use automatic mixed precision (AMP) for policy evaluation.
+        "display_cameras": Whether to display camera streams during rollouts.
+        "reward_classifier_pretrained_path": Path to the pretrained reward classifier. 
+            If provided, the reward classifier will be evaluated during rollouts.
+        "reward_classifier_config_file": Path to the configuration file for the reward classifier. 
+            Required if `reward_classifier_pretrained_path` is provided.
+        "device": The device for computations (e.g., "cpu", "cuda" or "mps"). 
+
+    Returns:
+        "dict": A dictionary containing the following rollout metrics and data:
+            - "metrics": Evaluation metrics such as cumulative rewards, success rates, etc.
+            - "rollout_data": Detailed data from the rollouts, including observations, actions, rewards, and done flags.
+    """
+    # TODO (michel-aractingi) comment this out for testing with a fixed policy
+    # assert isinstance(policy, Policy)
+    # policy.eval()
+
+    sum_rewards = []
+    max_rewards = []
+    rollouts = []
+
+    start_eval = time.perf_counter()
+    progbar = trange(n_episodes, desc="Evaluating policy on real robot")
+    reward_classifier = get_classifier(reward_classifier_pretrained_path, reward_classifier_config_file).to(device)
+
+    device = get_device_from_parameters(policy) if device is None else device
+
+    for _ in progbar:
+        rollout_data = rollout(
+            robot, policy, reward_classifier, fps, control_time_s, use_amp, display_cameras, device
+        )
+
+        rollouts.append(rollout_data)
+        sum_rewards.append(sum(rollout_data["next.reward"]))
+        max_rewards.append(max(rollout_data["next.reward"]))
+
+    info = {
+        "per_episode": [
+            {
+                "episode_ix": i,
+                "sum_reward": sum_reward,
+                "max_reward": max_reward,
+            }
+            for i, (sum_reward, max_reward) in enumerate(
+                zip(
+                    sum_rewards[:n_episodes],
+                    max_rewards[:n_episodes],
+                    strict=False,
+                )
+            )
+        ],
+        "aggregated": {
+            "avg_sum_reward":    float(np.nanmean(torch.cat(sum_rewards[:n_episodes]))),
+            "avg_max_reward": float(np.nanmean(torch.cat(max_rewards[:n_episodes]))),
+            "eval_s": time.time() - start_eval,
+            "eval_ep_s": (time.time() - start_eval) / n_episodes,
+        },
+    }
+
+    if robot.is_connected:
+        robot.disconnect()
+
+    return info
+
+
+def init_keyboard_listener():
+    """
+    Initialize a keyboard listener for controlling the recording and human intervention process.
+
+    Keyboard controls: (Note that this might require sudo permissions to monitor keyboard events)
+        - Right Arrow Key ('->'): Stops the current recording and exits early, useful for ending an episode 
+          and moving the next episode recording. 
+        - Left Arrow Key ('<-'): Re-records the current episode, allowing the user to start over.
+        - Space Bar: Controls the human intervention process in three steps:
+            1. First press pauses the policy and prompts the user to position the leader similar to the follower.
+            2. Second press initiates human interventions, allowing teleop control of the robot.
+            3. Third press resumes the policy rollout.
+    """
+    events = {}
+    events["exit_early"] = False
+    events["rerecord_episode"] = False
+    events["pause_policy"] = False
+    events["human_intervention_step"] = False
+
+    if is_headless():
+        logging.warning(
+            "Headless environment detected. On-screen cameras display and keyboard inputs will not be available."
+        )
+        listener = None
+        return listener, events
+
+    # Only import pynput if not in a headless environment
+    from pynput import keyboard
+
+    def on_press(key):
+        try:
+            if key == keyboard.Key.right:
+                print("Right arrow key pressed. Exiting loop...")
+                events["exit_early"] = True
+            elif key == keyboard.Key.left:
+                print("Left arrow key pressed. Exiting loop and rerecord the last episode...")
+                events["rerecord_episode"] = True
+                events["exit_early"] = True
+            elif key == keyboard.Key.space:
+                # check if first space press then pause the policy for the user to get ready
+                # if second space press then the user is ready to start intervention
+                if not events["pause_policy"]:
+                    print(
+                        "Space key pressed. Human intervention required.\n"
+                        "Place the leader in similar pose to the follower and press space again."
+                    )
+                    events["pause_policy"] = True
+                    log_say("Human intervention stage. Get ready to take over.", play_sounds=True)
+                elif events["pause_policy"] and not events["human_intervention_step"]:
+                    events["human_intervention_step"] = True
+                    print("Space key pressed. Human intervention starting.")
+                    log_say("Starting human intervention.", play_sounds=True)
+                elif events["human_intervention_step"]:
+                    events["human_intervention_step"] = False
+                    events["pause_policy"] = False
+                    print("Space key pressed. Human intervention ending, policy resumes control.")
+                    log_say("Policy resuming.", play_sounds=True)
+
+        except Exception as e:
+            print(f"Error handling key press: {e}")
+
+    listener = keyboard.Listener(on_press=on_press)
+    listener.start()
+
+    return listener, events
+
+
+if __name__ == "__main__":
+    init_logging()
+
+    parser = argparse.ArgumentParser(
+        description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter
+    )
+    group = parser.add_mutually_exclusive_group(required=True)
+    group.add_argument(
+        "--robot-path",
+        type=str,
+        default="lerobot/configs/robot/koch.yaml",
+        help="Path to robot yaml file used to instantiate the robot using `make_robot` factory function.",
+    )
+    group.add_argument(
+        "--robot-overrides",
+        type=str,
+        nargs="*",
+        help="Any key=value arguments to override config values (use dots for.nested=overrides)",
+    )
+    group.add_argument(
+        "-p",
+        "--pretrained-policy-name-or-path",
+        help=(
+            "Either the repo ID of a model hosted on the Hub or a path to a directory containing weights "
+            "saved using `Policy.save_pretrained`. If not provided, the policy is initialized from scratch "
+            "(useful for debugging). This argument is mutually exclusive with `--config`."
+        ),
+    )
+    group.add_argument(
+        "--config",
+        help=(
+            "Path to a yaml config you want to use for initializing a policy from scratch (useful for "
+            "debugging). This argument is mutually exclusive with `--pretrained-policy-name-or-path` (`-p`)."
+        ),
+    )
+    parser.add_argument("--revision", help="Optionally provide the Hugging Face Hub revision ID.")
+    parser.add_argument(
+        "--out-dir",
+        help=(
+            "Where to save the evaluation outputs. If not provided, outputs are saved in "
+            "outputs/eval/{timestamp}_{env_name}_{policy_name}"
+        ),
+    )
+    parser.add_argument(
+        "--display-cameras", help=("Whether to display the camera feed while the rollout is happening")
+    )
+    parser.add_argument(
+        "--reward-classifier-pretrained-path",
+        type=str,
+        default=None,
+        help="Path to the pretrained classifier weights.",
+    )
+    parser.add_argument(
+        "--reward-classifier-config-file",
+        type=str,
+        default=None,
+        help="Path to a yaml config file that is necessary to build the reward classifier model.",
+    )
+
+    args = parser.parse_args()
+
+    robot_cfg = init_hydra_config(args.robot_path, args.robot_overrides)
+    robot = make_robot(robot_cfg)
+    if not robot.is_connected:
+        robot.connect()
+
+    eval_policy(
+        robot,
+        None,
+        fps=40,
+        n_episodes=2,
+        control_time_s=100,
+        display_cameras=args.display_cameras,
+        reward_classifier_config_file=args.reward_classifier_config_file,
+        reward_classifier_pretrained_path=args.reward_classifier_pretrained_path,
+    )
--- a/lerobot/scripts/train.py
+++ b/lerobot/scripts/train.py
@@ -93,6 +93,17 @@ def make_optimizer_and_scheduler(cfg, policy):
    elif policy.name == "tdmpc":
        optimizer = torch.optim.Adam(policy.parameters(), cfg.training.lr)
        lr_scheduler = None
+
+    elif policy.name == "sac":
+        optimizer = torch.optim.Adam(
+            [
+                {"params": policy.actor.parameters(), "lr": policy.config.actor_lr},
+                {"params": policy.critic_ensemble.parameters(), "lr": policy.config.critic_lr},
+                {"params": policy.temperature.parameters(), "lr": policy.config.temperature_lr},
+            ]
+        )
+        lr_scheduler = None
+
    elif cfg.policy.name == "vqbet":
        from lerobot.common.policies.vqbet.modeling_vqbet import VQBeTOptimizer, VQBeTScheduler

@@ -311,6 +322,11 @@ def train(cfg: DictConfig, out_dir: str | None = None, job_name: str | None = No

    logging.info("make_dataset")
    offline_dataset = make_dataset(cfg)
+    # TODO (michel-aractingi): temporary fix to avoid datasets with task_index key that doesn't exist in online environment
+    # i.e., pusht
+    if "task_index" in offline_dataset.hf_dataset[0]:
+        offline_dataset.hf_dataset = offline_dataset.hf_dataset.remove_columns(["task_index"])
+
    if isinstance(offline_dataset, MultiLeRobotDataset):
        logging.info(
            "Multiple datasets were provided. Applied the following index mapping to the provided datasets: "
--- a/lerobot/scripts/train_hilserl_classifier.py
+++ b/lerobot/scripts/train_hilserl_classifier.py
@@ -0,0 +1,320 @@
+#!/usr/bin/env python
+
+# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import logging
+import time
+from contextlib import nullcontext
+from pathlib import Path
+from pprint import pformat
+
+import hydra
+import torch
+import torch.nn as nn
+import wandb
+from deepdiff import DeepDiff
+from omegaconf import DictConfig, OmegaConf
+from termcolor import colored
+from torch import optim
+from torch.cuda.amp import GradScaler
+from torch.utils.data import DataLoader, WeightedRandomSampler, random_split
+from tqdm import tqdm
+
+from lerobot.common.datasets.factory import resolve_delta_timestamps
+from lerobot.common.datasets.lerobot_dataset import LeRobotDataset
+from lerobot.common.logger import Logger
+from lerobot.common.policies.factory import _policy_cfg_from_hydra_cfg
+from lerobot.common.policies.hilserl.classifier.configuration_classifier import ClassifierConfig
+from lerobot.common.policies.hilserl.classifier.modeling_classifier import Classifier
+from lerobot.common.utils.utils import (
+    format_big_number,
+    get_safe_torch_device,
+    init_hydra_config,
+    set_global_seed,
+)
+
+
+def get_model(cfg, logger):  # noqa I001
+    classifier_config = _policy_cfg_from_hydra_cfg(ClassifierConfig, cfg)
+    model = Classifier(classifier_config)
+    if cfg.resume:
+        model.load_state_dict(Classifier.from_pretrained(str(logger.last_pretrained_model_dir)).state_dict())
+    return model
+
+
+def create_balanced_sampler(dataset, cfg):
+    # Creates a weighted sampler to handle class imbalance
+
+    labels = torch.tensor([item[cfg.training.label_key] for item in dataset])
+    _, counts = torch.unique(labels, return_counts=True)
+    class_weights = 1.0 / counts.float()
+    sample_weights = class_weights[labels]
+
+    return WeightedRandomSampler(weights=sample_weights, num_samples=len(sample_weights), replacement=True)
+
+
+def support_amp(device: torch.device, cfg: DictConfig) -> bool:
+    # Check if the device supports AMP
+    # Here is an example of the issue that says that MPS doesn't support AMP properply
+    return cfg.training.use_amp and device.type in ("cuda", "cpu")
+
+
+def train_epoch(model, train_loader, criterion, optimizer, grad_scaler, device, logger, step, cfg):
+    # Single epoch training loop with AMP support and progress tracking
+    model.train()
+    correct = 0
+    total = 0
+
+    pbar = tqdm(train_loader, desc="Training")
+    for batch_idx, batch in enumerate(pbar):
+        start_time = time.perf_counter()
+        images = [batch[img_key].to(device) for img_key in cfg.training.image_keys]
+        labels = batch[cfg.training.label_key].float().to(device)
+
+        # Forward pass with optional AMP
+        with torch.autocast(device_type=device.type) if support_amp(device, cfg) else nullcontext():
+            outputs = model(images)
+            loss = criterion(outputs.logits, labels)
+
+        # Backward pass with gradient scaling if AMP enabled
+        optimizer.zero_grad()
+        if cfg.training.use_amp:
+            grad_scaler.scale(loss).backward()
+            grad_scaler.step(optimizer)
+            grad_scaler.update()
+        else:
+            loss.backward()
+            optimizer.step()
+
+        # Track metrics
+        if model.config.num_classes == 2:
+            predictions = (torch.sigmoid(outputs.logits) > 0.5).float()
+        else:
+            predictions = torch.argmax(outputs.logits, dim=1)
+        correct += (predictions == labels).sum().item()
+        total += labels.size(0)
+
+        current_acc = 100 * correct / total
+        train_info = {
+            "loss": loss.item(),
+            "accuracy": current_acc,
+            "dataloading_s": time.perf_counter() - start_time,
+        }
+
+        logger.log_dict(train_info, step + batch_idx, mode="train")
+        pbar.set_postfix({"loss": f"{loss.item():.4f}", "acc": f"{current_acc:.2f}%"})
+
+
+def validate(model, val_loader, criterion, device, logger, cfg, num_samples_to_log=8):
+    # Validation loop with metric tracking and sample logging
+    model.eval()
+    correct = 0
+    total = 0
+    batch_start_time = time.perf_counter()
+    samples = []
+    running_loss = 0
+
+    with (
+        torch.no_grad(),
+        torch.autocast(device_type=device.type) if support_amp(device, cfg) else nullcontext(),
+    ):
+        for batch in tqdm(val_loader, desc="Validation"):
+            images = [batch[img_key].to(device) for img_key in cfg.training.image_keys]
+            labels = batch[cfg.training.label_key].float().to(device)
+
+            outputs = model(images)
+            loss = criterion(outputs.logits, labels)
+
+            # Track metrics
+            if model.config.num_classes == 2:
+                predictions = (torch.sigmoid(outputs.logits) > 0.5).float()
+            else:
+                predictions = torch.argmax(outputs.logits, dim=1)
+            correct += (predictions == labels).sum().item()
+            total += labels.size(0)
+            running_loss += loss.item()
+
+            # Log sample predictions for visualization
+            if len(samples) < num_samples_to_log:
+                for i in range(min(num_samples_to_log - len(samples), len(images))):
+                    if model.config.num_classes == 2:
+                        confidence = round(outputs.probabilities[i].item(), 3)
+                    else:
+                        confidence = [round(prob, 3) for prob in outputs.probabilities[i].tolist()]
+                    samples.append(
+                        {
+                            "image": wandb.Image(images[i].cpu()),
+                            "true_label": labels[i].item(),
+                            "predicted": predictions[i].item(),
+                            "confidence": confidence,
+                        }
+                    )
+
+    accuracy = 100 * correct / total
+    avg_loss = running_loss / len(val_loader)
+    print(f"Average validation loss {avg_loss}, and accuracy {accuracy}")
+
+    eval_info = {
+        "loss": avg_loss,
+        "accuracy": accuracy,
+        "eval_s": time.perf_counter() - batch_start_time,
+        "eval/prediction_samples": wandb.Table(
+            data=[[s["image"], s["true_label"], s["predicted"], f"{s['confidence']}"] for s in samples],
+            columns=["Image", "True Label", "Predicted", "Confidence"],
+        )
+        if logger._cfg.wandb.enable
+        else None,
+    }
+
+    return accuracy, eval_info
+
+
+@hydra.main(version_base="1.2", config_path="../configs/policy", config_name="hilserl_classifier")
+def train(cfg: DictConfig) -> None:
+    # Main training pipeline with support for resuming training
+    logging.info(OmegaConf.to_yaml(cfg))
+
+    # Initialize training environment
+    device = get_safe_torch_device(cfg.device, log=True)
+    set_global_seed(cfg.seed)
+
+    out_dir = Path(cfg.output_dir)
+    out_dir.mkdir(parents=True, exist_ok=True)
+    logger = Logger(cfg, out_dir, cfg.wandb.job_name if cfg.wandb.enable else None)
+
+    # Setup dataset and dataloaders
+    dataset = LeRobotDataset(cfg.dataset_repo_id)
+    logging.info(f"Dataset size: {len(dataset)}")
+
+    train_size = int(cfg.train_split_proportion * len(dataset))
+    val_size = len(dataset) - train_size
+    train_dataset, val_dataset = random_split(dataset, [train_size, val_size])
+
+    sampler = create_balanced_sampler(train_dataset, cfg)
+    train_loader = DataLoader(
+        train_dataset,
+        batch_size=cfg.training.batch_size,
+        num_workers=cfg.training.num_workers,
+        sampler=sampler,
+        pin_memory=True,
+    )
+
+    val_loader = DataLoader(
+        val_dataset,
+        batch_size=cfg.eval.batch_size,
+        shuffle=False,
+        num_workers=cfg.training.num_workers,
+        pin_memory=True,
+    )
+
+    # Resume training if requested
+    step = 0
+    best_val_acc = 0
+
+    if cfg.resume:
+        if not Logger.get_last_checkpoint_dir(out_dir).exists():
+            raise RuntimeError(
+                "You have set resume=True, but there is no model checkpoint in "
+                f"{Logger.get_last_checkpoint_dir(out_dir)}"
+            )
+        checkpoint_cfg_path = str(Logger.get_last_pretrained_model_dir(out_dir) / "config.yaml")
+        logging.info(
+            colored(
+                "You have set resume=True, indicating that you wish to resume a run",
+                color="yellow",
+                attrs=["bold"],
+            )
+        )
+        # Load and validate checkpoint configuration
+        checkpoint_cfg = init_hydra_config(checkpoint_cfg_path)
+        # Check for differences between the checkpoint configuration and provided configuration.
+        # Hack to resolve the delta_timestamps ahead of time in order to properly diff.
+        resolve_delta_timestamps(cfg)
+        diff = DeepDiff(OmegaConf.to_container(checkpoint_cfg), OmegaConf.to_container(cfg))
+        # Ignore the `resume` and parameters.
+        if "values_changed" in diff and "root['resume']" in diff["values_changed"]:
+            del diff["values_changed"]["root['resume']"]
+        if len(diff) > 0:
+            logging.warning(
+                "At least one difference was detected between the checkpoint configuration and "
+                f"the provided configuration: \n{pformat(diff)}\nNote that the checkpoint configuration "
+                "takes precedence.",
+            )
+        # Use the checkpoint config instead of the provided config (but keep `resume` parameter).
+        cfg = checkpoint_cfg
+        cfg.resume = True
+
+    # Initialize model and training components
+    model = get_model(cfg=cfg, logger=logger).to(device)
+
+    optimizer = optim.AdamW(model.parameters(), lr=cfg.training.learning_rate)
+    # Use BCEWithLogitsLoss for binary classification and CrossEntropyLoss for multi-class
+    criterion = nn.BCEWithLogitsLoss() if model.config.num_classes == 2 else nn.CrossEntropyLoss()
+    grad_scaler = GradScaler(enabled=cfg.training.use_amp)
+
+    # Log model parameters
+    num_learnable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
+    num_total_params = sum(p.numel() for p in model.parameters())
+    logging.info(f"Learnable parameters: {format_big_number(num_learnable_params)}")
+    logging.info(f"Total parameters: {format_big_number(num_total_params)}")
+
+    if cfg.resume:
+        step = logger.load_last_training_state(optimizer, None)
+
+    # Training loop with validation and checkpointing
+    for epoch in range(cfg.training.num_epochs):
+        logging.info(f"\nEpoch {epoch+1}/{cfg.training.num_epochs}")
+
+        train_epoch(model, train_loader, criterion, optimizer, grad_scaler, device, logger, step, cfg)
+
+        # Periodic validation
+        if cfg.training.eval_freq > 0 and (epoch + 1) % cfg.training.eval_freq == 0:
+            val_acc, eval_info = validate(
+                model,
+                val_loader,
+                criterion,
+                device,
+                logger,
+                cfg,
+            )
+            logger.log_dict(eval_info, step + len(train_loader), mode="eval")
+
+            # Save best model
+            if val_acc > best_val_acc:
+                best_val_acc = val_acc
+                logger.save_checkpoint(
+                    train_step=step + len(train_loader),
+                    policy=model,
+                    optimizer=optimizer,
+                    scheduler=None,
+                    identifier="best",
+                )
+
+        # Periodic checkpointing
+        if cfg.training.save_checkpoint and (epoch + 1) % cfg.training.save_freq == 0:
+            logger.save_checkpoint(
+                train_step=step + len(train_loader),
+                policy=model,
+                optimizer=optimizer,
+                scheduler=None,
+                identifier=f"{epoch+1:06d}",
+            )
+
+        step += len(train_loader)
+
+    logging.info("Training completed")
+
+
+if __name__ == "__main__":
+    train()
--- a/lerobot/scripts/train_sac.py
+++ b/lerobot/scripts/train_sac.py
@@ -0,0 +1,586 @@
+#!/usr/bin/env python
+
+# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import logging
+import functools
+from pprint import pformat
+import random
+from typing import Optional, Sequence, TypedDict, Callable
+
+import hydra
+import torch
+import torch.nn.functional as F
+from torch import nn
+from tqdm import tqdm
+from deepdiff import DeepDiff
+from omegaconf import DictConfig, OmegaConf
+
+from lerobot.common.datasets.lerobot_dataset import LeRobotDataset
+
+# TODO: Remove the import of maniskill
+from lerobot.common.datasets.factory import make_dataset
+from lerobot.common.envs.factory import make_env, make_maniskill_env
+from lerobot.common.envs.utils import preprocess_observation, preprocess_maniskill_observation
+from lerobot.common.logger import Logger, log_output_dir
+from lerobot.common.policies.factory import make_policy
+from lerobot.common.policies.sac.modeling_sac import SACPolicy
+from lerobot.common.policies.utils import get_device_from_parameters
+from lerobot.common.utils.utils import (
+    format_big_number,
+    get_safe_torch_device,
+    init_hydra_config,
+    init_logging,
+    set_global_seed,
+)
+from lerobot.scripts.eval import eval_policy
+
+
+def make_optimizers_and_scheduler(cfg, policy):
+    optimizer_actor = torch.optim.Adam(
+        # NOTE: Handle the case of shared encoder where the encoder weights are not optimized with the gradient of the actor
+        params=policy.actor.parameters_to_optimize,
+        lr=policy.config.actor_lr,
+    )
+    optimizer_critic = torch.optim.Adam(
+        params=policy.critic_ensemble.parameters(), lr=policy.config.critic_lr
+    )
+    # We wrap policy log temperature in list because this is a torch tensor and not a nn.Module
+    optimizer_temperature = torch.optim.Adam(params=[policy.log_alpha], lr=policy.config.critic_lr)
+    lr_scheduler = None
+    optimizers = {
+        "actor": optimizer_actor,
+        "critic": optimizer_critic,
+        "temperature": optimizer_temperature,
+    }
+    return optimizers, lr_scheduler
+
+
+class Transition(TypedDict):
+    state: dict[str, torch.Tensor]
+    action: torch.Tensor
+    reward: float
+    next_state: dict[str, torch.Tensor]
+    done: bool
+    complementary_info: dict[str, torch.Tensor] = None
+
+
+class BatchTransition(TypedDict):
+    state: dict[str, torch.Tensor]
+    action: torch.Tensor
+    reward: torch.Tensor
+    next_state: dict[str, torch.Tensor]
+    done: torch.Tensor
+
+
+def random_crop_vectorized(images: torch.Tensor, output_size: tuple) -> torch.Tensor:
+    """
+    Perform a per-image random crop over a batch of images in a vectorized way.
+    (Same as shown previously.)
+    """
+    B, C, H, W = images.shape
+    crop_h, crop_w = output_size
+
+    if crop_h > H or crop_w > W:
+        raise ValueError(
+            f"Requested crop size ({crop_h}, {crop_w}) is bigger than the image size ({H}, {W})."
+        )
+
+    tops = torch.randint(0, H - crop_h + 1, (B,), device=images.device)
+    lefts = torch.randint(0, W - crop_w + 1, (B,), device=images.device)
+
+    rows = torch.arange(crop_h, device=images.device).unsqueeze(0) + tops.unsqueeze(1)
+    cols = torch.arange(crop_w, device=images.device).unsqueeze(0) + lefts.unsqueeze(1)
+
+    rows = rows.unsqueeze(2).expand(-1, -1, crop_w)  # (B, crop_h, crop_w)
+    cols = cols.unsqueeze(1).expand(-1, crop_h, -1)  # (B, crop_h, crop_w)
+
+    images_hwcn = images.permute(0, 2, 3, 1)  # (B, H, W, C)
+
+    # Gather pixels
+    cropped_hwcn = images_hwcn[torch.arange(B, device=images.device).view(B, 1, 1), rows, cols, :]
+    # cropped_hwcn => (B, crop_h, crop_w, C)
+
+    cropped = cropped_hwcn.permute(0, 3, 1, 2)  # (B, C, crop_h, crop_w)
+    return cropped
+
+
+def random_shift(images: torch.Tensor, pad: int = 4):
+    """Vectorized random shift, imgs: (B,C,H,W), pad: #pixels"""
+    _, _, h, w = images.shape
+    images = F.pad(input=images, pad=(pad, pad, pad, pad), mode="replicate")
+    return random_crop_vectorized(images=images, output_size=(h, w))
+
+
+class ReplayBuffer:
+    def __init__(
+        self,
+        capacity: int,
+        device: str = "cuda:0",
+        state_keys: Optional[Sequence[str]] = None,
+        image_augmentation_function: Optional[Callable] = None,
+        use_drq: bool = True,
+    ):
+        """
+        Args:
+            capacity (int): Maximum number of transitions to store in the buffer.
+            device (str): The device where the tensors will be moved ("cuda:0" or "cpu").
+            state_keys (List[str]): The list of keys that appear in `state` and `next_state`.
+            image_augmentation_function (Optional[Callable]): A function that takes a batch of images
+                and returns a batch of augmented images. If None, a default augmentation function is used.
+            use_drq (bool): Whether to use the default DRQ image augmentation style, when sampling in the buffer.
+        """
+        self.capacity = capacity
+        self.device = device
+        self.memory: list[Transition] = []
+        self.position = 0
+
+        # If no state_keys provided, default to an empty list
+        # (you can handle this differently if needed)
+        self.state_keys = state_keys if state_keys is not None else []
+        if image_augmentation_function is None:
+            self.image_augmentation_function = functools.partial(random_shift, pad=4)
+        self.use_drq = use_drq
+
+    def add(
+        self,
+        state: dict[str, torch.Tensor],
+        action: torch.Tensor,
+        reward: float,
+        next_state: dict[str, torch.Tensor],
+        done: bool,
+        complementary_info: Optional[dict[str, torch.Tensor]] = None,
+    ):
+        """Saves a transition."""
+        if len(self.memory) < self.capacity:
+            self.memory.append(None)
+
+        # Create and store the Transition
+        self.memory[self.position] = Transition(
+            state=state,
+            action=action,
+            reward=reward,
+            next_state=next_state,
+            done=done,
+            complementary_info=complementary_info,
+        )
+        self.position: int = (self.position + 1) % self.capacity
+
+    @classmethod
+    def from_lerobot_dataset(
+        cls,
+        lerobot_dataset: LeRobotDataset,
+        device: str = "cuda:0",
+        state_keys: Optional[Sequence[str]] = None,
+    ) -> "ReplayBuffer":
+        """
+        Convert a LeRobotDataset into a ReplayBuffer.
+
+        Args:
+            lerobot_dataset (LeRobotDataset): The dataset to convert.
+            device (str): The device . Defaults to "cuda:0".
+            state_keys (Optional[Sequence[str]], optional): The list of keys that appear in `state` and `next_state`.
+            Defaults to None.
+
+        Returns:
+            ReplayBuffer: The replay buffer with offline dataset transitions.
+        """
+        # We convert the LeRobotDataset into a replay buffer, because it is more efficient to sample from
+        # a replay buffer than from a lerobot dataset.
+        replay_buffer = cls(capacity=len(lerobot_dataset), device=device, state_keys=state_keys)
+        list_transition = cls._lerobotdataset_to_transitions(dataset=lerobot_dataset, state_keys=state_keys)
+        # Fill the replay buffer with the lerobot dataset transitions
+        for data in list_transition:
+            replay_buffer.add(
+                state=data["state"],
+                action=data["action"],
+                reward=data["reward"],
+                next_state=data["next_state"],
+                done=data["done"],
+            )
+        return replay_buffer
+
+    @staticmethod
+    def _lerobotdataset_to_transitions(
+        dataset: LeRobotDataset,
+        state_keys: Optional[Sequence[str]] = None,
+    ) -> list[Transition]:
+        """
+        Convert a LeRobotDataset into a list of RL (s, a, r, s', done) transitions.
+
+        Args:
+            dataset (LeRobotDataset):
+                The dataset to convert. Each item in the dataset is expected to have
+                at least the following keys:
+                {
+                    "action": ...
+                    "next.reward": ...
+                    "next.done": ...
+                    "episode_index": ...
+                }
+                plus whatever your 'state_keys' specify.
+
+            state_keys (Optional[Sequence[str]]):
+                The dataset keys to include in 'state' and 'next_state'. Their names
+                will be kept as-is in the output transitions. E.g.
+                ["observation.state", "observation.environment_state"].
+                If None, you must handle or define default keys.
+
+        Returns:
+            transitions (List[Transition]):
+                A list of Transition dictionaries with the same length as `dataset`.
+        """
+
+        # If not provided, you can either raise an error or define a default:
+        if state_keys is None:
+            raise ValueError("You must provide a list of keys in `state_keys` that define your 'state'.")
+
+        transitions: list[Transition] = []
+        num_frames = len(dataset)
+
+        for i in tqdm(range(num_frames)):
+            current_sample = dataset[i]
+
+            # ----- 1) Current state -----
+            current_state: dict[str, torch.Tensor] = {}
+            for key in state_keys:
+                val = current_sample[key]
+                current_state[key] = val.unsqueeze(0)  # Add batch dimension
+
+            # ----- 2) Action -----
+            action = current_sample["action"].unsqueeze(0)  # Add batch dimension
+
+            # ----- 3) Reward and done -----
+            reward = float(current_sample["next.reward"].item())  # ensure float
+            done = bool(current_sample["next.done"].item())  # ensure bool
+
+            # ----- 4) Next state -----
+            # If not done and the next sample is in the same episode, we pull the next sample's state.
+            # Otherwise (done=True or next sample crosses to a new episode), next_state = current_state.
+            next_state = current_state  # default
+            if not done and (i < num_frames - 1):
+                next_sample = dataset[i + 1]
+                if next_sample["episode_index"] == current_sample["episode_index"]:
+                    # Build next_state from the same keys
+                    next_state_data: dict[str, torch.Tensor] = {}
+                    for key in state_keys:
+                        val = next_sample[key]
+                        next_state_data[key] = val.unsqueeze(0)  # Add batch dimension
+                    next_state = next_state_data
+
+            # ----- Construct the Transition -----
+            transition = Transition(
+                state=current_state,
+                action=action,
+                reward=reward,
+                next_state=next_state,
+                done=done,
+            )
+            transitions.append(transition)
+
+        return transitions
+
+    def sample(self, batch_size: int) -> BatchTransition:
+        """Sample a random batch of transitions and collate them into batched tensors."""
+        list_of_transitions = random.sample(self.memory, batch_size)
+
+        # -- Build batched states --
+        batch_state = {}
+        for key in self.state_keys:
+            batch_state[key] = torch.cat([t["state"][key] for t in list_of_transitions], dim=0).to(
+                self.device
+            )
+            if key.startswith("observation.image") and self.use_drq:
+                batch_state[key] = self.image_augmentation_function(batch_state[key])
+
+        # -- Build batched actions --
+        batch_actions = torch.cat([t["action"] for t in list_of_transitions]).to(self.device)
+
+        # -- Build batched rewards --
+        batch_rewards = torch.tensor([t["reward"] for t in list_of_transitions], dtype=torch.float32).to(
+            self.device
+        )
+
+        # -- Build batched next states --
+        batch_next_state = {}
+        for key in self.state_keys:
+            batch_next_state[key] = torch.cat([t["next_state"][key] for t in list_of_transitions], dim=0).to(
+                self.device
+            )
+            if key.startswith("observation.image") and self.use_drq:
+                batch_next_state[key] = self.image_augmentation_function(batch_next_state[key])
+
+        # -- Build batched dones --
+        batch_dones = torch.tensor([t["done"] for t in list_of_transitions], dtype=torch.float32).to(
+            self.device
+        )
+        batch_dones = torch.tensor([t["done"] for t in list_of_transitions], dtype=torch.float32).to(
+            self.device
+        )
+
+        # Return a BatchTransition typed dict
+        return BatchTransition(
+            state=batch_state,
+            action=batch_actions,
+            reward=batch_rewards,
+            next_state=batch_next_state,
+            done=batch_dones,
+        )
+
+
+def concatenate_batch_transitions(
+    left_batch_transitions: BatchTransition, right_batch_transition: BatchTransition
+) -> BatchTransition:
+    """NOTE: Be careful it change the left_batch_transitions in place"""
+    left_batch_transitions["state"] = {
+        key: torch.cat([left_batch_transitions["state"][key], right_batch_transition["state"][key]], dim=0)
+        for key in left_batch_transitions["state"]
+    }
+    left_batch_transitions["action"] = torch.cat(
+        [left_batch_transitions["action"], right_batch_transition["action"]], dim=0
+    )
+    left_batch_transitions["reward"] = torch.cat(
+        [left_batch_transitions["reward"], right_batch_transition["reward"]], dim=0
+    )
+    left_batch_transitions["next_state"] = {
+        key: torch.cat(
+            [left_batch_transitions["next_state"][key], right_batch_transition["next_state"][key]], dim=0
+        )
+        for key in left_batch_transitions["next_state"]
+    }
+    left_batch_transitions["done"] = torch.cat(
+        [left_batch_transitions["done"], right_batch_transition["done"]], dim=0
+    )
+    return left_batch_transitions
+
+
+def train(cfg: DictConfig, out_dir: str | None = None, job_name: str | None = None):
+    if out_dir is None:
+        raise NotImplementedError()
+    if job_name is None:
+        raise NotImplementedError()
+
+    init_logging()
+    logging.info(pformat(OmegaConf.to_container(cfg)))
+
+    # Create an env dedicated to online episodes collection from policy rollout.
+    # online_env = make_env(cfg, n_envs=cfg.training.online_rollout_batch_size)
+    # NOTE: Off policy algorithm are efficient enought to use a single environment
+    logging.info("make_env online")
+    # online_env = make_env(cfg, n_envs=1)
+    # TODO: Remove the import of maniskill and unifiy with make env
+    online_env = make_maniskill_env(cfg, n_envs=1)
+    if cfg.training.eval_freq > 0:
+        logging.info("make_env eval")
+        # eval_env = make_env(cfg, n_envs=1)
+        # TODO: Remove the import of maniskill and unifiy with make env
+        eval_env = make_maniskill_env(cfg, n_envs=1)
+
+    # TODO: Add a way to resume training
+
+    # log metrics to terminal and wandb
+    logger = Logger(cfg, out_dir, wandb_job_name=job_name)
+
+    set_global_seed(cfg.seed)
+
+    # Check device is available
+    device = get_safe_torch_device(cfg.device, log=True)
+
+    torch.backends.cudnn.benchmark = True
+    torch.backends.cuda.matmul.allow_tf32 = True
+
+    logging.info("make_policy")
+    # TODO: At some point we should just need make sac policy
+    policy: SACPolicy = make_policy(
+        hydra_cfg=cfg,
+        # dataset_stats=offline_dataset.meta.stats if not cfg.resume else None,
+        # Hack: But if we do online traning, we do not need dataset_stats
+        dataset_stats=None,
+        pretrained_policy_name_or_path=str(logger.last_pretrained_model_dir) if cfg.resume else None,
+        device=device,
+    )
+    assert isinstance(policy, nn.Module)
+
+    optimizers, lr_scheduler = make_optimizers_and_scheduler(cfg, policy)
+
+    # TODO: Handle resume
+
+    num_learnable_params = sum(p.numel() for p in policy.parameters() if p.requires_grad)
+    num_total_params = sum(p.numel() for p in policy.parameters())
+
+    log_output_dir(out_dir)
+    logging.info(f"{cfg.env.task=}")
+    logging.info(f"{cfg.training.online_steps=}")
+    logging.info(f"{num_learnable_params=} ({format_big_number(num_learnable_params)})")
+    logging.info(f"{num_total_params=} ({format_big_number(num_total_params)})")
+
+    obs, info = online_env.reset()
+
+    # HACK for maniskill
+    # obs = preprocess_observation(obs)
+    obs = preprocess_maniskill_observation(obs)
+    obs = {key: obs[key].to(device, non_blocking=True) for key in obs}
+
+    replay_buffer = ReplayBuffer(
+        capacity=cfg.training.online_buffer_capacity, device=device, state_keys=cfg.policy.input_shapes.keys()
+    )
+
+    batch_size = cfg.training.batch_size
+
+    if cfg.dataset_repo_id is not None:
+        logging.info("make_dataset offline buffer")
+        offline_dataset = make_dataset(cfg)
+        logging.info("Convertion to a offline replay buffer")
+        offline_replay_buffer = ReplayBuffer.from_lerobot_dataset(
+            offline_dataset, device=device, state_keys=cfg.policy.input_shapes.keys()
+        )
+        batch_size: int = batch_size // 2  # We will sample from both replay buffer
+
+    # NOTE: For the moment we will solely handle the case of a single environment
+    sum_reward_episode = 0
+
+    for interaction_step in range(cfg.training.online_steps):
+        # NOTE: At some point we should use a  wrapper to handle the observation
+
+        if interaction_step >= cfg.training.online_step_before_learning:
+            action = policy.select_action(batch=obs)
+            next_obs, reward, done, truncated, info = online_env.step(action.cpu().numpy())
+        else:
+            action = online_env.action_space.sample()
+            next_obs, reward, done, truncated, info = online_env.step(action)
+            # HACK
+            action = torch.tensor(action, dtype=torch.float32).to(device, non_blocking=True)
+
+        # HACK: For maniskill
+        # next_obs = preprocess_observation(next_obs)
+        next_obs = preprocess_maniskill_observation(next_obs)
+        next_obs = {key: next_obs[key].to(device, non_blocking=True) for key in obs}
+        sum_reward_episode += float(reward[0])
+        # Because we are using a single environment
+        # we can safely assume that the episode is done
+        if done[0] or truncated[0]:
+            logging.info(f"Global step {interaction_step}: Episode reward: {sum_reward_episode}")
+            logger.log_dict({"Sum episode reward": sum_reward_episode}, interaction_step)
+            sum_reward_episode = 0
+            # HACK: This is for maniskill
+            logging.info(
+                f"global step {interaction_step}: episode success: {info['success'].float().item()} \n"
+            )
+            logger.log_dict({"Episode success": info["success"].float().item()}, interaction_step)
+
+        replay_buffer.add(
+            state=obs,
+            action=action,
+            reward=float(reward[0]),
+            next_state=next_obs,
+            done=done[0],
+        )
+        obs = next_obs
+
+        if interaction_step < cfg.training.online_step_before_learning:
+            continue
+        for _ in range(cfg.policy.utd_ratio - 1):
+            batch = replay_buffer.sample(batch_size)
+            if cfg.dataset_repo_id is not None:
+                batch_offline = offline_replay_buffer.sample(batch_size)
+                batch = concatenate_batch_transitions(batch, batch_offline)
+
+            actions = batch["action"]
+            rewards = batch["reward"]
+            observations = batch["state"]
+            next_observations = batch["next_state"]
+            done = batch["done"]
+
+            loss_critic = policy.compute_loss_critic(
+                observations=observations,
+                actions=actions,
+                rewards=rewards,
+                next_observations=next_observations,
+                done=done,
+            )
+            optimizers["critic"].zero_grad()
+            loss_critic.backward()
+            optimizers["critic"].step()
+
+        batch = replay_buffer.sample(batch_size)
+        if cfg.dataset_repo_id is not None:
+            batch_offline = offline_replay_buffer.sample(batch_size)
+            batch = concatenate_batch_transitions(
+                left_batch_transitions=batch, right_batch_transition=batch_offline
+            )
+
+        actions = batch["action"]
+        rewards = batch["reward"]
+        observations = batch["state"]
+        next_observations = batch["next_state"]
+        done = batch["done"]
+
+        loss_critic = policy.compute_loss_critic(
+            observations=observations,
+            actions=actions,
+            rewards=rewards,
+            next_observations=next_observations,
+            done=done,
+        )
+        optimizers["critic"].zero_grad()
+        loss_critic.backward()
+        optimizers["critic"].step()
+
+        training_infos = {}
+        training_infos["loss_critic"] = loss_critic.item()
+
+        if interaction_step % cfg.training.policy_update_freq == 0:
+            # TD3 Trick
+            for _ in range(cfg.training.policy_update_freq):
+                loss_actor = policy.compute_loss_actor(observations=observations)
+
+                optimizers["actor"].zero_grad()
+                loss_actor.backward()
+                optimizers["actor"].step()
+
+                training_infos["loss_actor"] = loss_actor.item()
+
+                loss_temperature = policy.compute_loss_temperature(observations=observations)
+                optimizers["temperature"].zero_grad()
+                loss_temperature.backward()
+                optimizers["temperature"].step()
+
+                training_infos["loss_temperature"] = loss_temperature.item()
+
+        if interaction_step % cfg.training.log_freq == 0:
+            logger.log_dict(training_infos, interaction_step, mode="train")
+
+        policy.update_target_networks()
+
+
+@hydra.main(version_base="1.2", config_name="default", config_path="../configs")
+def train_cli(cfg: dict):
+    train(
+        cfg,
+        out_dir=hydra.core.hydra_config.HydraConfig.get().run.dir,
+        job_name=hydra.core.hydra_config.HydraConfig.get().job.name,
+    )
+
+
+def train_notebook(out_dir=None, job_name=None, config_name="default", config_path="../configs"):
+    from hydra import compose, initialize
+
+    hydra.core.global_hydra.GlobalHydra.instance().clear()
+    initialize(config_path=config_path)
+    cfg = compose(config_name=config_name)
+    train(cfg, out_dir=out_dir, job_name=job_name)
+
+
+if __name__ == "__main__":
+    train_cli()
--- a/poetry.lock
+++ b/poetry.lock
@@ -3139,6 +3139,27 @@ dev = ["changelist (==0.5)"]
 lint = ["pre-commit (==3.7.0)"]
 test = ["pytest (>=7.4)", "pytest-cov (>=4.1)"]

+[[package]]
+name = "lightning-utilities"
+version = "0.11.9"
+description = "Lightning toolbox for across the our ecosystem."
+optional = true
+python-versions = ">=3.8"
+files = [
+    {file = "lightning_utilities-0.11.9-py3-none-any.whl", hash = "sha256:ac6d4e9e28faf3ff4be997876750fee10dc604753dbc429bf3848a95c5d7e0d2"},
+    {file = "lightning_utilities-0.11.9.tar.gz", hash = "sha256:f5052b81344cc2684aa9afd74b7ce8819a8f49a858184ec04548a5a109dfd053"},
+]
+
+[package.dependencies]
+packaging = ">=17.1"
+setuptools = "*"
+typing-extensions = "*"
+
+[package.extras]
+cli = ["fire"]
+docs = ["requests (>=2.0.0)"]
+typing = ["mypy (>=1.0.0)", "types-setuptools"]
+
 [[package]]
 name = "llvmlite"
 version = "0.43.0"
@@ -6798,6 +6819,38 @@ webencodings = ">=0.4"
 doc = ["sphinx", "sphinx_rtd_theme"]
 test = ["pytest", "ruff"]

+[[package]]
+name = "tokenizers"
+version = "0.21.0"
+description = ""
+optional = true
+python-versions = ">=3.7"
+files = [
+    {file = "tokenizers-0.21.0-cp39-abi3-macosx_10_12_x86_64.whl", hash = "sha256:3c4c93eae637e7d2aaae3d376f06085164e1660f89304c0ab2b1d08a406636b2"},
+    {file = "tokenizers-0.21.0-cp39-abi3-macosx_11_0_arm64.whl", hash = "sha256:f53ea537c925422a2e0e92a24cce96f6bc5046bbef24a1652a5edc8ba975f62e"},
+    {file = "tokenizers-0.21.0-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:6b177fb54c4702ef611de0c069d9169f0004233890e0c4c5bd5508ae05abf193"},
+    {file = "tokenizers-0.21.0-cp39-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:6b43779a269f4629bebb114e19c3fca0223296ae9fea8bb9a7a6c6fb0657ff8e"},
+    {file = "tokenizers-0.21.0-cp39-abi3-manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:9aeb255802be90acfd363626753fda0064a8df06031012fe7d52fd9a905eb00e"},
+    {file = "tokenizers-0.21.0-cp39-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:d8b09dbeb7a8d73ee204a70f94fc06ea0f17dcf0844f16102b9f414f0b7463ba"},
+    {file = "tokenizers-0.21.0-cp39-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:400832c0904f77ce87c40f1a8a27493071282f785724ae62144324f171377273"},
+    {file = "tokenizers-0.21.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:e84ca973b3a96894d1707e189c14a774b701596d579ffc7e69debfc036a61a04"},
+    {file = "tokenizers-0.21.0-cp39-abi3-musllinux_1_2_aarch64.whl", hash = "sha256:eb7202d231b273c34ec67767378cd04c767e967fda12d4a9e36208a34e2f137e"},
+    {file = "tokenizers-0.21.0-cp39-abi3-musllinux_1_2_armv7l.whl", hash = "sha256:089d56db6782a73a27fd8abf3ba21779f5b85d4a9f35e3b493c7bbcbbf0d539b"},
+    {file = "tokenizers-0.21.0-cp39-abi3-musllinux_1_2_i686.whl", hash = "sha256:c87ca3dc48b9b1222d984b6b7490355a6fdb411a2d810f6f05977258400ddb74"},
+    {file = "tokenizers-0.21.0-cp39-abi3-musllinux_1_2_x86_64.whl", hash = "sha256:4145505a973116f91bc3ac45988a92e618a6f83eb458f49ea0790df94ee243ff"},
+    {file = "tokenizers-0.21.0-cp39-abi3-win32.whl", hash = "sha256:eb1702c2f27d25d9dd5b389cc1f2f51813e99f8ca30d9e25348db6585a97e24a"},
+    {file = "tokenizers-0.21.0-cp39-abi3-win_amd64.whl", hash = "sha256:87841da5a25a3a5f70c102de371db120f41873b854ba65e52bccd57df5a3780c"},
+    {file = "tokenizers-0.21.0.tar.gz", hash = "sha256:ee0894bf311b75b0c03079f33859ae4b2334d675d4e93f5a4132e1eae2834fe4"},
+]
+
+[package.dependencies]
+huggingface-hub = ">=0.16.4,<1.0"
+
+[package.extras]
+dev = ["tokenizers[testing]"]
+docs = ["setuptools-rust", "sphinx", "sphinx-rtd-theme"]
+testing = ["black (==22.3)", "datasets", "numpy", "pytest", "requests", "ruff"]
+
 [[package]]
 name = "tomli"
 version = "2.0.2"
@@ -6863,6 +6916,34 @@ typing-extensions = ">=4.8.0"
 opt-einsum = ["opt-einsum (>=3.3)"]
 optree = ["optree (>=0.11.0)"]

+[[package]]
+name = "torchmetrics"
+version = "1.6.0"
+description = "PyTorch native Metrics"
+optional = true
+python-versions = ">=3.9"
+files = [
+    {file = "torchmetrics-1.6.0-py3-none-any.whl", hash = "sha256:a508cdd87766cedaaf55a419812bf9f493aff8fffc02cc19df5a8e2e7ccb942a"},
+    {file = "torchmetrics-1.6.0.tar.gz", hash = "sha256:aebba248708fb90def20cccba6f55bddd134a58de43fb22b0c5ca0f3a89fa984"},
+]
+
+[package.dependencies]
+lightning-utilities = ">=0.8.0"
+numpy = ">1.20.0"
+packaging = ">17.1"
+torch = ">=2.0.0"
+
+[package.extras]
+all = ["SciencePlots (>=2.0.0)", "gammatone (>=1.0.0)", "ipadic (>=1.0.0)", "librosa (>=0.10.0)", "matplotlib (>=3.6.0)", "mecab-python3 (>=1.0.6)", "mypy (==1.13.0)", "nltk (>3.8.1)", "numpy (<2.0)", "onnxruntime (>=1.12.0)", "pesq (>=0.0.4)", "piq (<=0.8.0)", "pycocotools (>2.0.0)", "pystoi (>=0.4.0)", "regex (>=2021.9.24)", "requests (>=2.19.0)", "scipy (>1.0.0)", "sentencepiece (>=0.2.0)", "torch (==2.5.1)", "torch-fidelity (<=0.4.0)", "torchaudio (>=2.0.1)", "torchvision (>=0.15.1)", "tqdm (<4.68.0)", "transformers (>4.4.0)", "transformers (>=4.42.3)", "types-PyYAML", "types-emoji", "types-protobuf", "types-requests", "types-setuptools", "types-six", "types-tabulate"]
+audio = ["gammatone (>=1.0.0)", "librosa (>=0.10.0)", "numpy (<2.0)", "onnxruntime (>=1.12.0)", "pesq (>=0.0.4)", "pystoi (>=0.4.0)", "requests (>=2.19.0)", "torchaudio (>=2.0.1)"]
+detection = ["pycocotools (>2.0.0)", "torchvision (>=0.15.1)"]
+dev = ["PyTDC (==0.4.1)", "SciencePlots (>=2.0.0)", "bert-score (==0.3.13)", "dython (==0.7.6)", "dython (>=0.7.8,<0.8.0)", "fairlearn", "fast-bss-eval (>=0.1.0)", "faster-coco-eval (>=1.6.3)", "gammatone (>=1.0.0)", "huggingface-hub (<0.27)", "ipadic (>=1.0.0)", "jiwer (>=2.3.0)", "kornia (>=0.6.7)", "librosa (>=0.10.0)", "lpips (<=0.1.4)", "matplotlib (>=3.6.0)", "mecab-ko (>=1.0.0,<1.1.0)", "mecab-ko-dic (>=1.0.0)", "mecab-python3 (>=1.0.6)", "mir-eval (>=0.6)", "monai (==1.3.2)", "monai (==1.4.0)", "mypy (==1.13.0)", "netcal (>1.0.0)", "nltk (>3.8.1)", "numpy (<2.0)", "numpy (<2.2.0)", "onnxruntime (>=1.12.0)", "pandas (>1.4.0)", "permetrics (==2.0.0)", "pesq (>=0.0.4)", "piq (<=0.8.0)", "pycocotools (>2.0.0)", "pystoi (>=0.4.0)", "pytorch-msssim (==1.0.0)", "regex (>=2021.9.24)", "requests (>=2.19.0)", "rouge-score (>0.1.0)", "sacrebleu (>=2.3.0)", "scikit-image (>=0.19.0)", "scipy (>1.0.0)", "sentencepiece (>=0.2.0)", "sewar (>=0.4.4)", "statsmodels (>0.13.5)", "torch (==2.5.1)", "torch-complex (<0.5.0)", "torch-fidelity (<=0.4.0)", "torchaudio (>=2.0.1)", "torchvision (>=0.15.1)", "tqdm (<4.68.0)", "transformers (>4.4.0)", "transformers (>=4.42.3)", "types-PyYAML", "types-emoji", "types-protobuf", "types-requests", "types-setuptools", "types-six", "types-tabulate"]
+image = ["scipy (>1.0.0)", "torch-fidelity (<=0.4.0)", "torchvision (>=0.15.1)"]
+multimodal = ["piq (<=0.8.0)", "transformers (>=4.42.3)"]
+text = ["ipadic (>=1.0.0)", "mecab-python3 (>=1.0.6)", "nltk (>3.8.1)", "regex (>=2021.9.24)", "sentencepiece (>=0.2.0)", "tqdm (<4.68.0)", "transformers (>4.4.0)"]
+typing = ["mypy (==1.13.0)", "torch (==2.5.1)", "types-PyYAML", "types-emoji", "types-protobuf", "types-requests", "types-setuptools", "types-six", "types-tabulate"]
+visual = ["SciencePlots (>=2.0.0)", "matplotlib (>=3.6.0)"]
+
 [[package]]
 name = "torchvision"
 version = "0.19.1"
@@ -6956,6 +7037,75 @@ files = [
 docs = ["myst-parser", "pydata-sphinx-theme", "sphinx"]
 test = ["argcomplete (>=3.0.3)", "mypy (>=1.7.0)", "pre-commit", "pytest (>=7.0,<8.2)", "pytest-mock", "pytest-mypy-testing"]

+[[package]]
+name = "transformers"
+version = "4.47.0"
+description = "State-of-the-art Machine Learning for JAX, PyTorch and TensorFlow"
+optional = true
+python-versions = ">=3.9.0"
+files = [
+    {file = "transformers-4.47.0-py3-none-any.whl", hash = "sha256:a8e1bafdaae69abdda3cad638fe392e37c86d2ce0ecfcae11d60abb8f949ff4d"},
+    {file = "transformers-4.47.0.tar.gz", hash = "sha256:f8ead7a5a4f6937bb507e66508e5e002dc5930f7b6122a9259c37b099d0f3b19"},
+]
+
+[package.dependencies]
+filelock = "*"
+huggingface-hub = ">=0.24.0,<1.0"
+numpy = ">=1.17"
+packaging = ">=20.0"
+pyyaml = ">=5.1"
+regex = "!=2019.12.17"
+requests = "*"
+safetensors = ">=0.4.1"
+tokenizers = ">=0.21,<0.22"
+tqdm = ">=4.27"
+
+[package.extras]
+accelerate = ["accelerate (>=0.26.0)"]
+agents = ["Pillow (>=10.0.1,<=15.0)", "accelerate (>=0.26.0)", "datasets (!=2.5.0)", "diffusers", "opencv-python", "sentencepiece (>=0.1.91,!=0.1.92)", "torch"]
+all = ["Pillow (>=10.0.1,<=15.0)", "accelerate (>=0.26.0)", "av (==9.2.0)", "codecarbon (==1.2.0)", "flax (>=0.4.1,<=0.7.0)", "jax (>=0.4.1,<=0.4.13)", "jaxlib (>=0.4.1,<=0.4.13)", "kenlm", "keras-nlp (>=0.3.1,<0.14.0)", "librosa", "onnxconverter-common", "optax (>=0.0.8,<=0.1.4)", "optuna", "phonemizer", "protobuf", "pyctcdecode (>=0.4.0)", "ray[tune] (>=2.7.0)", "scipy (<1.13.0)", "sentencepiece (>=0.1.91,!=0.1.92)", "sigopt", "tensorflow (>2.9,<2.16)", "tensorflow-text (<2.16)", "tf2onnx", "timm (<=1.0.11)", "tokenizers (>=0.21,<0.22)", "torch", "torchaudio", "torchvision"]
+audio = ["kenlm", "librosa", "phonemizer", "pyctcdecode (>=0.4.0)"]
+benchmark = ["optimum-benchmark (>=0.3.0)"]
+codecarbon = ["codecarbon (==1.2.0)"]
+deepspeed = ["accelerate (>=0.26.0)", "deepspeed (>=0.9.3)"]
+deepspeed-testing = ["GitPython (<3.1.19)", "accelerate (>=0.26.0)", "beautifulsoup4", "cookiecutter (==1.7.3)", "datasets (!=2.5.0)", "deepspeed (>=0.9.3)", "dill (<0.3.5)", "evaluate (>=0.2.0)", "faiss-cpu", "nltk (<=3.8.1)", "optuna", "parameterized", "protobuf", "psutil", "pydantic", "pytest (>=7.2.0,<8.0.0)", "pytest-rich", "pytest-timeout", "pytest-xdist", "rjieba", "rouge-score (!=0.0.7,!=0.0.8,!=0.1,!=0.1.1)", "ruff (==0.5.1)", "sacrebleu (>=1.4.12,<2.0.0)", "sacremoses", "sentencepiece (>=0.1.91,!=0.1.92)", "tensorboard", "timeout-decorator"]
+dev = ["GitPython (<3.1.19)", "Pillow (>=10.0.1,<=15.0)", "accelerate (>=0.26.0)", "av (==9.2.0)", "beautifulsoup4", "codecarbon (==1.2.0)", "cookiecutter (==1.7.3)", "datasets (!=2.5.0)", "dill (<0.3.5)", "evaluate (>=0.2.0)", "faiss-cpu", "flax (>=0.4.1,<=0.7.0)", "fugashi (>=1.0)", "ipadic (>=1.0.0,<2.0)", "isort (>=5.5.4)", "jax (>=0.4.1,<=0.4.13)", "jaxlib (>=0.4.1,<=0.4.13)", "kenlm", "keras-nlp (>=0.3.1,<0.14.0)", "libcst", "librosa", "nltk (<=3.8.1)", "onnxconverter-common", "optax (>=0.0.8,<=0.1.4)", "optuna", "parameterized", "phonemizer", "protobuf", "psutil", "pyctcdecode (>=0.4.0)", "pydantic", "pytest (>=7.2.0,<8.0.0)", "pytest-rich", "pytest-timeout", "pytest-xdist", "ray[tune] (>=2.7.0)", "rhoknp (>=1.1.0,<1.3.1)", "rich", "rjieba", "rouge-score (!=0.0.7,!=0.0.8,!=0.1,!=0.1.1)", "ruff (==0.5.1)", "sacrebleu (>=1.4.12,<2.0.0)", "sacremoses", "scikit-learn", "scipy (<1.13.0)", "sentencepiece (>=0.1.91,!=0.1.92)", "sigopt", "sudachidict-core (>=20220729)", "sudachipy (>=0.6.6)", "tensorboard", "tensorflow (>2.9,<2.16)", "tensorflow-text (<2.16)", "tf2onnx", "timeout-decorator", "timm (<=1.0.11)", "tokenizers (>=0.21,<0.22)", "torch", "torchaudio", "torchvision", "unidic (>=1.0.2)", "unidic-lite (>=1.0.7)", "urllib3 (<2.0.0)"]
+dev-tensorflow = ["GitPython (<3.1.19)", "Pillow (>=10.0.1,<=15.0)", "beautifulsoup4", "cookiecutter (==1.7.3)", "datasets (!=2.5.0)", "dill (<0.3.5)", "evaluate (>=0.2.0)", "faiss-cpu", "isort (>=5.5.4)", "kenlm", "keras-nlp (>=0.3.1,<0.14.0)", "libcst", "librosa", "nltk (<=3.8.1)", "onnxconverter-common", "onnxruntime (>=1.4.0)", "onnxruntime-tools (>=1.4.2)", "parameterized", "phonemizer", "protobuf", "psutil", "pyctcdecode (>=0.4.0)", "pydantic", "pytest (>=7.2.0,<8.0.0)", "pytest-rich", "pytest-timeout", "pytest-xdist", "rich", "rjieba", "rouge-score (!=0.0.7,!=0.0.8,!=0.1,!=0.1.1)", "ruff (==0.5.1)", "sacrebleu (>=1.4.12,<2.0.0)", "sacremoses", "scikit-learn", "sentencepiece (>=0.1.91,!=0.1.92)", "tensorboard", "tensorflow (>2.9,<2.16)", "tensorflow-text (<2.16)", "tf2onnx", "timeout-decorator", "tokenizers (>=0.21,<0.22)", "urllib3 (<2.0.0)"]
+dev-torch = ["GitPython (<3.1.19)", "Pillow (>=10.0.1,<=15.0)", "accelerate (>=0.26.0)", "beautifulsoup4", "codecarbon (==1.2.0)", "cookiecutter (==1.7.3)", "datasets (!=2.5.0)", "dill (<0.3.5)", "evaluate (>=0.2.0)", "faiss-cpu", "fugashi (>=1.0)", "ipadic (>=1.0.0,<2.0)", "isort (>=5.5.4)", "kenlm", "libcst", "librosa", "nltk (<=3.8.1)", "onnxruntime (>=1.4.0)", "onnxruntime-tools (>=1.4.2)", "optuna", "parameterized", "phonemizer", "protobuf", "psutil", "pyctcdecode (>=0.4.0)", "pydantic", "pytest (>=7.2.0,<8.0.0)", "pytest-rich", "pytest-timeout", "pytest-xdist", "ray[tune] (>=2.7.0)", "rhoknp (>=1.1.0,<1.3.1)", "rich", "rjieba", "rouge-score (!=0.0.7,!=0.0.8,!=0.1,!=0.1.1)", "ruff (==0.5.1)", "sacrebleu (>=1.4.12,<2.0.0)", "sacremoses", "scikit-learn", "sentencepiece (>=0.1.91,!=0.1.92)", "sigopt", "sudachidict-core (>=20220729)", "sudachipy (>=0.6.6)", "tensorboard", "timeout-decorator", "timm (<=1.0.11)", "tokenizers (>=0.21,<0.22)", "torch", "torchaudio", "torchvision", "unidic (>=1.0.2)", "unidic-lite (>=1.0.7)", "urllib3 (<2.0.0)"]
+flax = ["flax (>=0.4.1,<=0.7.0)", "jax (>=0.4.1,<=0.4.13)", "jaxlib (>=0.4.1,<=0.4.13)", "optax (>=0.0.8,<=0.1.4)", "scipy (<1.13.0)"]
+flax-speech = ["kenlm", "librosa", "phonemizer", "pyctcdecode (>=0.4.0)"]
+ftfy = ["ftfy"]
+integrations = ["optuna", "ray[tune] (>=2.7.0)", "sigopt"]
+ja = ["fugashi (>=1.0)", "ipadic (>=1.0.0,<2.0)", "rhoknp (>=1.1.0,<1.3.1)", "sudachidict-core (>=20220729)", "sudachipy (>=0.6.6)", "unidic (>=1.0.2)", "unidic-lite (>=1.0.7)"]
+modelcreation = ["cookiecutter (==1.7.3)"]
+natten = ["natten (>=0.14.6,<0.15.0)"]
+onnx = ["onnxconverter-common", "onnxruntime (>=1.4.0)", "onnxruntime-tools (>=1.4.2)", "tf2onnx"]
+onnxruntime = ["onnxruntime (>=1.4.0)", "onnxruntime-tools (>=1.4.2)"]
+optuna = ["optuna"]
+quality = ["GitPython (<3.1.19)", "datasets (!=2.5.0)", "isort (>=5.5.4)", "libcst", "rich", "ruff (==0.5.1)", "urllib3 (<2.0.0)"]
+ray = ["ray[tune] (>=2.7.0)"]
+retrieval = ["datasets (!=2.5.0)", "faiss-cpu"]
+ruff = ["ruff (==0.5.1)"]
+sagemaker = ["sagemaker (>=2.31.0)"]
+sentencepiece = ["protobuf", "sentencepiece (>=0.1.91,!=0.1.92)"]
+serving = ["fastapi", "pydantic", "starlette", "uvicorn"]
+sigopt = ["sigopt"]
+sklearn = ["scikit-learn"]
+speech = ["kenlm", "librosa", "phonemizer", "pyctcdecode (>=0.4.0)", "torchaudio"]
+testing = ["GitPython (<3.1.19)", "beautifulsoup4", "cookiecutter (==1.7.3)", "datasets (!=2.5.0)", "dill (<0.3.5)", "evaluate (>=0.2.0)", "faiss-cpu", "nltk (<=3.8.1)", "parameterized", "psutil", "pydantic", "pytest (>=7.2.0,<8.0.0)", "pytest-rich", "pytest-timeout", "pytest-xdist", "rjieba", "rouge-score (!=0.0.7,!=0.0.8,!=0.1,!=0.1.1)", "ruff (==0.5.1)", "sacrebleu (>=1.4.12,<2.0.0)", "sacremoses", "sentencepiece (>=0.1.91,!=0.1.92)", "tensorboard", "timeout-decorator"]
+tf = ["keras-nlp (>=0.3.1,<0.14.0)", "onnxconverter-common", "tensorflow (>2.9,<2.16)", "tensorflow-text (<2.16)", "tf2onnx"]
+tf-cpu = ["keras (>2.9,<2.16)", "keras-nlp (>=0.3.1,<0.14.0)", "onnxconverter-common", "tensorflow-cpu (>2.9,<2.16)", "tensorflow-probability (<0.24)", "tensorflow-text (<2.16)", "tf2onnx"]
+tf-speech = ["kenlm", "librosa", "phonemizer", "pyctcdecode (>=0.4.0)"]
+tiktoken = ["blobfile", "tiktoken"]
+timm = ["timm (<=1.0.11)"]
+tokenizers = ["tokenizers (>=0.21,<0.22)"]
+torch = ["accelerate (>=0.26.0)", "torch"]
+torch-speech = ["kenlm", "librosa", "phonemizer", "pyctcdecode (>=0.4.0)", "torchaudio"]
+torch-vision = ["Pillow (>=10.0.1,<=15.0)", "torchvision"]
+torchhub = ["filelock", "huggingface-hub (>=0.24.0,<1.0)", "importlib-metadata", "numpy (>=1.17)", "packaging (>=20.0)", "protobuf", "regex (!=2019.12.17)", "requests", "sentencepiece (>=0.1.91,!=0.1.92)", "tokenizers (>=0.21,<0.22)", "torch", "tqdm (>=4.27)"]
+video = ["av (==9.2.0)"]
+vision = ["Pillow (>=10.0.1,<=15.0)"]
+
 [[package]]
 name = "transforms3d"
 version = "0.4.2"
@@ -7558,6 +7708,7 @@ dev = ["debugpy", "pre-commit"]
 dora = ["gym-dora"]
 dynamixel = ["dynamixel-sdk", "pynput"]
 feetech = ["feetech-servo-sdk", "pynput"]
+hilserl = ["torchmetrics", "transformers"]
 intelrealsense = ["pyrealsense2"]
 pusht = ["gym-pusht"]
 stretch = ["hello-robot-stretch-body", "pynput", "pyrealsense2", "pyrender"]
@@ -7569,4 +7720,4 @@ xarm = ["gym-xarm"]
 [metadata]
 lock-version = "2.0"
 python-versions = ">=3.10,<3.13"
-content-hash = "41344f0eb2d06d9a378abcd10df8205aa3926ff0a08ac5ab1a0b1bcae7440fd8"
+content-hash = "44c74163e398e8ff16973957f69a47bb09b789e92ac4d8fb3ab268defab96427"
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -71,6 +71,8 @@ pyrender = {git = "https://github.com/mmatl/pyrender.git", markers = "sys_platfo
 hello-robot-stretch-body = {version = ">=0.7.27", markers = "sys_platform == 'linux'", optional = true}
 pyserial = {version = ">=3.5", optional = true}
 jsonlines = ">=4.0.0"
+transformers = {version = ">=4.47.0", optional = true}
+torchmetrics = {version = ">=1.6.0", optional = true}


 [tool.poetry.extras]
@@ -86,6 +88,7 @@ dynamixel = ["dynamixel-sdk", "pynput"]
 feetech = ["feetech-servo-sdk", "pynput"]
 intelrealsense = ["pyrealsense2"]
 stretch = ["hello-robot-stretch-body", "pyrender", "pyrealsense2", "pynput"]
+hilserl = ["transformers", "torchmetrics"]

 [tool.ruff]
 line-length = 110
--- a/tests/conftest.py
+++ b/tests/conftest.py
@@ -14,9 +14,11 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.

+import random
 import traceback

 import pytest
+import torch
 from serial import SerialException

 from lerobot import available_cameras, available_motors, available_robots
@@ -124,3 +126,14 @@ def patch_builtins_input(monkeypatch):
            print(text)

    monkeypatch.setattr("builtins.input", print_text)
+
+
+def pytest_addoption(parser):
+    parser.addoption("--seed", action="store", default="42", help="Set random seed for reproducibility")
+
+
+@pytest.fixture(autouse=True)
+def set_random_seed(request):
+    seed = int(request.config.getoption("--seed"))
+    random.seed(seed)  # Python random
+    torch.manual_seed(seed)  # PyTorch
--- a/tests/policies/hilserl/classifier/check_hiserl_reward_classifier.py
+++ b/tests/policies/hilserl/classifier/check_hiserl_reward_classifier.py
@@ -0,0 +1,244 @@
+#!/usr/bin/env python
+
+# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import logging
+
+import numpy as np
+import torch
+from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss
+from torch.optim import Adam
+from torch.utils.data import DataLoader
+from torchmetrics import AUROC, Accuracy, F1Score, Precision, Recall
+from torchvision.datasets import CIFAR10
+from torchvision.transforms import ToTensor
+
+from lerobot.common.policies.hilserl.classifier.modeling_classifier import Classifier, ClassifierConfig
+
+BATCH_SIZE = 1000
+LR = 0.1
+EPOCH_NUM = 2
+
+if torch.cuda.is_available():
+    DEVICE = torch.device("cuda")
+elif torch.backends.mps.is_available():
+    DEVICE = torch.device("mps")
+else:
+    DEVICE = torch.device("cpu")
+
+
+def train_evaluate_multiclass_classifier():
+    logging.info(
+        f"Start multiclass classifier train eval with {DEVICE} device, batch size {BATCH_SIZE}, learning rate {LR}"
+    )
+    multiclass_config = ClassifierConfig(model_name="microsoft/resnet-18", device=DEVICE, num_classes=10)
+    multiclass_classifier = Classifier(multiclass_config)
+
+    trainset = CIFAR10(root="data", train=True, download=True, transform=ToTensor())
+    testset = CIFAR10(root="data", train=False, download=True, transform=ToTensor())
+
+    trainloader = DataLoader(trainset, batch_size=BATCH_SIZE, shuffle=True)
+    testloader = DataLoader(testset, batch_size=BATCH_SIZE, shuffle=False)
+
+    multiclass_num_classes = 10
+    epoch = 1
+
+    criterion = CrossEntropyLoss()
+    optimizer = Adam(multiclass_classifier.parameters(), lr=LR)
+
+    multiclass_classifier.train()
+
+    logging.info("Start multiclass classifier training")
+
+    # Training loop
+    while epoch < EPOCH_NUM:  # loop over the dataset multiple times
+        for i, data in enumerate(trainloader):
+            inputs, labels = data
+            inputs, labels = inputs.to(DEVICE), labels.to(DEVICE)
+
+            # Zero the parameter gradients
+            optimizer.zero_grad()
+
+            # Forward pass
+            outputs = multiclass_classifier(inputs)
+
+            loss = criterion(outputs.logits, labels)
+            loss.backward()
+            optimizer.step()
+
+            if i % 10 == 0:  # print every 10 mini-batches
+                logging.info(f"[Epoch {epoch}, Batch {i}] loss: {loss.item():.3f}")
+
+        epoch += 1
+
+    print("Multiclass classifier training finished")
+
+    multiclass_classifier.eval()
+
+    test_loss = 0.0
+    test_labels = []
+    test_pridections = []
+    test_probs = []
+
+    with torch.no_grad():
+        for data in testloader:
+            images, labels = data
+            images, labels = images.to(DEVICE), labels.to(DEVICE)
+            outputs = multiclass_classifier(images)
+            loss = criterion(outputs.logits, labels)
+            test_loss += loss.item() * BATCH_SIZE
+
+            _, predicted = torch.max(outputs.logits, 1)
+            test_labels.extend(labels.cpu())
+            test_pridections.extend(predicted.cpu())
+            test_probs.extend(outputs.probabilities.cpu())
+
+    test_loss = test_loss / len(testset)
+
+    logging.info(f"Multiclass classifier test loss {test_loss:.3f}")
+
+    test_labels = torch.stack(test_labels)
+    test_predictions = torch.stack(test_pridections)
+    test_probs = torch.stack(test_probs)
+
+    accuracy = Accuracy(task="multiclass", num_classes=multiclass_num_classes)
+    precision = Precision(task="multiclass", average="weighted", num_classes=multiclass_num_classes)
+    recall = Recall(task="multiclass", average="weighted", num_classes=multiclass_num_classes)
+    f1 = F1Score(task="multiclass", average="weighted", num_classes=multiclass_num_classes)
+    auroc = AUROC(task="multiclass", num_classes=multiclass_num_classes, average="weighted")
+
+    # Calculate metrics
+    acc = accuracy(test_predictions, test_labels)
+    prec = precision(test_predictions, test_labels)
+    rec = recall(test_predictions, test_labels)
+    f1_score = f1(test_predictions, test_labels)
+    auroc_score = auroc(test_probs, test_labels)
+
+    logging.info(f"Accuracy: {acc:.2f}")
+    logging.info(f"Precision: {prec:.2f}")
+    logging.info(f"Recall: {rec:.2f}")
+    logging.info(f"F1 Score: {f1_score:.2f}")
+    logging.info(f"AUROC Score: {auroc_score:.2f}")
+
+
+def train_evaluate_binary_classifier():
+    logging.info(
+        f"Start binary classifier train eval with {DEVICE} device, batch size {BATCH_SIZE}, learning rate {LR}"
+    )
+
+    target_binary_class = 3
+
+    def one_vs_rest(dataset, target_class):
+        new_targets = []
+        for _, label in dataset:
+            new_label = float(1.0) if label == target_class else float(0.0)
+            new_targets.append(new_label)
+
+        dataset.targets = new_targets  # Replace the original labels with the binary ones
+        return dataset
+
+    binary_train_dataset = CIFAR10(root="data", train=True, download=True, transform=ToTensor())
+    binary_test_dataset = CIFAR10(root="data", train=False, download=True, transform=ToTensor())
+
+    # Apply one-vs-rest labeling
+    binary_train_dataset = one_vs_rest(binary_train_dataset, target_binary_class)
+    binary_test_dataset = one_vs_rest(binary_test_dataset, target_binary_class)
+
+    binary_trainloader = DataLoader(binary_train_dataset, batch_size=BATCH_SIZE, shuffle=True)
+    binary_testloader = DataLoader(binary_test_dataset, batch_size=BATCH_SIZE, shuffle=False)
+
+    binary_epoch = 1
+
+    binary_config = ClassifierConfig(model_name="microsoft/resnet-50", device=DEVICE)
+    binary_classifier = Classifier(binary_config)
+
+    class_counts = np.bincount(binary_train_dataset.targets)
+    n = len(binary_train_dataset)
+    w0 = n / (2.0 * class_counts[0])
+    w1 = n / (2.0 * class_counts[1])
+
+    binary_criterion = BCEWithLogitsLoss(pos_weight=torch.tensor(w1 / w0))
+    binary_optimizer = Adam(binary_classifier.parameters(), lr=LR)
+
+    binary_classifier.train()
+
+    logging.info("Start binary classifier training")
+
+    # Training loop
+    while binary_epoch < EPOCH_NUM:  # loop over the dataset multiple times
+        for i, data in enumerate(binary_trainloader):
+            inputs, labels = data
+            inputs, labels = inputs.to(DEVICE), labels.to(torch.float32).to(DEVICE)
+
+            # Zero the parameter gradients
+            binary_optimizer.zero_grad()
+
+            # Forward pass
+            outputs = binary_classifier(inputs)
+            loss = binary_criterion(outputs.logits, labels)
+            loss.backward()
+            binary_optimizer.step()
+
+            if i % 10 == 0:  # print every 10 mini-batches
+                print(f"[Epoch {binary_epoch}, Batch {i}] loss: {loss.item():.3f}")
+        binary_epoch += 1
+
+    logging.info("Binary classifier training finished")
+    logging.info("Start binary classifier evaluation")
+
+    binary_classifier.eval()
+
+    test_loss = 0.0
+    test_labels = []
+    test_pridections = []
+    test_probs = []
+
+    with torch.no_grad():
+        for data in binary_testloader:
+            images, labels = data
+            images, labels = images.to(DEVICE), labels.to(torch.float32).to(DEVICE)
+            outputs = binary_classifier(images)
+            loss = binary_criterion(outputs.logits, labels)
+            test_loss += loss.item() * BATCH_SIZE
+
+            test_labels.extend(labels.cpu())
+            test_pridections.extend(outputs.logits.cpu())
+            test_probs.extend(outputs.probabilities.cpu())
+
+    test_loss = test_loss / len(binary_test_dataset)
+
+    logging.info(f"Binary classifier test loss {test_loss:.3f}")
+
+    test_labels = torch.stack(test_labels)
+    test_predictions = torch.stack(test_pridections)
+    test_probs = torch.stack(test_probs)
+
+    # Calculate metrics
+    acc = Accuracy(task="binary")(test_predictions, test_labels)
+    prec = Precision(task="binary", average="weighted")(test_predictions, test_labels)
+    rec = Recall(task="binary", average="weighted")(test_predictions, test_labels)
+    f1_score = F1Score(task="binary", average="weighted")(test_predictions, test_labels)
+    auroc_score = AUROC(task="binary", average="weighted")(test_probs, test_labels)
+
+    logging.info(f"Accuracy: {acc:.2f}")
+    logging.info(f"Precision: {prec:.2f}")
+    logging.info(f"Recall: {rec:.2f}")
+    logging.info(f"F1 Score: {f1_score:.2f}")
+    logging.info(f"AUROC Score: {auroc_score:.2f}")
+
+
+if __name__ == "__main__":
+    train_evaluate_multiclass_classifier()
+    train_evaluate_binary_classifier()
--- a/tests/policies/hilserl/classifier/test_modelling_classifier.py
+++ b/tests/policies/hilserl/classifier/test_modelling_classifier.py
@@ -0,0 +1,85 @@
+import torch
+
+from lerobot.common.policies.hilserl.classifier.modeling_classifier import (
+    ClassifierConfig,
+    ClassifierOutput,
+)
+from tests.utils import require_package
+
+
+def test_classifier_output():
+    output = ClassifierOutput(
+        logits=torch.tensor([1, 2, 3]), probabilities=torch.tensor([0.1, 0.2, 0.3]), hidden_states=None
+    )
+
+    assert (
+        f"{output}"
+        == "ClassifierOutput(logits=tensor([1, 2, 3]), probabilities=tensor([0.1000, 0.2000, 0.3000]), hidden_states=None)"
+    )
+
+
+@require_package("transformers")
+def test_binary_classifier_with_default_params():
+    from lerobot.common.policies.hilserl.classifier.modeling_classifier import Classifier
+
+    config = ClassifierConfig()
+    classifier = Classifier(config)
+
+    batch_size = 10
+
+    input = torch.rand(batch_size, 3, 224, 224)
+    output = classifier(input)
+
+    assert output is not None
+    assert output.logits.shape == torch.Size([batch_size])
+    assert not torch.isnan(output.logits).any(), "Tensor contains NaN values"
+    assert output.probabilities.shape == torch.Size([batch_size])
+    assert not torch.isnan(output.probabilities).any(), "Tensor contains NaN values"
+    assert output.hidden_states.shape == torch.Size([batch_size, 2048])
+    assert not torch.isnan(output.hidden_states).any(), "Tensor contains NaN values"
+
+
+@require_package("transformers")
+def test_multiclass_classifier():
+    from lerobot.common.policies.hilserl.classifier.modeling_classifier import Classifier
+
+    num_classes = 5
+    config = ClassifierConfig(num_classes=num_classes)
+    classifier = Classifier(config)
+
+    batch_size = 10
+
+    input = torch.rand(batch_size, 3, 224, 224)
+    output = classifier(input)
+
+    assert output is not None
+    assert output.logits.shape == torch.Size([batch_size, num_classes])
+    assert not torch.isnan(output.logits).any(), "Tensor contains NaN values"
+    assert output.probabilities.shape == torch.Size([batch_size, num_classes])
+    assert not torch.isnan(output.probabilities).any(), "Tensor contains NaN values"
+    assert output.hidden_states.shape == torch.Size([batch_size, 2048])
+    assert not torch.isnan(output.hidden_states).any(), "Tensor contains NaN values"
+
+
+@require_package("transformers")
+def test_default_device():
+    from lerobot.common.policies.hilserl.classifier.modeling_classifier import Classifier
+
+    config = ClassifierConfig()
+    assert config.device == "cpu"
+
+    classifier = Classifier(config)
+    for p in classifier.parameters():
+        assert p.device == torch.device("cpu")
+
+
+@require_package("transformers")
+def test_explicit_device_setup():
+    from lerobot.common.policies.hilserl.classifier.modeling_classifier import Classifier
+
+    config = ClassifierConfig(device="meta")
+    assert config.device == "meta"
+
+    classifier = Classifier(config)
+    for p in classifier.parameters():
+        assert p.device == torch.device("meta")
--- a/tests/test_train_hilserl_classifier.py
+++ b/tests/test_train_hilserl_classifier.py
@@ -0,0 +1,304 @@
+import os
+import tempfile
+from pathlib import Path
+from unittest.mock import MagicMock, patch
+
+import pytest
+import torch
+from hydra import compose, initialize_config_dir
+from torch import nn
+from torch.utils.data import Dataset
+
+from lerobot.common.policies.hilserl.classifier.configuration_classifier import ClassifierConfig
+from lerobot.common.policies.hilserl.classifier.modeling_classifier import Classifier
+from lerobot.scripts.train_hilserl_classifier import (
+    create_balanced_sampler,
+    train,
+    train_epoch,
+    validate,
+)
+
+
+class MockDataset(Dataset):
+    def __init__(self, data):
+        self.data = data
+        self.meta = MagicMock()
+        self.meta.stats = {}
+
+    def __getitem__(self, idx):
+        return self.data[idx]
+
+    def __len__(self):
+        return len(self.data)
+
+
+def make_dummy_model():
+    model_config = ClassifierConfig(
+        num_classes=2, model_name="hf-tiny-model-private/tiny-random-ResNetModel", num_cameras=1
+    )
+    model = Classifier(config=model_config)
+    return model
+
+
+def test_create_balanced_sampler():
+    # Mock dataset with imbalanced classes
+    data = [
+        {"label": 0},
+        {"label": 0},
+        {"label": 1},
+        {"label": 0},
+        {"label": 1},
+        {"label": 1},
+        {"label": 1},
+        {"label": 1},
+    ]
+    dataset = MockDataset(data)
+    cfg = MagicMock()
+    cfg.training.label_key = "label"
+
+    sampler = create_balanced_sampler(dataset, cfg)
+
+    # Get weights from the sampler
+    weights = sampler.weights.float()
+
+    # Check that samples have appropriate weights
+    labels = [item["label"] for item in data]
+    class_counts = torch.tensor([labels.count(0), labels.count(1)], dtype=torch.float32)
+    class_weights = 1.0 / class_counts
+    expected_weights = torch.tensor([class_weights[label] for label in labels], dtype=torch.float32)
+
+    # Test that the weights are correct
+    assert torch.allclose(weights, expected_weights)
+
+
+def test_train_epoch():
+    model = make_dummy_model()
+    # Mock components
+    model.train = MagicMock()
+
+    train_loader = [
+        {
+            "image": torch.rand(2, 3, 224, 224),
+            "label": torch.tensor([0.0, 1.0]),
+        }
+    ]
+
+    criterion = nn.BCEWithLogitsLoss()
+    optimizer = MagicMock()
+    grad_scaler = MagicMock()
+    device = torch.device("cpu")
+    logger = MagicMock()
+    step = 0
+    cfg = MagicMock()
+    cfg.training.image_keys = ["image"]
+    cfg.training.label_key = "label"
+    cfg.training.use_amp = False
+
+    # Call the function under test
+    train_epoch(
+        model,
+        train_loader,
+        criterion,
+        optimizer,
+        grad_scaler,
+        device,
+        logger,
+        step,
+        cfg,
+    )
+
+    # Check that model.train() was called
+    model.train.assert_called_once()
+
+    # Check that optimizer.zero_grad() was called
+    optimizer.zero_grad.assert_called()
+
+    # Check that logger.log_dict was called
+    logger.log_dict.assert_called()
+
+
+def test_validate():
+    model = make_dummy_model()
+
+    # Mock components
+    model.eval = MagicMock()
+    val_loader = [
+        {
+            "image": torch.rand(2, 3, 224, 224),
+            "label": torch.tensor([0.0, 1.0]),
+        }
+    ]
+    criterion = nn.BCEWithLogitsLoss()
+    device = torch.device("cpu")
+    logger = MagicMock()
+    cfg = MagicMock()
+    cfg.training.image_keys = ["image"]
+    cfg.training.label_key = "label"
+    cfg.training.use_amp = False
+
+    # Call validate
+    accuracy, eval_info = validate(model, val_loader, criterion, device, logger, cfg)
+
+    # Check that model.eval() was called
+    model.eval.assert_called_once()
+
+    # Check accuracy/eval_info are calculated and of the correct type
+    assert isinstance(accuracy, float)
+    assert isinstance(eval_info, dict)
+
+
+def test_train_epoch_multiple_cameras():
+    model_config = ClassifierConfig(
+        num_classes=2, model_name="hf-tiny-model-private/tiny-random-ResNetModel", num_cameras=2
+    )
+    model = Classifier(config=model_config)
+
+    # Mock components
+    model.train = MagicMock()
+
+    train_loader = [
+        {
+            "image_1": torch.rand(2, 3, 224, 224),
+            "image_2": torch.rand(2, 3, 224, 224),
+            "label": torch.tensor([0.0, 1.0]),
+        }
+    ]
+
+    criterion = nn.BCEWithLogitsLoss()
+    optimizer = MagicMock()
+    grad_scaler = MagicMock()
+    device = torch.device("cpu")
+    logger = MagicMock()
+    step = 0
+    cfg = MagicMock()
+    cfg.training.image_keys = ["image_1", "image_2"]
+    cfg.training.label_key = "label"
+    cfg.training.use_amp = False
+
+    # Call the function under test
+    train_epoch(
+        model,
+        train_loader,
+        criterion,
+        optimizer,
+        grad_scaler,
+        device,
+        logger,
+        step,
+        cfg,
+    )
+
+    # Check that model.train() was called
+    model.train.assert_called_once()
+
+    # Check that optimizer.zero_grad() was called
+    optimizer.zero_grad.assert_called()
+
+    # Check that logger.log_dict was called
+    logger.log_dict.assert_called()
+
+
+@pytest.mark.parametrize("resume", [True, False])
+@patch("lerobot.scripts.train_hilserl_classifier.init_hydra_config")
+@patch("lerobot.scripts.train_hilserl_classifier.Logger.get_last_checkpoint_dir")
+@patch("lerobot.scripts.train_hilserl_classifier.Logger.get_last_pretrained_model_dir")
+@patch("lerobot.scripts.train_hilserl_classifier.Logger")
+@patch("lerobot.scripts.train_hilserl_classifier.LeRobotDataset")
+@patch("lerobot.scripts.train_hilserl_classifier.get_model")
+def test_resume_function(
+    mock_get_model,
+    mock_dataset,
+    mock_logger,
+    mock_get_last_pretrained_model_dir,
+    mock_get_last_checkpoint_dir,
+    mock_init_hydra_config,
+    resume,
+):
+    # Initialize Hydra
+    test_file_dir = os.path.dirname(os.path.abspath(__file__))
+    config_dir = os.path.abspath(os.path.join(test_file_dir, "..", "lerobot", "configs", "policy"))
+    assert os.path.exists(config_dir), f"Config directory does not exist at {config_dir}"
+
+    with initialize_config_dir(config_dir=config_dir, job_name="test_app", version_base="1.2"):
+        cfg = compose(
+            config_name="hilserl_classifier",
+            overrides=[
+                "device=cpu",
+                "seed=42",
+                f"output_dir={tempfile.mkdtemp()}",
+                "wandb.enable=False",
+                f"resume={resume}",
+                "dataset_repo_id=dataset_repo_id",
+                "train_split_proportion=0.8",
+                "training.num_workers=0",
+                "training.batch_size=2",
+                "training.image_keys=[image]",
+                "training.label_key=label",
+                "training.use_amp=False",
+                "training.num_epochs=1",
+                "eval.batch_size=2",
+            ],
+        )
+
+    # Mock the init_hydra_config function to return cfg
+    mock_init_hydra_config.return_value = cfg
+
+    # Mock dataset
+    dataset = MockDataset([{"image": torch.rand(3, 224, 224), "label": i % 2} for i in range(10)])
+    mock_dataset.return_value = dataset
+
+    # Mock checkpoint handling
+    mock_checkpoint_dir = MagicMock(spec=Path)
+    mock_checkpoint_dir.exists.return_value = resume  # Only exists if resuming
+    mock_get_last_checkpoint_dir.return_value = mock_checkpoint_dir
+    mock_get_last_pretrained_model_dir.return_value = Path(tempfile.mkdtemp())
+
+    # Mock logger
+    logger = MagicMock()
+    resumed_step = 1000
+    if resume:
+        logger.load_last_training_state.return_value = resumed_step
+    else:
+        logger.load_last_training_state.return_value = 0
+    mock_logger.return_value = logger
+
+    # Instantiate the model and set make_policy to return it
+    model = make_dummy_model()
+    mock_get_model.return_value = model
+
+    # Call train
+    train(cfg)
+
+    # Check that checkpoint handling methods were called
+    if resume:
+        mock_get_last_checkpoint_dir.assert_called_once_with(Path(cfg.output_dir))
+        mock_get_last_pretrained_model_dir.assert_called_once_with(Path(cfg.output_dir))
+        mock_checkpoint_dir.exists.assert_called_once()
+        logger.load_last_training_state.assert_called_once()
+    else:
+        mock_get_last_checkpoint_dir.assert_not_called()
+        mock_get_last_pretrained_model_dir.assert_not_called()
+        mock_checkpoint_dir.exists.assert_not_called()
+        logger.load_last_training_state.assert_not_called()
+
+    # Collect the steps from logger.log_dict calls
+    train_log_calls = logger.log_dict.call_args_list
+
+    # Extract the steps used in the train logging
+    steps = []
+    for call in train_log_calls:
+        mode = call.kwargs.get("mode", call.args[2] if len(call.args) > 2 else None)
+        if mode == "train":
+            step = call.kwargs.get("step", call.args[1] if len(call.args) > 1 else None)
+            steps.append(step)
+
+    expected_start_step = resumed_step if resume else 0
+
+    # Calculate expected_steps
+    train_size = int(cfg.train_split_proportion * len(dataset))
+    batch_size = cfg.training.batch_size
+    num_batches = (train_size + batch_size - 1) // batch_size
+
+    expected_steps = [expected_start_step + i for i in range(num_batches)]
+
+    assert steps == expected_steps, f"Expected steps {expected_steps}, got {steps}"
Author	SHA1	Message	Date
Michel Aractingi	daa1480a91	nit	2025-01-22 10:26:52 +01:00
Michel Aractingi	71ec721e48	cleaned eval_on_robot.py; readded policy; fixed doc strings	2025-01-22 10:26:52 +01:00
Michel Aractingi	bbb5ba0adf	Extend reward classifier for multiple camera views (#626 )	2025-01-22 10:26:52 +01:00
Eugene Mironov	844bfcf484	[Port HIL_SERL] Final fixes for the Reward Classifier (#598 )	2025-01-22 10:26:52 +01:00
Michel Aractingi	13441f0d98	added temporary fix for missing task_index key in online environment	2025-01-22 10:26:50 +01:00
Michel Aractingi	41b377211c	split encoder for critic and actor	2025-01-22 10:25:52 +01:00
KeWang1017	9ceb68ee90	Refine SAC configuration and policy for enhanced performance - Updated standard deviation parameterization in SACConfig to 'softplus' with defined min and max values for improved stability. - Modified action sampling in SACPolicy to use reparameterized sampling, ensuring better gradient flow and log probability calculations. - Cleaned up log probability calculations in TanhMultivariateNormalDiag for clarity and efficiency. - Increased evaluation frequency in YAML configuration to 50000 for more efficient training cycles. These changes aim to enhance the robustness and performance of the SAC implementation during training and inference.	2025-01-22 10:23:33 +01:00
KeWang1017	d1baa5a82f	trying to get sac running	2025-01-22 10:20:56 +01:00
Michel Aractingi	04da4dd3e3	Added normalization schemes and style checks	2025-01-22 10:19:19 +01:00
Michel Aractingi	b0e2fcdba7	added optimizer and sac to factory.py	2025-01-22 10:17:48 +01:00
Eugene Mironov	1e2a757cd3	[Port Hil-SERL] Add unit tests for the reward classifier & fix imports & check script (#578 )	2025-01-22 10:14:06 +01:00
Michel Aractingi	ab842ba6ae	nit in control_robot.py	2025-01-22 10:06:39 +01:00
Michel Aractingi	94a7221a94	Update lerobot/scripts/train_hilserl_classifier.py Co-authored-by: Yoel <yoel.chornton@gmail.com>	2025-01-22 10:06:39 +01:00
Claudio Coppola	00dadcace0	LerobotDataset pushable to HF from any folder (#563 )	2025-01-22 10:06:39 +01:00
berjaoui	81a2f2958d	Update 7_get_started_with_real_robot.md (#559 )	2025-01-22 10:06:39 +01:00
Michel Aractingi	68b4fb60ad	Control simulated robot with real leader (#514 ) Co-authored-by: Remi <remi.cadene@huggingface.co>	2025-01-22 10:06:39 +01:00
Remi	96b2b62377	Fix missing local_files_only in record/replay (#540 ) Co-authored-by: Simon Alibert <alibert.sim@gmail.com>	2025-01-22 10:06:39 +01:00
Michel Aractingi	b5c98bbfd3	Refactor OpenX (#505 )	2025-01-22 10:06:39 +01:00
Eugene Mironov	58e12cf2e8	Fixup	2025-01-22 10:06:39 +01:00
Michel Aractingi	d8b5fae622	Add human intervention mechanism and eval_robot script to evaluate policy on the robot (#541 ) Co-authored-by: Yoel <yoel.chornton@gmail.com>	2025-01-22 10:06:39 +01:00
Yoel	67ac81d728	Reward classifier and training (#528 ) Co-authored-by: Daniel Ritchie <daniel@brainwavecollective.ai> Co-authored-by: resolver101757 <kelster101757@hotmail.com> Co-authored-by: Jannik Grothusen <56967823+J4nn1K@users.noreply.github.com> Co-authored-by: Remi <re.cadene@gmail.com> Co-authored-by: Michel Aractingi <michel.aractingi@huggingface.co>	2025-01-22 10:06:39 +01:00
Michel Aractingi	b5f1ea3140	nit	2025-01-22 10:06:39 +01:00
AdilZouitine	4d854a1513	Stable version of rlpd + drq	2025-01-22 09:00:16 +00:00
AdilZouitine	87da655eab	Add type annotations and restructure SACConfig class fields	2025-01-21 09:51:12 +00:00
Adil Zouitine	a8fda9c61a	Change SAC policy implementation with configuration and modeling classes	2025-01-17 09:39:04 +01:00
Adil Zouitine	55505ff817	Add rlpd tricks	2025-01-16 11:53:36 +01:00
Adil Zouitine	20d31ab8e0	SAC works	2025-01-16 11:53:27 +01:00
Adil Zouitine	e5b83aab5e	remove breakpoint	2025-01-16 11:52:03 +01:00
Adil Zouitine	a9d5f62304	[WIP] correct sac implementation	2025-01-16 11:51:18 +01:00
Adil Zouitine	72e1ed7058	Add rlpd tricks	2025-01-16 11:42:24 +01:00
Adil Zouitine	d8e67a2609	SAC works	2025-01-16 11:42:24 +01:00
Adil Zouitine	50e12376de	remove breakpoint	2025-01-16 11:42:23 +01:00
Adil Zouitine	73aa6c25f3	[WIP] correct sac implementation	2025-01-16 11:42:14 +01:00