forked from tangger/lerobot
* fix: sharing predicted chunk with user * [pre-commit.ci] pre-commit autoupdate (#1011) Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Revert "[pre-commit.ci] pre-commit autoupdate" (#1025) * fix(ci): Pin draccus (<0.10.0) and torch (<2.7) to fix pipeline (#1022) Co-authored-by: imstevenpmwork <steven.palma@huggingface.co> Co-authored-by: Simon Alibert <75076266+aliberts@users.noreply.github.com> * fix(ci): Pin `torchcodec` (==0.2.1) to fix pipeline temporarly (#1030) * Update tutorial (#1021) Co-authored-by: Simon Alibert <75076266+aliberts@users.noreply.github.com> * Add description motor order SO-101 leader (#1051) * feat(encoding): switching to PyAV for ffmpeg related tasks (#983) * feat(docs): Add new docs build process (#1046) Co-authored-by: Mishig Davaadorj <dmishig@gmail.com> Co-authored-by: Steven Palma <steven.palma@huggingface.co> * Docs: adapt text + fix video code (#1064) * Fix typos (#1070) * docs: minor corrections and clean-up (#1089) * Update 10_use_so100.md; use diff syntax (#944) Co-authored-by: Pepijn <138571049+pkooij@users.noreply.github.com> * Update 12_use_so101.md (#1081) Co-authored-by: Pepijn <138571049+pkooij@users.noreply.github.com> * bug fix for #1071 When --display_data=true, Failed running control_robot. (#1073) * Add editable -e for feetech install command (#1133) * Fix: emptying action queue between resets (#1117) * fix: typos and grammar (#1148) * Update README.md (#1160) * Update README.md (#1163) * [Fix] Unpin torch beyond 2.6.0 & torchcodec beyond 0.2.1 (#1127) * (hotfix): nightly CI by clipping pymunk version below 7.0.0 (#1182) * [pre-commit.ci] pre-commit autoupdate (#1048) Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Simon Alibert <simon.alibert@huggingface.co> * Add SmolVLA (#1175) Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: fracapuano <francesco.capuano@huggingface.co> Co-authored-by: Steven Palma <imstevenpmwork@ieee.org> Co-authored-by: Dana Aubakirova <118912928+danaaubakirova@users.noreply.github.com> Co-authored-by: Remi <remi.cadene@huggingface.co> * Fix SmolVLA loss not sent to wandb (#1198) * Hardware API redesign (#777) Co-authored-by: Pepijn <138571049+pkooij@users.noreply.github.com> Co-authored-by: Steven Palma <imstevenpmwork@ieee.org> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Steven Palma <steven.palma@huggingface.co> Co-authored-by: Adil Zouitine <adilzouitinegm@gmail.com> Co-authored-by: Pepijn <pepijn@huggingface.co> * fix(smolvla): update record.py, fix populate_queues and remove unused dependencies (#1208) * replaced OBS_ROBOT with OBS_STATE constant (#1211) * Fix test_teleoperate (#1216) * Fix LeKiwi example (#1217) * Fix smolVLA dependencies (#1218) * fix(pyserial): adding pyserial dependency to global ones (#1219) * Update SmolVLA README.md (#1228) * Fix unable to set camera width/height to non-default (#1225) * Update tutorial link (#1250) * update KochFollower.get_observation() so it returns same observation structure as SO101 (#1248) Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * [pre-commit.ci] pre-commit autoupdate (#1185) Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Simon Alibert <75076266+aliberts@users.noreply.github.com> * Proposal for fix for enter_pressed on Windows (#1230) Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Simon Alibert <75076266+aliberts@users.noreply.github.com> * fix: update pi0 dependency version constraint (#1247) Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Match motor names with ids lekiwi (#1261) * fix issues: checkpoints keys mismatch and 'task' tokenisation in smolvla (#1256) Co-authored-by: danaaubakirova <d.aubakirova@alumni.edu.kz> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Simon Alibert <75076266+aliberts@users.noreply.github.com> Co-authored-by: Simon Alibert <simon.alibert@huggingface.co> * fix(docs): update realsense documentation (#1268) * Use HF Papers (#1120) * Skip normalization parameters in load_smolvla (#1274) * fix(record): no teleop needed when running with policy (#1284) * Port HIL SERL (#644) Co-authored-by: Michel Aractingi <michel.aractingi@huggingface.co> Co-authored-by: Eugene Mironov <helper2424@gmail.com> Co-authored-by: s1lent4gnt <kmeftah.khalil@gmail.com> Co-authored-by: Ke Wang <superwk1017@gmail.com> Co-authored-by: Yoel Chornton <yoel.chornton@gmail.com> Co-authored-by: imstevenpmwork <steven.palma@huggingface.co> Co-authored-by: Simon Alibert <simon.alibert@huggingface.co> * fix(docs): SmolVLA fine-tuning getting started (#1201) Co-authored-by: Pepijn <138571049+pkooij@users.noreply.github.com> Co-authored-by: danaaubakirova <d.aubakirova@alumni.edu.kz> Co-authored-by: Simon Alibert <75076266+aliberts@users.noreply.github.com> Co-authored-by: Francesco Capuano <francesco_capuano@aol.com> Co-authored-by: Steven Palma <steven.palma@huggingface.co> * chore(teleop): print calibration path saved (#1286) * chore(dependencies): add gamepad support with pygame and hidapi (#1287) * Robot integration tutorial (#1285) * fix(docs): update send_feedback docstrings * Add sim tutorial, fix lekiwi motor config, add notebook links (#1275) Co-authored-by: AdilZouitine <adilzouitinegm@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Michel Aractingi <michel.aractingi@huggingface.co> Co-authored-by: s1lent4gnt <kmeftah.khalil@gmail.com> Co-authored-by: Michel Aractingi <michel.aractingi@gmail.com> Co-authored-by: Eugene Mironov <helper2424@gmail.com> Co-authored-by: imstevenpmwork <steven.palma@huggingface.co> Co-authored-by: Simon Alibert <75076266+aliberts@users.noreply.github.com> Co-authored-by: Steven Palma <imstevenpmwork@ieee.org> * Fixes on robot integration tutorial (#1290) * Add keyboard teleop device to control the end effector robot (#1289) * Improve type hints (#1293) * fix(record): no teleop arg in reset environment (#1294) * `learner.py` import so101_leader instead of so100 (#1295) Co-authored-by: Adil Zouitine <adilzouitinegm@gmail.com> * Fixing `PI0` Policy (#1297) * `gym_manipulator.py` Remove None value action_intervention of BaseLeaderTeleoperator (#1299) * (chore): incorrect resume parameter in recording documentation (#1301) * Update lekiwi.mdx (#1229) * bump `pi0` and `hil` transformers version (#1298) * docs: fix imitation learning robots docs command (#1308) * fix(benchmarks): remove .numpy() from frame in benchmark script (#1354) * add smolvla to the supported policies to run tests (: * add: chunk-level access for the policy * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add: smolvla in availables * remove: smolvla from library supported policies * fix: change env for training, xarm is broken as of now * add: predict_action_chunk to all supported policies * fix: add robot type constants * add: predict action chunk in base policy class * restore original Makefile * fix: minor * fix: dict keys come from lerobot/constants * fix: improve act encapsulation, properly supporting temporal ensembling * fix: smolvla action chunking * fix: very minor, but very annoying * fix: minor * fix minor naming Co-authored-by: Steven Palma <imstevenpmwork@ieee.org> Signed-off-by: Francesco Capuano <74058581+fracapuano@users.noreply.github.com> * fix: refactoring inference for single actions and chunks into different components * fix: minor * fix: temporal ensembling * fix: moving populate queues out of modular component for batch preparation * fix: minor for CI * fix: smovla debug * fix: reward classifier, maybe the last policy lacking? --------- Signed-off-by: Francesco Capuano <74058581+fracapuano@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Simon Alibert <75076266+aliberts@users.noreply.github.com> Co-authored-by: Adil Zouitine <adilzouitinegm@gmail.com> Co-authored-by: imstevenpmwork <steven.palma@huggingface.co> Co-authored-by: Pepijn <138571049+pkooij@users.noreply.github.com> Co-authored-by: Caroline Pascal <caroline8.pascal@gmail.com> Co-authored-by: Mishig Davaadorj <dmishig@gmail.com> Co-authored-by: omahs <73983677+omahs@users.noreply.github.com> Co-authored-by: CharlesCNorton <135471798+CharlesCNorton@users.noreply.github.com> Co-authored-by: masato-ka <jp6uzv@gmail.com> Co-authored-by: Ragnar <rodiondenmark@gmail.com> Co-authored-by: mshukor <mustafa.shukor97@gmail.com> Co-authored-by: Simon Alibert <simon.alibert@huggingface.co> Co-authored-by: Steven Palma <imstevenpmwork@ieee.org> Co-authored-by: Dana Aubakirova <118912928+danaaubakirova@users.noreply.github.com> Co-authored-by: Remi <remi.cadene@huggingface.co> Co-authored-by: Ben Zhang <5977478+ben-z@users.noreply.github.com> Co-authored-by: Pepijn <pepijn@huggingface.co> Co-authored-by: Dhruva <51377003+utterwqlnut@users.noreply.github.com> Co-authored-by: Daisuke Sato <tiryoh@gmail.com> Co-authored-by: Sarunas Kalade <sarunas.kalade@amd.com> Co-authored-by: koenvanwijk <koenvanwijk@users.noreply.github.com> Co-authored-by: Yushun Xiang <73413365+YushunXiang@users.noreply.github.com> Co-authored-by: danaaubakirova <d.aubakirova@alumni.edu.kz> Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Co-authored-by: Michel Aractingi <michel.aractingi@huggingface.co> Co-authored-by: Eugene Mironov <helper2424@gmail.com> Co-authored-by: s1lent4gnt <kmeftah.khalil@gmail.com> Co-authored-by: Ke Wang <superwk1017@gmail.com> Co-authored-by: Yoel Chornton <yoel.chornton@gmail.com> Co-authored-by: Michel Aractingi <michel.aractingi@gmail.com> Co-authored-by: tidely <43219534+tidely@users.noreply.github.com> Co-authored-by: David <17435126+DavidLMS@users.noreply.github.com>
324 lines
12 KiB
Python
324 lines
12 KiB
Python
# !/usr/bin/env python
|
|
|
|
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
|
|
#
|
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
|
# you may not use this file except in compliance with the License.
|
|
# You may obtain a copy of the License at
|
|
#
|
|
# http://www.apache.org/licenses/LICENSE-2.0
|
|
#
|
|
# Unless required by applicable law or agreed to in writing, software
|
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
# See the License for the specific language governing permissions and
|
|
# limitations under the License.
|
|
|
|
import logging
|
|
|
|
import torch
|
|
from torch import Tensor, nn
|
|
|
|
from lerobot.common.constants import OBS_IMAGE, REWARD
|
|
from lerobot.common.policies.normalize import Normalize, Unnormalize
|
|
from lerobot.common.policies.pretrained import PreTrainedPolicy
|
|
from lerobot.common.policies.sac.reward_model.configuration_classifier import RewardClassifierConfig
|
|
|
|
|
|
class ClassifierOutput:
|
|
"""Wrapper for classifier outputs with additional metadata."""
|
|
|
|
def __init__(
|
|
self,
|
|
logits: Tensor,
|
|
probabilities: Tensor | None = None,
|
|
hidden_states: Tensor | None = None,
|
|
):
|
|
self.logits = logits
|
|
self.probabilities = probabilities
|
|
self.hidden_states = hidden_states
|
|
|
|
def __repr__(self):
|
|
return (
|
|
f"ClassifierOutput(logits={self.logits}, "
|
|
f"probabilities={self.probabilities}, "
|
|
f"hidden_states={self.hidden_states})"
|
|
)
|
|
|
|
|
|
class SpatialLearnedEmbeddings(nn.Module):
|
|
def __init__(self, height, width, channel, num_features=8):
|
|
"""
|
|
PyTorch implementation of learned spatial embeddings
|
|
|
|
Args:
|
|
height: Spatial height of input features
|
|
width: Spatial width of input features
|
|
channel: Number of input channels
|
|
num_features: Number of output embedding dimensions
|
|
"""
|
|
super().__init__()
|
|
self.height = height
|
|
self.width = width
|
|
self.channel = channel
|
|
self.num_features = num_features
|
|
|
|
self.kernel = nn.Parameter(torch.empty(channel, height, width, num_features))
|
|
|
|
nn.init.kaiming_normal_(self.kernel, mode="fan_in", nonlinearity="linear")
|
|
|
|
def forward(self, features):
|
|
"""
|
|
Forward pass for spatial embedding
|
|
|
|
Args:
|
|
features: Input tensor of shape [B, H, W, C] or [H, W, C] if no batch
|
|
Returns:
|
|
Output tensor of shape [B, C*F] or [C*F] if no batch
|
|
"""
|
|
|
|
features = features.last_hidden_state
|
|
|
|
original_shape = features.shape
|
|
if features.dim() == 3:
|
|
features = features.unsqueeze(0) # Add batch dim
|
|
|
|
features_expanded = features.unsqueeze(-1) # [B, H, W, C, 1]
|
|
kernel_expanded = self.kernel.unsqueeze(0) # [1, H, W, C, F]
|
|
|
|
# Element-wise multiplication and spatial reduction
|
|
output = (features_expanded * kernel_expanded).sum(dim=(2, 3)) # Sum H,W
|
|
|
|
# Reshape to combine channel and feature dimensions
|
|
output = output.view(output.size(0), -1) # [B, C*F]
|
|
|
|
# Remove batch dim
|
|
if len(original_shape) == 3:
|
|
output = output.squeeze(0)
|
|
|
|
return output
|
|
|
|
|
|
class Classifier(PreTrainedPolicy):
|
|
"""Image classifier built on top of a pre-trained encoder."""
|
|
|
|
name = "reward_classifier"
|
|
config_class = RewardClassifierConfig
|
|
|
|
def __init__(
|
|
self,
|
|
config: RewardClassifierConfig,
|
|
dataset_stats: dict[str, dict[str, Tensor]] | None = None,
|
|
):
|
|
from transformers import AutoModel
|
|
|
|
super().__init__(config)
|
|
self.config = config
|
|
|
|
# Initialize normalization (standardized with the policy framework)
|
|
self.normalize_inputs = Normalize(config.input_features, config.normalization_mapping, dataset_stats)
|
|
self.normalize_targets = Normalize(
|
|
config.output_features, config.normalization_mapping, dataset_stats
|
|
)
|
|
self.unnormalize_outputs = Unnormalize(
|
|
config.output_features, config.normalization_mapping, dataset_stats
|
|
)
|
|
|
|
# Set up encoder
|
|
encoder = AutoModel.from_pretrained(self.config.model_name, trust_remote_code=True)
|
|
# Extract vision model if we're given a multimodal model
|
|
if hasattr(encoder, "vision_model"):
|
|
logging.info("Multimodal model detected - using vision encoder only")
|
|
self.encoder = encoder.vision_model
|
|
self.vision_config = encoder.config.vision_config
|
|
else:
|
|
self.encoder = encoder
|
|
self.vision_config = getattr(encoder, "config", None)
|
|
|
|
# Model type from config
|
|
self.is_cnn = self.config.model_type == "cnn"
|
|
|
|
# For CNNs, initialize backbone
|
|
if self.is_cnn:
|
|
self._setup_cnn_backbone()
|
|
|
|
self._freeze_encoder()
|
|
|
|
# Extract image keys from input_features
|
|
self.image_keys = [
|
|
key.replace(".", "_") for key in config.input_features if key.startswith(OBS_IMAGE)
|
|
]
|
|
|
|
if self.is_cnn:
|
|
self.encoders = nn.ModuleDict()
|
|
for image_key in self.image_keys:
|
|
encoder = self._create_single_encoder()
|
|
self.encoders[image_key] = encoder
|
|
|
|
self._build_classifier_head()
|
|
|
|
def _setup_cnn_backbone(self):
|
|
"""Set up CNN encoder"""
|
|
if hasattr(self.encoder, "fc"):
|
|
self.feature_dim = self.encoder.fc.in_features
|
|
self.encoder = nn.Sequential(*list(self.encoder.children())[:-1])
|
|
elif hasattr(self.encoder.config, "hidden_sizes"):
|
|
self.feature_dim = self.encoder.config.hidden_sizes[-1] # Last channel dimension
|
|
else:
|
|
raise ValueError("Unsupported CNN architecture")
|
|
|
|
def _freeze_encoder(self) -> None:
|
|
"""Freeze the encoder parameters."""
|
|
for param in self.encoder.parameters():
|
|
param.requires_grad = False
|
|
|
|
def _create_single_encoder(self):
|
|
encoder = nn.Sequential(
|
|
self.encoder,
|
|
SpatialLearnedEmbeddings(
|
|
height=4,
|
|
width=4,
|
|
channel=self.feature_dim,
|
|
num_features=self.config.image_embedding_pooling_dim,
|
|
),
|
|
nn.Dropout(self.config.dropout_rate),
|
|
nn.Linear(self.feature_dim * self.config.image_embedding_pooling_dim, self.config.latent_dim),
|
|
nn.LayerNorm(self.config.latent_dim),
|
|
nn.Tanh(),
|
|
)
|
|
|
|
return encoder
|
|
|
|
def _build_classifier_head(self) -> None:
|
|
"""Initialize the classifier head architecture."""
|
|
# Get input dimension based on model type
|
|
if self.is_cnn:
|
|
input_dim = self.config.latent_dim
|
|
else: # Transformer models
|
|
if hasattr(self.encoder.config, "hidden_size"):
|
|
input_dim = self.encoder.config.hidden_size
|
|
else:
|
|
raise ValueError("Unsupported transformer architecture since hidden_size is not found")
|
|
|
|
self.classifier_head = nn.Sequential(
|
|
nn.Linear(input_dim * self.config.num_cameras, self.config.hidden_dim),
|
|
nn.Dropout(self.config.dropout_rate),
|
|
nn.LayerNorm(self.config.hidden_dim),
|
|
nn.ReLU(),
|
|
nn.Linear(
|
|
self.config.hidden_dim,
|
|
1 if self.config.num_classes == 2 else self.config.num_classes,
|
|
),
|
|
)
|
|
|
|
def _get_encoder_output(self, x: torch.Tensor, image_key: str) -> torch.Tensor:
|
|
"""Extract the appropriate output from the encoder."""
|
|
with torch.no_grad():
|
|
if self.is_cnn:
|
|
# The HF ResNet applies pooling internally
|
|
outputs = self.encoders[image_key](x)
|
|
return outputs
|
|
else: # Transformer models
|
|
outputs = self.encoder(x)
|
|
return outputs.last_hidden_state[:, 0, :]
|
|
|
|
def extract_images_and_labels(self, batch: dict[str, Tensor]) -> tuple[list, Tensor]:
|
|
"""Extract image tensors and label tensors from batch."""
|
|
# Check for both OBS_IMAGE and OBS_IMAGES prefixes
|
|
images = [batch[key] for key in self.config.input_features if key.startswith(OBS_IMAGE)]
|
|
labels = batch[REWARD]
|
|
|
|
return images, labels
|
|
|
|
def predict(self, xs: list) -> ClassifierOutput:
|
|
"""Forward pass of the classifier for inference."""
|
|
encoder_outputs = torch.hstack(
|
|
[self._get_encoder_output(x, img_key) for x, img_key in zip(xs, self.image_keys, strict=True)]
|
|
)
|
|
logits = self.classifier_head(encoder_outputs)
|
|
|
|
if self.config.num_classes == 2:
|
|
logits = logits.squeeze(-1)
|
|
probabilities = torch.sigmoid(logits)
|
|
else:
|
|
probabilities = torch.softmax(logits, dim=-1)
|
|
|
|
return ClassifierOutput(logits=logits, probabilities=probabilities, hidden_states=encoder_outputs)
|
|
|
|
def forward(self, batch: dict[str, Tensor]) -> tuple[Tensor, dict[str, Tensor]]:
|
|
"""Standard forward pass for training compatible with train.py."""
|
|
# Normalize inputs if needed
|
|
batch = self.normalize_inputs(batch)
|
|
batch = self.normalize_targets(batch)
|
|
|
|
# Extract images and labels
|
|
images, labels = self.extract_images_and_labels(batch)
|
|
|
|
# Get predictions
|
|
outputs = self.predict(images)
|
|
|
|
# Calculate loss
|
|
if self.config.num_classes == 2:
|
|
# Binary classification
|
|
loss = nn.functional.binary_cross_entropy_with_logits(outputs.logits, labels)
|
|
predictions = (torch.sigmoid(outputs.logits) > 0.5).float()
|
|
else:
|
|
# Multi-class classification
|
|
loss = nn.functional.cross_entropy(outputs.logits, labels.long())
|
|
predictions = torch.argmax(outputs.logits, dim=1)
|
|
|
|
# Calculate accuracy for logging
|
|
correct = (predictions == labels).sum().item()
|
|
total = labels.size(0)
|
|
accuracy = 100 * correct / total
|
|
|
|
# Return loss and metrics for logging
|
|
output_dict = {
|
|
"accuracy": accuracy,
|
|
"correct": correct,
|
|
"total": total,
|
|
}
|
|
|
|
return loss, output_dict
|
|
|
|
def predict_reward(self, batch, threshold=0.5):
|
|
"""Eval method. Returns predicted reward with the decision threshold as argument."""
|
|
# Check for both OBS_IMAGE and OBS_IMAGES prefixes
|
|
batch = self.normalize_inputs(batch)
|
|
batch = self.normalize_targets(batch)
|
|
|
|
# Extract images from batch dict
|
|
images = [batch[key] for key in self.config.input_features if key.startswith(OBS_IMAGE)]
|
|
|
|
if self.config.num_classes == 2:
|
|
probs = self.predict(images).probabilities
|
|
logging.debug(f"Predicted reward images: {probs}")
|
|
return (probs > threshold).float()
|
|
else:
|
|
return torch.argmax(self.predict(images).probabilities, dim=1)
|
|
|
|
def get_optim_params(self):
|
|
"""Return optimizer parameters for the policy."""
|
|
return self.parameters()
|
|
|
|
def select_action(self, batch: dict[str, Tensor]) -> Tensor:
|
|
"""
|
|
This method is required by PreTrainedPolicy but not used for reward classifiers.
|
|
The reward classifier is not an actor and does not select actions.
|
|
"""
|
|
raise NotImplementedError("Reward classifiers do not select actions")
|
|
|
|
def predict_action_chunk(self, batch: dict[str, Tensor]) -> Tensor:
|
|
"""
|
|
This method is required by PreTrainedPolicy but not used for reward classifiers.
|
|
The reward classifier is not an actor and does not produce action chunks.
|
|
"""
|
|
raise NotImplementedError("Reward classifiers do not predict action chunks")
|
|
|
|
def reset(self):
|
|
"""
|
|
This method is required by PreTrainedPolicy but not used for reward classifiers.
|
|
The reward classifier is not an actor and does not select actions.
|
|
"""
|
|
pass
|