Improve video benchmark (#282)

Co-authored-by: Alexander Soare <alexander.soare159@gmail.com>
Co-authored-by: Remi <re.cadene@gmail.com>
This commit is contained in:
Simon Alibert
2024-07-09 20:20:25 +02:00
committed by GitHub
parent cc2f6e7404
commit e410e5d711
11 changed files with 985 additions and 772 deletions

View File

@@ -1,334 +0,0 @@
# Video benchmark
## Questions
What is the optimal trade-off between:
- maximizing loading time with random access,
- minimizing memory space on disk,
- maximizing success rate of policies?
How to encode videos?
- How much compression (`-crf`)? Low compression with `0`, normal compression with `20` or extreme with `56`?
- What pixel format to use (`-pix_fmt`)? `yuv444p` or `yuv420p`?
- How many key frames (`-g`)? A key frame every `10` frames?
How to decode videos?
- Which `decoder`? `torchvision`, `torchaudio`, `ffmpegio`, `decord`, or `nvc`?
## Metrics
**Percentage of data compression (higher is better)**
`compression_factor` is the ratio of the memory space on disk taken by the original images to encode, to the memory space taken by the encoded video. For instance, `compression_factor=4` means that the video takes 4 times less memory space on disk compared to the original images.
**Percentage of loading time (higher is better)**
`load_time_factor` is the ratio of the time it takes to load original images at given timestamps, to the time it takes to decode the exact same frames from the video. Higher is better. For instance, `load_time_factor=0.5` means that decoding from video is 2 times slower than loading the original images.
**Average L2 error per pixel (lower is better)**
`avg_per_pixel_l2_error` is the average L2 error between each decoded frame and its corresponding original image over all requested timestamps, and also divided by the number of pixels in the image to be comparable when switching to different image sizes.
**Loss of a pretrained policy (higher is better)** (not available)
`loss_pretrained` is the result of evaluating with the selected encoding/decoding settings a policy pretrained on original images. It is easier to understand than `avg_l2_error`.
**Success rate after retraining (higher is better)** (not available)
`success_rate` is the result of training and evaluating a policy with the selected encoding/decoding settings. It is the most difficult metric to get but also the very best.
## Variables
**Image content**
We don't expect the same optimal settings for a dataset of images from a simulation, or from real-world in an appartment, or in a factory, or outdoor, etc. Hence, we run this benchmark on two datasets: `pusht` (simulation) and `umi` (real-world outdoor).
**Requested timestamps**
In this benchmark, we focus on the loading time of random access, so we are not interested in sequentially loading all frames of a video like in a movie. However, the number of consecutive timestamps requested and their spacing can greatly affect the `load_time_factor`. In fact, it is expected to get faster loading time by decoding a large number of consecutive frames from a video, than to load the same data from individual images. To reflect our robotics use case, we consider a few settings:
- `single_frame`: 1 frame,
- `2_frames`: 2 consecutive frames (e.g. `[t, t + 1 / fps]`),
- `2_frames_4_space`: 2 consecutive frames with 4 frames of spacing (e.g `[t, t + 4 / fps]`),
**Data augmentations**
We might revisit this benchmark and find better settings if we train our policies with various data augmentations to make them more robust (e.g. robust to color changes, compression, etc.).
## Results
**`decoder`**
| repo_id | decoder | load_time_factor | avg_per_pixel_l2_error |
| --- | --- | --- | --- |
| lerobot/pusht | <span style="color: #32CD32;">torchvision</span> | 0.166 | 0.0000119 |
| lerobot/pusht | ffmpegio | 0.009 | 0.0001182 |
| lerobot/pusht | torchaudio | 0.138 | 0.0000359 |
| lerobot/umi_cup_in_the_wild | <span style="color: #32CD32;">torchvision</span> | 0.174 | 0.0000174 |
| lerobot/umi_cup_in_the_wild | ffmpegio | 0.010 | 0.0000735 |
| lerobot/umi_cup_in_the_wild | torchaudio | 0.154 | 0.0000340 |
### `1_frame`
**`pix_fmt`**
| repo_id | pix_fmt | compression_factor | load_time_factor | avg_per_pixel_l2_error |
| --- | --- | --- | --- | --- |
| lerobot/pusht | yuv420p | 3.788 | 0.224 | 0.0000760 |
| lerobot/pusht | yuv444p | 3.646 | 0.185 | 0.0000443 |
| lerobot/umi_cup_in_the_wild | yuv420p | 14.391 | 0.388 | 0.0000469 |
| lerobot/umi_cup_in_the_wild | yuv444p | 14.932 | 0.329 | 0.0000397 |
**`g`**
| repo_id | g | compression_factor | load_time_factor | avg_per_pixel_l2_error |
| --- | --- | --- | --- | --- |
| lerobot/pusht | 1 | 2.543 | 0.204 | 0.0000556 |
| lerobot/pusht | 2 | 3.646 | 0.182 | 0.0000443 |
| lerobot/pusht | 3 | 4.431 | 0.174 | 0.0000450 |
| lerobot/pusht | 4 | 5.103 | 0.163 | 0.0000448 |
| lerobot/pusht | 5 | 5.625 | 0.163 | 0.0000436 |
| lerobot/pusht | 6 | 5.974 | 0.155 | 0.0000427 |
| lerobot/pusht | 10 | 6.814 | 0.130 | 0.0000410 |
| lerobot/pusht | 15 | 7.431 | 0.105 | 0.0000406 |
| lerobot/pusht | 20 | 7.662 | 0.097 | 0.0000400 |
| lerobot/pusht | 40 | 8.163 | 0.061 | 0.0000405 |
| lerobot/pusht | 100 | 8.761 | 0.039 | 0.0000422 |
| lerobot/pusht | None | 8.909 | 0.024 | 0.0000431 |
| lerobot/umi_cup_in_the_wild | 1 | 14.411 | 0.444 | 0.0000601 |
| lerobot/umi_cup_in_the_wild | 2 | 14.932 | 0.345 | 0.0000397 |
| lerobot/umi_cup_in_the_wild | 3 | 20.174 | 0.282 | 0.0000416 |
| lerobot/umi_cup_in_the_wild | 4 | 24.889 | 0.271 | 0.0000415 |
| lerobot/umi_cup_in_the_wild | 5 | 28.825 | 0.260 | 0.0000415 |
| lerobot/umi_cup_in_the_wild | 6 | 31.635 | 0.249 | 0.0000415 |
| lerobot/umi_cup_in_the_wild | 10 | 39.418 | 0.195 | 0.0000399 |
| lerobot/umi_cup_in_the_wild | 15 | 44.577 | 0.169 | 0.0000394 |
| lerobot/umi_cup_in_the_wild | 20 | 47.907 | 0.140 | 0.0000390 |
| lerobot/umi_cup_in_the_wild | 40 | 52.554 | 0.096 | 0.0000384 |
| lerobot/umi_cup_in_the_wild | 100 | 58.241 | 0.046 | 0.0000390 |
| lerobot/umi_cup_in_the_wild | None | 60.530 | 0.022 | 0.0000400 |
**`crf`**
| repo_id | crf | compression_factor | load_time_factor | avg_per_pixel_l2_error |
| --- | --- | --- | --- | --- |
| lerobot/pusht | 0 | 1.699 | 0.175 | 0.0000035 |
| lerobot/pusht | 5 | 1.409 | 0.181 | 0.0000080 |
| lerobot/pusht | 10 | 1.842 | 0.172 | 0.0000123 |
| lerobot/pusht | 15 | 2.322 | 0.187 | 0.0000211 |
| lerobot/pusht | 20 | 3.050 | 0.181 | 0.0000346 |
| lerobot/pusht | None | 3.646 | 0.189 | 0.0000443 |
| lerobot/pusht | 25 | 3.969 | 0.186 | 0.0000521 |
| lerobot/pusht | 30 | 5.687 | 0.184 | 0.0000850 |
| lerobot/pusht | 40 | 10.818 | 0.193 | 0.0001726 |
| lerobot/pusht | 50 | 18.185 | 0.183 | 0.0002606 |
| lerobot/umi_cup_in_the_wild | 0 | 1.918 | 0.165 | 0.0000056 |
| lerobot/umi_cup_in_the_wild | 5 | 3.207 | 0.171 | 0.0000111 |
| lerobot/umi_cup_in_the_wild | 10 | 4.818 | 0.212 | 0.0000153 |
| lerobot/umi_cup_in_the_wild | 15 | 7.329 | 0.261 | 0.0000218 |
| lerobot/umi_cup_in_the_wild | 20 | 11.361 | 0.312 | 0.0000317 |
| lerobot/umi_cup_in_the_wild | None | 14.932 | 0.339 | 0.0000397 |
| lerobot/umi_cup_in_the_wild | 25 | 17.741 | 0.297 | 0.0000452 |
| lerobot/umi_cup_in_the_wild | 30 | 27.983 | 0.406 | 0.0000629 |
| lerobot/umi_cup_in_the_wild | 40 | 82.449 | 0.468 | 0.0001184 |
| lerobot/umi_cup_in_the_wild | 50 | 186.145 | 0.515 | 0.0001879 |
**best**
| repo_id | compression_factor | load_time_factor | avg_per_pixel_l2_error |
| --- | --- | --- | --- |
| lerobot/pusht | 3.646 | 0.188 | 0.0000443 |
| lerobot/umi_cup_in_the_wild | 14.932 | 0.339 | 0.0000397 |
### `2_frames`
**`pix_fmt`**
| repo_id | pix_fmt | compression_factor | load_time_factor | avg_per_pixel_l2_error |
| --- | --- | --- | --- | --- |
| lerobot/pusht | yuv420p | 3.788 | 0.314 | 0.0000799 |
| lerobot/pusht | yuv444p | 3.646 | 0.303 | 0.0000496 |
| lerobot/umi_cup_in_the_wild | yuv420p | 14.391 | 0.642 | 0.0000503 |
| lerobot/umi_cup_in_the_wild | yuv444p | 14.932 | 0.529 | 0.0000436 |
**`g`**
| repo_id | g | compression_factor | load_time_factor | avg_per_pixel_l2_error |
| --- | --- | --- | --- | --- |
| lerobot/pusht | 1 | 2.543 | 0.308 | 0.0000599 |
| lerobot/pusht | 2 | 3.646 | 0.279 | 0.0000496 |
| lerobot/pusht | 3 | 4.431 | 0.259 | 0.0000498 |
| lerobot/pusht | 4 | 5.103 | 0.243 | 0.0000501 |
| lerobot/pusht | 5 | 5.625 | 0.235 | 0.0000492 |
| lerobot/pusht | 6 | 5.974 | 0.230 | 0.0000481 |
| lerobot/pusht | 10 | 6.814 | 0.194 | 0.0000468 |
| lerobot/pusht | 15 | 7.431 | 0.152 | 0.0000460 |
| lerobot/pusht | 20 | 7.662 | 0.151 | 0.0000455 |
| lerobot/pusht | 40 | 8.163 | 0.095 | 0.0000454 |
| lerobot/pusht | 100 | 8.761 | 0.062 | 0.0000472 |
| lerobot/pusht | None | 8.909 | 0.037 | 0.0000479 |
| lerobot/umi_cup_in_the_wild | 1 | 14.411 | 0.638 | 0.0000625 |
| lerobot/umi_cup_in_the_wild | 2 | 14.932 | 0.537 | 0.0000436 |
| lerobot/umi_cup_in_the_wild | 3 | 20.174 | 0.493 | 0.0000437 |
| lerobot/umi_cup_in_the_wild | 4 | 24.889 | 0.458 | 0.0000446 |
| lerobot/umi_cup_in_the_wild | 5 | 28.825 | 0.438 | 0.0000445 |
| lerobot/umi_cup_in_the_wild | 6 | 31.635 | 0.424 | 0.0000444 |
| lerobot/umi_cup_in_the_wild | 10 | 39.418 | 0.345 | 0.0000435 |
| lerobot/umi_cup_in_the_wild | 15 | 44.577 | 0.313 | 0.0000417 |
| lerobot/umi_cup_in_the_wild | 20 | 47.907 | 0.264 | 0.0000421 |
| lerobot/umi_cup_in_the_wild | 40 | 52.554 | 0.185 | 0.0000414 |
| lerobot/umi_cup_in_the_wild | 100 | 58.241 | 0.090 | 0.0000420 |
| lerobot/umi_cup_in_the_wild | None | 60.530 | 0.042 | 0.0000424 |
**`crf`**
| repo_id | crf | compression_factor | load_time_factor | avg_per_pixel_l2_error |
| --- | --- | --- | --- | --- |
| lerobot/pusht | 0 | 1.699 | 0.302 | 0.0000097 |
| lerobot/pusht | 5 | 1.409 | 0.287 | 0.0000142 |
| lerobot/pusht | 10 | 1.842 | 0.283 | 0.0000184 |
| lerobot/pusht | 15 | 2.322 | 0.305 | 0.0000268 |
| lerobot/pusht | 20 | 3.050 | 0.285 | 0.0000402 |
| lerobot/pusht | None | 3.646 | 0.285 | 0.0000496 |
| lerobot/pusht | 25 | 3.969 | 0.293 | 0.0000572 |
| lerobot/pusht | 30 | 5.687 | 0.293 | 0.0000893 |
| lerobot/pusht | 40 | 10.818 | 0.319 | 0.0001762 |
| lerobot/pusht | 50 | 18.185 | 0.304 | 0.0002626 |
| lerobot/umi_cup_in_the_wild | 0 | 1.918 | 0.235 | 0.0000112 |
| lerobot/umi_cup_in_the_wild | 5 | 3.207 | 0.261 | 0.0000166 |
| lerobot/umi_cup_in_the_wild | 10 | 4.818 | 0.333 | 0.0000207 |
| lerobot/umi_cup_in_the_wild | 15 | 7.329 | 0.406 | 0.0000267 |
| lerobot/umi_cup_in_the_wild | 20 | 11.361 | 0.489 | 0.0000361 |
| lerobot/umi_cup_in_the_wild | None | 14.932 | 0.537 | 0.0000436 |
| lerobot/umi_cup_in_the_wild | 25 | 17.741 | 0.578 | 0.0000487 |
| lerobot/umi_cup_in_the_wild | 30 | 27.983 | 0.453 | 0.0000655 |
| lerobot/umi_cup_in_the_wild | 40 | 82.449 | 0.767 | 0.0001192 |
| lerobot/umi_cup_in_the_wild | 50 | 186.145 | 0.816 | 0.0001881 |
**best**
| repo_id | compression_factor | load_time_factor | avg_per_pixel_l2_error |
| --- | --- | --- | --- |
| lerobot/pusht | 3.646 | 0.283 | 0.0000496 |
| lerobot/umi_cup_in_the_wild | 14.932 | 0.543 | 0.0000436 |
### `2_frames_4_space`
**`pix_fmt`**
| repo_id | pix_fmt | compression_factor | load_time_factor | avg_per_pixel_l2_error |
| --- | --- | --- | --- | --- |
| lerobot/pusht | yuv420p | 3.788 | 0.257 | 0.0000855 |
| lerobot/pusht | yuv444p | 3.646 | 0.261 | 0.0000556 |
| lerobot/umi_cup_in_the_wild | yuv420p | 14.391 | 0.493 | 0.0000476 |
| lerobot/umi_cup_in_the_wild | yuv444p | 14.932 | 0.371 | 0.0000404 |
**`g`**
| repo_id | g | compression_factor | load_time_factor | avg_per_pixel_l2_error |
| --- | --- | --- | --- | --- |
| lerobot/pusht | 1 | 2.543 | 0.226 | 0.0000670 |
| lerobot/pusht | 2 | 3.646 | 0.222 | 0.0000556 |
| lerobot/pusht | 3 | 4.431 | 0.217 | 0.0000567 |
| lerobot/pusht | 4 | 5.103 | 0.204 | 0.0000555 |
| lerobot/pusht | 5 | 5.625 | 0.179 | 0.0000556 |
| lerobot/pusht | 6 | 5.974 | 0.188 | 0.0000544 |
| lerobot/pusht | 10 | 6.814 | 0.160 | 0.0000531 |
| lerobot/pusht | 15 | 7.431 | 0.150 | 0.0000521 |
| lerobot/pusht | 20 | 7.662 | 0.123 | 0.0000519 |
| lerobot/pusht | 40 | 8.163 | 0.092 | 0.0000519 |
| lerobot/pusht | 100 | 8.761 | 0.053 | 0.0000533 |
| lerobot/pusht | None | 8.909 | 0.034 | 0.0000541 |
| lerobot/umi_cup_in_the_wild | 1 | 14.411 | 0.409 | 0.0000607 |
| lerobot/umi_cup_in_the_wild | 2 | 14.932 | 0.381 | 0.0000404 |
| lerobot/umi_cup_in_the_wild | 3 | 20.174 | 0.355 | 0.0000418 |
| lerobot/umi_cup_in_the_wild | 4 | 24.889 | 0.346 | 0.0000425 |
| lerobot/umi_cup_in_the_wild | 5 | 28.825 | 0.354 | 0.0000419 |
| lerobot/umi_cup_in_the_wild | 6 | 31.635 | 0.336 | 0.0000419 |
| lerobot/umi_cup_in_the_wild | 10 | 39.418 | 0.314 | 0.0000402 |
| lerobot/umi_cup_in_the_wild | 15 | 44.577 | 0.269 | 0.0000397 |
| lerobot/umi_cup_in_the_wild | 20 | 47.907 | 0.246 | 0.0000395 |
| lerobot/umi_cup_in_the_wild | 40 | 52.554 | 0.171 | 0.0000390 |
| lerobot/umi_cup_in_the_wild | 100 | 58.241 | 0.091 | 0.0000399 |
| lerobot/umi_cup_in_the_wild | None | 60.530 | 0.043 | 0.0000409 |
**`crf`**
| repo_id | crf | compression_factor | load_time_factor | avg_per_pixel_l2_error |
| --- | --- | --- | --- | --- |
| lerobot/pusht | 0 | 1.699 | 0.212 | 0.0000193 |
| lerobot/pusht | 5 | 1.409 | 0.211 | 0.0000232 |
| lerobot/pusht | 10 | 1.842 | 0.199 | 0.0000270 |
| lerobot/pusht | 15 | 2.322 | 0.198 | 0.0000347 |
| lerobot/pusht | 20 | 3.050 | 0.211 | 0.0000469 |
| lerobot/pusht | None | 3.646 | 0.206 | 0.0000556 |
| lerobot/pusht | 25 | 3.969 | 0.210 | 0.0000626 |
| lerobot/pusht | 30 | 5.687 | 0.223 | 0.0000927 |
| lerobot/pusht | 40 | 10.818 | 0.227 | 0.0001763 |
| lerobot/pusht | 50 | 18.185 | 0.223 | 0.0002625 |
| lerobot/umi_cup_in_the_wild | 0 | 1.918 | 0.147 | 0.0000071 |
| lerobot/umi_cup_in_the_wild | 5 | 3.207 | 0.182 | 0.0000125 |
| lerobot/umi_cup_in_the_wild | 10 | 4.818 | 0.222 | 0.0000166 |
| lerobot/umi_cup_in_the_wild | 15 | 7.329 | 0.270 | 0.0000229 |
| lerobot/umi_cup_in_the_wild | 20 | 11.361 | 0.325 | 0.0000326 |
| lerobot/umi_cup_in_the_wild | None | 14.932 | 0.362 | 0.0000404 |
| lerobot/umi_cup_in_the_wild | 25 | 17.741 | 0.390 | 0.0000459 |
| lerobot/umi_cup_in_the_wild | 30 | 27.983 | 0.437 | 0.0000633 |
| lerobot/umi_cup_in_the_wild | 40 | 82.449 | 0.499 | 0.0001186 |
| lerobot/umi_cup_in_the_wild | 50 | 186.145 | 0.564 | 0.0001879 |
**best**
| repo_id | compression_factor | load_time_factor | avg_per_pixel_l2_error |
| --- | --- | --- | --- |
| lerobot/pusht | 3.646 | 0.224 | 0.0000556 |
| lerobot/umi_cup_in_the_wild | 14.932 | 0.368 | 0.0000404 |
### `6_frames`
**`pix_fmt`**
| repo_id | pix_fmt | compression_factor | load_time_factor | avg_per_pixel_l2_error |
| --- | --- | --- | --- | --- |
| lerobot/pusht | yuv420p | 3.788 | 0.660 | 0.0000839 |
| lerobot/pusht | yuv444p | 3.646 | 0.546 | 0.0000542 |
| lerobot/umi_cup_in_the_wild | yuv420p | 14.391 | 1.225 | 0.0000497 |
| lerobot/umi_cup_in_the_wild | yuv444p | 14.932 | 0.908 | 0.0000428 |
**`g`**
| repo_id | g | compression_factor | load_time_factor | avg_per_pixel_l2_error |
| --- | --- | --- | --- | --- |
| lerobot/pusht | 1 | 2.543 | 0.552 | 0.0000646 |
| lerobot/pusht | 2 | 3.646 | 0.534 | 0.0000542 |
| lerobot/pusht | 3 | 4.431 | 0.563 | 0.0000546 |
| lerobot/pusht | 4 | 5.103 | 0.537 | 0.0000545 |
| lerobot/pusht | 5 | 5.625 | 0.477 | 0.0000532 |
| lerobot/pusht | 6 | 5.974 | 0.515 | 0.0000530 |
| lerobot/pusht | 10 | 6.814 | 0.410 | 0.0000512 |
| lerobot/pusht | 15 | 7.431 | 0.405 | 0.0000503 |
| lerobot/pusht | 20 | 7.662 | 0.345 | 0.0000500 |
| lerobot/pusht | 40 | 8.163 | 0.247 | 0.0000496 |
| lerobot/pusht | 100 | 8.761 | 0.147 | 0.0000510 |
| lerobot/pusht | None | 8.909 | 0.100 | 0.0000519 |
| lerobot/umi_cup_in_the_wild | 1 | 14.411 | 0.997 | 0.0000620 |
| lerobot/umi_cup_in_the_wild | 2 | 14.932 | 0.911 | 0.0000428 |
| lerobot/umi_cup_in_the_wild | 3 | 20.174 | 0.869 | 0.0000433 |
| lerobot/umi_cup_in_the_wild | 4 | 24.889 | 0.874 | 0.0000438 |
| lerobot/umi_cup_in_the_wild | 5 | 28.825 | 0.864 | 0.0000439 |
| lerobot/umi_cup_in_the_wild | 6 | 31.635 | 0.834 | 0.0000440 |
| lerobot/umi_cup_in_the_wild | 10 | 39.418 | 0.781 | 0.0000421 |
| lerobot/umi_cup_in_the_wild | 15 | 44.577 | 0.679 | 0.0000411 |
| lerobot/umi_cup_in_the_wild | 20 | 47.907 | 0.652 | 0.0000410 |
| lerobot/umi_cup_in_the_wild | 40 | 52.554 | 0.465 | 0.0000404 |
| lerobot/umi_cup_in_the_wild | 100 | 58.241 | 0.245 | 0.0000413 |
| lerobot/umi_cup_in_the_wild | None | 60.530 | 0.116 | 0.0000417 |
**`crf`**
| repo_id | crf | compression_factor | load_time_factor | avg_per_pixel_l2_error |
| --- | --- | --- | --- | --- |
| lerobot/pusht | 0 | 1.699 | 0.534 | 0.0000163 |
| lerobot/pusht | 5 | 1.409 | 0.524 | 0.0000205 |
| lerobot/pusht | 10 | 1.842 | 0.510 | 0.0000245 |
| lerobot/pusht | 15 | 2.322 | 0.512 | 0.0000324 |
| lerobot/pusht | 20 | 3.050 | 0.508 | 0.0000452 |
| lerobot/pusht | None | 3.646 | 0.518 | 0.0000542 |
| lerobot/pusht | 25 | 3.969 | 0.534 | 0.0000616 |
| lerobot/pusht | 30 | 5.687 | 0.530 | 0.0000927 |
| lerobot/pusht | 40 | 10.818 | 0.552 | 0.0001777 |
| lerobot/pusht | 50 | 18.185 | 0.564 | 0.0002644 |
| lerobot/umi_cup_in_the_wild | 0 | 1.918 | 0.401 | 0.0000101 |
| lerobot/umi_cup_in_the_wild | 5 | 3.207 | 0.499 | 0.0000156 |
| lerobot/umi_cup_in_the_wild | 10 | 4.818 | 0.599 | 0.0000197 |
| lerobot/umi_cup_in_the_wild | 15 | 7.329 | 0.704 | 0.0000258 |
| lerobot/umi_cup_in_the_wild | 20 | 11.361 | 0.834 | 0.0000352 |
| lerobot/umi_cup_in_the_wild | None | 14.932 | 0.925 | 0.0000428 |
| lerobot/umi_cup_in_the_wild | 25 | 17.741 | 0.978 | 0.0000480 |
| lerobot/umi_cup_in_the_wild | 30 | 27.983 | 1.088 | 0.0000648 |
| lerobot/umi_cup_in_the_wild | 40 | 82.449 | 1.324 | 0.0001190 |
| lerobot/umi_cup_in_the_wild | 50 | 186.145 | 1.436 | 0.0001880 |
**best**
| repo_id | compression_factor | load_time_factor | avg_per_pixel_l2_error |
| --- | --- | --- | --- |
| lerobot/pusht | 3.646 | 0.546 | 0.0000542 |
| lerobot/umi_cup_in_the_wild | 14.932 | 0.934 | 0.0000428 |

View File

@@ -1,90 +0,0 @@
#!/usr/bin/env python
# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Capture video feed from a camera as raw images."""
import argparse
import datetime as dt
from pathlib import Path
import cv2
def display_and_save_video_stream(output_dir: Path, fps: int, width: int, height: int):
now = dt.datetime.now()
capture_dir = output_dir / f"{now:%Y-%m-%d}" / f"{now:%H-%M-%S}"
if not capture_dir.exists():
capture_dir.mkdir(parents=True, exist_ok=True)
# Opens the default webcam
cap = cv2.VideoCapture(0)
if not cap.isOpened():
print("Error: Could not open video stream.")
return
cap.set(cv2.CAP_PROP_FPS, fps)
cap.set(cv2.CAP_PROP_FRAME_WIDTH, width)
cap.set(cv2.CAP_PROP_FRAME_HEIGHT, height)
frame_index = 0
while True:
ret, frame = cap.read()
if not ret:
print("Error: Could not read frame.")
break
cv2.imshow("Video Stream", frame)
cv2.imwrite(str(capture_dir / f"frame_{frame_index:06d}.png"), frame)
frame_index += 1
# Break the loop on 'q' key press
if cv2.waitKey(1) & 0xFF == ord("q"):
break
# Release the capture and destroy all windows
cap.release()
cv2.destroyAllWindows()
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument(
"--output-dir",
type=Path,
default=Path("outputs/cam_capture/"),
help="Directory where the capture images are written. A subfolder named with the current date & time will be created inside it for each capture.",
)
parser.add_argument(
"--fps",
type=int,
default=30,
help="Frames Per Second of the capture.",
)
parser.add_argument(
"--width",
type=int,
default=1280,
help="Width of the captured images.",
)
parser.add_argument(
"--height",
type=int,
default=720,
help="Height of the captured images.",
)
args = parser.parse_args()
display_and_save_video_stream(**vars(args))

View File

@@ -1,409 +0,0 @@
#!/usr/bin/env python
# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Assess the performance of video decoding in various configurations.
This script will run different video decoding benchmarks where one parameter varies at a time.
These parameters and theirs values are specified in the BENCHMARKS dict.
All of these benchmarks are evaluated within different timestamps modes corresponding to different frame-loading scenarios:
- `1_frame`: 1 single frame is loaded.
- `2_frames`: 2 consecutive frames are loaded.
- `2_frames_4_space`: 2 frames separated by 4 frames are loaded.
- `6_frames`: 6 consecutive frames are loaded.
These values are more or less arbitrary and based on possible future usage.
These benchmarks are run on the first episode of each dataset specified in DATASET_REPO_IDS.
Note: These datasets need to be image datasets, not video datasets.
"""
import json
import random
import shutil
import subprocess
import time
from pathlib import Path
import einops
import numpy as np
import PIL
import torch
from skimage.metrics import mean_squared_error, peak_signal_noise_ratio, structural_similarity
from lerobot.common.datasets.lerobot_dataset import LeRobotDataset
from lerobot.common.datasets.video_utils import (
decode_video_frames_torchvision,
)
OUTPUT_DIR = Path("tmp/run_video_benchmark")
DRY_RUN = False
DATASET_REPO_IDS = [
"lerobot/pusht_image",
"aliberts/aloha_mobile_shrimp_image",
"aliberts/paris_street",
"aliberts/kitchen",
]
TIMESTAMPS_MODES = [
"1_frame",
"2_frames",
"2_frames_4_space",
"6_frames",
]
BENCHMARKS = {
# "pix_fmt": ["yuv420p", "yuv444p"],
# "g": [1, 2, 3, 4, 5, 6, 10, 15, 20, 40, 100, None],
# "crf": [0, 5, 10, 15, 20, None, 25, 30, 40, 50],
"backend": ["pyav", "video_reader"],
}
def get_directory_size(directory):
total_size = 0
# Iterate over all files and subdirectories recursively
for item in directory.rglob("*"):
if item.is_file():
# Add the file size to the total
total_size += item.stat().st_size
return total_size
def run_video_benchmark(
output_dir,
cfg,
timestamps_mode,
seed=1337,
):
output_dir = Path(output_dir)
if output_dir.exists():
shutil.rmtree(output_dir)
output_dir.mkdir(parents=True, exist_ok=True)
repo_id = cfg["repo_id"]
# TODO(rcadene): rewrite with hardcoding of original images and episodes
dataset = LeRobotDataset(repo_id)
if dataset.video:
raise ValueError(
f"Use only image dataset for running this benchmark. Video dataset provided: {repo_id}"
)
# Get fps
fps = dataset.fps
# we only load first episode
ep_num_images = dataset.episode_data_index["to"][0].item()
# Save/Load image directory for the first episode
imgs_dir = Path(f"tmp/data/images/{repo_id}/observation.image_episode_000000")
if not imgs_dir.exists():
imgs_dir.mkdir(parents=True, exist_ok=True)
hf_dataset = dataset.hf_dataset.with_format(None)
img_keys = [key for key in hf_dataset.features if key.startswith("observation.image")]
imgs_dataset = hf_dataset.select_columns(img_keys[0])
for i, item in enumerate(imgs_dataset):
img = item[img_keys[0]]
img.save(str(imgs_dir / f"frame_{i:06d}.png"), quality=100)
if i >= ep_num_images - 1:
break
sum_original_frames_size_bytes = get_directory_size(imgs_dir)
# Encode images into video
video_path = output_dir / "episode_0.mp4"
g = cfg.get("g")
crf = cfg.get("crf")
pix_fmt = cfg["pix_fmt"]
cmd = f"ffmpeg -r {fps} "
cmd += "-f image2 "
cmd += "-loglevel error "
cmd += f"-i {str(imgs_dir / 'frame_%06d.png')} "
cmd += "-vcodec libx264 "
if g is not None:
cmd += f"-g {g} " # ensures at least 1 keyframe every 10 frames
# cmd += "-keyint_min 10 " set a minimum of 10 frames between 2 key frames
# cmd += "-sc_threshold 0 " disable scene change detection to lower the number of key frames
if crf is not None:
cmd += f"-crf {crf} "
cmd += f"-pix_fmt {pix_fmt} "
cmd += f"{str(video_path)}"
subprocess.run(cmd.split(" "), check=True)
video_size_bytes = video_path.stat().st_size
# Set decoder
decoder = cfg["decoder"]
decoder_kwgs = cfg["decoder_kwgs"]
backend = cfg["backend"]
if decoder == "torchvision":
decode_frames_fn = decode_video_frames_torchvision
else:
raise ValueError(decoder)
# Estimate average loading time
def load_original_frames(imgs_dir, timestamps) -> torch.Tensor:
frames = []
for ts in timestamps:
idx = int(ts * fps)
frame = PIL.Image.open(imgs_dir / f"frame_{idx:06d}.png")
frame = torch.from_numpy(np.array(frame))
frame = frame.type(torch.float32) / 255
frame = einops.rearrange(frame, "h w c -> c h w")
frames.append(frame)
return frames
list_avg_load_time = []
list_avg_load_time_from_images = []
per_pixel_l2_errors = []
psnr_values = []
ssim_values = []
mse_values = []
random.seed(seed)
for t in range(50):
# test loading 2 frames that are 4 frames appart, which might be a common setting
ts = random.randint(fps, ep_num_images - fps) / fps
if timestamps_mode == "1_frame":
timestamps = [ts]
elif timestamps_mode == "2_frames":
timestamps = [ts - 1 / fps, ts]
elif timestamps_mode == "2_frames_4_space":
timestamps = [ts - 5 / fps, ts]
elif timestamps_mode == "6_frames":
timestamps = [ts - i / fps for i in range(6)][::-1]
else:
raise ValueError(timestamps_mode)
num_frames = len(timestamps)
start_time_s = time.monotonic()
frames = decode_frames_fn(
video_path, timestamps=timestamps, tolerance_s=1e-4, backend=backend, **decoder_kwgs
)
avg_load_time = (time.monotonic() - start_time_s) / num_frames
list_avg_load_time.append(avg_load_time)
start_time_s = time.monotonic()
original_frames = load_original_frames(imgs_dir, timestamps)
avg_load_time_from_images = (time.monotonic() - start_time_s) / num_frames
list_avg_load_time_from_images.append(avg_load_time_from_images)
# Estimate reconstruction error between original frames and decoded frames with various metrics
for i, ts in enumerate(timestamps):
# are_close = torch.allclose(frames[i], original_frames[i], atol=0.02)
num_pixels = original_frames[i].numel()
per_pixel_l2_error = torch.norm(frames[i] - original_frames[i], p=2).item() / num_pixels
per_pixel_l2_errors.append(per_pixel_l2_error)
frame_np, original_frame_np = frames[i].numpy(), original_frames[i].numpy()
psnr_values.append(peak_signal_noise_ratio(original_frame_np, frame_np, data_range=1.0))
ssim_values.append(
structural_similarity(original_frame_np, frame_np, data_range=1.0, channel_axis=0)
)
mse_values.append(mean_squared_error(original_frame_np, frame_np))
# save decoded frames
if t == 0:
frame_hwc = (frames[i].permute((1, 2, 0)) * 255).type(torch.uint8).cpu().numpy()
PIL.Image.fromarray(frame_hwc).save(output_dir / f"frame_{i:06d}.png")
# save original_frames
idx = int(ts * fps)
if t == 0:
original_frame = PIL.Image.open(imgs_dir / f"frame_{idx:06d}.png")
original_frame.save(output_dir / f"original_frame_{i:06d}.png")
image_size = tuple(dataset[0][dataset.camera_keys[0]].shape[-2:])
avg_load_time = float(np.array(list_avg_load_time).mean())
avg_load_time_from_images = float(np.array(list_avg_load_time_from_images).mean())
avg_per_pixel_l2_error = float(np.array(per_pixel_l2_errors).mean())
avg_psnr = float(np.mean(psnr_values))
avg_ssim = float(np.mean(ssim_values))
avg_mse = float(np.mean(mse_values))
# Save benchmark info
info = {
"image_size": image_size,
"sum_original_frames_size_bytes": sum_original_frames_size_bytes,
"video_size_bytes": video_size_bytes,
"avg_load_time_from_images": avg_load_time_from_images,
"avg_load_time": avg_load_time,
"compression_factor": sum_original_frames_size_bytes / video_size_bytes,
"load_time_factor": avg_load_time_from_images / avg_load_time,
"avg_per_pixel_l2_error": avg_per_pixel_l2_error,
"avg_psnr": avg_psnr,
"avg_ssim": avg_ssim,
"avg_mse": avg_mse,
}
with open(output_dir / "info.json", "w") as f:
json.dump(info, f)
return info
def display_markdown_table(headers, rows):
for i, row in enumerate(rows):
new_row = []
for col in row:
if col is None:
new_col = "None"
elif isinstance(col, float):
new_col = f"{col:.3f}"
if new_col == "0.000":
new_col = f"{col:.7f}"
elif isinstance(col, int):
new_col = f"{col}"
else:
new_col = col
new_row.append(new_col)
rows[i] = new_row
header_line = "| " + " | ".join(headers) + " |"
separator_line = "| " + " | ".join(["---" for _ in headers]) + " |"
body_lines = ["| " + " | ".join(row) + " |" for row in rows]
markdown_table = "\n".join([header_line, separator_line] + body_lines)
print(markdown_table)
print()
def load_info(out_dir):
with open(out_dir / "info.json") as f:
info = json.load(f)
return info
def one_variable_study(
var_name: str, var_values: list, repo_ids: list, bench_dir: Path, timestamps_mode: str, dry_run: bool
):
print(f"**`{var_name}`**")
headers = [
"repo_id",
"image_size",
var_name,
"compression_factor",
"load_time_factor",
"avg_per_pixel_l2_error",
"avg_psnr",
"avg_ssim",
"avg_mse",
]
rows = []
base_cfg = {
"repo_id": None,
# video encoding
"g": 2,
"crf": None,
"pix_fmt": "yuv444p",
# video decoding
"backend": "pyav",
"decoder": "torchvision",
"decoder_kwgs": {},
}
for repo_id in repo_ids:
for val in var_values:
cfg = base_cfg.copy()
cfg["repo_id"] = repo_id
cfg[var_name] = val
if not dry_run:
run_video_benchmark(
bench_dir / repo_id / f"torchvision_{var_name}_{val}", cfg, timestamps_mode
)
info = load_info(bench_dir / repo_id / f"torchvision_{var_name}_{val}")
width, height = info["image_size"][0], info["image_size"][1]
rows.append(
[
repo_id,
f"{width} x {height}",
val,
info["compression_factor"],
info["load_time_factor"],
info["avg_per_pixel_l2_error"],
info["avg_psnr"],
info["avg_ssim"],
info["avg_mse"],
]
)
display_markdown_table(headers, rows)
def best_study(repo_ids: list, bench_dir: Path, timestamps_mode: str, dry_run: bool):
"""Change the config once you deciced what's best based on one-variable-studies"""
print("**best**")
headers = [
"repo_id",
"image_size",
"compression_factor",
"load_time_factor",
"avg_per_pixel_l2_error",
"avg_psnr",
"avg_ssim",
"avg_mse",
]
rows = []
for repo_id in repo_ids:
cfg = {
"repo_id": repo_id,
# video encoding
"g": 2,
"crf": None,
"pix_fmt": "yuv444p",
# video decoding
"backend": "video_reader",
"decoder": "torchvision",
"decoder_kwgs": {},
}
if not dry_run:
run_video_benchmark(bench_dir / repo_id / "torchvision_best", cfg, timestamps_mode)
info = load_info(bench_dir / repo_id / "torchvision_best")
width, height = info["image_size"][0], info["image_size"][1]
rows.append(
[
repo_id,
f"{width} x {height}",
info["compression_factor"],
info["load_time_factor"],
info["avg_per_pixel_l2_error"],
]
)
display_markdown_table(headers, rows)
def main():
for timestamps_mode in TIMESTAMPS_MODES:
bench_dir = OUTPUT_DIR / timestamps_mode
print(f"### `{timestamps_mode}`")
print()
for name, values in BENCHMARKS.items():
one_variable_study(name, values, DATASET_REPO_IDS, bench_dir, timestamps_mode, DRY_RUN)
# best_study(DATASET_REPO_IDS, bench_dir, timestamps_mode, DRY_RUN)
if __name__ == "__main__":
main()

View File

@@ -16,6 +16,7 @@
import logging
import subprocess
import warnings
from collections import OrderedDict
from dataclasses import dataclass, field
from pathlib import Path
from typing import Any, ClassVar
@@ -69,7 +70,7 @@ def decode_video_frames_torchvision(
tolerance_s: float,
backend: str = "pyav",
log_loaded_timestamps: bool = False,
):
) -> torch.Tensor:
"""Loads frames associated to the requested timestamps of a video
The backend can be either "pyav" (default) or "video_reader".
@@ -77,9 +78,8 @@ def decode_video_frames_torchvision(
https://github.com/pytorch/vision/blob/main/torchvision/csrc/io/decoder/gpu/README.rst
(note that you need to compile against ffmpeg<4.3)
While both use cpu, "video_reader" is faster than "pyav" but requires additional setup.
See our benchmark results for more info on performance:
https://github.com/huggingface/lerobot/pull/220
While both use cpu, "video_reader" is supposedly faster than "pyav" but requires additional setup.
For more info on video decoding, see `benchmark/video/README.md`
See torchvision doc for more info on these two backends:
https://pytorch.org/vision/0.18/index.html?highlight=backend#torchvision.set_video_backend
@@ -142,6 +142,10 @@ def decode_video_frames_torchvision(
"It means that the closest frame that can be loaded from the video is too far away in time."
"This might be due to synchronization issues with timestamps during data collection."
"To be safe, we advise to ignore this item during training."
f"\nqueried timestamps: {query_ts}"
f"\nloaded timestamps: {loaded_ts}"
f"\nvideo: {video_path}"
f"\nbackend: {backend}"
)
# get closest frames to the query timestamps
@@ -158,22 +162,52 @@ def decode_video_frames_torchvision(
return closest_frames
def encode_video_frames(imgs_dir: Path, video_path: Path, fps: int):
"""More info on ffmpeg arguments tuning on `lerobot/common/datasets/_video_benchmark/README.md`"""
def encode_video_frames(
imgs_dir: Path,
video_path: Path,
fps: int,
video_codec: str = "libsvtav1",
pixel_format: str = "yuv420p",
group_of_pictures_size: int | None = 2,
constant_rate_factor: int | None = 30,
fast_decode: int = 0,
log_level: str | None = "error",
overwrite: bool = False,
) -> None:
"""More info on ffmpeg arguments tuning on `benchmark/video/README.md`"""
video_path = Path(video_path)
video_path.parent.mkdir(parents=True, exist_ok=True)
ffmpeg_cmd = (
f"ffmpeg -r {fps} "
"-f image2 "
"-loglevel error "
f"-i {str(imgs_dir / 'frame_%06d.png')} "
"-vcodec libx264 "
"-g 2 "
"-pix_fmt yuv444p "
f"{str(video_path)}"
ffmpeg_args = OrderedDict(
[
("-f", "image2"),
("-r", str(fps)),
("-i", str(imgs_dir / "frame_%06d.png")),
("-vcodec", video_codec),
("-pix_fmt", pixel_format),
]
)
subprocess.run(ffmpeg_cmd.split(" "), check=True)
if group_of_pictures_size is not None:
ffmpeg_args["-g"] = str(group_of_pictures_size)
if constant_rate_factor is not None:
ffmpeg_args["-crf"] = str(constant_rate_factor)
if fast_decode:
key = "-svtav1-params" if video_codec == "libsvtav1" else "-tune"
value = f"fast-decode={fast_decode}" if video_codec == "libsvtav1" else "fastdecode"
ffmpeg_args[key] = value
if log_level is not None:
ffmpeg_args["-loglevel"] = str(log_level)
ffmpeg_args = [item for pair in ffmpeg_args.items() for item in pair]
if overwrite:
ffmpeg_args.append("-y")
ffmpeg_cmd = ["ffmpeg"] + ffmpeg_args + [str(video_path)]
subprocess.run(ffmpeg_cmd, check=True)
@dataclass