Commit Graph

5 Commits

Author SHA1 Message Date
Pepijn
e82e7a02e9 feat(train): add accelerate for multi gpu training (#2154)
* Enhance training and logging functionality with accelerator support

- Added support for multi-GPU training by introducing an `accelerator` parameter in training functions.
- Updated `update_policy` to handle gradient updates based on the presence of an accelerator.
- Modified logging to prevent duplicate messages in non-main processes.
- Enhanced `set_seed` and `get_safe_torch_device` functions to accommodate accelerator usage.
- Updated `MetricsTracker` to account for the number of processes when calculating metrics.
- Introduced a new feature in `pyproject.toml` for the `accelerate` library dependency.

* Initialize logging in training script for both main and non-main processes

- Added `init_logging` calls to ensure proper logging setup when using the accelerator and in standard training mode.
- This change enhances the clarity and consistency of logging during training sessions.

* add docs and only push model once

* Place  logging under accelerate and update docs

* fix pre commit

* only log in main process

* main logging

* try with local rank

* add tests

* change runner

* fix test

* dont push to hub in multi gpu tests

* pre download dataset in tests

* small fixes

* fix path optimizer state

* update docs, and small improvements in train

* simplify accelerate main process detection

* small improvements in train

* fix OOM bug

* change accelerate detection

* add some debugging

* always use accelerate

* cleanup update method

* cleanup

* fix bug

* scale lr decay if we reduce steps

* cleanup logging

* fix formatting

* encorperate feedback pr

* add min memory to cpu tests

* use accelerate to determin logging

* fix precommit and fix tests

* chore: minor details

---------

Co-authored-by: AdilZouitine <adilzouitinegm@gmail.com>
Co-authored-by: Steven Palma <steven.palma@huggingface.co>
2025-10-16 17:41:55 +02:00
Steven Palma
0878c6880f fix(ci): inverted names (#1705) 2025-08-09 00:21:42 +02:00
Steven Palma
f6ec1d89a5 feat(ci): add release workflow (#1562) 2025-07-21 19:08:32 +02:00
Steven Palma
e88b30e6cc fix(ci): multiple fixes (#1549)
* fix(ci): tag of image when pushing to main

* fix(docs): remove symlink in docs folder

* chore(docs): move .mdx files to docs/ folder

* chore(docs): create symlink to docs files

* chore(ci): de-couple fast and full test pipeline

* fix(ci): skip GPU Tests for community PRs
2025-07-20 23:09:35 +02:00
Steven Palma
89f59b0703 refactor(ci): workflows improvements (#1535)
* refactor(ci): consolidate documentation workflows

* refactor(ci): improve quality workflow

* refactor(ci): edit security workflow

* refactor(ci): improve testing workflows

* fix(ci): several fixes

* chore(ci): renaming + permissions

* chore(ci): remove now unused dockerfiles

* chore(docs): add license headers to dockerfiles

* chore(ci): add cache-binary false to setup-buildx actions

* fix(ci): several fixes

* dgb(ci): explicit env in the workflow

* fix(ci): more explicit env vars for writing

* fix(ci): nightly gpu tag
2025-07-19 20:09:12 +02:00