Skip to content

Conversation

amorehead
Copy link

@amorehead amorehead commented Sep 10, 2025

Summary:
Makes test_utils.py (and torchtnt in general) safe to use start_method=fork for multi-GPU training with torchelastic. An example of a project that would benefit from this change is fairchem, which uses both torchelastic and torchtnt in conjunction for multi-GPU training.

Test plan:
I verified that making this change allows me to train models within the fairchem codebase when start_method=fork for elastic_launch. Without this change, a CUDA context will be created within the parent process of any Python package that imports torchtnt, which would subsequently make training with fork impossible when using multiple GPUs in parallel.

Fixes:
Together with this fairchem PR, this will fix crashes related to multi-GPU (local, not SLURM) model training using the fairchem codebase when start_method=fork.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant