-
Notifications
You must be signed in to change notification settings - Fork 146
Description
🐛 Bug
https://github.com/pytorch/torchx/blob/main/torchx/examples/apps/lightning_classy_vision/train.py
does not work on local scheduler with multiple nodes and when gpus are available. (works on cpus however).
This is because the local scheduler does NOT mask gpus by setting CUDA_VISIBLE_DEVICES on for each
replica. Which is expected behavior since the local scheduler DOES NOT do any type of resource isolation. This makes the example cv trainer not compatible with local_cwd
(or any other variant of local scheduler) when running multiple replicas.
Interestingly you can still work around it by specifying --nnodes=1 --nproc_per_node 8
instead of --nnodes=2 --nproc_per_node=4
or --nnodes=4 --nproc_per_node=2
.
Module (check all that applies):
-
torchx.spec
-
torchx.component
-
torchx.apps
-
torchx.runtime
-
torchx.cli
-
torchx.schedulers
-
torchx.pipelines
-
torchx.aws
-
torchx.examples
-
other
To Reproduce
Steps to reproduce the behavior:
ON A HOST WITH GPU
torchx run -s local_cwd ./torchx/examples/apps/lightning_classy_vision/component.py:trainer_dist
--nnodes 2 # <- anything greater than 1
--nproc_per_node 2
--rdzv_backend c10d
--rdzv_endpoint localhost:29500
Expected behavior
Should work
Environment
- torchx version (e.g. 0.1.0rc1):
- Python version:
- OS (e.g., Linux):
- How you installed torchx (
conda
,pip
, source,docker
): - Docker image and tag (if using docker):
- Git commit (if installed from source):
- Execution environment (on-prem, AWS, GCP, Azure etc):
- Any other relevant information: