Skip to content

[torchx/examples] Fix classy vision trainer example on local, multi-node, gpu #297

@kiukchung

Description

@kiukchung

🐛 Bug

https://github.com/pytorch/torchx/blob/main/torchx/examples/apps/lightning_classy_vision/train.py
does not work on local scheduler with multiple nodes and when gpus are available. (works on cpus however).

This is because the local scheduler does NOT mask gpus by setting CUDA_VISIBLE_DEVICES on for each
replica. Which is expected behavior since the local scheduler DOES NOT do any type of resource isolation. This makes the example cv trainer not compatible with local_cwd (or any other variant of local scheduler) when running multiple replicas.

Interestingly you can still work around it by specifying --nnodes=1 --nproc_per_node 8 instead of --nnodes=2 --nproc_per_node=4 or --nnodes=4 --nproc_per_node=2.

Module (check all that applies):

  • torchx.spec
  • torchx.component
  • torchx.apps
  • torchx.runtime
  • torchx.cli
  • torchx.schedulers
  • torchx.pipelines
  • torchx.aws
  • torchx.examples
  • other

To Reproduce

Steps to reproduce the behavior:

ON A HOST WITH GPU

 torchx run -s local_cwd ./torchx/examples/apps/lightning_classy_vision/component.py:trainer_dist
   --nnodes 2 # <- anything greater than 1
  --nproc_per_node 2 
  --rdzv_backend c10d 
  --rdzv_endpoint localhost:29500

Expected behavior

Should work

Environment

  • torchx version (e.g. 0.1.0rc1):
  • Python version:
  • OS (e.g., Linux):
  • How you installed torchx (conda, pip, source, docker):
  • Docker image and tag (if using docker):
  • Git commit (if installed from source):
  • Execution environment (on-prem, AWS, GCP, Azure etc):
  • Any other relevant information:

Additional context

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions