[torchx/examples] Fix classy vision trainer example on local, multi-node, gpu

## 🐛 Bug

https://github.com/pytorch/torchx/blob/main/torchx/examples/apps/lightning_classy_vision/train.py
does not work on local scheduler with multiple nodes and when gpus are available. (works on cpus however).

This is because the local scheduler does NOT mask gpus by setting CUDA_VISIBLE_DEVICES on for each
replica. Which is expected behavior since the local scheduler DOES NOT do any type of resource isolation. This makes the example cv trainer not compatible with `local_cwd` (or any other variant of local scheduler) when running multiple replicas. 

Interestingly you can still work around it by specifying `--nnodes=1 --nproc_per_node 8` instead of `--nnodes=2 --nproc_per_node=4` or `--nnodes=4 --nproc_per_node=2`.

Module (check all that applies):
 * [ ] `torchx.spec`
 * [ ] `torchx.component`
 * [ ] `torchx.apps`
 * [ ] `torchx.runtime`
 * [ ] `torchx.cli`
 * [ ] `torchx.schedulers`
 * [ ] `torchx.pipelines`
 * [ ] `torchx.aws`
 * [x] `torchx.examples`
 * [ ] `other`


## To Reproduce

Steps to reproduce the behavior:

ON A HOST WITH GPU
```
 torchx run -s local_cwd ./torchx/examples/apps/lightning_classy_vision/component.py:trainer_dist
   --nnodes 2 # <- anything greater than 1
  --nproc_per_node 2 
  --rdzv_backend c10d 
  --rdzv_endpoint localhost:29500
```

## Expected behavior

Should work

## Environment

 - torchx version (e.g. 0.1.0rc1):
 - Python version:
 - OS (e.g., Linux):
 - How you installed torchx (`conda`, `pip`, source, `docker`):
 - Docker image and tag (if using docker):
 - Git commit (if installed from source):
 - Execution environment (on-prem, AWS, GCP, Azure etc):
 - Any other relevant information:

## Additional context

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[torchx/examples] Fix classy vision trainer example on local, multi-node, gpu #297

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[torchx/examples] Fix classy vision trainer example on local, multi-node, gpu #297

Description

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions