-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Closed
Labels
help wantedOpen to be worked onOpen to be worked on
Description
I have a single 8-GPU machine with a faulty GPU0.
I'm running imagenet_example.py on 7 GPUs on this machine by specifying gpus=[1,2,3,4,5,6,7] in the Trainer i.e. I do not want to use GPU0
However, when i run nvidia-smi, I see the Trainer's pid shows on all 8 GPUs, just with lower memory on GPU0 (see output below). I also find it to be slower than non-PL code by about 4x. I don't see this behavior if I manually set CUDA_VISIBLE_DEVICES=1,2,3,4,5,6,7 followed by gpus=7 in Trainer. Similarly, it works fine when using a single GPU with, say, gpus=[1].
I'm not sure if it's relevant but I also see gpu=0 in the tqdm progress bar
nvidia-smi with Trainer(gpus=[1,2,3,4,5,6,7]) and CUDA_VISIBLE_DEVICES unset
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 40155 C python 719MiB |
| 1 40155 C python 6003MiB |
| 2 40155 C python 6019MiB |
| 3 40155 C python 6019MiB |
| 4 40155 C python 6019MiB |
| 5 40155 C python 6019MiB |
| 6 40155 C python 6019MiB |
| 7 40155 C python 6019MiB |
+-----------------------------------------------------------------------------+
nvidia-smi with Trainer(gpus=7) and CUDA_VISIBLE_DEVICES=1,2,3,4,5,6,7
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 1 34452 C python 6003MiB |
| 2 34452 C python 6019MiB |
| 3 34452 C python 6019MiB |
| 4 34452 C python 6019MiB |
| 5 34452 C python 6019MiB |
| 6 34452 C python 6019MiB |
| 7 34452 C python 6019MiB |
+-----------------------------------------------------------------------------+
Expected behavior
The process should run on the specified GPUs without manually setting CUDA_VISIBLE_DEVICES
Environment
PyTorch version: 1.4.0
Is debug build: No
CUDA used to build PyTorch: 10.1
OS: Ubuntu 18.04.3 LTS
GCC version: (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0
CMake version: version 3.10.2
Python version: 3.8
Is CUDA available: Yes
CUDA runtime version: Could not collect
GPU models and configuration:
GPU 0: GeForce RTX 2080 Ti
GPU 1: GeForce RTX 2080 Ti
GPU 2: GeForce RTX 2080 Ti
GPU 3: GeForce RTX 2080 Ti
GPU 4: GeForce RTX 2080 Ti
GPU 5: GeForce RTX 2080 Ti
GPU 6: GeForce RTX 2080 Ti
GPU 7: GeForce RTX 2080 Ti
Nvidia driver version: 418.87.00
cuDNN version: Could not collect
Versions of relevant libraries:
[pip] numpy==1.18.1
[pip] pytorch-lightning==0.6.0
[pip] torch==1.4.0
[pip] torch-lr-finder==0.1.2
[pip] torchvision==0.5.0
[conda] blas 1.0 mkl
[conda] mkl 2020.0 166
[conda] mkl-service 2.3.0 py38he904b0f_0
[conda] mkl_fft 1.0.15 py38ha843d7b_0
[conda] mkl_random 1.1.0 py38h962f231_0
[conda] pytorch 1.4.0 py3.8_cuda10.1.243_cudnn7.6.3_0 pytorch
[conda] pytorch-lightning 0.6.0 pypi_0 pypi
[conda] torch-lr-finder 0.1.2 pypi_0 pypi
[conda] torchvision 0.5.0 py38_cu101 pytorch
v-iashin and Yevgnen
Metadata
Metadata
Assignees
Labels
help wantedOpen to be worked onOpen to be worked on