Skip to content

Process runs on more GPUs than specified #958

@sahnimanas

Description

@sahnimanas

I have a single 8-GPU machine with a faulty GPU0.
I'm running imagenet_example.py on 7 GPUs on this machine by specifying gpus=[1,2,3,4,5,6,7] in the Trainer i.e. I do not want to use GPU0

However, when i run nvidia-smi, I see the Trainer's pid shows on all 8 GPUs, just with lower memory on GPU0 (see output below). I also find it to be slower than non-PL code by about 4x. I don't see this behavior if I manually set CUDA_VISIBLE_DEVICES=1,2,3,4,5,6,7 followed by gpus=7 in Trainer. Similarly, it works fine when using a single GPU with, say, gpus=[1].
I'm not sure if it's relevant but I also see gpu=0 in the tqdm progress bar

nvidia-smi with Trainer(gpus=[1,2,3,4,5,6,7]) and CUDA_VISIBLE_DEVICES unset

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     40155      C   python                                       719MiB |
|    1     40155      C   python                                      6003MiB |
|    2     40155      C   python                                      6019MiB |
|    3     40155      C   python                                      6019MiB |
|    4     40155      C   python                                      6019MiB |
|    5     40155      C   python                                      6019MiB |
|    6     40155      C   python                                      6019MiB |
|    7     40155      C   python                                      6019MiB |
+-----------------------------------------------------------------------------+

nvidia-smi with Trainer(gpus=7) and CUDA_VISIBLE_DEVICES=1,2,3,4,5,6,7

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    1     34452      C   python                                      6003MiB |
|    2     34452      C   python                                      6019MiB |
|    3     34452      C   python                                      6019MiB |
|    4     34452      C   python                                      6019MiB |
|    5     34452      C   python                                      6019MiB |
|    6     34452      C   python                                      6019MiB |
|    7     34452      C   python                                      6019MiB |
+-----------------------------------------------------------------------------+

Expected behavior

The process should run on the specified GPUs without manually setting CUDA_VISIBLE_DEVICES

Environment

PyTorch version: 1.4.0
Is debug build: No
CUDA used to build PyTorch: 10.1

OS: Ubuntu 18.04.3 LTS
GCC version: (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0
CMake version: version 3.10.2

Python version: 3.8
Is CUDA available: Yes
CUDA runtime version: Could not collect
GPU models and configuration:
GPU 0: GeForce RTX 2080 Ti
GPU 1: GeForce RTX 2080 Ti
GPU 2: GeForce RTX 2080 Ti
GPU 3: GeForce RTX 2080 Ti
GPU 4: GeForce RTX 2080 Ti
GPU 5: GeForce RTX 2080 Ti
GPU 6: GeForce RTX 2080 Ti
GPU 7: GeForce RTX 2080 Ti

Nvidia driver version: 418.87.00
cuDNN version: Could not collect

Versions of relevant libraries:
[pip] numpy==1.18.1
[pip] pytorch-lightning==0.6.0
[pip] torch==1.4.0
[pip] torch-lr-finder==0.1.2
[pip] torchvision==0.5.0
[conda] blas                      1.0                         mkl
[conda] mkl                       2020.0                      166
[conda] mkl-service               2.3.0            py38he904b0f_0
[conda] mkl_fft                   1.0.15           py38ha843d7b_0
[conda] mkl_random                1.1.0            py38h962f231_0
[conda] pytorch                   1.4.0           py3.8_cuda10.1.243_cudnn7.6.3_0    pytorch
[conda] pytorch-lightning         0.6.0                    pypi_0    pypi
[conda] torch-lr-finder           0.1.2                    pypi_0    pypi
[conda] torchvision               0.5.0                py38_cu101    pytorch

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions