On a multi-GPU machine used by multiple people for running Python code, like university clusters (in my case, the cluster offered by my university, University of Twente https://jupyter.utwente.nl/), it’s important to specify which GPU to use when you want to run your code. By default, torch-based Machine Learning and Deep Learning libraries/packages/frameworks and even Tensorflow will use GPU=0
the first GPU available on your system.
However, if there’s someone else using it, you will end up with a similar error RuntimeError: CUDA out of memory. Tried to allocate X MiB......
. Therefore, you can first check from the command line nvidia-smi
to check which GPU is being used or is free.
If I run nvidia-smi
on the cluster, I get the following output:
(KNLP) jovyan@4a206bd95566:~/thesis$ nvidia-smi Sat Sep 2 09:22:02 2023 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 510.108.03 Driver Version: 510.108.03 CUDA Version: 11.6 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA A16 On | 00000000:1B:00.0 Off | 0 | | 0% 45C P0 27W / 62W | 14808MiB / 15356MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 NVIDIA A16 On | 00000000:1C:00.0 Off | 0 | | 0% 48C P0 27W / 62W | 233MiB / 15356MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 2 NVIDIA A16 On | 00000000:1D:00.0 Off | 0 | | 0% 40C P0 26W / 62W | 5060MiB / 15356MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 3 NVIDIA A16 On | 00000000:1E:00.0 Off | 0 | | 0% 37C P0 26W / 62W | 233MiB / 15356MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 4 NVIDIA A16 On | 00000000:CE:00.0 Off | 0 | | 0% 44C P0 27W / 62W | 233MiB / 15356MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 5 NVIDIA A16 On | 00000000:CF:00.0 Off | 0 | | 0% 47C P0 27W / 62W | 233MiB / 15356MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 6 NVIDIA A16 On | 00000000:D0:00.0 Off | 0 | | 0% 36C P0 21W / 62W | 233MiB / 15356MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 7 NVIDIA A16 On | 00000000:D1:00.0 Off | 0 | | 0% 35C P0 27W / 62W | 233MiB / 15356MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| +-----------------------------------------------------------------------------+
As you see, the first GPU=0
is almost fully occupied/used by a process that I don’t run myself because if the process was initiated by me, I would see the process at the bottom of the output of the command under Processes:
Thus, I have to change the default GPU being used by my code to something different than 0
based on GPU memory availability. Therefore I use the following code snippet:
import torch device_id = 2 print(f"You have {torch.cuda.device_count()} available GPU ") print(f"Your current device ID is {torch.cuda.current_device()}") torch.cuda.set_device(device_id) # The device ID you want to you use print(f"Your new device ID is {torch.cuda.current_device()}") # Verify that that device you have chosen is being used """ Output (in my case) You have 8 available GPU Your current device ID is 0 Your new device ID is 2 """
Extra
I have written the following Python code to get information about available GPUs, the current NVIDIA driver version, and the CUDA version installed on the system. However, to run the code, you need to install the following Python package first: nvidia-ml-py
(Python Bindings for the NVIDIA Management Library, which is maintained by NVIDIA) using pip install nvidia-ml-py
from pynvml import * nvmlInit() print(f"Driver Version: {nvmlSystemGetDriverVersion()}") print(f"CUDA version {nvmlSystemGetCudaDriverVersion()}\n") deviceCount = nvmlDeviceGetCount() def convert_bytes_to_MB(value): return round(value/1000000,2) for i in range(deviceCount): handle = nvmlDeviceGetHandleByIndex(i) print(f"Device {i} : {nvmlDeviceGetName(handle)}:") total_mem_label = "Total Memory(MB)" used_mem_label = "Used Memory(MB)" free_mem_label = "Free Memory(MB)" print(f"{total_mem_label:<20}{used_mem_label:<20}{free_mem_label:<20} ") mem_inf = nvmlDeviceGetMemoryInfo(handle) total = convert_bytes_to_MB(mem_inf.total) used = convert_bytes_to_MB(mem_inf.used) free = convert_bytes_to_MB(mem_inf.free) print(f"{total: <20}{used:<20}{free:<20}\n")
The code above outputs the following information:
Driver Version: 510.108.03 CUDA version 11060 Device 0 : NVIDIA A16: Total Memory(MB) Used Memory(MB) Free Memory(MB) 16101.93 15566.57 535.36 Device 1 : NVIDIA A16: Total Memory(MB) Used Memory(MB) Free Memory(MB) 16101.93 716.64 15385.3 Device 2 : NVIDIA A16: Total Memory(MB) Used Memory(MB) Free Memory(MB) 16101.93 716.64 15385.3 Device 3 : NVIDIA A16: Total Memory(MB) Used Memory(MB) Free Memory(MB) 16101.93 716.64 15385.3 Device 4 : NVIDIA A16: Total Memory(MB) Used Memory(MB) Free Memory(MB) 16101.93 716.64 15385.3 Device 5 : NVIDIA A16: Total Memory(MB) Used Memory(MB) Free Memory(MB) 16101.93 716.64 15385.3 Device 6 : NVIDIA A16: Total Memory(MB) Used Memory(MB) Free Memory(MB) 16101.93 716.64 15385.3 Device 7 : NVIDIA A16: Total Memory(MB) Used Memory(MB) Free Memory(MB) 16101.93 716.64 15385.3
I hope you have learned something new 🙂