How to specify the CUDA device for your Python Torch-based code4 min read

11 months ago

5 Min read

On a multi-GPU machine used by multiple people for running Python code, like university clusters (in my case, the cluster offered by my university, University of Twente https://jupyter.utwente.nl/), it’s important to specify which GPU to use when you want to run your code. By default, torch-based Machine Learning and Deep Learning libraries/packages/frameworks and even Tensorflow will use GPU=0 the first GPU available on your system.

However, if there’s someone else using it, you will end up with a similar error RuntimeError: CUDA out of memory. Tried to allocate X MiB....... Therefore, you can first check from the command line nvidia-smito check which GPU is being used or is free.

If I run nvidia-smi on the cluster, I get the following output:

(KNLP) jovyan@4a206bd95566:~/thesis$ nvidia-smi 
Sat Sep  2 09:22:02 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.108.03   Driver Version: 510.108.03   CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A16          On   | 00000000:1B:00.0 Off |                    0 |
|  0%   45C    P0    27W /  62W |  14808MiB / 15356MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A16          On   | 00000000:1C:00.0 Off |                    0 |
|  0%   48C    P0    27W /  62W |    233MiB / 15356MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A16          On   | 00000000:1D:00.0 Off |                    0 |
|  0%   40C    P0    26W /  62W |   5060MiB / 15356MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A16          On   | 00000000:1E:00.0 Off |                    0 |
|  0%   37C    P0    26W /  62W |    233MiB / 15356MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA A16          On   | 00000000:CE:00.0 Off |                    0 |
|  0%   44C    P0    27W /  62W |    233MiB / 15356MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   5  NVIDIA A16          On   | 00000000:CF:00.0 Off |                    0 |
|  0%   47C    P0    27W /  62W |    233MiB / 15356MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   6  NVIDIA A16          On   | 00000000:D0:00.0 Off |                    0 |
|  0%   36C    P0    21W /  62W |    233MiB / 15356MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   7  NVIDIA A16          On   | 00000000:D1:00.0 Off |                    0 |
|  0%   35C    P0    27W /  62W |    233MiB / 15356MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

As you see, the first GPU=0 is almost fully occupied/used by a process that I don’t run myself because if the process was initiated by me, I would see the process at the bottom of the output of the command under Processes:

Thus, I have to change the default GPU being used by my code to something different than 0 based on GPU memory availability. Therefore I use the following code snippet:

import torch
device_id = 2
print(f"You have {torch.cuda.device_count()} available GPU ")
print(f"Your current device ID is {torch.cuda.current_device()}")
torch.cuda.set_device(device_id) # The device ID you want to you use
print(f"Your new device ID is {torch.cuda.current_device()}") # Verify that that device you have chosen is being used

""" Output (in my case)
You have 8 available GPU 
Your current device ID is 0
Your new device ID is 2
"""

Extra

I have written the following Python code to get information about available GPUs, the current NVIDIA driver version, and the CUDA version installed on the system. However, to run the code, you need to install the following Python package first: nvidia-ml-py (Python Bindings for the NVIDIA Management Library, which is maintained by NVIDIA) using pip install nvidia-ml-py

from pynvml import *
nvmlInit()
print(f"Driver Version: {nvmlSystemGetDriverVersion()}")
print(f"CUDA version {nvmlSystemGetCudaDriverVersion()}\n")
deviceCount = nvmlDeviceGetCount()
def convert_bytes_to_MB(value):
    return round(value/1000000,2)

for i in range(deviceCount):
    handle = nvmlDeviceGetHandleByIndex(i)
    print(f"Device {i} : {nvmlDeviceGetName(handle)}:")
    total_mem_label = "Total Memory(MB)"
    used_mem_label = "Used Memory(MB)"
    free_mem_label = "Free Memory(MB)"    
    print(f"{total_mem_label:<20}{used_mem_label:<20}{free_mem_label:<20} ")
    mem_inf = nvmlDeviceGetMemoryInfo(handle)
    total = convert_bytes_to_MB(mem_inf.total)
    used = convert_bytes_to_MB(mem_inf.used)
    free = convert_bytes_to_MB(mem_inf.free)
    print(f"{total: <20}{used:<20}{free:<20}\n")

The code above outputs the following information:

Driver Version: 510.108.03
CUDA version 11060

Device 0 : NVIDIA A16:
Total Memory(MB)    Used Memory(MB)     Free Memory(MB)      
16101.93            15566.57            535.36              

Device 1 : NVIDIA A16:
Total Memory(MB)    Used Memory(MB)     Free Memory(MB)      
16101.93            716.64              15385.3             

Device 2 : NVIDIA A16:
Total Memory(MB)    Used Memory(MB)     Free Memory(MB)      
16101.93            716.64              15385.3             

Device 3 : NVIDIA A16:
Total Memory(MB)    Used Memory(MB)     Free Memory(MB)      
16101.93            716.64              15385.3             

Device 4 : NVIDIA A16:
Total Memory(MB)    Used Memory(MB)     Free Memory(MB)      
16101.93            716.64              15385.3             

Device 5 : NVIDIA A16:
Total Memory(MB)    Used Memory(MB)     Free Memory(MB)      
16101.93            716.64              15385.3             

Device 6 : NVIDIA A16:
Total Memory(MB)    Used Memory(MB)     Free Memory(MB)      
16101.93            716.64              15385.3             

Device 7 : NVIDIA A16:
Total Memory(MB)    Used Memory(MB)     Free Memory(MB)      
16101.93            716.64              15385.3

I hope you have learned something new 🙂