To run multiple GPU jobs—each correctly mapped to its assigned GPU—on a single HTCondor node, here’s what to change:
TPV changes:
params:
request_gpus: "{gpus or 0}"
Example conf
tools.yml
params:
docker_run_extra_arguments: ' --gpus all --env CUDA_VISIBLE_DEVICES=$_CONDOR_AssignedGPUs '
Example conf
tools.yml
params:
singularity_run_extra_arguments: ' --nv --env CUDA_VISIBLE_DEVICES=$_CONDOR_AssignedGPUs '
Singularity doc.
HTCondor config changes:
GPU_DISCOVERY_EXTRA = -extra -divide <N>
Where N
is an Int. (GPU memory will be equally divided between slots)
GPU_DISCOVERY_EXTRA = -extra -by-index
Use -by-index
. GPU discovery command doc.
Here, I describe how to configure and utilize multiple GPUs on a single worker node within an HTCondor compute environment. It outlines my attempts at HTCondor configuration changes, the behavior of partitionable slots, troubleshooting steps, and the final solution for ensuring GPU visibility within jobs.
On a GPU worker node, update the HTCondor configuration to export necessary GPU-related environment variables:
ENVIRONMENT_FOR_AssignedGPUs = CUDA_VISIBLE_DEVICES, GPU_DEVICE_ORDINAL
Apply the configuration change:
condor_reconfig
# and/or
systemctl restart condor
In the job submit file, request GPU resources as needed:
request_gpus = 1
This ensures that HTCondor allocates one GPU per job.
HTCondor advertises all available GPUs as part of a partitionable slot.
Example output from a GPU host with 4 GPUs:
condor_status -l tstgpu.bi.privat | grep -i gpu
AssignedGPUs = "GPU-b156e653,GPU-b2c83767,GPU-c62b119c,GPU-e58c2e11"
AvailableGPUs = { GPUs_GPU_b156e653,GPUs_GPU_b2c83767,GPUs_GPU_c62b119c,GPUs_GPU_e58c2e11 }
This reflects a single partitionable slot encompassing all 4 GPUs.
When a job is submitted with request_gpus = 1
:
GPU-b156e653
).AssignedGPUs
attribute determines the exact GPU UUID.HTCondor sets the specified environment variables (CUDA_VISIBLE_DEVICES
, etc.) accordingly — if configured correctly.
GPU_DISCOVERY_EXTRA = $(GPU_DISCOVERY_EXTRA) -divide 2
nvidia-smi
— unexpected.$CUDA_VISIBLE_DEVICES
appears empty or incorrect.GPU-b156e653
) rather than index-based (CUDA0
, CUDA1
, etc.).CUDA_VISIBLE_DEVICES
.CUDA_VISIBLE_DEVICES
) were populated with truncated or invalid values (e.g., 156653
from GPU-b156e653
).Check individual job ClassAds to confirm GPU assignment:
condor_q -l <job_id> | grep -i assigned
# Example output:
AssignedGPUs = "GPU-b156e653"
Only set:
ENVIRONMENT_FOR_AssignedGPUs = CUDA_VISIBLE_DEVICES
Then, inside the job script, explicitly export the proper environment variable using the HTCondor-published _CONDOR_AssignedGPUs
variable:
export CUDA_VISIBLE_DEVICES=$_CONDOR_AssignedGPUs
HTCondor automatically publishes the $_CONDOR_AssignedGPUs environment variable, which contains the value of the slot’s AssignedGPUs attribute.
Finally, I submitted 5 GPU jobs (the job files and scripts can be found below and on the NFS path /data/misc06/test_pxe_gpu_jobs
).
As expected, I can see that four jobs are running and one is idle; each of the four running jobs uses its own assigned GPU.
Further, the job output files (a few print
and echo
statements in the example script) show that each job is assigned only one GPU and that it uses that
_CONDOR_AssignedGPUs
contains the exact value of the slot’s AssignedGPUs
attribute (e.g., GPU-b156e653
).CUDA_VISIBLE_DEVICES
universe = vanilla
executable = /data/misc06/test_pxe_gpu_jobs/gpu_test.sh
output = /data/misc06/test_pxe_gpu_jobs/job1.out
error = /data/misc06/test_pxe_gpu_jobs/job1.err
log = /data/misc06/test_pxe_gpu_jobs/job1.log
requirements = Machine == "tstgpu.bi.privat"
request_cpus = 2
request_memory = 2GB
request_GPUs = 1
queue 1
gpu_test.sh
#!/bin/bash
echo "[$(date)] Starting job on host: $(hostname)"
echo "Raw Assigned GPU(s): $_CONDOR_AssignedGPUs"
echo "Assigned GPU(s): $CUDA_VISIBLE_DEVICES"
export CUDA_VISIBLE_DEVICES=$_CONDOR_AssignedGPUs
/data/misc06/test_pxe_gpu_jobs/tf-cuda-venv/bin/python3 /data/misc06/test_pxe_gpu_jobs/gpu_test_tf.py
echo "[$(date)] Job finished."
gpu_test_tf.py
# gpu_test_tf.py
import tensorflow as tf
import time
import os
gpus = tf.config.list_physical_devices('GPU')
if not gpus:
print("No GPU found.")
exit(1)
# Optionally: log which GPU(s) TensorFlow sees
print("Visible GPU(s) to TensorFlow:", gpus)
print("Using GPU(s):", gpus)
# Create two large constant tensors
a = tf.random.normal([4096, 4096])
b = tf.random.normal([4096, 4096])
@tf.function
def matrix_multiply():
return tf.matmul(a, b)
start = time.time()
print("Starting 10-minute TensorFlow GPU workload...")
# Run matrix multiplication in a loop for ~10 minutes
while time.time() - start < 100:
result = matrix_multiply()
_ = result.numpy() # Force evaluation on GPU
print("Workload complete.")
I ran a latest test where I removed the ENVIRONMENT_FOR_AssignedGPUs = CUDA_VISIBLE_DEVICES
from the HTCondor worker conf.
Without this ( ENVIRONMENT_FOR_AssignedGPUs = CUDA_VISIBLE_DEVICES
) in the HTCondor worker config, the variable CUDA_VISIBLE_DEVICES
actually gets correctly populated (instead of the truncated 156653
, they get GPU-b156e653
), unlike what I described above, which warranted a dedicated export CUDA_VISIBLE_DEVICES=$_CONDOR_AssignedGPUs
on the test script.
Output from my latest test run (submitted six jobs, 4 of them ran, where each job had access to individual GPUs), I added the following to the bash script (see issue above for details)
echo "Raw Assigned GPU(s): $_CONDOR_AssignedGPUs"
echo "CUDA_VISIBLE_DEVICES: $CUDA_VISIBLE_DEVICES"
Raw Assigned GPU(s): GPU-b156e653
CUDA_VISIBLE_DEVICES: GPU-b156e653
This indicates that no modifications are needed to the HTCondor worker config or additional handling of environment variables in the scripts.
--gpus all
, and Docker seems to handle this differently. The container sees all 4 GPUs when running nvidia-smi
from within the container, and all the jobs end up using GPU index 0.--env CUDA_VISIBLE_DEVICES=$_CONDOR_AssignedGPUs'
and this makes the jobs to use only the assigned GPUs. Note: In theory, one could also pass the assigned GPU ID directly to Docker’s --gpus
option, e.g., --gpus "device=$_CONDOR_AssignedGPUs"
. However, due to complex shell quoting and variable interpolation challenges, this approach seems cumbersome and error-prone. Using the environment variable to control GPU visibility is simpler and more robust.-divide <N>
to split the 4 GPU slots into N slots (N
= 3, in this example; so we have 12 GPU slots). The HTcondor config is this: GPU_DISCOVERY_EXTRA = -extra-divide 3
.