The techniques described here depend on HTCondor version 8.1.4 or a later version. Those with less current versions may try an older, less flexible technique as described at HowToManageGpusInSeriesSeven .
The general technique to manage GPUs has 3 steps:
- Advertise the GPU by configuring HTCondor such that an execute node includes information about available GPUs in its machine ClassAd .
- A job requests a GPU, specifying any further specific requirements about the GPU, in order to acquire a suitable GPU
- The job identifies the GPU through the use of arguments or environment, to learn which GPU it may use.
1. Advertise the GPU
The availability of GPU resources must be advertised in the machine's ClassAd , in order for jobs that need GPUs to be matched with machines that have GPUs. As of HTCondor version 8.2.0 (actually 8.1.6), the detection and advertisement of GPUs is automated by adding a single line representing a metaknob to the configuration of the execute node.
use feature : GPUs
Advertise additional attributes of the GPUs by also setting
GPU_DISCOVERY_EXTRA = -extra
The
feature:GPUs
metaknob will invoke the
condor_gpu_discovery
tool
to generate and populate the machine
ClassAd
for a custom resource identified by the
GPUs
tag within
ClassAd
attribute names.
The HTCondor condor_gpu_discovery tool is designed to assist in detecting GPUs and in providing details that help to set up the advertisement of GPU information. This tool detects CUDA and OpenCL devices, and it outputs a list of GPU identifiers for all detected devices.
HTCondor has a general mechanism for declaring user-defined slot resources. GPUs are a user-defined slot resource, so this same mechanism is used to define a resource. Use of the metaknob always uses the resource type name
GPUs
. This resource type name is case insensitive, but all characters within the name are significant,
so be consistent.
What the metaknob configures
The
feature GPUs
defines a custom resource with this configuration of the execute node:
MACHINE_RESOURCE_INVENTORY_GPUs = $(LIBEXEC)/condor_gpu_discovery -properties ENVIRONMENT_FOR_AssignedGPUs = GPU_DEVICE_ORDINAL=/(CUDA|OCL)// CUDA_VISIBLE_DEVICES
MACHINE_RESOURCE_INVENTORY_GPUs
tells HTCondor to run the
condor_gpu_discovery
tool, and use its output to define a custom resource with the resource tag
GPUs
.
ENVIRONMENT_FOR_AssignedGPUs
tells HTCondor to publish the value of machine
ClassAd
attribute
AssignedGPUs
for a slot in the job's environment using the environment variables
GPU_DEVICE_ORDINAL
and
CUDA_VISIBLE_DEVICES
. In addition,
AssignedGPUs
will always be published into the job's environment as
_CONDOR_AssignedGPUs
.
The output of the
condor_gpu_discovery
tool reports
DetectedGPUs
and lists the GPU id of each one. GPU ids will be CUDA<n> or OCL<n>, where <n> is an integer, and CUDA or OCL indicates whether the CUDA library or the OpenCL library is used to communicate with the device.
The
-properties
argument in the
condor_gpu_discovery
command tells it to also list significant attributes of the device(s). These attributes will then be published in each slot
ClassAd
.
Here is typical output of condor_gpu_discovery :
> condor_gpu_discovery -properties DetectedGPUs="CUDA0, CUDA1, CUDA2, CUDA3" CUDACapability=3.0 CUDADeviceName="GeForce GTX 690" CUDADriverVersion=5.50 CUDAECCEnabled=false CUDAGlobalMemory=2048 CUDARuntimeVersion=5.0
This output indicates that 4 GPUs were detected, all of which have the same properties.
Extra configuration
If using a static slot configuration, to control how many GPUs are assigned to each slot, use the
SLOT_TYPE_<n>
configuration syntax to specify
Gpus
, the same as would be done for
Cpus
or
Memory
. If not specified, slots default to
GPUS=auto
, which will assign
GPUs proportionally to slots until there are no more GPUs to assign, and then it will assign 0 GPUs to the remaining slots. So a machine with
NUM_CPUS=8
and
DetectedGPUs="CUDA0, CUDA1, CUDA2, CUDA3"
will assign 1 GPUs each to the first 4 slots, and no GPUs to the remaining slots. Slot
ClassAds
with GPUs assigned will include the following attributes:
Cpus=1 GPUs=1 TotalCpus=8 TotalGPUs=4 TotalSlotCpus=1 TotalSlotGPUs=1 CUDACapability=3.0 CUDADeviceName="GeForce GTX 690" CUDADriverVersion=5.50 CUDAECCEnabled=false CUDAGlobalMemory=2048 CUDARuntimeVersion=5.0
When using Partitionable slots, by default the Partitionable slot will be assigned all GPUs. Dynamic slots created from the Partitionable slot will be assigned GPUs when the job requests them.
2. A job requests a GPU
User jobs that require a GPU must specify this requirement. In a job's submit description file, the simple request is
request_GPUs = 1
A more complex request, such as:
request_GPUs = 2 requirements = CUDARuntimeVersion >= 5.5 \ && (CUDACapability >= 3.0) \ && (CUDAGlobalMemoryMb >= 1500)
specifies that the job requires a CUDA GPU with at least 1500 Mb of memory, the CUDA runtime version 5.5 or later, and a CUDA Capability of 3.0 or greater.
3. Identify the GPU
Once a job matches to a given slot, it needs to know which GPUs to use, if multiple are present. GPUs that the job are permitted to use are specified as defined values for the slot
ClassAd
attribute
AssignedGPUs
. They are also published into the job's environment with variable
_CONDOR_AssignedGPUs
. In addition, if the configuration is defined with
ENVIRONMENT_FOR_AssignedGPUs
set, environment variables
CUDA_VISIBLE_DEVICES
and
GPU_DEVICE_ORDINAL
are published. The
AssignedGPUs
attribute value can be accessed if passed to the job as arguments using the $$() substitution macro syntax. For example, if the job takes an argument "--device=X" where X is the device to use, specify this in the submit description file with
arguments = "--device=$$(AssignedGPUs)"
Alternatively, the job might look to the environment variable
_CONDOR_AssignedGPUs
, or
CUDA_VISIBLE_DEVICES
.