We are developing a way to run targeted CHTC jobs on the Cooley cluster at ALCF (Argonne Leadership Computing Facility). This is done via "hobble-in", a human-assisted form of glide-in that submits HTCondor startds as user jobs into Cooley's batch scheduler, Cobalt. The setup mimics how the condor annex allocates resources from AWS to run CHTC jobs ( UsingCondorAnnexOnChtc ).
Each machine in Cooley has the following resources:
- Two 2.4 GHz Intel Haswell E5-2620 v3 processors (6 cores per CPU, 12 cores total)
- One NVIDIA Tesla K80 (with two GPUs)
- 384GB RAM, 24 GB GPU RAM (12 GB per GPU)
- FDR Infiniband interconnect
- 345GB local scratch space
- Red Hat Enterprise Linux Server release 6.8 (Santiago)
Similar to other resources outside of CHTC, running on Cooley will not happen by default. Jobs need to meet certain conditions regarding where and how they're submitted.
Jobs that want to be able to run on Cooley need to be submitted to a schedd configured to flock to the CHTC Condor Annex collector. Currently, that means submitting from these machines:
The jobs' resource requests must not exceed the available resources of a Cooley machine (detailed above).
The jobs must indicate that they want to run on resources outside of CHTC and that they're willing to run on Cooley specifically. To do so, the user must add the following lines to their submit file:
+WantFlocking = True +MayUseCooley = True
If the user wants the jobs to run only on Cooley, they can add a clause to the Requirements expression:
requirements = Facility =?= href="/wiki-archive/pages/Cooley"
If the user wants the jobs to run only on a specific Cooley annex, they can add a different clause to the Requirements expression:
requirements = AnnexName =?= href="/wiki-archive/pages/Test1"
These two additions to the Requirements expression can be used individually or together.
Submitting jobs to Cooley is done from the login machines (hostname cooley.alcf.anl.gov) using the Cobalt batch scheduler. Access to theses machines is done via ssh using two-factor authentication. Thus, submission and management of Cooley jobs must be done by a human who has a login account.
Before submitting Cooley jobs, a CHTC staff member must configure the
system so that the proper CUDA libraries can be found by HTCondor and user jobs. To do so, they must create the file
with the following contents:
+mvapich2 +cuda-9.0.176 @default
Once logged into a Cooley login machine, a CHTC staff member can submit a new hobble-in job with a command like this:
% /home/jfrey/hobblein/bin/hobblein_submit 2 24:00:00 Test1 jfrey 1337326 %
This will submit a hobble-in job that uses 2 machines, runs for 24 hours, has an annex name of Test1, and will only run jobs owned by user jfrey. If the username argument is omitted, then the hobble-in will run jobs from any user.
The command will print out the Cobalt job id (1337326 in this case). The status of the job can be checked via Cobalt with this command:
% qstat JobID User WallTime Nodes State Location ======================================================= 1337326 jfrey 00:10:00 2 running cc024,cc124 %
You can check the status of all jobs you've submitted as well:
% qstat -u jfrey JobID User WallTime Nodes State Location ======================================================== 1337314 jfrey 00:10:00 2 exiting cc028,cc031 1337326 jfrey 00:10:00 2 starting cc024,cc124 %
You can also delete a job from Cobalt:
% qdel 1337326 Deleted Jobs JobID User ================ 1337326 jfrey %
While the hobble-in job is running at Cooley, machine ads will appear in the CHTC annex collector. They can be queried using the annex name specified to hobblein_submit.
% condor_status -pool annex-cm.chtc.wisc.edu -annex Test1 Name OpSys Arch State Activity LoadAv Mem firstname.lastname@example.org LINUX X86_64 Unclaimed Idle 0.000 387137 email@example.com LINUX X86_64 Unclaimed Idle 0.000 387137 Machines Owner Claimed Unclaimed Matched Preempting Drain X86_64/LINUX 2 0 0 2 0 0 0 Total 2 0 0 2 0 0 0 %
The startds can be shut down like with a regular annex. This must be done as the root user on annex-cm.chtc.wisc.edu, due to the current security configuration.
# condor_off -annex Test1 Sent "Kill-Daemon" command for "master" to master firstname.lastname@example.org Sent "Kill-Daemon" command for "master" to master email@example.com #