How to load balance users to one of many submit nodes
Known to work with 8.6
It is common for uses of HPC systems to have a shared file system and a pool of submit servers that users are assigned to dynamically and transparently based on load. It is transparent to users because all of their files are on a shared file system, so which machine is running their job is something that they need not know in many cases.
This use model can be approximated with HTCondor, but there are some differences which should be noted.
The HTCondor job queue is not shared between submit nodes, but any node can query any other nodes job queue by setting configuration parameters or using command line arguments. It is also possible to query ALL jobs queues with a single command, although this is not fully transparent to the user.
- other differences?
The basic strategy is to make use of HTCondor's ability to customize the configuration per-user so that any given users jobs will always go to a specific schedd regardless of which submit machine they are currently logged in to. This can be almost completely transparent to the user if all of the user's files are on a shared file system.
The administrator, or the user will add a file into the users
file. That file will the configuration variable
<schedd-name>is the same name that you would see in the output of the command
Name Machine RunningJobs IdleJobs HeldJobs submit-3.chtc.wisc.edu submit-3.chtc.wisc.edu 2638 47950 7870 submit-4.chtc.wisc.edu submit-4.chtc.wisc.edu 1990 49497 8977 submit-5.chtc.wisc.edu submit-5.chtc.wisc.edu 11586 28108 6966 testsubmit.chtc.wisc.edu testsubmit.chtc.wisc.edu 0 0 1
So, for instance you might configure the user tj to use submit-3
fragment of /home/tj/.condor/user_config
Now whenever user
, the submit/query will go to the schedd called
As long as
user has no jobs in any of the HTCondor schedd queues, he can be moved to a new schedd merely by changing the contents of
file. This can happen even while
is logged in, since the file is re-parsed by each invocation of
Whether or not tj has any jobs in any of the schedd's can be determined by running
condor_status -submit firstname.lastname@example.org -af Machine 'HeldJobs+IdleJobs+RunningJobs+LocalJobsIdle+LocalJobsRunning'
If the schedd's job queue is stored on the shared file system, HTCondor can be configured so that when a given machine fails, any schedd's running on it will be automatically restarted on the backup machine, but with the same schedd name.
details to be written