# Fill in executable and max expected runtime in minutes. # If the job runs longer than expected, it will go on hold, # and then will be restarted on a different machine. After # three restarts on three different machines, the job will # stay on hold. # executable = foo.exe expected_runtime_minutes = 20 # # Should not need to change the below... # job_machine_attrs = Machine job_machine_attrs_history_length = 4 requirements = target.machine =!= MachineAttrMachine1 && \ target.machine =!= MachineAttrMachine2 && \ target.machine =!= MachineAttrMachine3 periodic_hold = JobStatus == 2 && \ time() - EnteredCurrentStatus > 60 * $(expected_runtime_minutes) periodic_hold_subcode = 1 periodic_release = HoldReasonCode == 3 && HoldReasonSubCode == 1 && \ JobRunCount < 3 periodic_hold_reason = ifthenelse(JobRunCount<3,"Ran too long, will retry","Ran too long") queue
Note the technique is to put the job on hold via
if it runs too long, resulting in the job going to the Held state. Next the job is released via
, causing the job to go back to Idle and be rescheduled. The
expression ensures the job runs on different machine entirely, not just a different slot on the same machine; see
Also note that
expression only releases a job that was put on hold for a known cause, which we implement by utilizing the
attribute. After all, we don't want to release a job that was put on hold for a different reason, such as the user running
. We also set
to something helpful, so typing
displays something informative.
Finally, note that we limit the number of times a job goes through the hold/release cycle.