How To Auto Retry Elsewhere

Imagine that you knew your job will never run for longer than 20 minutes, and if your job does run for more than 20 minutes, you want to presume something is wrong with the machine. Here is an example of a condor_submit file that tells HTCondor to restart the job on a different machine if the job ran for more than 20 minutes.

# Fill in executable and max expected runtime in minutes.
# If the job runs longer than expected, it will go on hold,
# and then will be restarted on a different machine.  After
# three restarts on three different machines, the job will
# stay on hold.
#
executable = foo.exe
expected_runtime_minutes = 20
#
# Should not need to change the below...
#
job_machine_attrs = Machine
job_machine_attrs_history_length = 4
requirements = target.machine =!= MachineAttrMachine1 && \
   target.machine =!= MachineAttrMachine2 && \
   target.machine =!= MachineAttrMachine3
periodic_hold = JobStatus == 2 && \
   time() - EnteredCurrentStatus > 60 * $(expected_runtime_minutes)
periodic_hold_subcode = 1
periodic_release = HoldReasonCode == 3 && HoldReasonSubCode == 1 && \
   JobRunCount < 3
periodic_hold_reason = ifthenelse(JobRunCount<3,"Ran too long, will retry","Ran too long")
queue

Note the technique is to put the job on hold via periodic_hold if it runs too long, resulting in the job going to the Held state. Next the job is released via periodic_release , causing the job to go back to Idle and be rescheduled. The requirements expression ensures the job runs on different machine entirely, not just a different slot on the same machine; see AvoidingBlackHoles .

Also note that periodic_release expression only releases a job that was put on hold for a known cause, which we implement by utilizing the periodic_hold_subcode attribute. After all, we don't want to release a job that was put on hold for a different reason, such as the user running condor_hold . We also set periodic_hold_reason to something helpful, so typing condor_q -hold displays something informative.

Finally, note that we limit the number of times a job goes through the hold/release cycle.

All of the mechanisms used in the below submit file are described on the condor_submit manual page . Also may be useful to browse the documented job classad attributes .