How To Avoid Job Restarts

Known to work with version 7.4.0

By default, HTCondor manages jobs under the assumption that the user wants them to be run as many times as necessary in order to successfully finish. If all goes well, this means the job will only run once. However, various failures can require that the job be restarted in order to succeed. Examples of such failures include:

  • execute machine crashes while job is running
  • submit machine crashes while job is running and does not reconnect to job before the job lease expires
  • job is evicted by condor_vacate, PREEMPT, or is preempted by another job
  • output files from job fail to be transferred back to submit machine
  • input files fail to be transferred
  • network failures between execute and submit machine

In some cases, it is desired that jobs not be restarted. The user wants HTCondor to try to run the job once, and if this attempt fails for any reason, it should not make a second attempt. To achieve this, the following can be put in the job's submit file:

requirements = NumJobStarts == 0
periodic_remove = JobStatus == 1 && NumJobStarts > 0

Note that this does not guarantee that HTCondor will only start the job once. The NumJobStarts job attribute is updated shortly after the job starts running. Various types of failures can result in the job starting without this attribute being updated (e.g. network failure between submit and execute machine). By setting SHADOW_LAZY_QUEUE_UPDATE=false, the window of time between the job starting and the update of NumJobStarts can be decreased, but this still does not provide a guarantee that the job will never be started more than once. This policy is therefore to start the job at least once, and, with best effort but no strong guarantee, not more than once. As usual, HTCondor does provide a strong guarantee that the job is never running more than once at the same time.

Instead of NumJobStarts , you can flag off of several over attributes that are incremented when a job starts up:

  • NumShadowStarts - Incremented when the condor_shadow starts, but before the job has started. Guarantees that a job will run at most once, but it a problem occurs between the shadow starting and the job starting, the job will never run.
  • NumJobMatches -
  • JobRunCount -

Search terms: run once and only once, one, single