Note that rescue DAGs have gone through a number of significant changes over the history of DAGMan. I'm only going to document the most recent version here.
Naming of rescue DAG files
If the original DAG file is foo.dag , the rescue DAG files will be named foo.dag.rescue.001 , foo.dag.rescue.002 , etc. Each time a rescue DAG is written, the rescue DAG number is incremented. When a rescue DAG is run, the highest-numbered rescue DAG is used (by default).
Format of rescue DAG
As of version 7.7.2, rescue DAG files are no longer a complete DAG file as they used to be. Now, the rescue DAG only records the state of the DAG (which nodes are done, and remaining node retries) and must be parsed in conjunction with the original DAG file. This is done so that a user discovers an error in their DAG file when they have a rescue DAG, they only need to fix the original DAG file, rather than having to fix both the orginal file and the rescue DAG file. (The old behavior can be obtained by setting DAGMAN_WRITE_PARTIAL_RESCUE to false.)
Anyhow, in the default case, the only things (besides comments) in the rescue DAG file are lines marking nodes as done and lines resetting the remaining retries for nodes.
Creation of rescue DAG
Rescue DAGs are created in three cases:
- The DAGMan job itself is condor_rm'ed
- A node or nodes have failed, and the DAG has reached the point where it can make no more forward progress.
Immediately after parsing (or attempting to parse) the DAG, if
-DumpRescueis given on the command line.
In the first two cases,
gets called (if DAGMan is condor_rm'ed,
is called by
, which is called by daemoncore; otherwise, it's explicitly called at various places in the DAGMan code), and that calls
. Note that
flag to make sure you don't get into a recursion if you have an error while trying to write the rescue DAG.
In the third case,
is called at various places in
(man, that function is too big!!).
In any case,
, which checks all legal rescue DAG names and finds the highest-numbered one, and then
, which creates a properly-formatted rescue DAG file name, and then
, which actually writes the rescue DAG.
is pretty straightforward. The main complication is that there are a bunch of places where we don't write stuff out if
iterates through all of the nodes and calls
on each one. The main reason for breaking
was to reduce the excessive indentation, and make the code easier to read. Note that this
flag is passed to
, because most of the info printed by
printed to a partial rescue DAG (which is the default mode).
Use/parsing of rescue DAG
(that huge function again!) calls
to find out if there are any rescue DAGs for the current DAG. (Oh, yeah -- the user now has to run a rescue DAG by re-submitting the original DAG, not by submitting the rescue DAG directly as they did a long time ago.) The user can also specify
on the command line to specify a rescue DAG to run, if they don't want to run the latest one. If you specify a rescue DAG number, any later rescue DAGs are renamed by
After parsing the original DAG file(s), DAGMan then parses the rescue DAG (which just sets the status of various nodes to DONE as appropriate, and also possibly changes the number of retries left on some nodes). At that point, we're ready to actually run the DAG.
Additional actions on condor_rm of DAGMan
, if there are any node jobs in the queue, we call
to remove them. If any PRE or POST scripts are running, we call
to kill them. (Note that in the case of DAGMan having made all the progress it can in the face of node failures, there won't be any node jobs or scripts running at this point.)
removes any HTCondor jobs by using the constraint
"DAGManJobId == <id>"
; any Stork jobs are removed individually.
iterates through all of the nodes, and individually kills any running scripts via daemoncore.
we also call
, which starts the DAG final node if there is one and we haven't already started it.