Vanilla Starter Design

Control flow in the vanilla starter, from main_init onwards to the Create_Process job exec in the context of a vanilla job.

main_init():
+ WHEN_WIN32: SetConsoleCtrlHandler is called
+ We register a cleanup routine: exception_cleanup
+ We parseArgs our arguments and return a JobInfoCommunicator object.
  parseArgs():
   + See the code for details.
   + We return one of a specific JIC type of:
     - JICShadow for vanilla-ish universe jobs with a Shadow
     - JICLocalSchedd for local universe under schedd (no shadow)
     . These next two deal with running a job as specified in a file on disk.
     - JICLocalFile
     - JICLocalConfig
+ We call Starter->Init (type CStarter, a singleton-like object already
  defined as a global variable) with the JobInfoCommunicator object,
  and some global variables we set while parsing arguments. The notable
  one is whether or not we are a gidshell.
  Starter->Init():
   + Arrange that an internal field is set to the passed in JIC object.
   + Config().
   + Figure out working directory.
   + Register signal handlers:
     - DC_SIGSUSPEND -> CStarter::RemoteSuspend
     - DC_SIGCONTINUE -> CStarter::RemoteContinue
     - DC_SIGHARDKILL -> CStarter::RemoteShutdownFast
     - DC_SIGSOFTKILL -> this
     - DC_SIGPCKPT -> CStarter::RemotePeriodicCkpt
     - DC_SIGREMOVE -> CStarter::RemoteRemove
     - SIGUSR1 -> CStarter::RemoteRemove
     - DC_SIGHOLD -> CStarter::RemoteHold
   + Register reaper handler:
     - Reaper -> CStarter::Reaper
   + Register command handlers:
     - CA_CMD -> CStarter::classadCommand
     - UPDATE_GSI_CRED -> CStarter::updateX509Proxy
     - DELEGATE_GSI_CRED_STARTER -> CStarter::updateX509Proxy
     - STARTER_HOLD_JOB -> CStarter::remoteHoldCommand
     - CREATE_JOB_OWNER_SEC_SESSION -> CStarter::createJobOwnerSecSession
     - START_SSHD -> CStarter::startSSHD
   + Set the resource limits for the job with sysapi_set_resource_limits().
   + Initialize the JobInfoCommunicator object with jic->Init().
     jic->init() - There is one for each JIC, assuming JICShadow.
     + getJobAdFromShadow()
     + Maybe receiveMachineAd()
     + checkForStarterDebugging() and wait if it is true.
     + Create a Shadow object: DCShadow()
     + initShadowInfo() grabs some shadow version info from the job ad.
     + registerStarterInfo() tells the shadow about the starter.
     + initMatchSecuritySession() create a security session for file xfer.
     + initJobInfo() cache important stuff out of job ad.
     + initUserPriv() who is the owner and set up privstate codes.
     + Starter->createTempExecuteDir() makes a spot to run the job.
     + initIOProxy() as the user if they ask for it.
     + u_log->initFromJobAd() if the job wants a starter user log.
     + writeExecutionVisa() if the user job wanted it.
   + Have the JIC set up the job's environment.
     jic->setupJobEnvironment() - One for each JIC, assuming JICShadow.
     + Call beginFileTransfer() if file transfer is wanted.
       + Get ATTR_TRANSFER_EXECUTABLE
       + Change ATTR_TRANSFER_EXECUTABLE to ATTR_JOB_CMD/CONDOR_EXEC
       + Initialize the FileTransfer Object
         filetrans->Init(job_ad, false, PRIV_USER)
          + Initialization TBD
       + filetrans->setSecuritySession()
       + REGISTER FILETRANSFER CALLBACK -> JICShadow::transferCompleted()
       + Pay attention to the version to the shadow
         filetrans->setPeerVersion()
       + Do non blocking file transfer: filetrans->DownloadFiles(false)
       + END OF CONTROL, BACK TO DAEMON CORE, SEND CONTROL TO
         transferCompleted()

     + transferCompleted() FROM DAEMONCORE or directly called if no file xfer.
       + WHEN filetranser object exists
         + ftrans->GetInfo() and make sure the xfer succeeded.
         + Ensure the executable has the executable bit set.
    + Call parent's JobInfoCommunicator::setupJobEnvironment() function.
      + When job hooks are active.
        + m_hook_mgr->tryHookPrepareJob(), fail out if it didn't work.
      + Call Starter->jobEnvironmentReady():
        Starter->jobEnvironmentReady()
        + WHEN linux, initialize a helper object for glexec with
          ATTR_X509_USER_PROXY if it exists.
        + Change the sand box ownership to the user away from HTCondor.
        + Implement the job deferral policy:
          CStarter::jobWaitUntilExecuteTime()
          + The gist is that we register a timer for when the job should
            execute. The timer continues the control flow to the actual
            execution of the job. If there is no deferral time, the timer
            gets a time of 0, which means "now".
            REGISTER TIMER -> CStarter::SpawnPreScript()
          + TBD
          + END OF CONTROL, BACK TO DAEMONCORE

FROM DAEMONCORE:
CStarter::SpawnPreScript()
 + See if we need any pre/post scripts.
 + WHEN there is a pre script (eewww):
   + TBD for exactly what happens, but basically, when we enter this
     control path, we start the pre script, and go BACK TO DAEMON CORE.
     The reaper of the pre script calls SpawnJob() to actually start
     the job. A bit of interesting control flow, to say the least.
     END OF CONTROL FLOW

 From DAEMONCORE, or if there was no pre-script:
 + Call SpawnJob() to actually start the job.
   + SpawnJob()
     + Make a specific OsProc class for the job, it could be one of:
       - VanillaProc (assume this code path)
       - JavaProc
       - ParallelProc
       - MPIMasterProc or MPIComradeProc
       - VMProc
    + Start the job.
      job->StartJob() - Assuming VanillaProc
   	  + WHEN win32
        + TBD
      + Figure out the pid tracking interval.
      + Determine if we are on a dedicated account for pid tracking purposes.
      + WHEN Linux
        + Track by supplementary group ids.
      + Call parent's OsProc::StartJob()
        OsProc::StartJob()
        + Do some checks for validity, otherwise don't run the job.
        + Get a privsep helper object, if needed.
        + Set up the arguments to the job.
        + If we're gridshell, chmod the executable to be executable, since
          globus probably got it wrong.
        + If there is a USER_JOB_WRAPPER, use it.
        + Set up the job's environment variables.
        + Set up the job's std files (as the user's privstate).
        + If we want to renice the job, add in some more stuff to the job ad.
        + Write the job and machine ads into the execute directory if desired.
        + Handle if we desire to start the job in a suspended state.
        + Enforce coresize on the job.
        + WHEN win32, TBD
        + Create the process either with the privsep helper, or Create_Process.
        + Close any fds we passed to the child, since the child has them.
        + If there was an error, handle it.
        + Record the start time of the job.
 + If spawning of the job was not successful:
   + Bail with an error, and shut down the starter. END CONTROL.
 + Suspend the job right now if we wanted to start it that way.
 + Set up a ToolDaemon for the job, if desired.
 + Let the Job Info Communicator know the jobs was started.
   jic->allJobsSpawned();
   + Set up some timers to fire at a 8 second start, 5 minute interval while
     the job is running to manage it.
     JobInfoCommunicator::startUpdateTimer()
     + Configure the time slice system.
     + REGISTER TIMER -> JobInfoCommunicator::periodicJobUpdateTimerHandler().
     + END OF CONTROL, BACK TO DAEMON CORE.

ON TIMER CALLBACK:
JobInfoCommunicator::periodicJobUpdateTimerHandler()
+ This seems to be the only timer which is periodically running while the job
  is running and does things on behalf of the job.
+ Call the "real" timeout function. This is because default arguments to
  timer callbacks get mishandled by some other, possibly daemoncore, system.
  periodicJobUpdate()
  + WHEN have job hooks
    + Update the job info.
  END OF CONTROL.

Control flow in the vanilla starter, from job completing to shutdown.

TBD

Control flow in the vanilla starter, from signal reception or other external
notification to shutdown or change of state.

TBD