Version 1.0 : Initial version.
IF YOU MAKE ANY CHANGES, ADD A NEW VERSION NUMBER/LINE ABOVE
This document is associated w/ ticket #1095 .
This document details the initial requirements of HTCondor's High Frequency Computing (HFC) effort. The goal of HTCondor HFC is to effectively manage and obtain high-throughput of short (less than one second) jobs, or tasks.
Note what is presented here are the requirements for a 1.0 release, the smallest possible useful release; in the interest of a quick, successful initial implementation, we want to minimize feature creep. Furthermore, note that this document discusses just the requirements for this work, purposefully tries to avoid mentioning any particular implementation.
Three fundamental abstractions encompass HTCondor HFC: tasks , schedulers , and workers . Requirements for these three appears below, followed by more general requirements.
The task represents the unit of work, required worker attributes, desired worker attributes, current state, input, and output (if complete). Users may add custom attribute/value pairs to the task at task submission time, and these will be static for the rest of the task lifetime. Note that the task, (unlike the Scheduler or Worker), is just a set of data passed around the system.
There is no sandbox or file-transfer associated with any task. The assumption is that the worker selected by the task either has the correct data local to it, or can fetch it before the task runs. Similarly, output shall be a small XML document which is attached to the task itself. Any file, database or other output of the task is the responsibility of the task or worker.
Tasks will primarily contain payloads between 1k-64k in size, for benchmarking 4-16k is normal, but megabytes should be possible at far reduced performance.
The system defines at least once semantics for running of a task, if there is some reason a task can't be started (or completed) more than once, it is the responsibility of the task to deal with multiple (perhaps concurrent) executions.
Required Task Attributes:
|GUID||String||Submitter||Must be globally unique|
|BatchID||String||Submitter||Allows management of batches of tasks|
|State||Integer||Scheduler||Enumeration of current state|
|Input||String||Submitter||Usually XML, of arbitrary size|
|Output||String||Worker||Usually XML, of arbitrary size|
|Service||String||Submitter||The service that should process the task|
|Delivery||String||Submitter||Description of plugin to deliver to|
|LastWorker||String||Worker||Last Worker Ran on|
Along with Tasks and Workers, the Scheduler is one of the three fundamental abstractions in the system. It will be a service (or daemon).
There are three major user-accessible interfaces to the Scheduler: a submission interface, a query/management interface, and a results interface. All three interfaces should be identical, which should allow for arbitrary composition of the various parts. All interfaces will be stream-based. This stream-based interface will support flow-control, to tell the sender to stop sending data during overload. Of course, there will be internally-needed interfaces to the scheduler, but those are beyond the scope of these requirements.
All software interfaces will be defined in terms of an human-readable ASCII wire protocol over raw sockets. This allows for programming language agnostic implementations, though CLI based reference implementations will be delivered. (TODO: firewall requirements -- should only the scheduler have inbound connections?)
The submission interface allows a client to submit a batch of tasks. The batch size may be small (or unitary), with reduced performance. The submission interface shall be synchronous, and when the scheduler sends a positive acknowledgement to the client, it indicates that the scheduler has atomically received every task in the batch, and retains ownership from them on. If the ack is not received, the client should re-send the entire batch, with the same task GUIDs and batch IDs.
All tasks in a batch are considered to have equal priority in the system, and the scheduler will run them in no particular order.
There will be a query/management interface to the scheduler which will allow the user or user programs to inquire as to the state of tasks, to remove idle and running tasks, and to edit certain attributes of tasks.
The scheduler will advertise the number of idle tasks per requested service, so that the higher level scheduler will be able to adjust the number of workers appropriately.
There will be a plugin mechanism to the scheduler providing for multiple concurrent results delivery mechanisms. There may be several predefined mechanisms hardcoded and baked into the scheduler.
The worker is one of the three major abstractions in the system. The role of the worker is to startup, execute any initial code, then sequentially execute tasks the scheduler assigns to it. Upon task completion, the worker assigns output to finished task, and returns the task to the scheduler. The worker is mainly a user-written program, so that it can execute tasks without the overhead of spawning a process and repeating any initialization steps.
Note that the worker is really in two halves: the "system" half, which is constant and provided by the system, and the "user" half, which is written by the user, and actually executes the task.
The worker can have attributes that help tasks match to certain workers, or allow tasks to prefer certain workers. These attributes may be updated as a side-effect of running a task. This can be delivered on a separate control channel from the task itself.
Workers may come and go during the course of a larger computation, according to the needs of a higher lever scheduler.
There will be a heartbeat (or keep alive) between the worker and the scheduler to detect hung workers, but not hung tasks.
For resource allocation purposes, the service will have an ad that indicates how many tasks/workers are currently present for a given service, so preemption can be done to adjust worker counts using quota/etc.
The interface to the worker will be specified by a programming-language neutral wire protocol, so that the worker can be implemented in any reasonable programming language.
There will be some mechanism to avoid workers which are acting as "black holes".
Workers can have "tags" to specify affinity with certain kinds of tasks -- e.g. this worker is for "red" tasks.
Other General Requirements
Performance and Lifecycle Requirements
The user shall be able to submit, execute, and complete 10 million zero second tasks running concurrently on 1,000 workers in 8 hours. The system shall manage all 10 million tasks concurrently, though perhaps via multiple schedulers . Submissions will be streamed. Results, however, will be delivered as finished, and delivery of results will not be batched.
This implies a minimal sustained throughput of 350 tasks per second. To handle bursts, the system must be able to handle completion of 1,000 echo tasks per second, and ideally 5,000. Idle tasks will be matched to pre-staged workers via a unilateral matching process. These pre-staged workers may start and stop during the course of running the whole computation. This matching process will support Requirements and Rank, where the attributes of the task and the worker may be predefined by the system, or custom extended attributes defined by the user. Worker attributes will represent (among other things) pre-staged local file sets, which will number in the low tens (less than 20?)
Task results will be delivered to the user via a custom plugin, which will be called for each task as soon as the results are ready. All performance benchmarks will be made with a null plugin. One potential action a plugin could implement is to submit more tasks, based on the result of a completed task. Or, it could remove existing tasks, if the completed task obviated certain pre-existing tasks.
In order to get substantially better performance than vanilla HTCondor jobs, we need to relax the reliability constraints that HTCondor normally maintains.
Task submission shall be atomic. That is, after the client submits to the scheduler, and receives a positive acknowledgment, all tasks in the batch are under control of the scheduler. It will impossible for the scheduler to get a truncated task or a truncated batch of tasks. If the client fails to receive the ack, it should resend the batch with the same guid.
The scheduler shall not be durable. If the scheduler or the scheduler machine crashes, state information will be lost, and some tasks will need to be rerun. In the event that a worker or worker machine crashed, the scheduler will detect this, and automatically re-run the task on a different worker which matches the task.
All code will be merged into the mainline HTCondor, and generally available to HTCondor users under HTCondor's normal licensing and distribution terms.
The work needs to run on both Window and Linux execute machines and submit machines, with remote submission possible. As this will be an ongoing work in progress, we will not guarantee forward or backward compatibility for the various versions we will release.
We assume that all clients, schedulers, and workers all run within one security domain, and that all networks and machines are trusted. Connections will not be authenticated nor communications encrypted.