PPT Slide
Mechanisms for Matchmaking and Parallel High Throughput Computing in
the Condor Distributed System
Rajesh Raman, raman@cs.wisc.edu
Todd Tannenbaum, tannenba@cs.wisc.edu
http://www.cs.wisc.edu/condor
Condor Project
- Overview
- What is Condor ?
- Projects and Collaborations
- High Throughput Computing
- ClassAds and MatchMaking
- Parallel Computing with Condor
What is Condor ?
- High Throughput Computing
- Distributed Resources
- Physically distributed
- Distributed ownership
- Resource Management
- Increase utilization of resources
- Simple interface to execution environment
- User level interface
- Application level interface
Important Mechanisms
- Checkpointing (and migration)
- Owner policies require resource reclamation
- Need to save (resumable) state of application
- Remote System Calls
- Preserves submission environment in execution environment.
The Condor Team
- Research Staff
- Todd Tannenbaum
- Derek Wright
- Adding 2 more...
Condor Team, cont.
- Graduate Students
- Rajesh Raman (MatchMaking)
- Jim Basney (Split Execution)
- Shrinivas Ashwin (Mr. Parallel)
- Adiel Yoaz (Accounting)
Condor Almuni
- Mike Litzkow
- David Dewitt
- Marvin Solomon
- Many others… (Produced XXX Masters and XXX PhDs]
Current Collaborators and Projects
- UW-Flock
- Intel Sponsorship: $4.2 Million
- Graduate School, Engineering
- metaNEOS: metacomputing environments for optimization
- with Prof. Michael Ferris
Condor Pool Installations
- Universites
- U of Wisconsin, U of Illinois, U of Michigan, Dartmouth, Duke, U of Washington, U of Virginia, U of California-Berkeley
- Government
- NCSA, Nasa, US Navy, NSA, NIKHEF (Amsterdam), INFN (Italy)
- Commercial
- Hewlett-Packard Labs, J.P. Morgan, Mercedez-Benz, Dragon Systems
Power of Computing Environments
- High Performance Computing
- Fixed amount of work; how much time?
- Response time/latency oriented
- Traditional Performance metrics: FLOPS, MIPS
- High Throughput Computing
- Fixed amount of time; how much work?
- Throughput oriented
- Application specific performance metrics
Distributed Ownership of Resources
- Commodity resources
- Underutilized: 70% of a pool's cycles are not utilized
- Fragmented: owned by different people
- Can provide HTC with these cycles, BUT
- Must not impact QOS to owner
- Owners specify access policy
- Expressed with control expressions
- The current state of the resource (e.g., load average)
- Characteristics of the request (e.g., who wants to use it?)
- Time of day, random numbers, etc
Condor Architecture
- Startds ( Represent owners of resources)
- Implement owner's access control policy
- Schedds( Represent customers of the system)
- Maintain persistent queues of resource requests
- Manager
- Collector: Database of resources
- Negotiator: Matchmaker
- Accountant: Priority maintenance
Condor Architecture, cont.
Matchmaking
- Customers
- Require resources with certain characteristics
- Discriminating customers
- Requests place constraints on resources
- Distributed ownership
- Resources service requests which match owner's policy
- Discriminating resources
- Resource offers place constraints on customers
Matchmaking with Classified Advertisements
- Parties requiring matchmaking advertise
- Characteristics and requirements (i.e., constraints)
- Advertisements matched by a Matchmaker
- Matched parties contact each other to "claim”
- Communication, authentication, constraint verification, negotiation of terms, etc.
- Claiming does not involve the Matchmaker
- Method is symmetric
- No client/server relation imposed
Classified Advertisement Matchmaking Framework
- Expression and evaluation of characteristics
- ClassAd, Closure, EvaluationContext
- Advertising Protocol
- Contents of advertisements
- Publication protocol
- Matchmaking Algorithm
- Relates ad contents to matching process
- Priority schemes, Ranking schemes, etc.
Classified Advertisement Matchmaking Framework (contd.)
- Matchmaking Protocol
- How are relevant parties informed of a successful match?
- What information are they given?
- Claiming Protocol
- How do matched parties claim each other to cooperate?
ClassAd: Mechanism for expressing characteristics
- A ClassAd is a set of names, each of which is bound to an expression. e.g., [ Name => "Joe Hacker" ; Height => 182 ; Sex => "Male" ; Disposition => (TimeOfDay() < 600) ? "Sour" : UNDEFINED ; Requirements => (other.Height < Height) && (other.Sex == "Female") ]
- Expressions
- Constants, attribute references, function calls
ClassAd (contd.)
- Attribute references may refer to attributes in other ads
- Attribute references "trigger" expression evaluation
- Scope resolution
- Evaluates to UNDEFINED if no such expression exists
- Values
- String, integer, real, UNDEFINED and ERROR types
- Operators are total (i.e., defined over all values)
Closure: Evaluation Environment for a ClassAd
- Determines which ClassAd's attributes to lookup
- Closure is
- ClassAd and an ordered mapping of (scope-name, closure) pairs
- No name may be repeated
EvaluationContext: Evaluation Environment for several ClassAds
- A set of closures which is self-contained
- No closure reference leaves the context
- Condor's "Standard Context" is a bit more complex
- Includes closures for a matchmaker "advertisement”
Matchmaking in Condor
- Opportunistic Resource Exploitation
- Resource availability is unpredictable
- Exploit resources as soon as they are available
- Return resources as soon as they are unavailable
- Matchmaking performed continuously
- Attractive for malleable parallel applications
- Request more resources after execution commences
- Granted immediately if resources are available, or
- As soon as resources become available
Matchmaking in Condor (contd.)
- Advertising protocol
- Startd's, Schedd's send classads to Collector
- Must contain a "Requirements” expression
- Optionally contain a"Rank” and “CurrentRank” expressions
- Startds send a "private ad" containing a capability
- Matchmaking protocol
- Give the matched Startd and Schedd the capability from the startd's private ad
Matchmaking in Condor (contd.)
- Matchmaking Algorithm
- Request ad A matched with offer ad B “iff”
- A's "Requirements" expression evaluates to TRUE, and
- B's "Requirements" expression evaluates to TRUE, and
- B‘s"Rank" expression value is greater than "CurrentRank", and
- A’s "Rank" expression value is its greatest when evaluated against B
- Claiming protocol
- Negotiate "heartbeat" frequency, checkpoint transfer, etc.
Condor Parallelism
- Job Level
- Condor clusters of processes
- DagMan
- Task Level
- Interfacing Condor and PVM
- PVM: Message Passing
- Condor: Resource Management
- PVM Resource Manager Interface
Interfacing Condor and PVM, cont.
Interfacing Condor and PVM, cont.
- CARMI -vs- PVM
- Resource Requests
- PVM: Synchronous
- CARMI: Asynchronous
- Resource Request Mechanism
- PVM: Hostname and Type String
- CARMI: ClassAd
- Task Management
- CARMI: Additional Notifications
- CARMI: Additional Operations
-
Master-Worker Model
- A good fit for an opportunistic environment
- Master
- Runs on Submit Machine
- Manages pool of tasks
- Worker
- Runs on remote machines
- Receives pieces of work from the Master, returns answer
Additional Condor/PVM Frameworks
- CoCheck
- Checkpoint a Worker or set of Workers
- Requirements for a consistent checkpoint
- Synchronize all processes
- Flush PVM messages in transit
- Perform Checkpoint (save image)
- Remap TIDs
- WoDi
- A framework for Master-Worker applications
- Performs optimizations
Future Work
Future Work Part II
- Matchmaking
- Aggregate Resources/Requests
Summary
- Condor is an implementation of a High Throughput Computing system in an opportunistic environment.
- Major Mechanisms to achieve HTC:
- Matchmaking
- Checkpointing
- Remote system calls
- Sandboxing