Condor Tutorial
First EuroGlobus Workshop
June 2001

Tutorial Outline

Tutorial Outline

The Condor Project (Established ‘85)

What is High-Throughput Computing?

What is Condor?

The Condor System

Some HTC Challenges

What is ClassAd Matchmaking?

Upgrade to Condor-G

What Have We Done on the Grid Already?

NUG30 Solved on the Grid with Condor + Globus

NUG30 - Solved!!!

The Idea

Meet Frieda.

Frieda’s Application …

I have 600
simulations to run.

Where can I get help?

Slide 18

Installing Condor

So Frieda Installs Personal Condor on her machine…

Slide 21

Personal Condor?!

What’s the benefit of a Condor “Pool” with just one user and one machine?

Your Personal Condor will ...

Getting Started: Submitting Jobs to Condor

Making your job batch-ready

Creating a Submit Description File

Simple Submit Description File

Running condor_submit

Running condor_submit

Another Submit Description File

“Clusters” and “Processes”

Example Submit Description File for a Cluster

Slide 33

Submit Description File for a BIG Cluster of Jobs

Submit Description File for a BIG Cluster of Jobs

Using condor_rm

Temporarily halt a Job

Using condor_history

Getting Email from Condor

Getting Email from Condor (cont’d)

A Job’s life story: The “User Log” file

Sample Condor User Log

Uses for the User Log

Condor JobMonitor
Screenshot

Job Priorities w/ condor_prio

Want other Scheduling possibilities?
Extend with the Scheduler Universe

DAGMan

What is a DAG?

Defining a DAG

Submitting a DAG

Running a DAG

Running a DAG (cont’d)

Running a DAG (cont’d)

Recovering a DAG

Recovering a DAG (cont’d)

Finishing a DAG

Additional DAGMan Features

We’ve seen how Condor will

What if each job needed to run for 20 days?

What if I wanted to interrupt a job with a higher priority job?

Condor’s Standard Universe to the rescue!

Process Checkpointing

Relinking Your Job for submission to the
Standard Universe

Limitations in the
Standard Universe

When will Condor checkpoint your job?

What Condor Daemons are running on my machine, and what do they do?

Condor Daemon Layout

condor_master

condor_master (cont’d)

condor_startd

condor_schedd

condor_collector

condor_negotiator

Happy Day!  Frieda’s organization purchased a Beowulf Cluster!

Slide 74

Layout of the Condor Pool

condor_status

Frieda tries out parallel jobs…

The Boss says Frieda can add her
co-workers’ desktop machines into her Condor pool as well…
but only if they can also submit jobs.

Layout of the Condor Pool

Some of the machines in the Pool do not have enough memory or scratch disk space to run my job!

Specify Requirements!

Specify Rank!

How can my jobs access their data files?

Access to Data in Condor

Remote System Calls

Job Startup

condor_q -io

I am adding nodes to the Cluster… but the Engineering Department has priority on these nodes.

The Machine (Startd) Policy Expressions

Freida’s Current Settings

Freida’s New Settings for the Chemistry nodes

Submit file with Custom Attribute

What if “Department” not specified?

Another example

The Cluster is fine.  But not the desktop machines.  Condor can only use the desktops when they would otherwise be idle.

So Frieda decides she wants the desktops to:

Macros in the Config File

Desktop Machine Policy

Policy Review

General User Commands

Administrator Commands

CondorView Usage Graph

Back to the Story:
Disaster Strikes!

Frieda Goes to the Grid!

Slide 105

How Flocking Works

Condor Flocking

Condor Flocking, cont.

Condor-G: Globus + Condor

Condor-G Installation: Tell it what you need…

… and watch it go!

Frieda Submits a Globus Universe Job

How It Works

How It Works

How It Works

How It Works

How It Works

Condor Globus Universe

Globus Universe Concerns

Changes to the Globus JobManager for Fault Tolerance

Globus Universe Fault-Tolerance: Submit-side Failures

Globus Universe Fault-Tolerance:
Lost Contact with Remote Jobmanager

Globus Universe Fault-Tolerance: Credential Management

But Frieda Wants More…

Solution: Condor GlideIn

How It Works

How It Works

How It Works

How It Works

How It Works

How It Works

How It Works

Slide 133

GlideIn Concerns

Common Questions, cont.

In Review

Slide 137

Case Study: CMS Production

CMS Physics

CMS Physics

ENORMOUS Data Challenges Ahead

Leveraging Grid Resources

Challenges of a CMS Run

CMS Run on the Grid

CMS Run on the Grid

CMS Run on the Grid

CMS Run on the Grid

CMS Run on the Grid

CMS Run on the Grid

CMS Run Details

CMS Run Details

Future Directions

Slide 153

Thank you!