Checkpointing

HTCondor
Overview
Download
Documentation
Release Highlights
Release Schedule

HTCondor-CE
Overview
Download
Documentation
Release Highlights

General
News
Security
Documentation
Download

Documentation
HTCondor
HTCondor-CE

Support
HTCSS Contract Support
Community Mailing List
External Support Organizations

Email Lists
Users / Help
World / Announcements

Workshops
Throughput Computing 2025
Materials from past Workshops

HTCSS
What is HTCSS?
Logos

CHTC
Website ⬏
Staff ⬏
Jobs ⬏
Research & Publications ⬏

Overview

The ability to cause a job to vacate a non-idle workstation and migrate to an idle workstation is central to HTCondor’s job scheduling. To allow these migrating jobs to continue their progress, HTCondor must be able to continue a vacated job without a complete loss of computation time. HTCondor does this by using a checkpoint of a job’s state. A checkpoint is generated just before a job vacates a machine. The checkpoint is then used as a starting point for the job after migration.

HTCondor gives a program the ability to checkpoint itself by providing a checkpointing library which contains a signal handler for SIGTSTP, the HTCondor checkpoint signal. When a HTCondor job receives a checkpoint signal, it writes a checkpoint file, which contains the process’s data and stack segments, as well as information about open files, pending signals, and CPU state. When the job is restarted, the state contained in the checkpoint file is restored (by startup code in the checkpoint library) and the process resumes the computation at the point where the checkpoint was generated. Programs submitted to be run by the HTCondor system are re-linked (but not re-compiled) to include this library.

A technical description of the checkpointing and process migration mechanisms implemented in HTCondor is given in this technical report.

Standalone Checkpointing

The HTCondor distribution includes a checkpointing library, to provide checkpoints for Unix processes that are not submitted for execution under HTCondor. Information on using standalone checkpointing is available in the manual.

Periodic Checkpointing

In addition to writing a checkpoint at vacate time, HTCondor can be configured to write checkpoints periodically while the job is executing, to provide additional reliability.

Checkpoint Server

By default, a checkpoint is written to a file on the local disk of the machine where the job was submitted. A checkpoint server has been developed to serve as a repository for checkpoints. When a host is configured to use a checkpoint server, jobs submitted on that machine write and read checkpoints to and from the server rather than the local disk of the submitting machine, taking the burden of storing checkpoint files off of the submitting machines and placing it instead on server machines (with disk space dedicated to the purpose of storing checkpoints).

Other Checkpointing Resources

Checkpointing Research at the University of Tennessee

Unlocking New Frontiers in Drug Discovery: High Throughput Computing in the Thyme Lab

Bringing The National Research Platform Capacity to the HTC Community

Oral Roberts University Advances Cyberinfrastructure with OSPool Integration

IceCube Neutrino Observatory’s Use of High Throughput Computing: Making Discoveries in Astrophysics Using Neutrinos

Learning and adapting with OSG: Investigating the strong nuclear force

What is HTCSS?

Overview

Standalone Checkpointing

Periodic Checkpointing

Checkpoint Server

Other Checkpointing Resources