How to configure MPI on Windows
Known to work with HTCondor version: 7.5
Note: The good folks at Kitware have also posted a comprehensive webpage detailing their setup using mpich2 with HTCondor on Windows
Download a copy of MPICH2 from the Argonne site:
During the install, make sure to enable use of the tools by all users, and not just yourself. Once the installation is compleate, ensure that the MPICH pass phrase is the same on all machines in the pool:
Windows Registry Editor Version 5.00 [HKEY_LOCAL_MACHINE\SOFTWARE\MPICH\SMPD] "phrase"="behappy"
The final step is to disble the SMPD service. Using the Microsoft Managment Console , open the Services Common Console Document . Find the entry labelled MPICH2 Process Manager, Argonne National Lab and modify it's properties such that it is inactive and will no longer start on boot.
We do this so that HTCondor can manage the MPI processes itself. That is, we will configure HTCondor such that it can start, stop and monitor any or all of the MPI related process.
The aim of installing MPICH2 is to use it in conjunction with HTCondor. To this end, we need to make some changes to HTCondor's configuration, as well as edit some helper scripts.
The following will let HTCondor manage the
daemon rather than the Windows Service Manager:
## # Tell the condor_master what binary to use for MPICH2's # comunication daemon. ## SMPD_SERVER = C:\Program Files\MPICH2\bin\smpd.exe SMPD_SERVER_ARGS = -p 6666 -d 1 ## # We are letting HTCondor spawn and manage the smpd.exe server # for us, so we need to tell the condor_master on this machine # to spawn a smpd.exe server, in addition to any other daemons # it is configured to run. ## DAEMON_LIST = $(DAEMON_LIST), SMPD_SERVER
Unlike the HTCondor daemons, we cannot define a log file for the
daemon in the HTCondor configuration file, ao we must do this manually:
C:\> smpd.exe -set logfile C:\condor\log\SmpdLog logfile = C:\condor\log\SmpdLog
Note that we only need to do this once, since the option will be saved persistently in the registry.
We use the following configuration on the dedicated pool machines that will be used for MPI jobs:
DAEMON_LIST = MASTER, STARTD WANT_SUSPEND = False WANT_VACATE = False START = True SUSPEND = False CONTINUE = True PREEMPT = False KILL = False RANK = 0 DedicatedScheduler = "DedicatedScheduler@real.host.name" STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler
This configuration will allow HTCondor to always runs jobs.
Once HTCondor has been configured to run parallel jobs, we must create a helper script such that HTCondor can use it to interact with the new MPICH2 installation.
The following can be used as a base for a more complex driver:
set _CONDOR_PROCNO=%_CONDOR_PROCNO% set _CONDOR_NPROCS=%_CONDOR_NPROCS% REM If not the head node, just sleep forever if not [%_CONDOR_PROCNO%] ==  copy con nothing REM Set this to the bin directory of MPICH installation set MPDIR="C:\Program Files\MPICH2\bin" REM run the actual mpijob %MPDIR%\mpiexec.exe -n %_CONDOR_NPROCS% -p 6666 %* exit 0
Submitting a an MPI Job
For our purposes we will use the sample shipped with the Windows version of MPICH2 to demostrate how to use HTCondor and MPI on Windows:
C:\demo> cp "C:\Program Files\MPICH2\examples\cpi.exe" . 1 file(s) copied.
We also need to create an input file:
10 100 1000 10000000000000 0
Where the last input,
, terminates the
All that remains is to define a submit file:
universe = parallel executable = mp2script.bat arguments = cpi.exe machine_count = 1 input = input.file output = out.$(NODE).log error = error.$(NODE).log log = work.$(NODE).log should_transfer_files = yes when_to_transfer_output = on_exit transfer_input_files = cpi.exe queue
is the helper script, and
is the input file.
Now we can submit our MPI application to HTCondor:
C:\demo> condor_submit parallel.submit Submitting job(s). Logging submit event(s). 1 job(s) submitted to cluster 18. ...
C:\demo> condor_q -- Submitter: marge : <172.16.46.137:4301> : marge ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 18.0 Administrator 2/17 16:10 0+00:00:11 R 0 0.0 mp2script.bat cpi. 1 jobs; 0 idle, 1 running, 0 held ...
C:\demo> condor_q -- Submitter: marge : <172.16.46.137:4301> : marge ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 0 jobs; 0 idle, 0 running, 0 held
to see the result of the run:
C:\demo> more out.log Enter the number of intervals: (0 quits) pi is approximately 3.1424259850010987, \ Error is 0.0008333314113056 wall clock time = 0.000014 Enter the number of intervals: (0 quits) pi is approximately 3.1416009869231254, \ Error is 0.0000083333333323 wall clock time = 0.000007 Enter the number of intervals: (0 quits) pi is approximately 3.1415927369231227, \ Error is 0.0000000833333296 wall clock time = 0.000637 Enter the number of intervals: (0 quits) pi is approximately 3.1415926535898295, \ Error is 0.0000000000000364 wall clock time = 12.460265 Enter the number of intervals: (0 quits)
Where the output are is result of an inputs of