Job Fails Under Condor

"I have a job that runs on the command line; but it crashes when run under HTCondor." -- probably many of us have faced a problem like this.

We recently worked with a user who had a job that exhibited this behavior (it segfaulted when run under HTCondor). It took us a while to figure out what the cause was -- environment variables under HTCondor differed only trivially from the command line (the job was using "getenv = true"), and the command line arguments were exactly the same.

We eventually figured out that the job was crashing because the file descriptor limit when run under HTCondor was higher than when it was run from the command line(!). This was a bit of a surprise, and clearly indicates problems in the code of the program; but it also points up an important, and somewhat non-obvious, way in which running a job under HTCondor differs from running it on the command line.

(HTCondor jobs inherit their limits from the HTCondor daemon that spawns them. In the case of the file descriptor limit, some HTCondor daemons need higher limits that most user jobs typically need. We are considering changing this in the future, but this is the current situation.)

At any rate, system limits are something to keep in mind when debugging this type of problem.

Another thing that is likely to be different between running on the command line and running under HTCondor is the umask setting (controlling the permissions of files created by the job). This is one more thing to check if you are having problems with jobs not working correctly under HTCondor.

Here's an example of a job that prints out the limits, changes the stack size limit, and prints out the limits again.

# File: change_limits.csh

  #! /bin/csh
  limit
  echo ""
  echo "Changing stacksize"
  limit stacksize 4096
  echo ""
  limit

# File: change_limits.sub

  universe = vanilla
  executable = change_limits.csh
  output = change_limits.out
  queue

# File: change_limits.out

  cputime      unlimited
  filesize     unlimited
  datasize     unlimited
  stacksize    unlimited
  coredumpsize unlimited
  memoryuse    unlimited
  vmemoryuse   unlimited
  descriptors  1024
  memorylocked 64 kbytes
  maxproc      1024

  Changing stacksize

  cputime      unlimited
  filesize     unlimited
  datasize     unlimited
  stacksize    4096 kbytes
  coredumpsize unlimited
  memoryuse    unlimited
  vmemoryuse   unlimited
  descriptors  1024
  memorylocked 64 kbytes
  maxproc      1024

Note that the limits on your process under HTCondor will depend on your HTCondor configuration. Also, the limits may vary according to which universe your job runs under.