Java Universe

This was extracted from an email written by Doug Thain on 2-Dec-2001)

Java Universe Notes

Installation

Each startd defines JAVA and JAVA_ARGS parameters pointing to the local JVM binary. There a variety of ways the Java installation could be broken -- the classpath could be wrong, the binary could be wrong, shared libraries may be missing, and so on. To prevent "black holes," the startd does not simply advertise HasJava . Rather, it attempts to execute a small class ( CondorJavaTest ) that probes the JVM for properties and produces a ClassAd . If this program succeeds, the startd re-advertises itself with HasJava=True and JavaVersion and JavaVendor set appropriately. If it fails, the startd dumps the whole error log of the program to the debug file. At the moment, the setup only allows for one JVM to be installed. With new ClassAds , it would be very easy to advertise all of the JVMs available, along with their properties.

Execution

We don't simply want to execute the plain JVM and return its exit status as the job completion code. The exit code of the JVM does not give us enough information to determine exactly what happened to the job. Here's why:

If the program fails to run, the JVM returns 1. If the program runs and falls off the end of main(), it returns zero. If the program runs and exits with System.exit(x), then the JVM returns x. If the program runs and terminates with an exception, it returns 1. The exception case is particularly interesting, because certain exceptions ( OutOfMemoryError , InternalError ) indicate a problem with the local machine, while others ( NullPointerException ) indicate a problem with the program itself. If the JVM itself exits with a signal of any kind, then there is definitely a problem with installation, not the program.

To deal with this, the Java program is run inside a wrapper. The wrapper uses reflection to look up and invoke the program with its arguments. Before executing the program, the wrapper writes a file marking the beginning of execution. The program execution is performed in a try{} block that catches all exceptions thrown by the program. After the program completes, the wrapper writes an end marker that describes the termination -- "normal" if the program fell off of main(), "abnormal" if it threw an exception, and "noexec" if the program could not be executed. The latter two also include a printout of the exception hierarchy and a backtrace of the error.

After executing the program in the wrapper, the starter combines the exit status of the JVM with information from the start and end markers to figure out exactly what happened. There are quite a variety of cases that must be examined, and you can look in the code or ask me if you really want all the gory details.

Currently, the starter analyzes all this information, and then maps all of the exit possibilities into three cases that would apply to a binary program:

If the job completed normally, then it reports an exit with code n. The shadow should cause the job to return to the user.

If the job threw an exception indicating a program error, then it reports an exit with signal SIGSEGV. The shadow should cause the job to return to the user, and the exception information can be found in stderr.

If there was a problem in the execution environment (including exceptions derived from Error), then it reports an exit with signal SIGKILL. The shadow should cause the job to remain in the queue and execute elsewhere.

An improved mechanism would actually report the exception class to the shadow so that it could be interpreted by the ExitRequirements expression and be placed in the user's email. However, I think this can wait until we get some experience running jobs and evaluate whether we are trapping and dealing with errors correctly.

I/O Proxy

A secure I/O proxy provides fine-grained I/O services for naive applications. This I/O proxy is particularly suited to Java applications, but may be used in any universe.

If the job ClassAd contains WantIOProxy=True, then the starter creates a new listening TCP socket. In the job's execute directory, it leaves a file containing the address of the socket and a random "cookie." The user job finds this file, and connects to the address given. After authenticating by presenting the "cookie," the job may perform I/O using a simple protocol called Chirp.

Chirp is a lightweight, fine-grained I/O protocol that resembles the UNIX interface. For example, to open and read twenty bytes from /etc/hosts:

	client: open /etc/hosts r 0
	server: 5
	client: read 5 20
	server: 20
	server: [the data]
We have Chirp client libraries in C and Java, so the user can simply include these in a program instead of writing from scratch. We are also writing a protocol document so that other languages and implementations may be built.

The starter accepts incoming Chirp requests and converts them into remote I/O operations to the appropriate I/O device. This is currently the shadow process at the job submission site. However, it is entirely possible to re-route these to a nearby storage appliance such as a NeST.

The starter provides security services for the job's I/O by communicating over CEDAR, the standard HTCondor socket abstraction. This layer provides a wide range of encryption and authentication methods, including Kerberos, GSI, and SSPI. The exact protocol used depends on a negotiation between the execution and storage sites. By tunneling I/O through the starter, the user is able to take advantage of secure I/O with a minimum of burden on the application.

Implementation Status

The I/O proxy server is checked into the new starter/shadow on V6_3-branch.

The I/O proxy client code has it's own directory, condor_chirp, on V6_3-branch. We haven't decided exactly how to distribute it yet: part of the release, part of the SDK, or a separate distribution.

The Java support is coded up in the new starter/shadow. Pending some more testing, I will check it into the 6.3-branch.

Compiling Notes

The C portion of the Java Universe is built automatically with the rest of HTCondor. However, two small Java programs are necessary to build a complete release. If HAS_JAVA=YES in site.def, then these programs will be built using the JavaCompiler variable and placed in release_dir/lib. If HAS_JAVA=NO, then the release will be built with all the Java support except the two small Java programs. The startd will operate correctly, but will not advertise or run Java without the two classes. If we don't care to install a javac on all of our development platforms, these two programs could be compiled on Linux, placed in a standard directory, and then simply copied into the distribution from any platform.