Featured User

Testing GPU/ML Framework Compatibility

Hannah Cheren
July 6, 2022

Justin Hiemstra, a Machine Learning Application Specialist for CHTC’s GPU Lab, discusses the testing suite developed to test CHTC’s support for GPU and ML framework compatibility.

Photo by Ali Shah Lakhani on Unsplash.
Photo by Ali Shah Lakhani on Unsplash.

Researchers at UW–Madison have increasingly required graphics processing units (GPUs) for their work. GPUs are specialized computing hardware that drives different data science technologies, including machine learning (ML). But what actually goes into running an ML job on the UW-Madison Center for High Throughput Computing (CHTC) using GPU capacity?

Justin Hiemstra, a graduate student in the Department of Electrical and Computer Engineering at the University of Wisconsin-Madison and currently working as an ML Application Specialist for CHTC’s GPU Lab, outlined the steps for running an ML job on CHTC using GPU capacity during HTCondor Week 2022.

Whenever a researcher has an ML job that they want to run on CHTC with a GPU, they need three things:

First, the researcher must write their ML code using a deep learning framework, such as PyTorch or TensorFlow.

Second, the researcher needs to pick a GPU type. “You can run ML jobs on a normal server without GPUs, but certain machine learning processes (e.g., neural networks) run much faster if you use one,” notes Christina Koch, one of Hiemstra’s supervisors for his work. When using the HTCondor Software Suite, the researcher can choose a specific GPU type by specifying a CUDA compute capability in the HTCondor job submit file.

Third, the researcher has to pick a CUDA runtime library. This library handles communication between the GPU and the application space, allowing the ML code to run its computations.

For an ML job to complete successfully, these components (ML framework, GPU type, CUDA runtime) must be compatible.

Some issues come into play with this setup. The first issue is a lack of documentation.“There’s no central resource we can go to to look at and see if different versions of deep learning frameworks, GPUs, and capabilities are compatible with each other,” Hiemstra notes.

The second issue is that as these frameworks and GPU hardware evolves, Hiemstra and his team have noticed they’ve started to drop support for older frameworks and compute capabilities.

The third issue is “whenever you have computing resources made up of discrete servers in a computing pool, you run into the issue of heterogeneous server configurations.” This issue adds to the problem and confusion of trying to pick compatible versions.

Hiemstra has put together a suite of test jobs to explore this compatibility issue. The jobs test whether a single tuple of CUDA runtime, framework, and compute capability versions are compatible on a CHTC resource. He looks at three things to do so:

First, did the jobs match, meaning, was it able to find the resources that they requested? Second, did the Conda Environment resolve, meaning, was it able to match all of the versions without finding any conflicting dependencies? Finally, was the framework able to communicate with GPUs? The job should run as expected if all three of these things happen. The test job will print out an error file if any of these fail.

When a job fails, it’ll give some indication as to where the error occurred. These messages get recorded and later reviewed by Hiemstra so he can try to understand better what’s happening on the GPU Servers in the CHTC GPU Lab.

The goal now for Hiemstra and his team is to “look at all of the different versions we might be interested in combining to see which ranges of compute capabilities, frameworks, and CUDA runtime libraries might work with each other.”

Issues arise when trying to analyze the entire version space. First, the version space to test grows combinatorially. “On an active system where research is being done, that’ll start gobbling up capacity and taking it away from the researchers,” Hiemstra remarks.

To prune the version space, so they’re not testing tens of thousands of different versions, they know that the CHTC has certain compute capabilities available. Knowing this, Hiemstra and his team can limit the number of versions they test only to include those that the CHTC has available. In addition, they assume that researchers use tools, such as Conda, to install their software, so they focus on framework and CUDA runtime versions that are available through Conda.

The second issue is that the team needs some way of automatically collecting the different test parameters. Essentially, the goal is to have it so that someone in CHTC doesn’t have to update this by hand continuously. Each job needs several files to run, so to dynamically generate these files, the team uses Python String formatting to build these files.

Finally, they’d like to find a way to manage all the jobs since they will continue to “fire off” hundreds of jobs during this process. To do this, they decided on a “timeout” period of 24 hours so that they don’t have scripts running on the CHTC Access Point indefinitely.

Hiemstra and his team use DAGman, a tool for Directed Acyclic Graph (DAG) workflows, to first spawn a parent process.

That parent process will do all the version space pruning and file generation. It’ll then submit all the jobs for testing, wait 24 hours for that timeout, and “run a postscript to interpret the output of those jobs.”

Next, they process all the output files to gain further insight into how the system works.

Currently, they’re running it quarterly, looking at the output table, and seeing if anything unexpected pops up. Hiemstra explains that going through this process “should give us some tools to debug the system if something suddenly crashes or if different versions a researcher is using are not compatible with each other.

Hiemstra is curious to see and examine for the future how the choice of versions that a researcher picks affects the runtime of their ML model or whether or not it affects the outcome or performance of that model.

“Everything about machine learning approaches is diverse and changing quickly,” Koch remarks, “having information about compatible frameworks and GPUs allows us to be more responsive and helpful to researchers who may be new to the field.”

The implementation of this tool that Hiemstra and his team have developed can be found on the CHTC GitHub Repository.

Watch a video recording of Justin Hiemstra’s talk at HTCondor Week 2022, and browse his slides.