Miron is interested in what it would take to make reproducible builds of HTCondor in the build and test environment. These builds should be "on-demand"--perhaps eventually via a web interface--and ideally should not rely on the HTCondor Flightworthy infrastructure. That is, we should be able to stash what is necessary for a build, and therefore be able to do the builds independently.
Summary of steps we know we need to take
These are the steps we need in the short to medium term--clearly there are more things we could do long-term.
Provide a repository that associates the source code and externals with a given version.
Update the HTCondor build script so that it (optionally) does not fetch externals from where they are now--either they are provided to the build, or it might fetch them from the repository. (This step is technically optional, but allows for the builds to happen without relying on an external service.)
Provide a command-line interface to request a build. Input includes:
- HTCondor version
- Platform (OS & architecture, such as x86_rhap_5)
- Desired packaging (tarball, RPM, Deb, Windows installer. Not all of them are available on all platforms.)
- Provide a web interface.
Today, builds get input files from a variety of locations. For the HTCondor builds these include the git source code repository for the HTCondor source code and the HTCondor web site for the external packages HTCondor depends on. Reproducing any given build requires that both git and the web site be available and complete. If an old external was deleted from the web site because HTCondor perceived it as no longer necessary, builds needing that external would not work.
The HTCondor source and documentation are pulled together on an NMI submit node. The externals are pulled on each individual build machine on an as needed basis, completely outside of NMI management. If the externals cache is being used, the input file will never be loaded.
Because each machine pulls together externals on its own, it's hard to say how much disk space a given build requires. The HTCondor source is about 20MB while the documentation is about 1MB. The externals, excluding glibc, are about 107MB; a given build may not use all of the externals, but different builds typically share externals. glibc is complex, only platforms supporting the standard universe include it, and different platforms may use different glibc. A given glibc source is about 163MB, and we might use 2 or 3 across all of our builds. This gives a total size of about 20+1+107+(163*3)= 617MB. These are all compressed tarballs, further compression is unlikely to provide significant savings. Assuming we archive two builds per night (one stable, one developer), we're looking at about 440GB per year of storage. Assuming a single backup, we're close to needing 1TB per year. These numbers ignore developer builds.
Do we archive every build, or just "important" builds like releases?
- If we're not archiving every build, do we use the archive system always (for consistency), but mark non-important builds for deletion "soon"?
Do we move old archives to offline or otherwise "slow" storage?
Do we maintain backups?
HTCondor's full(-ish) revision history is available via Git. Should an archive contain the entire repository? Chews up more space, but likely more useful in event of upstream disappearing. Something clever like a shared Git repository would theoretically work, but ignores risk that upstream repo would be corrupted, damaging our past copies.
What about Metronome?
- As Metronome evolves, do how to we insure that the archived HTCondor "glue" will work with the update Metronome?
Do we need to keep old Metronome versions around?
- What about the HTCondor binaries on which Metronome runs?
As Metronome evolves to use virtual execute machines, it becomes easier to archive off an entire build environment.
- Obviously, the size of the archive grows substantially if one chooses to do this
- Adds a dependency on virtual machines
Probably safer than relying on old hardware that could still run a particular complete image, though.
- The fact that you can now take a VMware image, and run it in Virtual Box would seem to support this.
- Is this something that's worth adding to the list of things to archive?
- For "special" releases?
- What about the NMI submit machine?
- What about the PostgresSQL version that Metronome uses?
- HTCondor pulls externals directly onto build machines, completely outside of NMI's knowledge. Logically this should be moved onto the NMI submit machine, where it can be archived easily and where duplicates can be merged. However, this increases disk and network utilization of the submit nodes, nodes that historically have been overloaded.
Input files will typically be compressed tarballs, so there isn't much advantage to further trying to package them up. It's easier to just give each archive its own directory.
If we are archiving a build, we should do the build out of the archive, identically to if we were trying to rebuild from the archive in the future. This ensures that the archive works and simplifies the code as we always follow the same code path. This will increase disk I/O on whatever machine it is done on, likely the already heavily loaded NMI submit node.
Proposal: Add a new machine: "Packager" which makes read-only disk space available. A submitted job is done on the packager. The packager collects all of the input into the archive location, then submits the "real" jobs the NMI submit point. Initially the packager would expose the archive location as NFS or similar to the submit point, otherwise looking like a "normal" build. This needlessly routes data through the submit node, but should work without Metronome changes. As HTCondor's capability to draw files from web or other inputs grows, individual execute nodes could pull files directly from the packager.
Todd M notes, "If having Metronome track what's fetched on the execute hosts would actually be helpful (note that we wouldn't want to manage the cache, since that's really HTCondor's job (HTCondor's, not Flightworthy), I could probably add a 'remote_fetch' step..."
Another note from Todd M:
- - Talked to Nick about long-term reproducibility. Because of compiler drift, even extraordinary efforts to maintain source code repositories wouldn't necessarily suffice. Likewise, hardware drift could easily make it hard if not impossible to install any copy of an old OS we may have kept. It seems like the best bet in terms of duplicating old bits is to use virtual machines, archiving them every so often, and forward-converting (and testing) them at the same time against the latest version of the virtual machine software. However, we may be better served by (automatically, machine-readable) maintaining knowledge about what used to work; then, at some point in the future, we have a place to start from that we can be certain /used/ to work. Depending on exactly what the goal is, a plan that quickly adapts old releases to the present day may be of more use than a plan that quickly cranks out (presumably patched) binaries for systems that nobody uses with dependencies that are obsolete. (In fact, it occurs to me as I write this up that there's no reason Flightworthy couldn't continue to submit builds and tests for stable releases long after it's stopped working on them. Even if you don't fix them when the rot into brokenness, at least you can keep the record of what broke them, and have a good place to start, later.)
Other than matching up, fetching and building externals, I think the main issue is that HTCondor is currently stored in a git repository. We have 2 alternatives here:
- Keep a copy of the current git source code and a snapshot of the git repo
- Copy the source code of the build out, and modify the glue logic to handle the source tarball. This should be similar to a workspace build, I think.