Magyar oldal English site

Site map
2023-08-24 18:32:43 (Originally published at: 2023-08-21)

You don't need a declarative language to have reproducible builds

Reproducible builds bring the idea of functional programming into the build process. Build tools essentially become "pure functions": their output depend only on their inputs and nothing else. So no query of the date/time, no relying on hardware random values, or filesystem layout, etc. That's why these build tools try to sandbox and things like that.

An useful property of the pure functions is that their output can be cached (memoized). When the function is called with previously known arguments, then the cached output can be used instead of evaluating (recompiling) the function again. Therefore the build process can be fast when there are no changes.

It can be done in imperative style

Pure functions and memoization is used in the context of functional programming. But we don't have to use functional paradigm to have pure functions, we can use imperative style as long as there are no side effects. This would allow us to crate a build system that use imperative paradigm, while still remain fast and reproducible.

Consider the following C#-ish pseudo-code for compiling a C++ program:

File buildCpp(File source, String[] options, Tree<FileName, File> headers)
{
    /* magic */
}

File linkProgram(File source, String[] linkerOptions, Tree<FileName, File> libs)
{
    /* magic */
}

File buildProgram(
    File[] sources,
    File[] libs,
    String[] compilerOptions,
    String[] linkerOptions,
    Tree<FileName, File> headers,
)
{
    File[] objects = new File[];
    foreach (src in sources)
    {
        objects.add(buildCpp(src, compilerOptions, cppDeps(src, compilerOptions, headers)));
    }
    return linkProgram(objects + libs, linkerOptions);
}

Here File is a data type that's represents a file. It would have the usual open, close, read, seek functions needed to read the file. But it does not contain the path of the corresponding file on the filesystem. The build system binds it into a real file. The function only sees and abstraction of it.

The String the text type in some well known encoding e.g. UTF-8. FileName is the type that represents filename. A filename is not a string type: on Linux it's a sequence of bytes that can contain any byte except the NUL and / characters, so invalid UTF-8 can be a valid filename. On Windows filenames are a sequence of 16 bit integers, that can contain unpaired surrogates therefore they would be invalid Unicode but still a valid filename.

The [] means it's an array. Arrays can be traversed with foreach and the add method can be used to add more elements to it.

The Tree<X, Y> is essentially a recursive dictionary: X is the key type, Y is the value type, but the value can also be another Tree<X, Y> so it's recursive. It can be used to represent a filesystem. It would represent the filesystem the compiler sees, so when the C++ code includes a file #include "foo.h" it would know what it refers to.

The cppDeps function creates a subset of the headers filesystem tree and only keeps those headers that the C++ source file depends on.

The buildCpp and linkProgram functions do the compilation and linking respectively and do so as a pure function. It theory the whole compiler can be implemented in these functions. But we don't have compilers and linkers that operate like this. So that's why build system need to improvise and make various kinds of sandboxes to ensure that the tools operate as a pure function. That's why we pass the filesystem as as a tree, because the tools can only see things we give them as function arguments, nothing else.

To actually build the program we would write something like this:

File[] sources = Directory.glob("*.cpp");
String[] options = ["-I/system_headers", "-iquote", "/my_project", "-Wall"];
String[] linkOptions = ["--no-undefined"];
Tree<FileName, File> headers = new Tree<FileName, File>{
    [FileName.fromString("system_headers")] = FsTree.fromDirectory("/usr/include"),
    [FileName.fromString("myproject")] = FsTree.fromDirectory("."),
};
File[] libs = [File.bind("libfoo.a"), File.bind("/usr/include/libstdc++.so")];

File program = buildProgram(sources, libs, options, linkOptions, headers);
program.persist("myprogram");

Let's dissect this: Directory.glob() can give use an array of files matching the given patterns. So the sources are bound to the .cpp files in the current directory. The options and linkOptions is just an array of GCC-like options to pass to the compiler. Note the filesystem paths there. The headers provide the file-system structure for the compiler to use when resolving the included headers. The FsTree.fromDirectory would return a filesystem tree with all the files bound to the real files within. In the libs you can we can manually create File objects from real files. Then finally as a result of the build we get our program as a File object. This file object is not bound into the real filesystem yet, because it's created by the build process, so we would call the persist method to create an actual file.

It should be noted that there are operations here, that access the system: Directory.glob, FsTree.fromDirectory, File.bind, program.persist. These would allow cheating and would allow a function to communicate outside and depend on things that are not passed as an argument. So these functions are impure and it's forbidden to call them from a pure function: you would need to mark the function as impure or eliminate the call.

The whole point is that the result of pure functions can be cached. So next time they are called the cached result is used. How can this be done? Every data type must have a way to be converted into a cryptographic hash. For example in C# the Object class have a GetHashCode method. That class returns with a simple 32 bit integer and it's used for hash based collections. In our case we need a GetIdentity function, which would return with a 256bit cryptographic hash and it is always calculated on the value, never on the reference of an object. For the File type it would calculate the hash of the underlying file. For arrays it would calculate the hash of each of the elements, concatenate the resulting hashes and calculate the hash of that. For trees it would recursively traverse and calculate the hash of each key and value, concatenate all of this and calculate the hash of it. And finally for functions each of the argument is hashed, the hashes are concatenated and hashed again, and we get the final hash. We can then consult the cache and check if there is a cached value for that hash. If there is a cached value, we simply return that, if there is no cached value then we execute the function.

Rebuild only what's needed

After the first build, another call to the buildProgram function would return the program immedately from cache, because the function is pure

If you change a .cpp file then the buildProgram function must run, because its arguments are different. The cppDeps function can return from cache for unchanged files. The buildCpp function can also return from cache for unchanged files. The buildCpp function run and return a new object file only for changed files. The linkProgram will run because some of the object files are changed.

If you change a header file, then the cppDeps function need to run for each source to check for any new header inclusion, and it will return a new pruned filesystem tree. This new tree is hashed, and if the changed header is among them, the buildCpp will miss the cache and the source file will be rebuilt.

Problems and challenges

Pure functions can only call pure functions so all operations you see in the buildProgram function is pure. Even the array type's add function is pure: it takes the previous state of the array plus the arguments and then creates a new state of array from that. In theory this operation can also be cached and fetched from cache instead of doing the actual addition. However in this case it's faster to simply add the element, so these functions need to be marked as "no cache" and would always execute.

Every time a pure function is called the hash on their arguments needs to be calculated, if the same file appears multiple places (eg. as a header dependency for a source) it's hash need to be calculated many times. So to mitigate this whenever a File object is created, it's hash is precalculated and is stored within the object and then the GetIdentity function would return it when needed instead of recalc. This also means when a tool creates a file it would create something like FileBuilder which it can write data to, and when finished, it would convert it to a File which is a read-only finished file.

There is no C++ compiler that can operate without making system calls or depend on any side-effects. So it's challenging to make build tools pure.

Conclusion

It's possible to have a reproducible build while at the same time the build script is written in an imperative style.

Recently updated:
Logo