Reflections on limits to verifiability of software integrity and on the role of reproducibility

(c) 2023 an; this text is being released to public domain, alternatively, when necessary, under 0BSD

first published: 2023-11-23 on rb-general@lists.reproducible-builds.org

last change: 2023-11-23

Summary

This discourse explores the absolute and/or practical limits to assurance of software integrity. Reproducibility appears to be crucial to be able to strongly assert integrity of artifacts. The analysis emphasizes that human consensus plays a fundamental role in application of reproducibility.

Moreover, combined with checking of consensus between diverse installations, reproducibility allows to assert the artifact integrity even in the absence of knowledge about the integrity not only of the software but also of the hardware used to generate the artifacts. This allows generation of a known-good starting-point toolchain, where “toolchain” is explicitly defined to include all post-boot[1] software needed for source-to-binary translation, i.e. kernel, general purpose utilities etc.

Under assumption that less than majority of the used installations are corrupted specifically to execute an attack against the checking, a majority consensus provides complete integrity assurance. Diversity of the initial toolchains makes it plausible that the majority could not be able to collude in an attack.

Known approaches which do not specifically imply both reproducibility and comparison between independent builds, provide a level of integrity assurance of the resulting artifacts limited by the integrity of the a priori trusted prerequisites for running of programs, which, at the very minimum, includes hardware. As an example, binary-seed-based solutions can not exceed the lowest of the integrity levels of the binary seed, of the pre-boot[2] software and of the hardware being used.

Introduction

Computing devices and tools have become a crucial part of many vital activities of everyday life. As much as other real life components of highest importance (e.g. water supply), they are a target of attacks for high nefarious rewards and represent a point of vulnerability of the society. One attack vector which is gaining prominence is subverting the computing components (compilers, libraries, build server installations) being used to build applications, to overturn the behavior of the applications in real life activities. This is known as supply chain attacks[3]. Audition of the source code, combined with ensuring that machine code representation of computing tools is functionally equivalent to the source is a corresponding powerful protection means.

A practical approach to ensuring functional equivalence is a confirmation of equality of the contents of the programs (the data, representing algorithms expressed in some language, human-readable or not). Historically this is routinely applied (usually via checksumming), by comparing source or binary code to a known-good master. The identification of a trustable master binary data has shown to be a challenge by itself, because it is much harder to audit than source. This problem disappears if the functional equivalence of a binary representation to the source one can be reliably confirmed. One of the approaches to a solution are the so called reproducible builds. Given reproducibility, a known-good translator between the source and the binary representation of algorithms, aka “toolchain”, is enough to provide a known-good master of the binary result. Then traditional verification against the latter, e.g. by checksumming, can be applied[4].

The starting point

Unfortunately, to ensure the integrity of a toolchain by the reproducible builds approach, we need a known-good toolchain to build it. This problem does not seem to be solvable unless we can break the recursion. The purpose of my analysis is among others to find out, to what degree, theoretically and in practice, a “known-good” starting point with minimal implicit trust can be chosen and used. The result, as discussed below, appears to be that no implicit trust is necessary at all.

Terminology

The word “integrity” is to be interpreted as “absence of hidden purposeful deviations from the expected/specified behavior”. I will treat conformance to specification as technically equivalent to integrity, without trying to distinguish between malicious versus spontaneous deviations. I will specifically discuss the effects of bugs and of underspecification.

A “toolchain” is to be understood to include its whole runtime environment (e.g. the OS kernel, libraries, the needed general purpose program).

I will use the word “trust” to refer to the assumption that a component of a computing system behaves conforming to the concerned human party’s interpretation of the specification of the corresponding algorithm[5].

I take for granted that always at least some part of the environment[6] of any computing process (execution of an algorithm expressed in some way) for the purposes of this process has to be assumed, a priori, to be trusted[7]. (For example, there does not seem to exist any practical way to reliably analyze hardware, i.e. to verify its full conformance to its expected properties.)

This point of view does not distinguish between hard- and software. It is the user who chooses the sufficient contents of the a-priori-trusted part of her system, to make the system operational.

This could be for example trusting the Digital Equipment Corporation to implement their PDP-11 hardware honestly. At the time it was a reasonable expectation, when there hardly existed technologies capable to covertly and consistently subvert the CPU or the peripherals for some gain of the perpetrator. Today the situation is different but the overwhelming majority of the users still choose to trust both their hardware and not least the pre-boot software.

This presents a challenge of doing “sufficiently reliable” computations on systems which generally should not be assumed to be trustable.

I will hence suppose that a user has taken a practically weighed decision regarding her choice of an a-priori-trusted set of the resources (hard- and software) and given this conscious choice wants to ensure, i.e. to avoid taking for granted, that an additional part of her computing environment also follows an acceptable for her interpretation of the corresponding specifications. I will also assume that the border of the trusted set lies outside the possible gray area between hard- and software, in the realm of software.

The tentative goal is to find out, under which circumstances and to what degree the sought assurance can be achieved.

Reproducibility

The point of interest is reproducibility as a method to confirm integrity. A reproducible build based on a trusted toolchain allows to safely expand one’s trusted subset of the computing tools.

The source code is the human-readable specification of the corresponding algorithm. Thus, if we like the specification (the reliability of auditing of the source code is out of the scope of this text) and are sure about the integrity of the toolchain which translates the source representation of the algorithm into another representation, suitable for adding to our trusted realm (e.g. suitable to be interpreted by the hardware present in that realm), then we can trust this resulting binary representation.

Reproducible builds allow us to base our trust in a binary program on a different, socially anchored, kind of trust, in people who have used their compilers to build the data independently of each other.

Are all of them trustworthy? Are their compilers trustworthy? Are the communication channels, for example mailing lists, protected from tampering? The answers are “definitely not always”. Fortunately this does not matter, as long as the consensus about which result is correct is strong.

The strength depends on the degree of corruption in each component above. Modern “social media” give us unfortunately an example of communication methods which can make it extremely hard to reach a consensus. Nevertheless, in practice consensus on reproducible artifacts converges well.

To use this, we need a trusted ability to reliably identify binary artifacts as trustable. This can be solved by checksumming, where the checksums become the confirmation of “known good” data.

Now let us explore the situation where neither a toolchain nor any means of binary contents verification (e.g. checksumming and displaying the result to the user) are included into the initial trusted set. The goal will be to find out whether this precludes successful application of the approach of reproducible builds (e.g. we know that there is known-good boot data corresponding to its source, but how do we know that our copy corresponds to it?).

As an example, the hardware and a bootloader would be included in the initial set, but not anything else. Booting some data without checking would be adding an extra a-priori-trusted component.

Verifiability

Does this imply that a checksumming tool must be present in the a-priori-trusted set? Fortunately, not necessarily.

It is possible to rely on a consensus approach again, in a slightly different form.

Let us look for a consensus between multiple independent and dissimilar computing installations. As a bonus, if these installations happen to be under control of the same user, then the matters of communication reliability and personal honesty disappear, from the point of view of that particular user.

Let us define the whole of a checksumming tool to include its runtime environment (say, an OS kernel, libraries, general purpose tools and similar). In other words, it shall be usable totally “standalone”, as a boot load. Let us specify its behavior in a human-readable form of a programming language. Let us use several independent interpreters of this specification which express their interpretation in the language understood by a tool present in our a-priory-trusted set (e.g. by the hardware). These interpreters are colloquially called compilers. If their results match, then we know that they share a common view on how the specification shall be expressed in the corresponding machine language.

Of course, independent translations of a specification from language to language are mostly guaranteed to be different. To overcome this, let us compare the resulting functionality, not its implementation. To compare the functionality exhaustively is, of course, impossible. Reproducible builds solve this problem: in the first stage we can build a toolchain from source and then let it translate the source specification of checksumming into binary.

It is irrelevant whether the different built compilers behave identically in all situations, we will only use them once - but they must agree on the translation of the checksumming tool. A consensus indicates a common view on what the built compiler is supposed to do when translating the checksumming tool, a view expressed in the machine language. A dissent means that some of the toolchains has/have a differing view.

Let the resulting checksumming tools check themselves and each other. As long as at least one of their corresponding toolchains did not collude[8], they will not cover each other’s dishonesty or disagreement in the interpretation. Thus and then, a consensus will mean sanity, as strong as the strength of the checksumming function.

What is the degree of trustability which can be achieved in this way? I argue that in the presence of diversity it is at least as good as the reliability of reproducible builds.

It can be higher, when the human factor and communication channels are excluded. Under control of a single party with access to a large variety of installations it can be arbitrarily high. If we e.g. would choose to take for granted that specifically corrupted[9] installations represent at maximum a minority in the chosen installations pool and at the same time we observe a majority consensus, then the integrity assertion could be said to be “complete”, i.e. only limited by the completeness of the specification and possible ambiguity of its language.

This knowledge about the assurance being “complete” is though not transferable to others from the party conducting the check; for the others the assurance is only as good as social trust and/or consensus can provide.

A different aspect is how much the toolchains' interpretation of the sources corresponds to the one expected by the human. The programming language can be underspecified and the implementation of the intermediate compiler and/or of the checksumming tool can happen to be expressed ambiguously.

This does not present any problem with lone outliers, the corresponding results can be discarded without caring of whether this represents a rigged compiler or a flaky compiler or a sane compiler taking a different stand on underspecified input language statements.

However a presence of several differing consensus sets also can happen. I suggest treating all such sets but the largest one as non-consenting and then observe whether the remaining consenting set is big enough to sufficiently exceed the expectation of the maximum possible share of corruption. The question of the most efficient handling of such a case remains otherwise open.

Given presumably incomplete knowledge about the provenance and reliability of the available pre-existing toolchains, extending the number and diversity of them helps to ensure the validity of the result.

As long as there is at least one non-corrupted system in the chosen consenting set, the integrity is assured. How would we know that there is one there?

I suggest that an assumption that less than half of reasonably chosen installations with independent provenance are specifically corrupted will hold with practical (i.e. sufficiently small) sizes of the installation pools. Then a majority consensus will provide assurance as good as reproducible builds in general and up to “complete” as illustrated above.

I welcome suggestions of alternative and possibly more efficient criteria for ensuring a practically needed level of integrity assurance.

Given the generated and consented upon data, the user can create a boot media from the generated data and recheck that all systems checksum this data to an equal value. Then this data on the media is proven for integrity, without having to assert integrity of any of the participating systems, not even of the corresponding hardware.

Availability of a known good checksumming tool makes the user ready for reliance on reproducible builds to expand the trusted set on any target installation under her control, at will.

Now, let us follow the routine described above and also, side-by-side with a checksumming program, let us generate a toolchain. Then we also have got a known good toolchain - by relying on reproducibility consensus, but not on any prior known good toolchain. The checksumming tool is not even necessary, the reciprocal cross-comparison can be as well executed on the full data, by general purpose utilities, especially when the builds are under control of a single user.

For a party who intends to build all binaries from source, the known good toolchain provides the necessary starting point. Not least, this suits well for reproducible builds, to participate by generating and distributing binaries and their checksums or otherwise to confirm the validity of the checksums.

For a user who chooses to avoid compilations (this choice is by far the most common case) and relies on reproducible builds for integrity assurance, the corresponding starting point would be the known good checksumming tool as above (a kernel and general purpose programs included).

Practical feasibility

Feasibility of consensus-based bootstrap has been illustrated by its implementation, as the successfully completed VSOBFS (“Verifiable Source-Only Bootstrap From Scratch”) project[10], published in March 2023.

To avoid a confusion between the concept and the implementation, I suggest a designation for the concept as “self-sufficient source-to-binary equivalence assurance via consensus of diverse reproducible builds”, SSSBEA-CODRB.

What about reducing the a priori trusted set?

Given that several or all of the computing instances in the discussed scenario produced a trustable checksumming program, can we conclude that we proved their innocence/trustability? Unfortunately not. We only proved that they did not interfere with the production of the checksumming program.

On the other hand, if there is open-source pre-boot software for a certain hardware platform, then including this pre-boot software into the multiple compilation scenario above, to generate the corresponding artifacts, allows us to reduce the necessary a-priory-trusted set on that platform to only include hardware.

Related projects

The practicality of SSSBEA-CODRB means that the efforts aimed at minimizing the starting binary seed and at creation of an inspectable-in-binary-form seed like e.g. in GNU Mes[11],[12] seem misdirected, such a seed is now known to be unnecessary for a full-integrity bootstrap.

The use of a misnomer “full-source bootstrap” denoting the intention to treat the binary seed as if it were source[13] may regrettably have contributed to the disorientation of the development.

Conclusions

Integrity assurance based on consensus of diverse reproducible builds allows expanding a trusted set of components in a computing environment, without adding any a priori trust.

The assurance is at least as good as of reproducible builds in general (the latter is limited by the efficiency of human social communication and the capacity to reliably reach consensus).

The integrity of the produced artifacts is stronger than of any computing process, because all implicit trust is excluded, even trust in hardware.

On the contrary, integrity of artifacts produced by binary-seed based bootstrapping without relying on consensus between independent builds is at best as good as the integrity of the least trustable of the corresponding hardware, the pre-boot software and the binary seed.

Integrity-assured generation of pre-boot software allows to minimize the amount of non-hardware among a priori trusted components in certain target systems, potentially to zero.

Some questions mentioned in the text remain open, suggestions and discussions are welcome.

[1] post-boot refers to all software except pre-boot (below)

[2] pre-boot denotes the software responsible for hardware initialization and for loading boot data supplied by the user

[3] https://en.wikipedia.org/wiki/Supply_chain_attack

[4] Both making a direct copy/comparison and using checksumming represent an integrity-proof-problem by themselves, i.e. how to make sure that the copy is made from the correct master and error-free, respectively how the authenticity of the checksum value is established. We assume that the traditional approach is being used, which is a consensus, based on social trust, which protects against forging and against communication channel errors. An example can be fetching data from several different mirror sites. Without their consensus and a social consensus on their reliability it is hard to distinguish malicious sites from legitimate and identify transfer errors. Authenticity of checksums is the same problem, but with much less data to compare. Signing of the data allows additional socially-anchored predicates in the consensus estimation, at the cost of extra consensus chains to consider (who/how many signed whose key).

[5] The authenticity of the specification does not matter. Its contents as seen by the human is the basis for verification of its implementation. If they do not match, the implementation is to be rejected.

[6] In the general sense, not referring to Unix- or otherwise OS-specific uses of this term.

[7] Note that this assumption concerns processes, not artifacts.

[8] “Collusion” in this context does not imply a common attack origin or payload, but denotes a common specific goal to subvert the contents check. Such attacks, even unrelated to each other, can happen to cover up each other’s corruptions, then the corresponding systems would look like they agree with each other.

[9] To be able to cause even a minimal chance of consensus, a corruption must be crafted against this specific method of checking the result, i.e. the checksumming application must falsely report specific outputs for specific input data, due to a corruption in itself or in its runtime environment, e.g. in the OS kernel. Such attacks, as a generalization of “trusting-trust” or via corrupted hardware, do not look impossible. I leave analysis or construction of such attacks as an open question for a possible further discussion.

[10] http://rbzfp7h25zcnmxu4wnxhespe64addpopah5ckfpdfyy4qetpziitp5qd.onion
otherwise
https://ipfs.io/ipfs/QmT2Mo4pcCGSf3iJ6NnU8nFv7yEUiM8mU62ArWbcdikEVn
otherwise
https://www.zq1.de/~bernhard/mirror/rbzfp7h25zcnmxu4wnxhespe64addpopah5ckfpdfyy4qetpziitp5qd.onion

[11] https://web.archive.org/web/20231120152543/https://www.gnu.org/software/mes/manual/html_node/Reduced-Binary-Seed-Bootstrap.html

[12] https://web.archive.org/web/20220926021019/https://www.gnu.org/software/mes/manual/html_node/Full-Source-Bootstrap.html

[13] The link above explains that Full-Source-Bootstrap means to treat certain binaries as if they were human-readable, in other words as if they were source: “… all binary seeds need to be inspectable and must be reviewed.”