Tiny Build Farm for Guix, part 2

Building science packages

In our efforts to create a Tiny Build Farm for Guix, that is supposed to report on the status of the packages assigned to the science team, so far we have seen how to set up the required infrastructure. On a dedicated machine with Guix as its operating system, we have added several Shepherd services: the Guix Build Coordinator together with a build agent; and the web server part of the BFFE, which enables us to follow the activity of the builders. For performance reasons, we have renounced at installing an instance of the Guix Data Service, and opt instead for talking to the instance operated by the Guix project at https://data.guix.gnu.org/, which continually evaluates the Guix master branch and creates derivations for all packages in the distribution. The next step is to explore how to programmatically talk to the remote data server from a Guile script, how to extract derivations we are interested in, and how to submit them for building to our instance of the build coordinator.

Getting information from the data service

We need to install the two packages guix-data-service and (for later use) guix-build-coordinator on the TBFG machine, which contain Guile libraries with the necessary functionality.

⚠ If installed into a user profile, both packages pull in the guix package as a propagated input, which prevents the user from updating it through guix pull. It is thus recommended to run

guix shell guile-next guix-data-service guix-build-coordinator

instead. At the time of writing, the guile package in Guix is at version 3.0.9, while the data service library requires guile-next, which is at version 3.0.10.

Let us open a Guile REPL and execute the following code (to ease copy-pasting, I omit the prompt of the REPL; lines starting with a $ sign and a number correspond to results).

$ guile
(use-modules (guix-data-service client))
(define my-data-service "https://data.guix.gnu.org/")

(define json
  (guix-data-service-request my-data-service
                             "repository/1/branch/master.json"))
json
$1 = (("revisions" . #((("data_available" . #f) ("commit-hash" . "cb47639a8081e8e2d651ad1612bbd1e482766469") …

The call to guix-data-service-request is equivalent to opening the URL https://data.guix.gnu.org/repository/1/branch/master.json, which executes the same query as the URL https://data.guix.gnu.org/repository/1/branch/master without the .json at the end, but it returns the result in JSON format. Moreover, the function call transforms the JSON into a Guile data structure through the guile-json library; in particular, JSON arrays become Guile vectors and JSON objects become Guile association lists, or alists for short (these are lists of key-value pairs, so brace yourself for lots of parentheses in a row). Thus parsing the result and extracting the information we are interested in amounts to unwrapping these successive layers; in true Scheme/Lisp style we will also usually transform the vectors into lists using the vector->list function. The JSON we asked for is an object with a unique field revisions, which contains an array of revisions, that is, git commits on the master branch; every revision is an object with the three fields date, commit-hash (these are strings) and data_available, a boolean indicating whether the data service has computed the derivations for this commit or not (which corresponds to the green or grey badges on the website). This structure can be derived by looking at and playing with the variables in the REPL, or probably more conveniently by opening the corresponding URL in a web browser, which should show the JSON in a special mode. We can now write a small function (or maybe two even smaller functions) that query the data service and return a list of revisions for which the data service has computed the derivations:

(define (data-available? revision)
  ;; Given a REVISION, check whether it has been treated by the
  ;; data service.
  (assoc-ref revision "data_available"))

(define (get-revisions data-service)
  ;; Query DATA-SERVICE for the list of revisions it has successfully
  ;; treated in the master branch.
  (filter data-available?
    (vector->list
      (assoc-ref
        (guix-data-service-request data-service
          "repository/1/branch/master.json")
        "revisions"))))

(define revisions (get-revisions my-data-service))
revisions
$2 = ((("data_available" . #t) ("commit-hash" . " …

In the following, we will work with revisions in this form, although mainly the commit hashes are of interest. We could print them as follows:

(define commits
  (map (lambda (revision)
         (assoc-ref revision "commit-hash"))
       revisions))
commits
$3 = ("b966f4007c8492ad89eedf32dd91b3352dba594e" "8a1f56cf8710fc142a2f8ef2e52be82e8aa9f53e" …
(length commits)
$4 = 46
(define commit (car commits))
commit
$5 = b966f4007c8492ad89eedf32dd91b3352dba594e

By default the data service returns 100 revisions (including those for which no data is available), which will be amply enough for our purposes.

The next step is to obtain the derivations for a given revision, say the newest one with data available. Again this is most easily reverse-engineered from the web interface of the data service: Click on the latest revision with a green badge, then on View package derivations; this shows how the URL is to be formed. Since we need all derivations, we also have to tick the All results checkbox; on the other hand, we may limit to one architecture, say x86_64-linux as System, and not consider cross-compilation by choosing (no target) for Target. These choices add GET parameters to the query, which can be passed as an alist for the optional third parameter of guix-data-service-request. Again adding .json to the URL (in front of the ?) shows the structure of the resulting JSON. It is then easy to end up with the following function; notice the use of the quasiquote ` and the unquote ,:

(define (get-derivations data-service commit system)
  ;; Query DATA-SERVICE for the list of derivations for the given COMMIT
  ;; and SYSTEM.
  (map
    (lambda (p)
      (assoc-ref p "derivation"))
    (vector->list
      (assoc-ref
        (guix-data-service-request data-service
          (string-append "revision/" commit "/package-derivations.json")
          `((system . ,system) (target . "none") (all_results . "on")))
        "derivations"))))

(define derivations
  (get-derivations my-data-service commit "x86_64-linux"))
(length derivations)
$6 = 29531
(car derivations)
$7 = "/gnu/store/000lxmn2d17bv2v6znvf6z5vi7ndy8q4-r-janeaustenr-1.0.0.drv"

So the derivations are simply strings pointing to files in the store (of the data service, so far they are not yet available on the TBFG machine).

Filtering out team packages

29000 derivations are more than our poor tiny machine can handle; the next step is to filter out those that correspond to packages in the science team. The team is responsible for certain package modules (or equivalently, for .scm files in the gnu/packages/ directory); which ones can be seen in the file CODEOWNERS checked into the Guix git repository, itself derived from etc/teams.scm. As it does not change very often, for simplicity we may determine the list of modules by hand, which may require us to resolve regular expressions (here: fortran(-.+|)) into lists of actually present modules; here we end up with the following:

(define my-locations
  '("algebra" "astronomy" "chemistry" "fortran-check" "fortran-xyz"
  "geo" "graph" "lean" "maths" "medical" "sagemath" "statistics"))

When starting the project, I had hoped to extract the interesting packages directly from the (strings representing) derivations, given a fixed list of package names. But it is a truth universally acknowledged that a programmer never has the singularly good fortune of such simplicity, whatever their feelings or views when first entering the neighbourhood of a problem. Here two reasons speak against it: First of all, the packages of a team may change over time as packages are added, removed or moved to a different module. More immediately, though, only the combination of package name and version can be easily recovered from the derivation by removing a fixed prefix, the hash and a fixed suffix, using the following function:

(define (derivation->name+version derivation)
  ;; Given a DERIVATION (by a string of the form "/gnu/store/..."),
  ;; return the part of it that encodes the name and the version
  ;; of the underlying package.
  (string-drop (basename derivation ".drv") 33))

Thus /gnu/store/000lxmn2d17bv2v6znvf6z5vi7ndy8q4-r-janeaustenr-1.0.0.drv becomes r-janeaustenr-1.0.0, which is the concatenation of the package name (which is mostly fixed over different revisions) and the package version (which usually increases over time) with a hyphen in-between. More often than not it is possible to guess the two components: Here they are r-janeaustenr and 1.0.0. Package names often contain hyphens (like here, they serve to separate a language part, r, and the upstream name, janeaustenr, see the Guix naming conventions); this could be handled by splitting at the last hyphen, but versions may also contain hyphens. Both can contain alphabetic and numeric components. Thus it would be quite possible that the above derivation is for the flourishingly named version janeaustenr-1.0.0 of the r package.

So we need more code to extract the desired information. Luckily the data service knows about the packages in a revision, with their names and their versions in different fields; and also about their locations, that is, the files in which they are defined.

(define (get-packages data-service commit)
  ;; Query DATA-SERVICE for the list of packages for the given COMMIT.
  (vector->list
    (assoc-ref
      (guix-data-service-request data-service
        (string-append "revision/" commit "/packages.json")
        `((field . "version") (field . "location") (all_results . "on")))
      "packages")))

(define packages (get-packages my-data-service commit))
(car packages)
$8 = (("location" ("column" . 2) ("line" . 8273) ("file" . "gnu/packages/games.scm")) ("version" . "0.27.1") ("name" . "0ad"))

It is now enough to compare the file name with our list of locations to extract the packages we are interested in.

(define (location-package? package locations)
  ;; Check whether the PACKAGE comes from the list of LOCATIONS.
  (let* ((file (assoc-ref (assoc-ref package "location") "file"))
         (module (basename file ".scm")))
        (member module locations)))

(use-modules (srfi srfi-26))
(define (packages-name-version data-service commit locations)
  ;; Query DATA-SERVICE for a list of packages for the given COMMIT
  ;; that come from the list of LOCATIONS. Return a list of two-element
  ;; lists with the names and versions of these packages.
  (map
    (lambda (package)
      (list (assoc-ref package "name") (assoc-ref package "version")))
    (filter
      (cut location-package? <> locations)
      (get-packages data-service commit))))

(define team-name-versions
  (packages-name-version my-data-service commit my-locations))
(car team-name-versions)
$9 = ("4ti2" "1.6.12")

Finally we just need to compare the extracted team package names and their versions with the derivations. Unfortunately this can be quite costly; the following code presents a somewhat optimised solution with memory usage linear in the result, but a quadratic number of comparisons (thanks to Liliana Prikler for suggesting the use of filter-map to me):

(use-modules (srfi srfi-1))
(define (special-cartesian-product X Y)
  ;; Let X and Y be lists of two element lists of the form (x z) and (y z),
  ;; respectively. Return a list of all the (x y) such that there is an
  ;; element z with (x z) in X and (y z) in Y.
  (fold cons '()
        (filter-map (lambda (xz)
                      (let ((yz (find (lambda (yz)
                                        (equal? (cadr xz) (cadr yz)))
                                      Y)))
                        (if yz
                            (list (car xz) (car yz))
                            #f)))
                    X)))

(define (team-derivations data-service commit system locations)
  ;; Query DATA-SERVICE for the list of derivations for the given COMMIT
  ;; and SYSTEM, filtered by the LOCATIONS of the packages.
  ;; To memorise the computed information, return a list of two element
  ;; lists, each containing a derivation and the corresponding name.
  (let* ((derivations (get-derivations data-service commit system))
         (X (map
              (lambda (d)
                (list d (derivation->name+version d)))
            derivations))
         (name-versions
           (packages-name-version data-service commit locations))
         (Y (map
              (lambda (nv)
                (list (car nv) (string-append (car nv) "-" (cadr nv))))
            name-versions)))
    (special-cartesian-product X Y)))

(define (sort-derivation-names derivation-names)
  ;; Just for the fun of it, sort DERIVATION-NAMES, a list of two element
  ;; lists containing derivations and their names, by names.
  (sort derivation-names
        (lambda (x y)
          (string<? (cadr x) (cadr y)))))

(define good-derivation-names
  (sort-derivation-names
    (team-derivations my-data-service commit "x86_64-linux" my-locations)))
(define derivation-name
        (find (lambda (dn)
                (equal? (cadr dn) "lrslib"))
              good-derivation-names))
derivation-name
$10 = ("/gnu/store/3pxq1g2java4f8nwfq7n98qjvhkr1b34-lrslib-7.2.drv" "lrslib")

Strictly speaking, the function team-derivations is not correct; if there were simultaneously a derivation for the package r-jauneaustenr at version 1.0.0 and a derivation for the package r at version jauneaustenr-1.0.0, then either both or none of them would match, while it is possible that only one of the packages is covered by the science team, a situation not yet encountered; at worst, we would capture one too many derivations. For testing purposes during the development of the TBFG, we additionally check whether the name equals lrslib; in this way only one derivation is returned (while at the time of writing there are more than 700 packages covered by the science team). Moreover the package in question is a self-contained C program (without any inputs), which compiles rather quickly.

Submitting builds

Now that we have a list of derivations, we would like to submit them from our Guile script to the build coordinator. This is not very different from the approach seen last time for submitting from the command line. Again it is recommended to open a browser window on the /activity page of the BFFE to see the build coordinator and the agent in action.

(use-modules (guix-build-coordinator client-communication))

(define my-build-coordinator "http://localhost:8746")
(define ignore-if-build-for-derivation-exists? #f)
(define ignore-if-build-for-outputs-exists? #f)
(define ensure-all-related-derivation-outputs-have-builds? #f)
(define priority 0)

(define (submit-build build-coordinator data-service derivation tags)
  ;; Given a DERIVATION (as a string), submit it to BUILD-COORDINATOR
  ;; together with TAGS;
  ;; DATA-SERVICE is passed through and used by the build coordinator to
  ;; obtain the derivation file and further references contained in
  ;; DERIVATION.
  (send-submit-build-request
    build-coordinator derivation (list data-service) 0 priority
    ignore-if-build-for-derivation-exists?
    ignore-if-build-for-outputs-exists?
    ensure-all-related-derivation-outputs-have-builds?
    tags))

(submit-build my-build-coordinator my-data-service (car derivation-name) '())
$11 = (("build-submitted" . "8f8f1cad-fe9c-462c-bc59-3d1f87abf942"))
$12 = #<<response> …

The global variables, which we pass on to the submit-build function, determine the behaviour of the build coordinator. If ignore-if-build-for-derivation-exists? is true, then the build will not be carried out a second time if it was already tried (successfully or not) by the build coordinator before. In production, it will thus be preferable to set it to #t; while still experimenting, we are likely to submit the same derivation several times. Setting the value to #f would also make sense to check that rebuilding the same package works. The variable ignore-if-build-for-outputs-exists? goes a bit further; if set to #t, then the build will not be carried out if a different derivation with the same output was already tried (a very technical distinction; I would recommend to leave it at #f). If ensure-all-related-derivation-outputs-have-builds? is #t, then the build coordinator will recursively submit builds for all the derivations required as inputs to a given derivation. While this sounds reasonable at first, it can go very far, since the coordinator does not look at the store, but at the builds it has handled itself and recorded in its database. This means that the first build submission, when the database is still empty, will entail a complete bootstrap of the Guix distribution. So I would recommend to leave it also at #f. Then the build works as follows: The coordinator sends the derivation to an agent. The agent tries to download all required inputs from a substitute server and if successful, will build only the derivation it is asked to build. Otherwise, it reports back to the coordinator that it has encountered a set-up failure, together with a list of missing inputs. This triggers a hook in the coordinator, and the default hook is to add the missing inputs to the list of outstanding builds, as well as the failed build itself to try it again once the inputs are available. In this way, even if ensure-all-related-derivation-outputs-have-builds? is #f, all really missing inputs will be built recursively, until the build succeeds or a real failure in one of its inputs is encountered.

The submission immediately returns two values, without waiting for the package build to finish. The first return value can be used to link the submitted derivation to the shown UUID of the build, which is a key in the build coordinator database. The second return value is the HTTP response, which we will ignore from now on.

Tags can be added in a parenthesis rich format; the parameter is a list of tags, where each tag is a two element list (not a pair!), in which both elements are pairs. The first one pairs the keyword key to a value, the second one pairs the keyword value to a value (the values are used to construct the URL and can be strings or numbers). So the following would work:

(define tags `(((key . "commit")(value . ,commit))
               ((key . "name")(value . ,(cadr derivation-name)))
               ((key . "build")(value . 2))))
(submit-build my-build-coordinator my-data-service (car derivation-name) tags)
$13 (("build-submitted" . "82a56cac-1e93-4b4a-926f-d8762f919219"))
$14 = #<<response> …

The tags are shown in the activity window and are also recorded in the build coordinator database; as shown here, they can encode arbitrary additional information of a build, such as the commit it comes from, the package name or the submission count for a given derivation.

Code

For ease of use, the code developed in this post is made available, under GPLv3 or later, in a dedicated git repository on Codeberg. More precisely, it is collected in the file tbfg.scm at commit 51eb5c6d45c66d15b7c14340ec3af0732b5b66fd.

Outlook

We have queried the data service and used the resulting information on packages and derivations to submit build jobs to the build coordinator. But so far we have no programmatical access to the build results; we only saw the builds flicker by on the BFFE website. It would be nice to record success or failure, and more generally to keep track of the builds; this will be our next step. Since we do not want to operate a substitute server, but rather follow the state of the packages under the responsibility of the science team, unlike the official build farms we are not necessarily interested in obtaining the build results. These are sent from the build agents to the build coordinator; on the bordeaux build farm the nar herder shovels them to a separate substitute server. For us everything is on the same machine, which will thus contain successfully built packages in its store (at least until the next guix gc run). If desired, these could be made available using guix publish.