Tiny Build Farm for Guix, part 2
Building science packages
In our efforts to create a Tiny Build Farm for Guix, that is supposed
to report on the status of the packages assigned to the science team,
so far we have seen how to
set up
the required infrastructure.
On a dedicated machine with Guix as its operating system, we have added
several Shepherd services:
the Guix Build Coordinator together with a build agent;
and the web server part of the BFFE, which enables us to follow the
activity of the builders.
For performance reasons, we have renounced at installing an instance of the
Guix Data Service, and opt instead for talking to the instance operated
by the Guix project at https://data.guix.gnu.org/, which
continually evaluates the Guix master branch and creates derivations for
all packages in the distribution.
The next step is to explore how to programmatically talk to the remote
data server from a Guile script, how to extract derivations we are
interested in, and how to submit them for building to our instance of the
build coordinator.
Getting information from the data service
We need to install the two packages
guix-data-service and (for later use)
guix-build-coordinator on the TBFG machine, which contain
Guile libraries with the necessary functionality.
⚠ If installed into a user profile, both packages pull in the
guix package as a propagated input, which prevents the user
from updating it through guix pull.
It is thus recommended to run
guix shell guile-next guix-data-service guix-build-coordinator
instead. At the time of writing, the guile package in Guix
is at version 3.0.9, while the data service library requires
guile-next, which is at version 3.0.10.
Let us open a Guile REPL and execute the following code (to ease copy-pasting, I omit the prompt of the REPL; lines starting with a $ sign and a number correspond to results).
$ guile
(use-modules (guix-data-service client))
(define my-data-service "https://data.guix.gnu.org/")
(define json
(guix-data-service-request my-data-service
"repository/1/branch/master.json"))
json
$1 = (("revisions" . #((("data_available" . #f) ("commit-hash" . "cb47639a8081e8e2d651ad1612bbd1e482766469") …
The call to guix-data-service-request
is equivalent to opening the URL
https://data.guix.gnu.org/repository/1/branch/master.json,
which executes the same query as the URL
https://data.guix.gnu.org/repository/1/branch/master
without the .json at the end, but it returns the result in
JSON format.
Moreover, the function call transforms the JSON into a Guile data structure
through the
guile-json
library; in particular, JSON arrays become Guile
vectors
and JSON objects become Guile
association
lists, or alists for short (these are lists of key-value pairs,
so brace yourself for lots of parentheses in a row).
Thus parsing the result and extracting the information we are interested in
amounts to unwrapping these successive layers; in true Scheme/Lisp style
we will also usually transform the vectors into lists using the
vector->list function.
The JSON we asked for is an object with a unique field
revisions, which contains an array of revisions, that is,
git commits on the master branch;
every revision is an object with the three fields
date, commit-hash (these are strings)
and data_available, a boolean indicating whether the data
service has computed the derivations for this commit or not
(which corresponds to the green or grey badges on the website).
This structure can be derived by looking at and playing with the variables
in the REPL, or probably more conveniently by opening the corresponding URL
in a web browser, which should show the JSON in a special mode.
We can now write a small function (or maybe two even smaller functions)
that query the data service and return a list of revisions for which the
data service has computed the derivations:
(define (data-available? revision)
;; Given a REVISION, check whether it has been treated by the
;; data service.
(assoc-ref revision "data_available"))
(define (get-revisions data-service)
;; Query DATA-SERVICE for the list of revisions it has successfully
;; treated in the master branch.
(filter data-available?
(vector->list
(assoc-ref
(guix-data-service-request data-service
"repository/1/branch/master.json")
"revisions"))))
(define revisions (get-revisions my-data-service))
revisions
$2 = ((("data_available" . #t) ("commit-hash" . " …
In the following, we will work with revisions in this form, although mainly the commit hashes are of interest. We could print them as follows:
(define commits
(map (lambda (revision)
(assoc-ref revision "commit-hash"))
revisions))
commits
$3 = ("b966f4007c8492ad89eedf32dd91b3352dba594e" "8a1f56cf8710fc142a2f8ef2e52be82e8aa9f53e" …
(length commits)
$4 = 46
(define commit (car commits))
commit
$5 = b966f4007c8492ad89eedf32dd91b3352dba594e
By default the data service returns 100 revisions (including those for which no data is available), which will be amply enough for our purposes.
The next step is to obtain the derivations for a given revision, say the
newest one with data available. Again this is most easily
reverse-engineered from the web interface of the data service:
Click on the latest revision with a green badge, then on
View package derivations; this shows how the URL is to be formed.
Since we need all derivations, we also have to tick the All results
checkbox; on the other hand, we may limit to one architecture, say
x86_64-linux as System, and not consider
cross-compilation by choosing (no target) for Target.
These choices add GET parameters to the query, which can be passed
as an alist for the optional third parameter of
guix-data-service-request. Again adding .json
to the URL (in front of the ?) shows the structure of the
resulting JSON.
It is then easy to end up with the following function; notice the use
of the
quasiquote
` and the
unquote
,:
(define (get-derivations data-service commit system)
;; Query DATA-SERVICE for the list of derivations for the given COMMIT
;; and SYSTEM.
(map
(lambda (p)
(assoc-ref p "derivation"))
(vector->list
(assoc-ref
(guix-data-service-request data-service
(string-append "revision/" commit "/package-derivations.json")
`((system . ,system) (target . "none") (all_results . "on")))
"derivations"))))
(define derivations
(get-derivations my-data-service commit "x86_64-linux"))
(length derivations)
$6 = 29531
(car derivations)
$7 = "/gnu/store/000lxmn2d17bv2v6znvf6z5vi7ndy8q4-r-janeaustenr-1.0.0.drv"
So the derivations are simply strings pointing to files in the store (of the data service, so far they are not yet available on the TBFG machine).
Filtering out team packages
29000 derivations are more than our poor tiny machine can handle; the next
step is to filter out those that correspond to packages in the science team.
The team is responsible for certain package modules (or equivalently, for
.scm files in the gnu/packages/ directory);
which ones can be seen in the file CODEOWNERS checked into the
Guix git repository, itself derived from etc/teams.scm.
As it does not change very often, for simplicity we may determine the list
of modules by hand, which may require us to resolve regular expressions
(here: fortran(-.+|)) into lists of actually present modules;
here we end up with the following:
(define my-locations
'("algebra" "astronomy" "chemistry" "fortran-check" "fortran-xyz"
"geo" "graph" "lean" "maths" "medical" "sagemath" "statistics"))
When starting the project, I had hoped to extract the interesting packages directly from the (strings representing) derivations, given a fixed list of package names. But it is a truth universally acknowledged that a programmer never has the singularly good fortune of such simplicity, whatever their feelings or views when first entering the neighbourhood of a problem. Here two reasons speak against it: First of all, the packages of a team may change over time as packages are added, removed or moved to a different module. More immediately, though, only the combination of package name and version can be easily recovered from the derivation by removing a fixed prefix, the hash and a fixed suffix, using the following function:
(define (derivation->name+version derivation) ;; Given a DERIVATION (by a string of the form "/gnu/store/..."), ;; return the part of it that encodes the name and the version ;; of the underlying package. (string-drop (basename derivation ".drv") 33))
Thus
/gnu/store/000lxmn2d17bv2v6znvf6z5vi7ndy8q4-r-janeaustenr-1.0.0.drv
becomes
r-janeaustenr-1.0.0, which is the concatenation of the package
name (which is mostly fixed over different revisions) and the package
version (which usually increases over time) with a hyphen in-between.
More often than not it is possible to guess the two components: Here they
are r-janeaustenr and 1.0.0.
Package names often contain hyphens (like here, they serve to separate
a language part, r, and the upstream name,
janeaustenr, see the Guix
naming
conventions); this could be handled by splitting at the last hyphen,
but versions may also contain hyphens. Both can contain alphabetic and
numeric components. Thus it would be quite possible that the above
derivation is for the flourishingly named version
janeaustenr-1.0.0 of the r package.
So we need more code to extract the desired information. Luckily the data service knows about the packages in a revision, with their names and their versions in different fields; and also about their locations, that is, the files in which they are defined.
(define (get-packages data-service commit)
;; Query DATA-SERVICE for the list of packages for the given COMMIT.
(vector->list
(assoc-ref
(guix-data-service-request data-service
(string-append "revision/" commit "/packages.json")
`((field . "version") (field . "location") (all_results . "on")))
"packages")))
(define packages (get-packages my-data-service commit))
(car packages)
$8 = (("location" ("column" . 2) ("line" . 8273) ("file" . "gnu/packages/games.scm")) ("version" . "0.27.1") ("name" . "0ad"))
It is now enough to compare the file name with our list of locations to extract the packages we are interested in.
(define (location-package? package locations)
;; Check whether the PACKAGE comes from the list of LOCATIONS.
(let* ((file (assoc-ref (assoc-ref package "location") "file"))
(module (basename file ".scm")))
(member module locations)))
(use-modules (srfi srfi-26))
(define (packages-name-version data-service commit locations)
;; Query DATA-SERVICE for a list of packages for the given COMMIT
;; that come from the list of LOCATIONS. Return a list of two-element
;; lists with the names and versions of these packages.
(map
(lambda (package)
(list (assoc-ref package "name") (assoc-ref package "version")))
(filter
(cut location-package? <> locations)
(get-packages data-service commit))))
(define team-name-versions
(packages-name-version my-data-service commit my-locations))
(car team-name-versions)
$9 = ("4ti2" "1.6.12")
Finally we just need to compare the extracted team package names
and their versions with the derivations. Unfortunately this can be
quite costly; the following code presents a somewhat
optimised solution with memory usage linear in the result, but a quadratic
number of comparisons (thanks to Liliana Prikler for suggesting the
use of filter-map to me):
(use-modules (srfi srfi-1))
(define (special-cartesian-product X Y)
;; Let X and Y be lists of two element lists of the form (x z) and (y z),
;; respectively. Return a list of all the (x y) such that there is an
;; element z with (x z) in X and (y z) in Y.
(fold cons '()
(filter-map (lambda (xz)
(let ((yz (find (lambda (yz)
(equal? (cadr xz) (cadr yz)))
Y)))
(if yz
(list (car xz) (car yz))
#f)))
X)))
(define (team-derivations data-service commit system locations)
;; Query DATA-SERVICE for the list of derivations for the given COMMIT
;; and SYSTEM, filtered by the LOCATIONS of the packages.
;; To memorise the computed information, return a list of two element
;; lists, each containing a derivation and the corresponding name.
(let* ((derivations (get-derivations data-service commit system))
(X (map
(lambda (d)
(list d (derivation->name+version d)))
derivations))
(name-versions
(packages-name-version data-service commit locations))
(Y (map
(lambda (nv)
(list (car nv) (string-append (car nv) "-" (cadr nv))))
name-versions)))
(special-cartesian-product X Y)))
(define (sort-derivation-names derivation-names)
;; Just for the fun of it, sort DERIVATION-NAMES, a list of two element
;; lists containing derivations and their names, by names.
(sort derivation-names
(lambda (x y)
(string<? (cadr x) (cadr y)))))
(define good-derivation-names
(sort-derivation-names
(team-derivations my-data-service commit "x86_64-linux" my-locations)))
(define derivation-name
(find (lambda (dn)
(equal? (cadr dn) "lrslib"))
good-derivation-names))
derivation-name
$10 = ("/gnu/store/3pxq1g2java4f8nwfq7n98qjvhkr1b34-lrslib-7.2.drv" "lrslib")
Strictly speaking, the function
team-derivations is not correct; if there were
simultaneously a derivation for the package
r-jauneaustenr at version 1.0.0
and a derivation for the package
r at version jauneaustenr-1.0.0,
then either both or none of them would match, while it is possible that
only one of the packages is covered by the science team, a situation
not yet encountered; at worst, we would capture one too many derivations.
For testing purposes during the
development of the TBFG, we additionally check whether the name equals
lrslib; in this way only one derivation is returned (while
at the time of writing there are more than 700 packages covered by the
science team).
Moreover the package in question is a self-contained C program (without
any inputs), which compiles rather quickly.
Submitting builds
Now that we have a list of derivations, we would like to submit them from
our Guile script to the build coordinator. This is not very different from
the approach seen
last time
for submitting from the command line.
Again it is recommended to open a browser window on the
/activity page of the BFFE to see the build coordinator and
the agent in action.
(use-modules (guix-build-coordinator client-communication))
(define my-build-coordinator "http://localhost:8746")
(define ignore-if-build-for-derivation-exists? #f)
(define ignore-if-build-for-outputs-exists? #f)
(define ensure-all-related-derivation-outputs-have-builds? #f)
(define priority 0)
(define (submit-build build-coordinator data-service derivation tags)
;; Given a DERIVATION (as a string), submit it to BUILD-COORDINATOR
;; together with TAGS;
;; DATA-SERVICE is passed through and used by the build coordinator to
;; obtain the derivation file and further references contained in
;; DERIVATION.
(send-submit-build-request
build-coordinator derivation (list data-service) 0 priority
ignore-if-build-for-derivation-exists?
ignore-if-build-for-outputs-exists?
ensure-all-related-derivation-outputs-have-builds?
tags))
(submit-build my-build-coordinator my-data-service (car derivation-name) '())
$11 = (("build-submitted" . "8f8f1cad-fe9c-462c-bc59-3d1f87abf942"))
$12 = #<<response> …
The global variables, which we pass on to the submit-build
function, determine the behaviour of the build coordinator.
If ignore-if-build-for-derivation-exists? is true,
then the build will not be carried out a second time if it was already tried
(successfully or not) by the build coordinator before.
In production, it will thus be preferable to set it to #t;
while still experimenting, we are likely to submit the same derivation
several times. Setting the value to #f would also make sense
to check that rebuilding the same package works.
The variable ignore-if-build-for-outputs-exists? goes a bit
further; if set to #t, then the build will not be carried out
if a different derivation with the same output was already tried (a very
technical distinction; I would recommend to leave it at #f).
If ensure-all-related-derivation-outputs-have-builds? is
#t,
then the build coordinator will recursively submit builds for all the
derivations required as inputs to a given derivation. While this sounds
reasonable at first, it can go very far, since the coordinator does not
look at the store, but at the builds it has handled itself and recorded
in its database. This means that the first build submission, when the
database is still empty, will entail a complete bootstrap of the Guix
distribution. So I would recommend to leave it also at #f.
Then the build works as follows: The coordinator sends the derivation
to an agent. The agent tries to download all required inputs from a
substitute server and if successful, will build only the derivation it is
asked to build. Otherwise, it reports back to the coordinator that it has
encountered a set-up failure, together with a list of missing inputs.
This triggers a hook in the coordinator, and the default hook is to add
the missing inputs to the list of outstanding builds, as well as the
failed build itself to try it again once the inputs are available.
In this way, even if
ensure-all-related-derivation-outputs-have-builds? is
#f, all really missing inputs will be built recursively,
until the build succeeds or a real failure in one of its inputs is
encountered.
The submission immediately returns two values, without waiting for the package build to finish. The first return value can be used to link the submitted derivation to the shown UUID of the build, which is a key in the build coordinator database. The second return value is the HTTP response, which we will ignore from now on.
Tags can be added in a parenthesis rich format; the parameter is a list
of tags, where each tag is a two element list (not a pair!), in which both
elements are pairs. The first one pairs the keyword key
to a value, the second one pairs the keyword value to a
value (the values are used to construct the URL and can be strings or
numbers). So the following would work:
(define tags `(((key . "commit")(value . ,commit))
((key . "name")(value . ,(cadr derivation-name)))
((key . "build")(value . 2))))
(submit-build my-build-coordinator my-data-service (car derivation-name) tags)
$13 (("build-submitted" . "82a56cac-1e93-4b4a-926f-d8762f919219"))
$14 = #<<response> …
The tags are shown in the activity window and are also recorded in the build coordinator database; as shown here, they can encode arbitrary additional information of a build, such as the commit it comes from, the package name or the submission count for a given derivation.
Code
For ease of use, the code developed in this post is made available, under GPLv3 or later, in a dedicated git repository on Codeberg. More precisely, it is collected in the file tbfg.scm at commit 51eb5c6d45c66d15b7c14340ec3af0732b5b66fd.
Outlook
We have queried the data service and used the resulting information on
packages and derivations to submit build jobs to the build coordinator.
But so far we have no programmatical access to the build results; we only
saw the builds flicker by on the BFFE website.
It would be nice to record success or failure, and more generally to keep
track of the builds; this will be our next step.
Since we do not want to operate a substitute server, but rather follow the
state of the packages under the responsibility of the science team, unlike
the official build farms we are not necessarily interested in obtaining the
build results. These are sent from the build agents to the build coordinator;
on the bordeaux build farm the
nar herder
shovels them to a separate substitute server.
For us everything is on the same machine, which will thus contain
successfully built packages in its store (at least until the next
guix gc run). If desired, these could be made available using
guix
publish.