prev. article next article

source distribution is broken

10 Aug 2014

motivation

source and binary distribution in linux distributions is outdated and nobody is doing anything [tm].

problem

today i was packaging prosody [0] for nixos linux and wondered why lua-socket 2.0.2 wasn’t working as expected. the solution was pretty easy: it seems that 2.0.2 is just outdated and other distributions as ubuntu [2] and fedora core [3] do use [luasocket_3.0~rc1.orig.tar.gz] or lua-socket-3.0-0.6rc1.fc21.x86_64.rpm respectively. also both repos don’t state where they got the source code form which is not only annoying but also dangerous since i now don’t know what modifications either distribution included!

the issue: [1] does not have a 3.x release nor can i find the RCS used for development.

diego’s homepage [1] states:

Last modified by Diego Nehab on

Wed Oct 3 02:07:59 BRT 2007

now, what can we do?

seems like a abandoned project, as so often…

or a different example:

nixos-rebuild wanting to pull from github.com

pkgs/development/libraries/minmay/default.nix states:

meta = {
  homepage = "https://github.com/mazhe/minmay";
  license = stdenv.lib.licenses.lgpl21Plus;
  description = "An XMPP library (forked from the iksemel project)";
};

mazhe removed minmay, probably unknowingly that the nixos distribution still was using this minmay 1.0.0 release from this repo, see https://github.com/mazhe/minmay/. or maybe mazhe didn’t find any other release of minmay, forked it on github.com and made a release himself.

in case [1] is abandoned it would be a good idea to:

in fact, we as a community don’t have any good reliable way to have access to virtually any project over a large period of time.

software distribution today

first, let’s have a look at how software distribution is done in different linux distributions:

source and binary deployment

as shown in the picture above:

and most horrible yet:

proposal

shipping source/binary data must be redone.

shipping code/binaries

i thought about mixing GIT with the torrent technology, so that every time a new release happens, we just add it to a giant GIT database. i also propose that we need a meta layer between upstream (the developers) and downstream (the distributions or endusers). i’d like to call this midstream and ubuntu’s launchpad, github.com and similar platforms are pretty close to what midstream would do.

favoured technologies:

wanted features:

i want to outline that there is apt-p2p [6] already, which is AFAIK used for binary distribution of deb files and not used for source distribution. i could be wrong though.

hashing the downloads

when packaging software for nixos, we do ‘source deployment’ on the developers machine. when the nix-expression gets into nixpkgs, hydra will do ‘source deployment’ again. and in most cases hydra will produce a binary substitute (a NAR file) which is lika a DEB file on ubuntu.

how the hash is used

when doing ‘source deployment’, most often we would be using fetchurl, like below:

src = fetchurl {
  url = "http://downloads.sourceforge.net/project/pdfgrep/${version}/${name}.tar.gz";
  sha256 = "6e8bcaf8b219e1ad733c97257a97286a94124694958c27506b2ea7fc8e532437";
};

the given sha256 checksum is given by the packager and every time someone needs to do ‘source deployment’ again, this sha256sum is used to verify the download given by the url. ebuilds used by portage on gentoo also implement checksums of downloads this way.

the problem

github.com, sourceforge.net and similar sites often create/recreate containers like .tar.gz or .zip on demand (cause unknown) and therefore the same container changes hashes over time, altough it still contains the same files. github.com features a new release system [4] which addresses this problem.

what we really need:

possible solution

since we have to checksum the input for sanity/reproducibility we can’t just rely on a md5sum/sha256sum of the (compressed?) container. maybe we could ‘deserialize’ the (compressed?) archives into /nix/store, then build a NAR file of it and use the NAR’s hash to compare it to sha256 (doing so would remove all superfluous attributes like (timestamps, ownership, file order in the container, compression artifacts, compression mechanism, container format issues).

advantage/disatvantage:

note: a different interesting approach would be to maintain a list of files per container with respective checksums per file. the overall checksum would be the sum of all the single checksums

summary

we do need distribution agnostic source code storage systems which scale well and have a high reliability. we also need to alter the way we build the checksums of source code containers.

update: (18.8.2014) i've been talking to aszlig and he mentiones that we have <tarballs.nixos.org> which i didn't know of so if hydra was able to build the nix expression it would cache the tarball there. but i still don't think that these tarballs can be accessed by a normal user doing source deployment on his development machine. so in case a tarball changes hashes again, we still have to alter the checksum in the nix expression.

i would love to see more direct GIT usage in nixpkgs, since i think that doing explicit releases via containers is a waste of human time. this should be automated and containers releases should only be used to make source or binary deployment more efficient.

links