fetchGit/fetchTree reproducibility problems

Describe the bug

Builtin fetchers have to be reproducible. Any changes to their output either break the build because of a hash mismatch or cause the sources to change, if no hash is provided. Implementation errors in the builtin fetchers invalidate Nix as a tool for reproducibility.

While a git commit hash appears solid at first, its translation to a store path is not unique and prone to impurities.

This shows in #5260 for example, where someone relied on Nix's previous behavior of applying git's smudge filters. This was arguably an implementation error that was recently corrected, but it is a very breaking change nonetheless.

For a more detailed description of the problem with git smudge filters I'll kindly refer to the motivation description of #4635, which was a PR to attempt to fix the problem, at least in most cases.

As for the solution, perhaps it'd be best to make builtins.fetchGit behave exactly like Nix 2.3 and implement the reproducibility fixes only in builtins.fetchTree.

It’s not just smudge filters either. There is at least core.eol and core.symlinks confit options that affect working copy contents. There’s also the possibility for a global gitattributes file that can cause files to be iconv’ed at checkout, and substitute RCS-style ID tokens.

For true reproducibility it might be necessary to forgo the use of any “porcelain” and use the “plumbing” directly (either as commands or using libgit)

If ditching the "porcelain" tools, please make sure that whatever the replacement is still has some solution for authentication for fetching from private repositories. Authentication does not affect reproducibility (just reliability) so there's no issue with reading authentication information from the environment or filesystem.

@tomberek

The GitHub tarball API suffers from a similar problem: It runs git archive behind the scenes, which honours the export-subst attribute (which can perform some nondeterministic substitutions, and gives a different tree than git clone). This isn't the case however if the API is used with a tree hash rather than a commit hash. GitLab has the same behaviour (and probably the other Git forges too since that's inherited from git archive).

I looked at this with @tomberek, and it seems that the most reliable solution would be to

use git archive {treeHash} for fetchTree (git)
Use the tarball API with a tree hash (https://github.com/{owner}/{repo}/archive/{treeHash}.tar.gz) for GitHub. However this is problematic because getting the tree hash from a commit hash requires hitting the GitHub API which is severely rate-limited for anonymous users.

An alternative for GitHub would be to just make it sugar for Git, but add some clever blob filtering to get some reasonable performances. I've no idea how possible that would be.

However this is problematic because getting the tree hash from a commit hash requires hitting the GitHub API which is severely rate-limited for anonymous users.

This only applies during locking and impure use. Most of the fetching will happen on locked fetchTree calls that already have a copy of the tree hash. Only manually pinned fetchTree calls will be affected, and fallback is possible, as mentioned.

Note though that by making fetching tree-based, we solve an opposite problem, which is that archive based locking can not fall back to git operations because in the status quo, we'd have to invoke or emulate git archive, which is not desirable.

Another possible way around the rate limits, which doesn't involve cloning, is perhaps to use libgit2 to fetch only the relevant packfile and get the tree hash from there. I believe that's feasible in 2 requests to the git (non-"api") endpoint. Not sure if such functions are exposed though.

Finally I consider the false equivalence between the commit tarball and normal git fetching to be a serious bug.

@tomberek

However this is problematic because getting the tree hash from a commit hash requires hitting the GitHub API which is severely rate-limited for anonymous users.

This only applies during locking and impure use. Most of the fetching will happen on locked fetchTree calls that already have a copy of the tree hash. Only manually pinned fetchTree calls will be affected, and fallback is possible, as mentioned.

If we store the tree hash as part of the lock file, then yes. But it's problematic because it means that the commit hash isn't the ultimate source of truth any more (the tree hash is). So in a Flake context I could have inputs.foo.url = "github:foo/bar?rev=abcde123", but a forged lockfile that maps that to a totally unrelated tree hash.
But indeed, we could fall back to plain Git if we get rate-limited.

Another possible way around the rate limits, which doesn't involve cloning, is perhaps to use libgit2 to fetch only the relevant packfile and get the tree hash from there. I believe that's feasible in 2 requests to the git (non-"api") endpoint. Not sure if such functions are exposed though.

@tomberek has been trying that (directly curl-ing the Git server, not through libgit2) with not much success. But it's probably theoretically possible indeed.

not much success. But it's probably theoretically possible

Can be done with the git CLI.
Tested with https://github.com/NixOS/nixpkgs.git.
master takes 3 seconds to fetch the hashes for the first time.
Also takes 3 MB storage, for all historic commits of master.
1-3 seconds for syncing up when needed (ie commit hash not found locally).
Network use also about 3 MB.

Crucially we only incur the cost when using a truly new commit. E.g. if you have a flake with 12 versions of Nixpkgs, you only fetch the commits once.

Details

$ git init

$ git remote add origin https://github.com/NixOS/nixpkgs.git

$ time git fetch origin --filter=blob:none --depth=1 master
remote: Enumerating objects: 25463, done.
remote: Counting objects: 100% (25463/25463), done.
remote: Compressing objects: 100% (4729/4729), done.
remote: Total 25463 (delta 14), reused 24666 (delta 13), pack-reused 0
Receiving objects: 100% (25463/25463), 2.31 MiB | 3.46 MiB/s, done.
Resolving deltas: 100% (14/14), done.
From https://github.com/NixOS/nixpkgs
 * branch              master     -> FETCH_HEAD
 * [new branch]        master     -> origin/master

real	0m1.808s
user	0m0.394s
sys	0m0.127s

$ time git fetch origin --filter=blob:none --depth=1 master
remote: Total 0 (delta 0), reused 0 (delta 0), pack-reused 0
From https://github.com/NixOS/nixpkgs
 * branch              master     -> FETCH_HEAD

real	0m0.976s
user	0m0.112s
sys	0m0.044s

$ git rev-parse refs/remotes/origin/master^{tree}   # is instant
76a68875f4fbbe51864431a639f7413c15b2469b

$ du -shc .git
3.1M	.git
3.1M	total

$ time git fetch origin --filter=blob:none --depth=1 nixos-unstable
remote: Enumerating objects: 1029, done.
remote: Counting objects: 100% (1029/1029), done.
remote: Compressing objects: 100% (257/257), done.
remote: Total 519 (delta 174), reused 333 (delta 0), pack-reused 0
Receiving objects: 100% (519/519), 48.52 KiB | 1.13 MiB/s, done.
Resolving deltas: 100% (174/174), completed with 174 local objects.
From https://github.com/NixOS/nixpkgs
 * branch              nixos-unstable -> FETCH_HEAD
 * [new branch]        nixos-unstable -> origin/nixos-unstable

real	0m1.299s
user	0m0.143s
sys	0m0.054s

problematic because it means that the commit hash isn't the ultimate source of truth any more (the tree hash is). So in a Flake context I could have inputs.foo.url = "github:foo/bar?rev=abcde123", but a forged lockfile that maps that to a totally unrelated tree hash.

That's already a problem with rev and narHash. Don't trust lockfile updates from untrusted sources.
Nonetheless, we could always check that tree and rev match. It's cheap.

Can be done with the git CLI.

Oh, that is great! I didn't expect there would be such an easy (and cheap-ish) solution.

Right that's why I very much want to get in my git hashing changes the same time we do this redesign: being able to

verify everything everything Nix-side with one hash
have that hash also be useful to upstream, not be "yet another" hatch.

is very useful, especially has we approach a world where there is quite a lot of different ways to fetch git things.

Ultimately, in a world with signed commits being the norm, we should be writing down commit hashes and public keys in the input spec, and then everything is verified via merkle inclusion proofs form there.

git fetch origin --filter=blob:none --depth=1 master

Probably should be in git_fetch_options in libgit2, but I don't see how to do it.

It doesn't support sparse checkouts, so it wouldn't surprise me if clone filtering (fetch filtering?) isn't implemented yet either.
Partial cloning is also still a feature issue. (And the code suggests that the filter options are handled by the partial cloning feature)
Neither remote.h, nor fetch.h, nor clone.h mention "blob" either.

I assume that it's just not implemented yet. We might want to use the CLI for this procedure until it is implemented in libgit2.

New commands with --filter=tree:0 instead of --filter=blob:none

We can improve on #5313 (comment)

$ time git fetch origin --filter=blob:none --depth=1 master
[...]
real	0m1.808s

Re-running it today gives 0m1.299s, but tree:0 is even faster:

$ rm -rf .git/; git init; git remote add origin https://github.com/NixOS/nixpkgs.git;
$ time git fetch origin --filter=tree:0 --depth=1 master
[...]
real	0m0.556s

Getting or checking an arbitrary revision is slower, but this is only needed when we don't have a lock file to cache the tree hash.

$ time git fetch origin --filter=tree:0 --depth=1 ca09999d7a6d9e789e00078e660e90bd63acc7a4
[...]
real	0m2.140s

For comparison, the GitHub API responds in 0.12 to 0.4 seconds for an arbitrary commit.

curl https://api.github.com/repos/NixOS/nixpkgs/commits/ca09999d7a6d9e789e00078e660e90bd63acc7a4

@roberth

@roberth that's sweet :)
Does that mean that for fetching github:nixos/nixpkgs/{commitHash} we can

Use git to fetch the commit blob
Use git to get the tree hash from the blob
Use the /archive GitHub endpoint to fetch the archive corresponding to it
?

@thufschmitt

@thufschmitt That's the plan. We could elaborate that a bit:

Use the git command to fetch the commit blob (including ref resolution if needed)
Use libgit2 to get the tree hash from the commit
Use the /archive GitHub endpoint to fetch the archive corresponding to the tree hash
Verify the tree hash while unpacking the tarball (into a git repo?)
(contingency) fall back to normal git-command fetch.

With @DavHau we discussed two implementation strategies

Extend GitArchiveInputScheme
"Rebase" onto GitInputScheme

The latter appears more elegant, as we can re-frame the new fetching strategy as an alternate "git transport", somewhat similar to how git itself can deal with multiple protocols.

Even submodule support seems within reach that way, although for that we do need the slightly slow blob:none, which fetches a "wintery" tree that does contain references to submodules. But well, that's extra anyway, because github: doesn't support submodules yet. (Also probably similar story for subtree fetching, like the nixpkgs/lib flake)

This issue has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/fetchfromgithub-and-the-versioneer-fixing-source-reproducibility/66539/2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GitHub Sponsors

fetchGit/fetchTree reproducibility problems #5313

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Participants

fetchGit/fetchTree reproducibility problems #5313

Description

Activity

roberth commented on Oct 1, 2021

greedy commented on Oct 2, 2021

lilyball commented on Nov 26, 2021

thufschmitt commented on Oct 26, 2023

roberth commented on Oct 26, 2023

thufschmitt commented on Oct 27, 2023

roberth commented on Oct 27, 2023

thufschmitt commented on Oct 27, 2023

Ericson2314 commented on Oct 29, 2023

roberth commented on Nov 4, 2023

roberth commented on Mar 26, 2024

thufschmitt commented on Mar 26, 2024

roberth commented on Mar 26, 2024

nixos-discourse commented on Jul 15, 2025

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Participants

Issue actions