-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
Describe the bug
Builtin fetchers have to be reproducible. Any changes to their output either break the build because of a hash mismatch or cause the sources to change, if no hash is provided. Implementation errors in the builtin fetchers invalidate Nix as a tool for reproducibility.
While a git commit hash appears solid at first, its translation to a store path is not unique and prone to impurities.
This shows in #5260 for example, where someone relied on Nix's previous behavior of applying git's smudge filters. This was arguably an implementation error that was recently corrected, but it is a very breaking change nonetheless.
For a more detailed description of the problem with git smudge filters I'll kindly refer to the motivation description of #4635, which was a PR to attempt to fix the problem, at least in most cases.
Activity
roberth commentedon Oct 1, 2021
As for the solution, perhaps it'd be best to make
builtins.fetchGitbehave exactly like Nix 2.3 and implement the reproducibility fixes only inbuiltins.fetchTree.greedy commentedon Oct 2, 2021
It’s not just smudge filters either. There is at least
core.eolandcore.symlinksconfit options that affect working copy contents. There’s also the possibility for a global gitattributes file that can cause files to be iconv’ed at checkout, and substitute RCS-style ID tokens.For true reproducibility it might be necessary to forgo the use of any “porcelain” and use the “plumbing” directly (either as commands or using libgit)
lilyball commentedon Nov 26, 2021
If ditching the "porcelain" tools, please make sure that whatever the replacement is still has some solution for authentication for fetching from private repositories. Authentication does not affect reproducibility (just reliability) so there's no issue with reading authentication information from the environment or filesystem.
thufschmitt commentedon Oct 26, 2023
The GitHub tarball API suffers from a similar problem: It runs
git archivebehind the scenes, which honours the export-subst attribute (which can perform some nondeterministic substitutions, and gives a different tree thangit clone). This isn't the case however if the API is used with a tree hash rather than a commit hash. GitLab has the same behaviour (and probably the other Git forges too since that's inherited fromgit archive).I looked at this with @tomberek, and it seems that the most reliable solution would be to
git archive {treeHash}forfetchTree (git)An alternative for GitHub would be to just make it sugar for Git, but add some clever blob filtering to get some reasonable performances. I've no idea how possible that would be.
roberth commentedon Oct 26, 2023
This only applies during locking and impure use. Most of the fetching will happen on locked
fetchTreecalls that already have a copy of the tree hash. Only manually pinnedfetchTreecalls will be affected, and fallback is possible, as mentioned.Note though that by making fetching tree-based, we solve an opposite problem, which is that archive based locking can not fall back to git operations because in the status quo, we'd have to invoke or emulate
git archive, which is not desirable.Another possible way around the rate limits, which doesn't involve cloning, is perhaps to use
libgit2to fetch only the relevant packfile and get the tree hash from there. I believe that's feasible in 2 requests to the git (non-"api") endpoint. Not sure if such functions are exposed though.Finally I consider the false equivalence between the commit tarball and normal git fetching to be a serious bug.
thufschmitt commentedon Oct 27, 2023
If we store the tree hash as part of the lock file, then yes. But it's problematic because it means that the commit hash isn't the ultimate source of truth any more (the tree hash is). So in a Flake context I could have
inputs.foo.url = "github:foo/bar?rev=abcde123", but a forged lockfile that maps that to a totally unrelated tree hash.But indeed, we could fall back to plain Git if we get rate-limited.
@tomberek has been trying that (directly
curl-ing the Git server, not throughlibgit2) with not much success. But it's probably theoretically possible indeed.roberth commentedon Oct 27, 2023
Can be done with the git CLI.
Tested with
https://github.com/NixOS/nixpkgs.git.mastertakes 3 seconds to fetch the hashes for the first time.Also takes 3 MB storage, for all historic commits of
master.1-3 seconds for syncing up when needed (ie commit hash not found locally).
Network use also about 3 MB.
Crucially we only incur the cost when using a truly new commit. E.g. if you have a flake with 12 versions of Nixpkgs, you only fetch the commits once.
Details
That's already a problem with
revandnarHash. Don't trust lockfile updates from untrusted sources.Nonetheless, we could always check that tree and rev match. It's cheap.
thufschmitt commentedon Oct 27, 2023
Oh, that is great! I didn't expect there would be such an easy (and cheap-ish) solution.
Ericson2314 commentedon Oct 29, 2023
Right that's why I very much want to get in my git hashing changes the same time we do this redesign: being able to
is very useful, especially has we approach a world where there is quite a lot of different ways to fetch git things.
Ultimately, in a world with signed commits being the norm, we should be writing down commit hashes and public keys in the input spec, and then everything is verified via merkle inclusion proofs form there.
roberth commentedon Nov 4, 2023
Probably should be in
git_fetch_optionsinlibgit2, but I don't see how to do it.It doesn't support sparse checkouts, so it wouldn't surprise me if clone filtering (fetch filtering?) isn't implemented yet either.
Partial cloning is also still a feature issue. (And the code suggests that the filter options are handled by the partial cloning feature)
Neither
remote.h, norfetch.h, norclone.hmention "blob" either.I assume that it's just not implemented yet. We might want to use the CLI for this procedure until it is implemented in libgit2.
fetchGit#9327roberth commentedon Mar 26, 2024
New commands with --filter=tree:0 instead of --filter=blob:none
We can improve on #5313 (comment)
Re-running it today gives
0m1.299s, buttree:0is even faster:Getting or checking an arbitrary revision is slower, but this is only needed when we don't have a lock file to cache the tree hash.
For comparison, the GitHub API responds in 0.12 to 0.4 seconds for an arbitrary commit.
thufschmitt commentedon Mar 26, 2024
@roberth that's sweet :)
Does that mean that for fetching
github:nixos/nixpkgs/{commitHash}we can/archiveGitHub endpoint to fetch the archive corresponding to it?
roberth commentedon Mar 26, 2024
@thufschmitt That's the plan. We could elaborate that a bit:
/archiveGitHub endpoint to fetch the archive corresponding to the tree hashWith @DavHau we discussed two implementation strategies
GitArchiveInputSchemeGitInputSchemeThe latter appears more elegant, as we can re-frame the new fetching strategy as an alternate "git transport", somewhat similar to how git itself can deal with multiple protocols.
Even submodule support seems within reach that way, although for that we do need the slightly slow
blob:none, which fetches a "wintery" tree that does contain references to submodules. But well, that's extra anyway, becausegithub:doesn't support submodules yet. (Also probably similar story for subtree fetching, like thenixpkgs/libflake)nixos-discourse commentedon Jul 15, 2025
This issue has been mentioned on NixOS Discourse. There might be relevant details there:
https://discourse.nixos.org/t/fetchfromgithub-and-the-versioneer-fixing-source-reproducibility/66539/2
github:fetcher to fetch subtrees using the trees API #14715