Here at Microsoft we have teams of all shapes and sizes, and many of them are already using Git or are moving that way. For the most part, the Git client and Team Services Git repos work great for them. However, we also have a handful of teams with repos of unusual size! For example, the Windows codebase has over 3.5 million files and is over 270 GB in size. The Git client was never designed to work with repos with that many files or that much content. You can see that in action when you run “git checkout” and it takes up to 3 hours, or even a simple “git status” takes almost 10 minutes to run. That’s assuming you can get past the “git clone”, which takes 12+ hours.
Even so, we are fans of Git, and we were not deterred. That’s why we’ve been working hard on a solution that allows the Git client to scale to repos of any size. Today, we’re introducing GVFS (Git Virtual File System), which virtualizes the file system beneath your repo and makes it appear as though all the files in your repo are present, but in reality only downloads a file the first time it is opened. GVFS also actively manages how much of the repo Git has to consider in operations like checkout and status, since any file that has not been hydrated can be safely ignored. And because we do this all at the file system level, your IDEs and build tools don’t need to change at all!
In a repo that is this large, no developer builds the entire source tree. Instead, they typically download the build outputs from the most recent official build, and only build a small portion of the sources related to the area they are modifying. Therefore, even though there are over 3 million files in the repo, a typical developer will only need to download and use about 50-100K of those files.
With GVFS, this means that they now have a Git experience that is much more manageable: clone now takes a few minutes instead of 12+ hours, checkout takes 30 seconds instead of 2-3 hours, and status takes 4-5 seconds instead of 10 minutes. And we’re working on making those numbers even better. (Of course, the tradeoff is that their first build takes a little longer because it has to download each of the files that it is building, but subsequent builds are no slower than normal.)
While GVFS is still in progress, we’re excited to announce that we are open sourcing the client code at https://github.com/Microsoft/gvfs. Feel free to give it a try, but please be aware that it still relies on a pre-release file system driver. The driver binaries are also available for preview as a NuGet package, and your best bet is to play with GVFS in a VM and not in any production environment.
In addition to the GVFS sources, we’ve also made some changes to Git to allow it to work well on a GVFS-backed repo, and those sources are available at https://github.com/Microsoft/git. And lastly, GVFS relies on a protocol extension that any service can implement; the protocol is available at https://github.com/Microsoft/gvfs/blob/master/Protocol.md.
IMHO, the problems, which you described in the first paragraph of this article, caused for the most part not by Git itself, but by NTFS file system used in Windows. Mac’s file system (NFS+) handles very large number of small files (common to Git, also npm repositories etc.) without any issues (or at least much much better than NTFS).
And probably instead of inventing yet another weird stuff like GVFS, Microsoft would better invest its time and money in fixing the root issue, that is the NTFS file system…
I can definitely see how improving NTFS could help developers download 270 GIGABYTES of data on an initial clone faster. The limit definitely isn’t the speed of the link between them and the git server.
(sarcasm, for the Americans amongst you)
How often have you tried to have 3.5M/270GB of files on your NFS+ partition? If the answer is never, you might not be in any position to actually claim anything. Do show your biggest repos and benchmarks on how much faster they are on NFS+ compared to NTFS. Will be interesting to see how much performance would increase on a different FS. Though you could’ve included them in your original message since you clearly have done those.
Also I’m sure macOS magically can fix the problem of transfering millions of files, as well as doing other operations on them. Do tell how this magic works, my macOS doesn’t seem to have that. Have I not turned some option on?
Not to mention GVFS is not some “weird stuff”, except maybe for macOS users, where you don’t get to use anything other than what’s given to you, so you must always claim that’s the best there is and nothing can beat it. But maybe if you opened your eyes a bit and thought about it you might understand why this is actually usable for many different situations.
Have you ever worked with a codebase as large as Windows?
Neither NFSv4 or HFS+ will be much help here.
300 gigabytes of source code will take some time to download, what is done here is to only download metadata and the files you actually need locally.
I have not worked with repos myself that is 280 gigabytes, but tens of gigabytes will be a pain to work with too, most of the time.
NTFS and HFS+ are both old filesystems, but they are stable. It will take time to replace them both. But, no I don’t think you understand the issue at hand here.
IMHO If you’d used words like HFS+ or APFS as a comparison against NTFS then there might be some value to your statements, if backed up with a performance comparison report showing HFS+ with journaling enabled against NTFS. But NFS is a network file system. NTFS is a disk file system.
The kind of factor performance gains this brings likely outweigh any OS level file system differences. File systems are tricky to compare as they have difference features that impact performance.
Most people don’t deal with a repo this big… for those that do, this looks to be a valuable timesaving tool.
Yes, I meant HFS+, of course. “NFS+” was a typo…
Maybe I was wrong calling this GVFS tech “weird stuff”, as it will help to improve Git performance by downloading only files which are needed, but my point that NTFS is way slower than other file systems e.g. HFS+ when dealing with very large number of small files, is still valid. I’m too lazy to do benchmarks etc., but maybe I will. Speed difference is noticeable by naked eye though.
I don’t suppose the problem could be the 3.5 million files and 270 GB of bloatware that is the Windows codebase?
Shots fired!
But I may just like the new OSS flavor of Microsoft.
This really doesn’t sound that crazy to me. If you consider that according to the blog, a single branch of Windows is roughly 50k files (which actually sounds pretty small to me, I work regularly on a product of similar size which is much smaller than windows). Then consider you have 30 years of branches, versions, etc then 3.5 million files doesn’t sound particularly huge. That would mean each file has been modified just 70 times on average over the years.
What issue did you have using submodules? The scenario where a developer only needs to build a little bit of the tree seems like a perfect use case for them..