A couple of years ago, Microsoft made the decision to begin a multi-year investment in revitalizing our engineering system across the company. We are a big company with tons of teams – each with their own products, priorities, processes and tools. There are some “common” tools but also a lot of diversity – with VERY MANY internally developed one-off tools (by team I kind of mean division – thousands of engineers).
There are a lot of downsides to this:
- Lots of redundant investments in teams building similar tooling
- Inability to fund any of the tooling to “critical mass”
- Difficulty for employees to move around the company due to different tools and process
- Difficulty in sharing code across organizations
- Friction for new hires getting started due to an overabundance of “MS-only” tools
- And more…
We set out on an effort we call the “One Engineering System” or “1ES”. Just yesterday we had a 1ES day where thousands of engineers gathered to celebrate the progress we’ve made, to learn about the current state and to discuss the path forward. It was a surprisingly good event.
Aside… You might be asking yourself – hey, you’ve been telling us for years Microsoft uses TFS, have you been lying to us? No, I haven’t. Over 50K people have regularly used TFS but they don’t always use it for everything. Some use it for everything. Some use only work item tracking. Some only version control. Some build … We had internal versions (and in many cases more than one) of virtually everything TFS does and someone somewhere used them all. It was a bit of chaos, quite honestly. But, I think I can safely say, when aggregated and weighed – TFS had more adoption than any other set of tools.
I also want to point out that, when I say engineering system here, I am using the term VERY broadly. It includes but is not limited to:
- Source control
- Work management
- Builds
- Release
- Testing
- Package management
- Telemetry
- Flighting
- Incident management
- Localization
- Security scanning
- Accessibility
- Compliance management
- Code signing
- Static analysis
- and much, much more
So, back to the story. When we embarked on this journey, we had some heated debates about where we were going, what to prioritize, etc. You know, developers never have opinions. 🙂 There’s no way to try to address everything at once, without failing miserably so we agreed to start by tackling 3 problems:
- Work planning
- Source control
- Build
I won’t go into detailed reasons other than to say those are foundational and so much else integrates with them, builds on them etc. that they made sense. I’ll also observe that we had a HUGE amount of pain around build times and reliability due to the size of our products – some hundreds of millions of lines of code.
Over the intervening time those initial 3 investments have grown and, to varying degrees, the 1ES effort touches almost every aspect of our engineering process.
We put some interesting stakes in the ground. Some included:
The cloud is the future – Much of our infrastructure and tools were hosted internally (including TFS). We agreed that the cloud is the future – mobility, management, evolution, elasticity, all the reasons you can think of. A few years ago, that was very controversial. How could Microsoft put all our IP in the cloud? What about performance? What about security? What about reliability? What about compliance and control? What about… It took time but we eventually got a critical mass OK with the idea and as the years have passed, that decision has only made more and more sense and everyone is excited about moving to cloud.
1st party == 3rd party – This is an expression we use internally that means, as much as possible, we want to use what we ship and ship what we use. It’s not 100% and it’s not always concurrent but it’s the direction – the default assumption, unless there’s a good reason to do something else.
Visual Studio Team Services is the foundation – We made a bet on Team Services as the backbone. We need a fabric that ties our engineering system together – a hub from which you learn about and reach everything. That hub needs to be modern, rich, extensible, etc. Every team needs to be able to contribute and share their distinctive contributions to the engineering system. Team Services fits the bill perfectly. Over the past year usage of Team services within Microsoft has grown from a couple of thousand to over 50,000 committed users. Like with TFS, not every team uses it for everything yet, but momentum in that direction is strong.
Team Services work planning – Having chosen Team Services, it was pretty natural to choose the associated work management capabilities. We’ve on-boarded teams like the Windows group, with many thousands of users and many millions of work items, into a single Team Services account. We had to do a fair amount of performance and scale work to make that viable, BTW. At this point virtually every team at Microsoft has made this transition and all of our engineering work is being managed in Team Services
Team Services Build orchestration & CloudBuild – I’m not going to drill on this topic too much because it’s a mammoth post in and of itself. I’ll summarize it to say we’ve chosen the Team Services Build service as our build orchestration system and the Team Services Build management experience as our UI. We have also built a new “make engine” (that we don’t yet ship) for some of our largest code bases that does extremely high scale and fine grained caching, parallelization and incrementality. We’ve seen multi-hour builds drop sometimes to minutes. More on this in a future post at some point.
After much backstory, on to the meat 🙂
Git for source control
Maybe the most controversial decision was what to use for source control. We had an internal source control system called Source Depot that virtually everyone used in the early 2000’s. Over time, TFS and its Team Foundation Version Control solution won over much of the company but never made progress with the biggest teams – like Windows and Office. Lots of reasons I think – some of it was just that the cost for such large teams to migrate was extremely high and the two systems (Source Depot and TFS) weren’t different enough to justify it.
But source control systems generate intense loyalty – more so than just about any other developer tool. So the argument between TFVC, Source Depot, Git, Mercurial, and more was ferocious and, quite honestly, we made a decision without ever getting consensus – it just wasn’t going to happen. We chose to standardize on Git for many reasons. Over time, that decision has gotten more and more adherents.
There were many arguments against choosing Git but the most concrete one was scale. There aren’t many companies with code bases the size of some of ours. Windows and Office, in particular (but there are others), are massive. Thousands of engineers, millions of files, thousands of build machines constantly building it, quite honestly, it’s mind boggling. To be clear, when I refer to Window in this post, I’m actually painting a very broad brush – it’s Windows for PC, Mobile, Server, HoloLens, Xbox, IOT, and more. And Git is a distributed version control system (DVCS). It copies the entire repo and all its history to your local machine. Doing that with Windows is laughable (and we got laughed at plenty). TFVC and Source Depot had both been carefully optimized for huge code bases and teams. Git had *never* been applied to a problem like this (or probably even within an order of magnitude of this) and many asserted it would *never* work.
The first big debate was – how many repos do you have – one for the whole company at one extreme or one for each small component? A big spectrum. Git is proven to work extremely well for a very large number of modest repos so we spent a bunch of time exploring what it would take to factor our large codebases into lots of tenable repos. Hmm. Ever worked in a huge code base for 20 years? Ever tried to go back afterwards and decompose it into small repos? You can guess what we discovered. The code is very hard to decompose. The cost would be very high. The risk from that level of churn would be enormous. And, we really do have scenarios where a single engineer needs to make sweeping changes across a very large swath of code. Trying to coordinate that across hundreds of repos would be very problematic.
After much hand wringing we decided our strategy needed to be “the right number of repos based on the character of the code”. Some code is separable (like microservices) and is ideal for isolated repos. Some code is not (like Windows core) and needs to be treated like a single repo. And, I want to emphasize, it’s not just about the difficulty of decomposing the code. Sometimes, in big highly related code bases, it really is better to treat the codebase as a whole. Maybe someday I’ll tell the story of Bing’s effort to componentize the core Bing platform into packages and the versioning problems that caused for them. They are currently backing away from that strategy.
That meant we had to embark upon scaling Git to work on codebases that are millions of files, hundreds of gigabytes and used by thousands of developers. As a contextual side note, even Source Depot did not scale to the entire Windows codebase. It had been split across 40+ depots so that we could scale it out but a layer was built over it so that, for most use cases, you could treat it like one. That abstraction wasn’t perfect and definitely created some friction.
We started down at least 2 failed paths to scale Git. Probably the most extensive one was to use Git submodules to stitch together lots of repos into a single “super” repo. I won’t go into details but after 6 months of working on that we realized it wasn’t going to work – too many edge cases, too much complexity and fragility. We needed a bulletproof solution that would be well supported by almost all Git tooling.
Close to a year ago we reset and focused on how we would actually get Git to scale to a single repo that could hold the entire Windows codebase (include estimates of growth and history) and support all the developers and build machines.
We tried an approach of “virtualizing” Git. Normally Git downloads *everything* when you clone. But what if it didn’t? What if we virtualized the storage under it so that it only downloaded the things you need. So clone of a massive 300GB repo becomes very fast. As I perform Git commands or read/write files in my enlistment, the system seamlessly fetches the content from the cloud (and then stores it locally so future accesses to that data are all local). The one downside to this is that you lose offline support. If you want that you have to “touch” everything to manifest it locally but you don’t lose anything else – you still get the 100% fidelity Git experience. And for our huge code bases, that was OK.
It was a promising approach and we began to prototype it. We called the effort Git Virtual File System or GVFS. We set out with the goal of making as few changes to git.exe as possible. For sure we didn’t want to fork Git – that would be a disaster. And we didn’t want to change it in a way that the community would never take our contributions back either. So we walked a fine line doing as much “under” Git with a virtual file system driver as we could.
The file system driver basically virtualizes 2 things:
- The .git folder – This is where all your pack files, history, etc. are stored. It’s the “whole thing” by default. We virtualized this to pull down only the files we needed when we needed them.
- The “working directory” – the place you go to actually edit your source, build it, etc. GVFS monitors the working directory and automatically “checks out” any file that you touch making it feel like all the files are there but not paying the cost unless you actually access them.
As we progressed, as you’d imagine, we learned a lot. Among them, we learned the Git server has to be smart. It has to pack the Git files in an optimal fashion so that it doesn’t have to send more to the client than absolutely necessary – think of it as optimizing locality of reference. So we made lots of enhancements to the Team Services/TFS Git server. We also discovered that Git has lots of scenarios where it touches stuff it really doesn’t need to. This never really mattered before because it was all local and used for modestly sized repos so it was fast – but when touching it means downloading it from the server or scanning 6,000,000 files, uh oh. So we’ve been investing heavily in is performance optimizations to Git. Many of them also benefit “normal” repos to some degree but they are critical for mega repos. We’ve been submitting many of these improvements to the Git OSS project and have enjoyed a good working relationship with them.
So, fast forward to today. It works! We have all the code from 40+ Windows Source Depot servers in a single Git repo hosted on VS Team Services – and it’s very usable. You can enlist in a few minutes and do all your normal Git operations in seconds. And, for all intents and purposes, it’s transparent. It’s just Git. Your devs keep working the way they work, using the tools they use. Your builds just work. Etc. It’s pretty frick’n amazing. Magic!
As a side effect, this approach also has some very nice characteristics for large binary files. It doesn’t extend Git with a new mechanism like LFS does, no turds, etc. It allows you to treat large binary files like any other file but it only downloads the blobs you actually ever touch.
Git Merge
Today, at the Git Merge conference in Brussels, Saeed Noursalehi shared the work we’ve been doing – going into excruciating detail on what we’ve done and what we’ve learned. At the same time, we open sourced all our work. We’ve also included some additional server protocols we needed to introduce. You can find the GVFS project and the changes we’ve made to Git.exe in the Microsoft GitHub organization. GVFS relies on a new Windows filter driver (the moral equivalent of the FUSE driver in Linux) and we’ve worked with the Windows team to make an early drop of that available so you can try GVFS. You can read more and get more resources on Saeed’s blog post. I encourage you to check it out. You can even install it and give it a try.
While I’ll celebrate that it works, I also want to emphasize that it is still very much a work in progress. We aren’t done with any aspect of it. We think we have proven the concept but there’s much work to be done to make it a reality. The point of announcing this now and open sourcing it is to engage with the community to work together to help scale Git to the largest code bases.
Sorry for the long post but I hope it was interesting. I’m very excited about the work – both on 1ES at Microsoft and on scaling Git.
Brian
So interesting! Thanks for sharing
Incidentally, do you have stats you can share about the general trend of Git/TFVC usage in VSTS?
@Sam, Sure, in terms of active users on Team Services, TFVC is about twice as large at Git but Git is growing a little faster. I haven’t tried to plot the theoretical cross over point (where Git passes TFVC) but the grow rates are both good enough that it’s reasonably far out in the future – beyond any time horizon I would try to predict.
Brian
Thanks Brian. Really interesting!
Great post!
I would love to hear from your or one of your colleagues about how you organize all the VSTS artifacts to scale well. The Team Services docs have a great section on scaling work items and work management. But it lacks tips on how to use areas and iterations so the teams don’t get lost in poorly designed Area and Iteration hierarchies. Also, with multiple teams in a Team Project, we need to organize code, build definitions, release definitions, and so on. My naive approach is to simply create a top level in all the artifacts to mimic the large teams or product groups. There are many posts from over the years. But they get quickly outdated as VSTS evolves so quickly. Yes, that’s a high-class problem and congratulations on having it 🙂
Very cool!!
This is similar to “Sparsed workarea” concept used in my previous project (except that they have it for last 20 yrs). Your workarea contains only the files you work everything else in the main repository.
It would have been great if the implementation is generic or another file system(like aufs on windows) , so it can be extended outside Git.
Thanks, really intresting topic. Do you plan any Linux/MacOS support?
@Alexander, Yes, we are looking at Mac support now. The Office team needs it because they share a great deal of code between their Mac/iOS products and their Windows products – and those code bases are very large. The solution may not be identical in every way but we will create a good solution there too.
Brian
Does MS store documentation in this GIT repository as well? What about IT operational work item tracking? (tickets, etc.)
@Nate, Documentation – not a ton. We store most of our documents in Sharepoint. We have some basic wiki like/markdown in Team Services. That will start to grow shortly as we expand our Wiki features. Tickets – Team Services doesn’t have an optimized ticketing experience. You can use it for a low sophistication ticketing system but it’s not that. We have an internal tool that we call ICM (Incident something Management). It integrated with Team Services and may even be partially built on Team Services but has many more capabilities like managing “on call” schedules, auto dialer, bridge management, etc.
Brian
Wow, that’s big, that’s great!
Microsoft and Facebook both lanes on single repo strategy. Facebook has given up and moved to Mercurial. I am very happy Microsoft has kept improved Git. Feel great about Microsoft.
https://code.facebook.com/posts/218678814984400/scaling-mercurial-at-facebook/
Microsoft takes truly distributed VCS (I mean git) and tries to make into centralized VCS with distributed flavor. Something tells me that this is not going to work well. Anyway, git fanboys and cultists are now excited. Good job, Microsoft.
That’s kind of funny.
This is great work. Also great to hear your voice again in all this. I’m impressed with the new direction Microsoft is taking, and with the willingness to add and collaborate in the open source ecosystem.
Hi Brian, as always great post !!!
I won´t share questions or concerns. I just let you know that you make my Saturday morning in Toronto, reading this while I´m having a cup of coffee. Keep on the work in progress, and this is something which I really can start to share about at enterprise level (I´ve been in the “Git is not enough” discussion several times)
Best regards
-El Bruno
What? This was a long post?! I expected something even more fine grained! 🙂
Is there any more detailed documentation on the Windows filter driver and changes done to git?
But seriously, this is very interesting solution. I guess it’s important only for 0,01% of companies but the engineering and design sounds really good.