Publication Data
Large-scale cluster management at Google with Borg
Abstract: Google's Borg system is a cluster manager that runs hundreds
of thousands of jobs, from many thousands of different applications, across a number of
clusters each with up to tens of thousands of machines. It achieves high utilization by
combining admission control, efficient task-packing, over-commitment, and machine
sharing with process-level performance isolation. It supports high-availability
applications with runtime features that minimize fault-recovery time, and scheduling
policies that reduce the probability of correlated failures. Borg simplifies life for
its users by offering a declarative job specification language, name service
integration, real-time job monitoring, and tools to analyze and simulate system
behavior.
We present a summary of the Borg system architecture and features, important design decisions, a quantitative analysis of some of its policy decisions, and a qualitative examination of lessons learned from a decade of operational experience with it.