A Q&A with Malte Schwarzkopf, co-author of the Omega paper.
tl;dr: Malte explains how the primary goal of Google Omega’s shared-state design was flexibility in software engineering, rather than scalability. He explains that shared-state scheduling enables useful policies, such as priority preemption and interference-aware scheduling. Prior two-level scheduler designs (e.g. in YARN and Mesos) offer the same flexibility benefits, but at the time of the Omega paper suffered from the problems of information hiding (not all state is visible to all schedulers) and hoarding when gang-scheduling (accumulation requires leaving resources idle). He also comments on why Google chose the shared-state approach, and why Google does not use rigid global fairness policies. Finally, Malte discusses how the support for modular schedulers in systems like Mesos and Kubernetes makes it possible to bring Omega’s benefits to open-source cluster managers, and suggests that smarter scheduling is going to be the next key challenge for users of these systems to achieve the same efficiencies as Google’s infrastructure.
Kismatic: You published the Omega paper in 2013, and were involved in the exploration of shared-state scheduling in Omega. Do the lessons from that paper still ring true today, and what has changed in the two years since?
MS: Absolutely. Cluster orchestration is currently a very fast-moving area, but I think that the core principles of Omega are, if anything, more timely and important today.
One absolutely huge change that has taken place since 2013 is that much of the infrastructure to run a cluster manager like Omega has been commodified, largely thanks to Kubernetes and Mesos. Back when we did the work in 2011-2012, it would have been quite tricky to test the hypotheses that underlie Omega’s shared-state scheduling outside Google. Today, it’s much easier, as one can just hack up an existing modular scheduler and try new approaches in a test-bed or use a public trace like the 2012 Google cluster trace. Moreover, the problems that we addressed in Omega — how to efficiently schedule different workloads atop a shared cluster, and how to enable different teams to develop their own schedulers with access to all cluster state — are much more commonly acknowledged across the industry today. Containers allow many different applications to be deployed to a shared cluster, people are starting to realise that not all workloads can or should be scheduled the same way, and orchestration is seen as a problem of managing shared distributed state (as is testified by the use of Zookeeper and etcd within many orchestration layers).
I believe that the idea at the core of Omega — modelling job scheduling, and cluster management operations more broadly, as transactions against a shared state — is still very much applicable. In fact, consider the principle at the heart of Kubernetes: the user proposes a desired state, and Kubernetes then moves the cluster reality towards this goal. That’s exactly what happens in Omega when schedulers propose changes to the cluster state in order to move towards a desired next state.
Kismatic: Could you explain why the shared-state model of Omega allows for more flexible and efficient resource management in heterogeneous, mixed workload scenarios?
MS: Sure. Let’s think about the environment that many organisations were running a few years ago: a single Hadoop cluster on which multiple users run MapReduce jobs. That’s a pretty easy scheduling problem: all machines have n slots for MapReduce workers, and the scheduler assigns map and reduce tasks to these workers until either no more work is left, or all slots are busy.
But a slot is, of course, a very coarse-grained unit of scheduling. Not all MapReduce jobs require the same amount of resources, and not all of them use all of their allocated resources. Indeed, some of the work on the cluster may not consist of MapReduce tasks at all! Instead of statically partitioning machines along the single dimensions of “slots”, it’s a much better idea to “bin-pack” tasks onto machines to increase utilization. That’s what most modern schedulers do.
Multi-dimensional bin-packing is far from easy — in fact, it’s an NP-complete problem. The challenge is made even more difficult by the fact that different types of work prefer different ways of bin-packing tasks together: for example, it is perfectly okay for another MapReduce task to run on a machine whose network bandwidth is already 70% utilized, but it’s probably not okay to bin-pack a front-end web server there, since the high network load will affect request tail latency!
In Omega, each scheduler can look at all the available resources, existing tasks and current loads in the cluster (the “cluster state”) and make its decisions based on this. In other words, different schedulers can use different variations of bin-packing algorithms, but all have the same information available to them.
Kismatic: It’s a fairly common assumption that Omega was developed as a replacement for Borg due to scalability limitations that Borg may be hitting soon, could you clear up some of the confusion there?
MS: That is definitely not the case, and in retrospect, I wish we had made this clearer in the paper. A lot of follow-up work has effectively said things along the lines of “centralized schedulers cannot scale to large clusters, and therefore we must move to distributed scheduler architectures”. However, the primary reason for developing Omega’s shared-state model was not scalability but engineering flexibility. The goal is that different teams can develop schedulers independently, without being bound by assumptions in the code, and this was much more important than the scalability aspect. The additional scalability from parallelizing schedulers was a nice add-on, but realistically, a well-engineered centralized scheduler can scale to significantly larger clusters and workloads. The Borg paper actually makes this point: “We are not sure where the ultimate scalability limit to Borg’s centralized architecture will come from; so far, every time we have approached a limit, we’ve managed to eliminate it.” (§3.4)
There are some niche situations in which distributed scheduling is really required: a key use case is the rapid placement of latency-sensitive, short-term “requests” to existing workers. This is what the Sparrow scheduler targets, and a distributed design is absolutely appropriate for this workload. But if your tasks are not latency-sensitive, or the throughput does not exceed tens of thousands of tasks per second, a centralized scheduler will do just fine even for over 10,000 machines!
Kismatic: Many people are excited about the Mesos cluster manager, as it offers some of these features. In the Omega paper, you point out that Mesos’ two-level scheduling model had some drawbacks related to hoarding and information hiding. Could you please elaborate on that for our readers?
MS: First of all, I should be very clear about one thing: the description of Mesos in the Omega paper is based on Mesos as it was in 2012. It’s been a few years since, and some of the issues we identified have since been addressed and are no longer applicable in exactly the same way.
There are two key drawbacks we found with an offer-based model, and indeed any model that exposes only a subset of the cluster state to different schedulers. So much of this also affects request-based designs, such as, for example, YARN.
The first is related to the fact that a scheduler can only see the resources it has been allocated or offered. In the Mesos offer system, the resource manager fundamentally says to an application scheduler “hey, here are some resources that you can have — which would you like?”, and the application-level scheduler then picks from those. However, the application-level scheduler cannot see other relevant information for its decisions: for example, who is using the resources that aren’t on offer (which may impact the offered resources’ quality), and whether there are more preferable resources available that the application scheduler would rather choose (in other words, it should reject the offer and wait for a better one). This is the problem of information hiding. Another practical example of a desirable feature complicated by information hiding is priority preemption: if high-priority tasks need to be able to “kick out” low-priority ones, any location where a low-priority task runs is also effectively a resource on offer for specific workloads — but the scheduler will never see it!
Of course, this specific problem can be solved: for example, Mesos could offer resources occupied by low-priority tasks to high-priority schedulers; or the schedulers’ filters could indicate that they want to be offered preemptible resources. But this would complicate the resource manager API and logic, and bake in policies at the resource manager level.
The second problem is hoarding, which especially affects “gang-scheduled” jobs. These are jobs that must acquire all of the requested resources before any of their tasks can start running. An MPI job is a traditional example of such a computation, but there others, too: stateful stream processing systems also require their entire pipelines to be intact to get started. In a two-level model such as Mesos, this poses a problem: the application-level scheduler can either wait until offered sufficient resources in one offer to get everything started (which may never happen), or accept a series of smaller offers to accumulate sufficient resources. If it does the latter, the resources are uselessly hoarded until sufficiently many offers have been accepted, even though they could have been used productively for other workloads (such as low-priority MapReduce jobs) in the meantime.
I recently chatted with Ben Hindman and some of the core Mesos developers at Mesosphere about this, and they told me about some planned exciting changes to the core offer model that allow parallel offers and reservations, and which address the above. For example, Mesos will be able to make optimistic offers of the same resources to multiple schedulers, and “rescind” the losing scheduler’s offers. This requires a conflict resolution approach similar to Omega, and at that point, the two models almost converge: if Mesos offers all resources in the cluster to a scheduler (and rescinds those that are claimed by others), the scheduler has the same view of the cluster as it would in Omega (as Ben discusses in this slide deck). It looks like the implementation of this is still in its infancy, though.
Kismatic: In the Omega paper, you state that an offer-based two-level scheduling approach (as in Mesos) proved unsuitable for Google’s needs. Can you share why it was found to be very useful to share the full cluster state with all of the schedulers in order to manage dynamic shared workloads?
MS: Sure. Let’s take up the earlier example of priority preemption. At Google, it’s quite common for low-priority tasks to be terminated because their resources are required for something more important. In fact, when I ran my large MapReduce jobs, I would usually, over the total job runtime, expect to see a handful of workers fail and be rescheduled due to preemption. This happens because the Borg scheduler chose my tasks as preemption victims either because they were the least important ones, or because the specific resources that my tasks used were required for something else — for example, because a service job (such as GMail) experienced a load spike. But in order to choose tasks for eviction, the scheduler must be able to see them — even if they are not actually managed by this application-level scheduler. For example, an instance of Twitter’s Heron stream processor would want to be able to evict Hadoop or Spark jobs, even though they belong to a different “framework”.
Another example is one that we mention in the Omega paper: instead of having statically sized MapReduce jobs, we looked into dynamically increasing their resource allocations at times of low cluster load. This makes total sense: why would the job use exactly 100 workers? That’s just a number that a human typed in — the scheduler has a much better idea of what a good number is at any particular point in time. In fact, other systems have since adopted similar ideas: for example, Microsoft’s Apollo scheduler supports “opportunistic tasks”, and research systems like Jockey and Quasar have shown that auto-scaling can really make a big difference. But to be able to make that decision, the scheduler must be able to see what the overall cluster state is, and hence an offer-based two-level design did not suit Google’s requirements with Omega.
Kismatic: You describe in the Omega paper that the Mesos Dominant Resource Fairness (DRF) algorithm is quite fast and takes around ~1ms to make a resource offer to the upstream frameworks/application schedulers. What are you thoughts overall on the DRF model and why Google chose a different approach?
MS: The reason why Google does not use DRF is actually nothing to do with algorithmic complexity or performance. It’s simply because rigid fairness is not actually required in the Google environment. Google values the flexibility of being able to make the allocations temporarily unfair in order to get more work done and improve overall utility. Of course, Google also has to put safeguards in place to avoid people accidentally (or deliberately!) using enormous amounts of resources to the disadvantage of others. This works via a quota system: a virtual currency is used to budget and “buy” resources — if I run 100 high-priority tasks that each use 20 GB of memory for an hour, my team’s account will be debited 2,000 GB-hours of memory. Once the account runs out, the team has to talk to a human up the chain to increase its allocation or find other ways of getting the work done. In other words, splurging the monthly allocation in one day isn’t a good idea, and will get you into difficulty! The Borg paper describes the quota system in more detail (§2.6).
This discussion raises a more fundamental question about the utility of fairness, though: would you rather maintain strict fairness and leave resources idle, or permit temporary unfairness if it brings higher utilization and work completes sooner? The right choice usually very much depends on the organisation and the workload, but for Google, efficiency trumps strict fairness.
Kismatic: Many people are excited about Mesos being heavily used by companies like Twitter and Apple to manage large compute clusters for some interesting applications. While Omega is an exciting paper to read, it is generally perceived that an open source implementation of something like Omega has yet to surface. Do you know of any efforts in this direction, especially perhaps with Kubernetes in mind?
MS: It’s true that no direct analogue to what we did in Omega exists in the open-source world right now. That said, Omega has had a major impact on several open-source projects. The first and more important is of course Kubernetes, Google’s open-source cluster manager, which was hugely influenced by both Borg and Omega. The key idea of Kubernetes is that the user should specify the desired state of their cluster, and leave the mechanics of making it happen to the cluster manager, which consists of various agents that together push towards the desired state. That’s totally a very similar concept to Omega’s approach — remember, it can take Kubernetes some time to converge towards the desired solution, just as it does with scheduling decisions in Omega. In that sense, Kubernetes is an ideal platform to build an Omega-style scheduler upon. Indeed, the Kubernetes scheduler (kube-scheduler) is a pluggable component, so that’s very much possible: the architecture document states that the developers “expect to support multiple cluster schedulers and even user-provided schedulers in the future.” Since Kubernetes is completely componentized, one can certainly imagine multiple master controllers talking to multiple schedulers, or application-specific replication controllers making requests to application-specific schedulers.
There have also been various other schedulers that take ideas from Omega — for example, Microsoft Research has built Mercury (paper coming out at USENIX ATC, but a TR is already available), which is a hybrid scheduler that allows workloads to choose between distributed decisions based on potentially stale state and centralized decision based on accurate information. That code is being upstreamed into YARN currently, and certainly has a flavour of Omega to it.
Kismatic: At Kismatic, we are excited that Kubernetes supports pluggable components through its highly modular and extensible architecture. Since Kubernetes is fundamentally based on the design principles in Borg, do you think that an Omega-like shared-state scheduler could improve flexibility and efficiency in Kubernetes clusters?
MS: Absolutely. I’ve recently looked into what Kubernetes and Mesos do in terms of scheduling, and to be honest, it’s pretty naive compared to what Google does in Borg and Omega. However, what’s exciting is that pluggable scheduler APIs make it possible to improve upon the state of the art and for everyone to reap the benefits. So the foundation is there, and we can now do all the exciting stuff!
Just to illustrate how much of a change has happened here: at Cambridge, myself and a bunch of other folks actually built most of a full-blown cluster manager, Firmament, just in order to be able to do our research. We’ve got some pretty exciting results based on that work, but plugging into Kubernetes will make the benefits of such work much more readily available to the rest of the world.
I think a shared-state approach would actually help Kubernetes clusters really quite a lot: different applications and different pods really can have very different placement requirements, and integrating all of those into the kube-scheduler (or even variants of it) seems like a feature-explosion nightmare in waiting. Instead, having the ability to run multiple schedulers for specialized purposes seems like a good option for Kubernetes to me: it leads to better decisions, but keeps the code lean and modularized.
Kismatic: What other challenges to do you see ahead for cluster schedulers? What are the next hard problems to solve?
MS: Containers have also made heterogeneous clusters shared between different workloads much more accessible. With Kubernetes or another container orchestration platform (be it Mesos, Tectonic or Docker Machine), we can finally run entirely different applications on the same shared cluster without having to do contortions like starting native processes from map tasks in MapReduce or rebuilding applications against a particular API. I expect that this will lead to a push towards consolidation, and as a result, people will start to focus a lot more on scheduling things smartly. So far, most of the work in the open-source world seems to have been about getting things to run at all — which at scale for sure is hard enough — but I suspect better scheduling will be the next hurdle.
To name just a few hard challenges that we need to get better at:
- Avoiding co-location interference is super important with highly loaded machines. Performance for batch tasks can easily degrade by 3x if a poor scheduling decision puts pressure on a particular machine resource (e.g., disk, or the network, or even the L3 cache), and user-facing services may suffer even more (e.g., seeing huge tail latencies). While containers allow some degree of CPU and memory isolation, they fundamentally still share a lot of machine resources — hence, we need to get much better at predicting which workloads play nicely together and which are better kept apart.
- We need to get much better at what Google call “resource reclamation”: taking back surplus resources that a task (or container) reserved but isn’t actually using. It’s pretty well-known that humans tend to err on the side of caution and wildly overestimate their applications resource needs, or provision for peak load. But with a good scheduler, we can actually reclaim the slack reservations and use them for other, lower-priority workloads while they’re not required. Google does this in Borg, and I believe it’s one of the key secrets to their clusters’ high efficiency. No open-source cluster manager currently has support for this, though.
- It’d be nice to increase flexibility: have the scheduler suspend containers, move them around and automatically scale deployments. For example, instead of preempting a container and destroying it to make way for another one that needs, say, CPU, we could also throttle the lower-priority container so that it is effectively suspended, but allow it to maintain its state in RAM if memory is not currently under pressure on the machine in question.
- We need to make it easier to have user-specified scheduling policies without users having to go all the way to writing their own scheduler from scratch. Realistically, most organisations — especially smaller or more traditional ones — don’t have the resources to write a framework scheduler, even against a pluggable API. We need way for an SRE, or a devops-type person to be able to come up with a scheduling policy that works for their workload and organisation, and the cluster scheduler has to be a platform to express this policy, rather than just a resource allocation or container placement API.
We’re tackling some of these challenges in Firmament: for example, its pluggable cost models make it easy to specify a scheduling policy in an intuitive way, and we’ve developed cost models that avoid co-location interference before it even happens. We also show that it’s possible to make optimal decisions for a particular scheduling policy very rapidly (in hundreds of milliseconds) even at large scale (tens of thousands of machines). Those are super exciting results, and you’ll certainly hear me talk more about them in the near future. But there’s really so much more to do that it will take many people and many bright ideas to really make it possible to have “Google’s infrastructure for everyone else”!
Leave a Reply