At VMworld, various cool new technologies were previewed. In this series of articles, I will write about some of those previewed technologies. Unfortunately, I can’t cover them all as there are simply too many. This article is about DRS 2.0, which was session HBI2880BU. For those who want to see the session, you can find it here. This session was presented by Adarsh Jagadeeshwaran and Sai Inabattini. Please note that this is a summary of a session which is discussing a Technical Preview, this feature/product may never be released, and this preview does not represent a commitment of any kind, and this feature (or it’s functionality) is subject to change. Now let’s dive into it, what is DRS 2.0 all about?
The session started with an intro, DRS was first introduced in 2006. Since then datacenters, and workloads (cloud-native architectures), have changed a lot. DRS, however, has remained largely the same over the past 10 years. What we need is a resource management engine which is more workload-centric than it is cluster-centric, that is why we are planning on introducing DRS 2.0
What has changed? In general, the changes can be placed in 3 categories:
- New cost-benefit model
- Support for new resources and devices
- Faster and scalable
Let’s start with the cost-benefit model. First of all, it is essential to understand the algorithm and the DRS logic has changed. if you look at the slides below, hopefully, it makes it clear that DRS 2.0 is VM centric. Where 1.0 looks at the cluster state first, 2.0 focusses on the VM immediately. This makes a big difference because it can now potentially improve VM resource happiness sooner.
So now that we mentioned VM Happiness, what is this exactly? When computing the VM Happiness Score, DRS looks at between 10-15 metrics, the core metrics, however, are Host CPU Cache Cost, VM CPU Ready Time, VM Memory Swapped and Workload Burstiness. The score will be shown going forward in the UI as well on a per VM basis. If a migration is initiated, the key is to improve the VM Happiness score for that workload. Of course, you can also still see the cluster aggregated score. The VM Happiness score allows us to enables us to have a better balance while keeping the workloads happy.
Another improvement that is made is the frequency that DRS runs. DRS 2.0 will run every minute, instead of every 5 minutes. This means that if you have very dynamic workloads, DRS can simply respond faster. This has been made possible by remove the “cluster snapshot” mechanism that DRS 1.0 used, this was basically the limiting factor for DRS. Removing the cluster snapshot mechanism also means DRS 2.0 will use less memory and can work with a higher number of objects.
What I also found very interesting is that with DRS 2.0 you also now have the ability to change the VM demand interval, meaning that you have the ability to specify that the interval of default 15 minutes needs to be 40 minutes for instance. The benefit of increasing it from 15 to 40 would be the fact that spikes are averaged out. You can also decrease it down to 5 minutes if you want DRS to respond to spikey workloads. Pretty smart.
Another feature which is introduced in DRS 2.0 is the ability to do proper Network Load Balancing. Yes we had an option in the past to load balance on network load, but it was never a first-class citizen. It was a secondary metric which was only looked at after CPU and memory were considered. So if there was no need for balancing from a CPU or memory point of view, then the network utilization would not even be looked at. With 2.0 network is a primary metric, which means that VMs can be (and will be) migrated to balance networking utilization as shown in the screenshot below.
The cost-benefit model is also enhanced to include Network and PMEM. DRS 2.0 is also hardware aware, with that meaning that if you use vGPUs DRS 2.0 will take that into consideration for placement or balancing. On top of that, the cost-benefit model will also take workload behavior in to account (stable/unstable resource consumption), simply to avoid ping-pong’ing of VMs. Another useful feature that is added is that the cost-benefit model will take the benefit of a migration in to account when it comes to sustainability, how long will the benefit sustain is what it will look for. Also, something which Frank has discussed during various sessions when it comes to memory, DRS 2.0 no longer takes active memory into account as the world has changed. Most customers do not over-commit on memory as they used to in the past. this is why DRS 2.0 takes “granted memory” into account. This will result in significantly less vMotion’s.
The last thing that was discussed by Sai was the new Migration Threshold option in DRS 2.0. Again this is workload focussed and not cluster centric.
- Level 1 – No load balancing, only mandatory moves when rules are violated
- Level 2 – Highly stable workloads
- Level 3 – Stable workloads, focus on improving VM happiness (Default level)
- Level 4 – Bursty workloads
- Level 5 – Dynamic workloads
Adarsh came up next and he started with showing the performance benefits of DRS 2.0 using TPCx-HCI. What was clear is that with DRS 2.0 the performance improved compared to 1.0. This resulted in a 5-10% performance increase for VMs on average, which I feel is huge. Next Adarsh discussed troubleshooting, he discussed how using VM Level Automation could result in different behavior than before. In the past, DRS would consider all VMs when load balancing, in 2.0 DRS will only consider VM with the same automation level (or higher). DRS 2.0 may also cause an increase in migrations, but this can be controlled by changing the migration threshold for DRS. Also, DRS will also only do 1 migration between a specific source-destination pair.
When will it be available? Well, it already is available in VMware Cloud on AWS, it has been enabled for over a year without any issues. It will become available in a future release, dates and version numbers, of course, were not discussed. If you would like to move back to the old behavior, there will be an advanced setting (FastLoadBalance=0) to switch back. If you are interested in learning more, make sure to watch the recording, there’s a lot of good stuff in there.