VMworld Reveals: DRS 2.0 (#HBI2880BU)

At VMworld, various cool new technologies were previewed. In this series of articles, I will write about some of those previewed technologies. Unfortunately, I can’t cover them all as there are simply too many. This article is about DRS 2.0, which was session HBI2880BU. For those who want to see the session, you can find it here. This session was presented by Adarsh Jagadeeshwaran and Sai Inabattini. Please note that this is a summary of a session which is discussing a Technical Preview, this feature/product may never be released, and this preview does not represent a commitment of any kind, and this feature (or it’s functionality) is subject to change. Now let’s dive into it, what is DRS 2.0 all about?

The session started with an intro, DRS was first introduced in 2006. Since then datacenters, and workloads (cloud-native architectures), have changed a lot. DRS, however, has remained largely the same over the past 10 years. What we need is a resource management engine which is more workload-centric than it is cluster-centric, that is why we are planning on introducing DRS 2.0

What has changed? In general, the changes can be placed in 3 categories:

New cost-benefit model
Support for new resources and devices
Faster and scalable

Let’s start with the cost-benefit model. First of all, it is essential to understand the algorithm and the DRS logic has changed. if you look at the slides below, hopefully, it makes it clear that DRS 2.0 is VM centric. Where 1.0 looks at the cluster state first, 2.0 focusses on the VM immediately. This makes a big difference because it can now potentially improve VM resource happiness sooner.

So now that we mentioned VM Happiness, what is this exactly? When computing the VM Happiness Score, DRS looks at between 10-15 metrics, the core metrics, however, are Host CPU Cache Cost, VM CPU Ready Time, VM Memory Swapped and Workload Burstiness. The score will be shown going forward in the UI as well on a per VM basis. If a migration is initiated, the key is to improve the VM Happiness score for that workload. Of course, you can also still see the cluster aggregated score. The VM Happiness score allows us to enables us to have a better balance while keeping the workloads happy.

Another improvement that is made is the frequency that DRS runs. DRS 2.0 will run every minute, instead of every 5 minutes. This means that if you have very dynamic workloads, DRS can simply respond faster. This has been made possible by remove the “cluster snapshot” mechanism that DRS 1.0 used, this was basically the limiting factor for DRS. Removing the cluster snapshot mechanism also means DRS 2.0 will use less memory and can work with a higher number of objects.

What I also found very interesting is that with DRS 2.0 you also now have the ability to change the VM demand interval, meaning that you have the ability to specify that the interval of default 15 minutes needs to be 40 minutes for instance. The benefit of increasing it from 15 to 40 would be the fact that spikes are averaged out. You can also decrease it down to 5 minutes if you want DRS to respond to spikey workloads. Pretty smart.

Another feature which is introduced in DRS 2.0 is the ability to do proper Network Load Balancing. Yes we had an option in the past to load balance on network load, but it was never a first-class citizen. It was a secondary metric which was only looked at after CPU and memory were considered. So if there was no need for balancing from a CPU or memory point of view, then the network utilization would not even be looked at. With 2.0 network is a primary metric, which means that VMs can be (and will be) migrated to balance networking utilization as shown in the screenshot below.

The cost-benefit model is also enhanced to include Network and PMEM. DRS 2.0 is also hardware aware, with that meaning that if you use vGPUs DRS 2.0 will take that into consideration for placement or balancing. On top of that, the cost-benefit model will also take workload behavior in to account (stable/unstable resource consumption), simply to avoid ping-pong’ing of VMs. Another useful feature that is added is that the cost-benefit model will take the benefit of a migration in to account when it comes to sustainability, how long will the benefit sustain is what it will look for. Also, something which Frank has discussed during various sessions when it comes to memory, DRS 2.0 no longer takes active memory into account as the world has changed. Most customers do not over-commit on memory as they used to in the past. this is why DRS 2.0 takes “granted memory” into account. This will result in significantly less vMotion’s.

The last thing that was discussed by Sai was the new Migration Threshold option in DRS 2.0. Again this is workload focussed and not cluster centric.

Level 1 – No load balancing, only mandatory moves when rules are violated
Level 2 – Highly stable workloads
Level 3 – Stable workloads, focus on improving VM happiness (Default level)
Level 4 – Bursty workloads
Level 5 – Dynamic workloads

Adarsh came up next and he started with showing the performance benefits of DRS 2.0 using TPCx-HCI. What was clear is that with DRS 2.0 the performance improved compared to 1.0. This resulted in a 5-10% performance increase for VMs on average, which I feel is huge. Next Adarsh discussed troubleshooting, he discussed how using VM Level Automation could result in different behavior than before. In the past, DRS would consider all VMs when load balancing, in 2.0 DRS will only consider VM with the same automation level (or higher). DRS 2.0 may also cause an increase in migrations, but this can be controlled by changing the migration threshold for DRS. Also, DRS will also only do 1 migration between a specific source-destination pair.

When will it be available? Well, it already is available in VMware Cloud on AWS, it has been enabled for over a year without any issues. It will become available in a future release, dates and version numbers, of course, were not discussed. If you would like to move back to the old behavior, there will be an advanced setting (FastLoadBalance=0) to switch back. If you are interested in learning more, make sure to watch the recording, there’s a lot of good stuff in there.

Comments

Михаил Соловьев says

3 September, 2019 at 10:47

Duncan, a little mistake – #HBI2880BU in a header
- Duncan Epping says
  
  3 September, 2019 at 13:26
  
  Thanks for pointing that out, typo indeed. should be fixed now.
Davoud Teimouri says

3 September, 2019 at 19:51

Network load balancing is very interesting also I guess, no one will move back to old behavior. Thanks for sharing it.
Fatih Solen says

4 September, 2019 at 18:03

Good news, thanks for sharing.
Zibi says

4 September, 2019 at 23:06

Hello Duncan
Very interesting read. I really would like to take this session on the Vmworld EU and discuss this.
I’d like to ask would there be a possibility to configure 2 things in this DRS:
1. VM Happiness target level for each particular VM – to give us the ability to setup something like performance classes and pick those VMs that are more equal than the others.
2. Migration frequency for each VM – something like “migrate it I don’t care, can be migrated, do not migrate”
My biggest grip with DRS so far was his fondness of taking the biggest VM it could find (usually critical production server) and vMotion it somewhere else, impacting the performance or worse.
I really would like to see in this new DRS the ability to differentiate between VMs or Namespaces
tronarCarlos says

5 September, 2019 at 17:50

Nice new feature. Is there any info on how the expected hapiness of a VM on a different host is calculated ? (i.e. how does DRS 2.0 decide where to the VM is migrated). Thanks!
- duncan@yellow-bricks says
  
  19 September, 2019 at 10:25
  
  there was some additional detail in the session, other than that I can’t share much at this point
Craig Tompkins says

18 September, 2019 at 13:34

Duncan, you say this is already running in VMC on AWS, is that because it’s part of vCenter 6.8?
- duncan@yellow-bricks says
  
  19 September, 2019 at 10:24
  
  I am not sure which version they are running in VMC right now, but yeah it could be that version. And yes it is part of vCenter Server.
Calvin So says

23 October, 2019 at 16:18

How about the happiness of those VMs currently reside on a particular host? Moving a big VM into that host will make it happy but all small VMs currently running on that host will become unhappy so how DRS 2.0 weight on that?
tayfundeger says

12 March, 2020 at 21:54

Thanks for sharing.
Gurdeep Singh Sandhu says

29 December, 2020 at 16:40

How DRS measures VM DRS score for a VM on a Host in actual?
- Duncan Epping says
  
  30 December, 2020 at 11:05
  
  details were published here: https://blogs.vmware.com/vsphere/2020/05/vsphere-7-a-closer-look-at-the-vm-drs-score.html

Related

Reader Interactions

Comments