Yellow Bricks

vSphere 7 and DRS Scalable Shares, how are they calculated?

Duncan Epping · Mar 16, 2020 ·

I wrote a post and recorded a short demo that explained this cool new feature called Scalable Shares, part of vSphere 7 / DRS, last week. I didn’t want to go too deep in the post, but now that I am getting more questions about how this actually works, I figured I would provide some examples to explain it. As mentioned in my previous post, Scalable Shares solves a problem many have been facing over the last decade or so, which is that DRS does not take the number of VMs in the pool into account when it comes to allocating resources. So as an example:

Just imagine you have a resource pool called “Test”, it is a resource pool with “normal” shares and has 4 VMs. Let’s say this resource pool has 4000 shares.

Now compare that resource pool to the “Production” pool, which has shares set to “high” but has 24 VMs. Let’s say this resource pool has 8000 shares.

This is a very extreme example, but it shows you immediately what the problem is. On a per VM basis, the VMs in the production resource pool, would receive far less resources when there’s contention as DRS would simply divvy up the resources based on the assignment of the shares of the resource pool. Test would receive 1/3 of the resource (4000 of 12000 total shares), and production would receive 2/3 of the resources (8000 of 12000 total shares). If you would divide that by the number of VMs in each pool, it is obvious that the VMs in Test are in a better situation than the VMs in Production.

So how would this work if you Scalable Shares enabled? Well, let’s list some facts first:

A resource pool looks like a VM with 4 vCPUs and 16GB of memory to DRS
Scalable Shares looks at the total amount of shares in the Resource Pool (all vCPUs!)
For a resource pool high is 8000 shares, normal is 4000 shares and low is 2000 shares
- Note, that this is based on 4 vCPUs, so the real values are 2000, 1000, 500.

The calculation would be as following:

Resource Pool Shares = (4 vCPU * Shares of Pool )* (Total number of shares of all vCPUs in resource pool)

So as an example, in the case I have Test with normal shares and 4 VMs, and Production with high shares and 24 VMs, and all VMs have a single vCPU with normal priority the calculation for those two resource pools would be:

Test = (4 * 1000) * (4 * 1000) = 16,000,000 shares

Production = (4 * 2000) * (24 * 1000) = 192,000,000 shares

In other words, Production has 12 times more the number of shares as Test has when Scalable Shares is enabled. I hope that clears things up!

Introducing Scalable Shares – vSphere 7

Duncan Epping · Mar 12, 2020 ·

Early 2015 Frank Denneman and I had a discussion during a flight to San Francisco. We came up with this concept for Resource Pools where the number of shares would be determined by the number of VMs and the priority of the pool. In other words, we wanted to avoid the dilution of shares in an environment with resource pools and basically solve the resource pool pie paradox problem described here. We worked with the DRS team on describing the concept, we filed a patent for it and got the patent granted in 2019. Today I am happy to share that the feature made it into a release and will be part of vSphere 7.0.

Scalable Shares is a feature that is part of DRS. You either enable it on the Cluster or you enable it on the Resource Pool. Personally I would always enable it on the Resource Pool level. So how does it work? Well normally when you create a resource pool with High Priority and one with Normal Priority, the RP with Priority High will have 8000 shares and the RP with Priority Normal will get 4000 shares. In other words, a 2:1 ratio between High and Normal. Now if you have 8 VMs in High and 1 in Normal, you can imagine that this single VM in Normal will get more resources than the 8 in High when there is contention simply as the 8 share the resources of the pool while the single VM doesn’t share the resources.

When Scalable Shares is enabled, DRS does a calculation using the ratio (4:2:1 – High:Normal:Low) and the number of VMs. In other words, 8 VMs with a 1000 shares in High would be *4, and 1 VM with a 1000 shares in Normal would be *2. The result being:

Normal = (1*1000)*2=2000
High = (8*1000)*4=32000

I hope that makes a bit sense? It took me a while to fully grasp. If you are wondering what this looks like in the product, and what the impact could/would be when you switch from “traditional” to “scalable shares”, I created a demo that shows this below.

Last day of my Take 3 with the @ProjectVXR team!

Duncan Epping · Feb 28, 2020 ·

Today is the last day of my Take 3 with the Project VXR team. Ridiculous how fast these 3 months went. It seems like only yesterday that I posted I was going to start this journey of learning more about the world of Spatial Computing and Remote Rendering of VR in particular. If I say so myself, I feel that over the past 3 months I managed to accomplish quite a lot. The last three months I spend figuring out how to virtualize a Virtual Reality application and how to stream the application to a head-mounted display. I tested different solutions in this space, some of which I discussed on my blog in the past months, and I was surprised how smooth it worked, to be honest. If you are interested in this space, my recommendation would be to look into NVIDIA CloudXR in combination with vSphere 6.7 U3 and NVIDIA vGPU technology.

I can’t share all the details just yet, I wrote a white paper, which now needs to go through reviews and copy editing, and hopefully, it will be published soon. I will, however, discuss some of my findings and my experience during some of the upcoming VMUGs I will be presenting at. Hopefully, people will enjoy it and appreciate it.

One thing I would like to do is thank a few people who helped me tremendously in the past few months. First of all, of course, the folks on the Project VXR team, they gave me all the pointers/hints/tips I needed and shared a wealth of knowledge on the topic of spatial computing. I also like to thank Grid Factory, Ben in particular, for the many discussions, emails etc that we had. Of course also NVIDIA for the discussions and help around the lab equipment. Last but not least, I want to thank the VMware OCTO team focussing on Dell Technologies for providing me with a Dell Precision Workstation, and shipping it out literally within a day or two. Much appreciated everyone!

Now it is time to get back to reality.

NVIDIA rendering issues? Look at the stats!

Duncan Epping · Feb 14, 2020 ·

I’ve been down in the lab for the last week doing performance testing with virtual reality workloads and streaming these over wifi to an Oculus Quest headset. In order to render the graphics remotely, we leveraged NVIDIA GPU technology (RTX 8000 in my case here in the lab). We have been getting very impressive results, but of course at some point hit a bottleneck. We tried to figure out which part of the stack was causing the problem and started looking at the various NVIDIA stats through nvidia-smi. We figured the bottleneck would be GPU, so we looked at things like GPU utilization, FPS etc. Funny enough this wasn’t really showing a problem.

We then started looking at different angles, and there are two useful commands I would like to share. I am sure the GPU experts will be aware of these, but for those who don’t consider themselves an expert (like myself) it is good to know where to find the right stats. While whiteboarding the solution and the different part of the stacks we realized that GPU utilization wasn’t the problem, neither was the frame buffer size. But somehow we did see FPS (frames per second) drop and we did see encoder latency go up.

First I wanted to understand how many encoding sessions there were actively running. This is very easy to find out by using the following command. The screenshot below shows the output of it.

nvidia-smi encodersessions

As you can see, this shows 3 encoding sessions. One H.264 session and two H.265 sessions. Now note that we have 1 headset connected at this point, but it leads to three sessions. Why? Well, we need a VM to run the application, and the headset has two displays. Which results in three sessions. We can, however, disable the Horizon session using the encoder, that would save some resources, I tested that but the savings were minimal.

I can, of course, also look a bit closer at the encoder utilization. I used the following command for that. Note that I filter for the device I want to inspect which is the “-i <identifier>” part of the below command.

nvidia-smi dmon -i 00000000:3B:00.0

The above command provides the following output, the “enc” column is what was important to me, as that shows the utilization of the encoder. Which with the above 3 sessions was hitting 40% utilization roughly as shown below.

How did I solve the problem of the encoder bottleneck in the end? Well I didn’t, the only way around that is by having a good understanding of your workload and proper capacity planning. Do I need an NVIDIA RTX 6000 or 8000? Or is there a different card with more encoding power like the V100 that makes more sense? Figuring out the cost, performance and the trade-off here is key.

Two more weeks until the end of my Take 3 experience, and what a ride it has been. If you work for VMware and have been with the company for 5 years… Consider doing a Take 3, it is just awesome!

vGPUs and vMotion, why the long stun times?

Duncan Epping · Feb 7, 2020 ·

Last week one of our engineers shared something which I found very interesting. I have been playing with Virtual Reality technology and NVIDIA vGPUs for 2 months now. One thing I noticed is that we (VMware) introduced support for vMotion in vSphere 6.7 and support for vMotion of multi vGPU VMs in vSphere 6.7 U3. In order to enable this, you need to set an advanced setting first. William Lam described this in his blog how to set this via Powershell or the UI. Now when you read the documentation there’s one thing that stands out, and that is the relatively high stun times for vGPU enabled VMs. Just as an example, here are a few potential stun times with various sized vGPU frame buffers:

2GB – 16.5 seconds
8GB – 61.3 seconds
16GB – 100+ seconds (time out!)

This is all documented here for the various frame buffer sizes. Now there are a couple of things to know about this. First of all, the time mentioned was tested with 10GbE and the NVIDIA P40. This could be different for an RTX6000 or RTX8000 for instance. Secondly, they used a 10GbE NIC. If you use multi-NIC vMotion or for instance a 25GbE NIC than results may be different (times should be lower). But more importantly, the times mentioned assume the full frame buffer memory is consumed. If you have a 16GB frame buffer and only 2GB is consumed then, of course, the stun time would be lower than the above mentioned 100+ seconds.

Now, this doesn’t answer the question yet, why? Why on earth are these stun times this long? The vMotion process is described in this blog post by Niels in-depth, so I am not going to repeat it. It is also described in our Clustering Deep Dive book which you can download here for free. The key reason why with vMotion the “down time” (stun times) can be kept low is that vMotion uses a pre-copy process and tracks which memory pages are changed. In other words, when vMotion is initiated we copy memory pages to the destination host, and if a page has changed during that copy process we mark it as changed and copy it again. vMotion does this until the amount of memory that needs to be copied is extremely low and this would result in a seamless migration. Now here is the problem, it does this for VM memory. This isn’t possible for vGPUs unfortunately today.

Okay, so what does that mean? Well if you have a 16GB frame buffer and it is 100% consumed, the vMotion process will need to copy 16GB of frame buffer memory from the source to the destination host when the VM is stunned. Why when the VM is stunned? Well simply because that is the point in time where the frame buffer memory will not change! Hence the reason this could take a significant number of seconds unfortunately today. Definitely something to consider when planning on using vMotion on (multi) vGPU enabled VMs!