vSphere

vSphere 7 and DRS Scalable Shares, how are they calculated?

Duncan Epping · Mar 16, 2020 ·

I wrote a post and recorded a short demo that explained this cool new feature called Scalable Shares, part of vSphere 7 / DRS, last week. I didn’t want to go too deep in the post, but now that I am getting more questions about how this actually works, I figured I would provide some examples to explain it. As mentioned in my previous post, Scalable Shares solves a problem many have been facing over the last decade or so, which is that DRS does not take the number of VMs in the pool into account when it comes to allocating resources. So as an example:

Just imagine you have a resource pool called “Test”, it is a resource pool with “normal” shares and has 4 VMs. Let’s say this resource pool has 4000 shares.

Now compare that resource pool to the “Production” pool, which has shares set to “high” but has 24 VMs. Let’s say this resource pool has 8000 shares.

This is a very extreme example, but it shows you immediately what the problem is. On a per VM basis, the VMs in the production resource pool, would receive far less resources when there’s contention as DRS would simply divvy up the resources based on the assignment of the shares of the resource pool. Test would receive 1/3 of the resource (4000 of 12000 total shares), and production would receive 2/3 of the resources (8000 of 12000 total shares). If you would divide that by the number of VMs in each pool, it is obvious that the VMs in Test are in a better situation than the VMs in Production.

So how would this work if you Scalable Shares enabled? Well, let’s list some facts first:

A resource pool looks like a VM with 4 vCPUs and 16GB of memory to DRS
Scalable Shares looks at the total amount of shares in the Resource Pool (all vCPUs!)
For a resource pool high is 8000 shares, normal is 4000 shares and low is 2000 shares
- Note, that this is based on 4 vCPUs, so the real values are 2000, 1000, 500.

The calculation would be as following:

Resource Pool Shares = (4 vCPU * Shares of Pool )* (Total number of shares of all vCPUs in resource pool)

So as an example, in the case I have Test with normal shares and 4 VMs, and Production with high shares and 24 VMs, and all VMs have a single vCPU with normal priority the calculation for those two resource pools would be:

Test = (4 * 1000) * (4 * 1000) = 16,000,000 shares

Production = (4 * 2000) * (24 * 1000) = 192,000,000 shares

In other words, Production has 12 times more the number of shares as Test has when Scalable Shares is enabled. I hope that clears things up!

vGPUs and vMotion, why the long stun times?

Duncan Epping · Feb 7, 2020 ·

Last week one of our engineers shared something which I found very interesting. I have been playing with Virtual Reality technology and NVIDIA vGPUs for 2 months now. One thing I noticed is that we (VMware) introduced support for vMotion in vSphere 6.7 and support for vMotion of multi vGPU VMs in vSphere 6.7 U3. In order to enable this, you need to set an advanced setting first. William Lam described this in his blog how to set this via Powershell or the UI. Now when you read the documentation there’s one thing that stands out, and that is the relatively high stun times for vGPU enabled VMs. Just as an example, here are a few potential stun times with various sized vGPU frame buffers:

2GB – 16.5 seconds
8GB – 61.3 seconds
16GB – 100+ seconds (time out!)

This is all documented here for the various frame buffer sizes. Now there are a couple of things to know about this. First of all, the time mentioned was tested with 10GbE and the NVIDIA P40. This could be different for an RTX6000 or RTX8000 for instance. Secondly, they used a 10GbE NIC. If you use multi-NIC vMotion or for instance a 25GbE NIC than results may be different (times should be lower). But more importantly, the times mentioned assume the full frame buffer memory is consumed. If you have a 16GB frame buffer and only 2GB is consumed then, of course, the stun time would be lower than the above mentioned 100+ seconds.

Now, this doesn’t answer the question yet, why? Why on earth are these stun times this long? The vMotion process is described in this blog post by Niels in-depth, so I am not going to repeat it. It is also described in our Clustering Deep Dive book which you can download here for free. The key reason why with vMotion the “down time” (stun times) can be kept low is that vMotion uses a pre-copy process and tracks which memory pages are changed. In other words, when vMotion is initiated we copy memory pages to the destination host, and if a page has changed during that copy process we mark it as changed and copy it again. vMotion does this until the amount of memory that needs to be copied is extremely low and this would result in a seamless migration. Now here is the problem, it does this for VM memory. This isn’t possible for vGPUs unfortunately today.

Okay, so what does that mean? Well if you have a 16GB frame buffer and it is 100% consumed, the vMotion process will need to copy 16GB of frame buffer memory from the source to the destination host when the VM is stunned. Why when the VM is stunned? Well simply because that is the point in time where the frame buffer memory will not change! Hence the reason this could take a significant number of seconds unfortunately today. Definitely something to consider when planning on using vMotion on (multi) vGPU enabled VMs!

Disabling the frame rate limiter for your vGPU

Duncan Epping · Feb 4, 2020 ·

I have been testing with Virtual Reality apps within a VM for the past few days and I am leveraging NVIDIA vGPU technology on vSphere 6.7 U3. I was running into some undesired behavior and was pointed to the fact that this could be due to the frame rate being limited by default (Thank Ben!). I first checked the head-mounted display to see at what kind of frame rate it was running, by leveraging “adbLink” (for Mac) and the logcat command I could see the current frame rate hovering between 55-60. For virtual reality apps that leads to problems when moving your head from left to right as you will see black screens. For the Quest, for those wanting to play around with it as well, I used the following command to list the current frame rate for the NVIDIA CloudXR application (note that this is specific to this app) and “-s” filters for the keyword “VrApi”:

logcat -s VrApi

The result will be a full string, but the important bit is the following:

FPS=72,Prd=32ms,Tear=0,Early=0

I was digging through the NVIDIA documentation and it mentioned that if you used the Best Effort scheduler a frame rate limit would be applied. I wanted to test with the different schedulers anyway so I switched over to the Equal Share scheduler, which solved the problems listed above as it indeed disabled the frame rate limit. I could see the frame rate going up between 70 and 72. Of course, I also wanted to validate the “best effort” scheduler with frame rate limit disabled, I did this by adding an advanced setting to the VM:

pciPassthru0.cfg.frame_rate_limiter=0

This also resulted in a better experience, and again the frame rate going up to 70-72.

Can I still provision VMs when a vSAN Stretched Cluster site has failed? Part II

Duncan Epping · Dec 18, 2019 ·

3 years ago I wrote the following post: Can I still provision VMs when a VSAN Stretched Cluster site has failed? Last week I received a question on this subject, and although officially I am not supposed to work on vSAN in the upcoming three months I figured I could test this in the evening easily within 30 minutes. The question was simple, in my blog I described the failure of the Witness Host, what if a single host fails in one of the two “data” fault domains? What if I want to create a snapshot for instance, will this still work?

So here’s what I tested:

vSAN Stretched Cluster
4+4+1 configuration
- Meaning, 4 hosts in each “data site” and a witness host, for a total of 8 hosts in my vSAN cluster
Create a VM with cross-site protection and RAID-5 within the location

So I first failed a host in one of the two data sites. When I fail the host, the following is what happens when I create a VM with RAID-1 across sites and RAID-5 within a site:

Without “Force Provisioning” enabled the creation of the VM fails
When “Force Provisioning” is enabled the creation of the VM succeeds, the VM is created with a RAID-0 within 1 location

Okay, so this sounds similar to the originally described scenario, in my 2016 blog post, where I failed the witness. vSAN will create a RAID-0 configuration for the VM. When the host returns for duty the RAID-1 across locations and RAID-5 within each location is then automatically created. On top of that, you can snapshot VMs in this scenario, the snapshots will also be created as RAID-0. One thing to mind is that I would recommend removing “force provisioning” from the policy after the failure has been resolved! Below is a screenshot of the component layout of the scenario by the way.

I also retried the witness host down scenario, and in that case, you do not need to use the “force provisioning” option. One more thing to note. The above will only happen when you create a RAID configuration which is impossible to create as a result of the failure. If 1 host fails in a 4+4+1 stretched cluster you would like to create a RAID-1 across sites and a RAID-1 within sites then the VM would be created with the requested RAID configuration, which is demonstrated in the screenshot below.

Project nanoEDGE aka tiny vSphere/vSAN supported configurations!

Duncan Epping · Oct 10, 2019 ·

A few weeks ago VMware announced Project nanoEDGE on their blog virtual blocks. I had a whole bunch of questions the following days from customers and partners interested in understanding what it is and what it does. I personally prefer to call project nanoEDGE “a recipe”. In the recipe, it states which configuration would be supported for both vSAN as well as vSphere. Lets be clear, this is not a tiny version of VxRail or VMware Cloud Foundation, this is a hardware recipe that should help customers to deploy tiny supported configurations to thousands of locations around the world.

Project nanoEDGE is a project by VMware principal system engineer Simon Richardson. The funny thing is that right around the time Simon started discussing this with customers to see if there would be interest in something like this, I had similar discussions within the vSAN organization. When Simon mentioned he was going to work on this project with support from the VMware OCTO organization I was thrilled. I personally believe there’s a huge market for this. I have had dozens of conversations over the years with customers who have 1000s of locations and are currently running single-node solutions. Many of those customers need to deliver new IT services to these locations and the requirements for those services have changed as well in terms of availability, which makes it a perfect play for vSAN and vSphere (with HA).

So first of all, what would nanoEDGE look like?

As you can see, these are tiny “desktop alike” boxes. These boxes are the Supermicro E300-9D and they come in various flavors. The recipe currently explains the solution as 2 full vSAN servers and 1 host which is used for the vSAN Witness for the 2 node configuration. Of course, you could also run the witness remotely, or even throw in a switch and go with a 3 node configuration. The important part here is that all used components are on both the vSphere as well as the vSAN compatibility guide! The benefit of using the 2-node approach is the fact that you can use cross-over cables between the vSAN hosts and avoid the cost of a 10GbE Switch as a result! So what is in the box? The bill of materials is currently as follows:

3x Supermicro E300-9D-8CN8TP
- The box comes with 4x 1GbE NIC Port and 2x 10GbE NIC Port
- 10GbE can be used for direct connect
- It has an Intel® Xeon® processor D-2146NT – 8 cores
6 x 64GB RAM
3 x PCIe Riser Card (RSC-RR1U-E8)
3 x PCIe M.2 NVMe Add on Card (AOC-SLG3-2M2)
3x Capacity Tier – Intel M.2 NVMe P4511 1TB
3x Cache Tier – Intel M.2 NVMe P4801 375GB
3x Supermicro SATADOM 64GB
1 x Managed 1GbE Switch

From a software point of view the paper lists they tested with 6.7 U2, but of course, if the hardware is on the VCG for 6.7 U3 than it will also be supported to run that configuration. Of course, the team also did some performance tests, and they showed some pretty compelling numbers (40.000+ read IOPS and close to 20.000 write IOPS), especially when you consider that these types of configurations would usually run 15-20 VMs in total. One thing I do want to add, the bill of materials lists M.2 form factor flash devices, this allows nanoEdge to avoid the use of the internal unsupported AHCI disk controller, this is key in the hardware configuration! Do note, that in order to fit two M.2 devices in this tiny box, you will need to also order the listed PCIe Riser Card and the M.2 NVMe add on card, William Lam has a nice article on this subject by the way.

There are many other options on the vSAN HCL for both caching as well as capacity, so if you prefer to use a different device, make sure it is listed here.

I would recommend reading the paper, and if you have an interest in this solution please reach out to your local VMware representative for more detail/help.