What’s new for HA in vSphere 6.0?

Instead of one generic post with a bunch of data I picked a couple of features and dug a little bit deeper, today I will be discussing what is new for HA in vSphere 6.0. Lets start with a list and then look at the features / enhancements individually:

  • Support for Virtual Volumes – With Virtual Volumes a new type of storage entity is introduced in vSphere 6.0.
  • VM Component Protection – This allows HA to respond to a scenario where the connection to the virtual machine’s datastore is impacted temporarily or permanently.
    • “Response for Datastore with All Paths Down”
    • “Response for Datastore with Permanent Device Loss”
  • Increased scale – Cluster limit has grown from 32 to 64 hosts and to a max of 8000 VMs per cluster
  • Registration of “HA Disabled” VMs on hosts after failure

Lets start with support for Virtual Volumes. It may sound like this is a given but as the whole concept of a VMFS volume no longer exists with Virtual Volumes and VMs have “virtual volumes” instead of VMDKs you can imagine that some work was needed to allow for HA to restart virtual machines stored on a VVOL enabled storage system.

VM Component Protection (VMCP) is in my opinion THE big thing that got added to vSphere HA. What this feature basically allows you to do is protect yourself against storage failures. There are two types of failures VMCP will respond to and those are PDL and APD. Before we look at some of the details, I want to point out that configuring is extremely simple… Just one tickbox to enable it.

HA in vSphere 6.0

In the case of a PDL (permanent device loss), this is something HA already was capable of doing when configured through the command line, a VM will be restarted instantly when a PDL signal is issued by the storage system. For an APD (all paths down) this is a bit different. A PDL more or less indicates that the storage device does not expect the device to return any time soon. An APD is more of an unknown situation, it may return… it may not… and no clue how long it takes. With vSphere 5.1 some changes were introduced to the way APD is handled by the hypervisor in this mechanism is leveraged by HA to allow for a response. (Cormac wrote an excellent post about this APD handling here.) When an APD occurs a timer starts. After 140 seconds the APD is declared and the device is marked as APD time out. When the 140 seconds has passed HA will start counting. The HA time out is 3 minutes. When the 3 minutes has passed HA can restart the virtual machine, but you can configure VMCP to respond differently if you want it to. You could for instance specify that events are issued that a PDL or APD has occurred. You can also specify how aggressively HA needs to try to restart VMs that are impacted by an APD. Note that aggressive / conservative refers to the likelihood of HA being able to restart VMs. When set to “conservative” HA will only restart the VM that is impacted by the APD if it knows another host can restart it. In the case of “aggressive” HA will try to restart the VM even if it doesn’t know the state of the other hosts, which could lead to a situation where your VM is not restarted as there is no host that has access to the datastore the VM is located on. It is also good to know that if the APD is lifted and access to the storage is restored during the total of roughly 5 minutes and 20 seconds it would take to reboot the VM, that HA will not do anything unless you explicitly configure it do so. This is where the “Response for APD recovery after APD timeout” comes in to play.

HA in vSphere 6.0

Increased scale is pretty straight forward, from 32 to 64 hosts and a total of 8000 VMs per cluster. I don’t know too many customers hitting this boundaries but I do come across a request like this occasionally. So if you want to grow your cluster, you can now do so. Do note that you may hit other limits like the LUN limit or the VM limit or…

Registration of HA Disabled VMs after a failure is a feature I have requested a long time ago. I am glad to see this made it in to the release. Basically when you have HA disabled on a specific VM this feature will make sure that the VM gets registered on another host after a failure. This will allow you to easily power-on that VM when needed without needed to manually re-register it yourself. Note, HA will not do a power-on of the VM but it will just register it for you.

That was it for now…

New fling released: VM Resource and Availability Service

I have the pleasure of announcing a brand new fling that was released today. This fling is called “VM Resource and Availability Service” and is something that I came up with during a flight to Palo Alto while talking to Frank Denneman. When it comes to HA Admission Control the one thing that always bugged me was why it was all based on static values. Yes it is great to know my VMs will restart, but I would also like to know if they will receive the resources they were receiving before the fail-over. In other words, will my user experience be the same or not? After going back and forth with engineering we decided that this could be worth exploring further and we decided to create a fling. I want to thank Rahul(DRS Team), Manoj and Keith(HA Team) for taking the time and going to this extend to explore this concept.

Something which I think is also unique is that this is a SaaS based solution, it allows you to upload a DRM dump and then you can simulate failure of one or more hosts from a cluster (in vSphere) and identify how many:

  • VMs would be safely restarted on different hosts
  • VMs would fail to be restarted on different hosts
  • VMs would experience performance degradation after restarted on a different host

With this information, you can better plan the placement and configuration of your infrastructure to reduce downtime of your VMs/Services in case of host failures. Is that useful or what? I would like to ask everyone to go through the motion, and of course to provide feedback if you feel this is useful information or not. You can leave feedback on this blog post or the fling website, we are aiming to monitor both.

For those who don’t know where to find the DRM dump, Frank described it in his article on the drmdiagnose fling, which I also recommend trying out! There is also a readme file with a bit more in-depth info!

  • vCenter server appliance: /var/log/vmware/vpx/drmdump/clusterX/
  • vCenter server Windows 2003: %ALLUSERSPROFILE%\Application Data\VMware\VMware VirtualCenter\Logs\drmdump\clusterX\
  • vCenter server Windows 2008: %ALLUSERSPROFILE%\VMware\VMware VirtualCenter\Logs\drmdump\clusterX\

So where can you find it? Well that is really easy, no downloads as I said… fully ran as a service:

  1. Open hasimulator.vmware.com to access the web service.
  2. Click on “Simulate Now” to accept the EULA terms, upload the DRM dump file and start the simulation process.
  3. Click on the help icon (at the top right corner) for a detailed description on how to use this service.

vSphere 5.1 Clustering Deep Dive promotion & major milestone

This week when looking at the sales numbers of the vSphere Clustering Deep Dive series and Frank and I noticed that we hit a major milestone! In September 2014 we passed the 45000 copies distributed of the vSphere Clustering Deep Dive. Frank and I never ever expected this or even dared to dream to hit this milestone.

When we first started writing the 4.1 book we had discussions around what to expect from a sales point of view and I recall having a discussion with Frank around the sales number, Frank said he would be happy with 100 and I said well 400 would be nice. Needless to say we reset our expectations many times since then… We didn’t really follow it closely in the last 12-18 months, and as today we were discussing a potential update of the book we figured it was time to look at the numbers again just to get an idea. 45000 copies distributed (ebook + printed) is just remarkable, and we are very humbled, baffled and honoured!

We’ve noticed that the ebook is still very popular, and decided to do a promo. As of Monday the 13th of October the 5.1 ebook (kindle) will be available for only $ 0.99 for 72 hours, then after 72 hours the price will go up to $ 3.99 and then after 72 hours it will be back to the normal price. Make sure to get it while it is low priced!

You can pick it up here on Amazon.com! The only other kindle store we could open the promotion up for was amazon.co.uk, so that is also an option.

das.maskCleanShutdownEnabled is set to true by default

I had a couple of questions on the topic of das.maskCleanShutdownEnabled today. For those who have not read the other articles I wrote about this topic, this is in short what it does and why it was introduced and how I explained it in an email today:

When a virtual machine is powered off (or shut down) by a user a property is set to true named runtime.cleanPowerOff”. To vSphere HA this indicates that the virtual machine was powered off by a user and as such when a host fails it knows that for this virtual machine it doesn’t need to take action. By default this property is set to true. If for whatever reason the virtual machine is killed by ESXi than this property is set to false.

vSphere HA provides the ability to respond to a storage failure (PDL). When a PDL occurs it can kill a virtual machine and then restart the virtual machine. However, runtime.cleanPowerOff” default is “true” and vSphere HA cannot access the datastore (PDL remember) to change the property! So this means if the VM is killed after the PDL, then it won’t be restarted as HA assumes it was cleanly powered off.

This is where das.maskCleanShutdownEnabled comes in to play. By setting this to “true”, vSphere HA assumes that VM is not cleanly powered off. Only when you cleanly power it off the property is set. In other words, In a PDL situation it will now restart the VM even though the datastore was unavailable when the VM was killed!

Back to the original question, what is das.maskCleanShutdownEnabled set to in 5.1 and later? Do you need to set it manually? No you do not, by default it is set to true! So when you configure a cluster, be aware of this… Especially in a stretched cluster environment where a PDL scenario is not unlikely.

** do not forget to also set terminateVMonPDL described in this blog post if you want VMs to be automatically killed when a PDL occurs! **

VPLEX Geosynchrony 5.2 supporting up to 10ms latency with HA/DRS

I was just informed that as of last week VPLEX Metro with Geosynchrony 5.2 has been certified for a round trip (RTT) latency up to 10ms while running HA/DRS in a vMSC solution. So far all vMSC solutions had been certified with 5ms RTT and this is a major breakthrough if you ask me. Great to see that EMC spent the time certifying this including support for HA and DRS across this distance.

Round-trip-time for a non-uniform host access configuration is now supported up to 10 milliseconds for VPLEX Geosynchrony 5.2 and ESXi 5.5 with NMP and PowerPath

More details on this topic can be found here: