cluster

vSphere HA respecting VM-Host should rules?

Duncan Epping · Mar 5, 2015 ·

A long time ago I authored this white paper around stretched clusters. During out testing the one thing where we felt HA was lacking was the fact that it would not respect VM-Host should rules. So if you had these configured in a cluster and a host would fail then VMs could be restarted on ANY given host in the cluster. The first time that DRS would then run it would move the VMs back to where they belong according to the configured VM-Host should rules.

I guess one of the reasons for this was the fact that originally the affinity and anti-affinity rules were designed to be DRS rules. Over time I guess we realized that these are not DRS rules but rather Cluster rules. Based on the findings we did when authoring the white paper we filed a bunch of feature requests and one of them just made vSphere 6.0. As of vSphere 6.0 it is possible to have vSphere HA respecting VM-Host should rules through the use of an advanced setting called “das.respectVmHostSoftAffinityRules”.

When “das.respectVmHostSoftAffinityRules” is configured then vSphere HA will try to respect the rule when it can. So if there are any hosts in the cluster which belong to the same VM-Host group then HA will restart the respective VM on that host. Of course as this is a “should rule” HA has the ability to ignore the rule when needed. You can imagine that there could be a scenario where none of the hosts in the VM-Host should rule is available, in that case HA will restart the VM on any other host in the cluster. Useful? Yes, I think so!

What is das.maskCleanShutdownEnabled about?

Duncan Epping · Apr 25, 2012 ·

I had a question today around what the vSphere HA option advanced setting das.maskCleanShutdownEnabled is about. I described why it was introduced for Stretched Clusters but will give a short summary here:

Two advanced settings have been introduced in vSphere 5.0 Update 1 to enable HA to fail-over virtual machines which are located on datastores which are in a Permanent Device Loss state. This is very specific to stretchec cluster environments. The first setting is configured on a host level and is “disk.terminateVMOnPDLDefault”. This setting can be configured in /etc/vmware/settings and should be set to “True”. This setting ensures that a virtual machine is killed when the datastore it resides on is in a PDL state.

The second setting is a vSphere HA advanced setting called “das.maskCleanShutdownEnabled“. This setting is also not enabled by default and it will need to be set to “True”. This settings allows HA to trigger a restart response for a virtual machine which has been killed automatically due to a PDL condition. This setting allows HA to differentiate between a virtual machine which was killed due to the PDL state or a virtual machine which has been powered off by an administrator.

But why is “das.maskCleanShutdownEnabled” needed for HA? From a vSphere HA perspective there are two different types of “operations”. The first is a user initiated power-off (clean) and the other is a kill. When a virtual machine is powered off by a user, part of the process is setting the property “runtime.cleanPowerOff” to true.

Remember that when “disk.terminateVMOnPDLDefault” is configured your VMs will be killed when they issue I/O. This is where the problem arises, in a PDL scenario it is impossible to set “runtime.cleanPowerOff” as the datastore, and as such the vmx, is unreachable. As the property defaults to “true” vSphere HA will assume the VMs were cleanly powered off. This would result in vSphere HA not taking any action in a PDL scenario. By setting “das.maskCleanShutdownEnabled” to true, a scenario where all VMs are killed but never restarted can be avoided as you are telling vSphere HA to assume that all VMs are not shutdown in a cleanly matter. In that case vSphere HA will assume VMs are killed UNLESS the property is set.

If you have a stretched cluster environment, make sure to configure these settings accordingly!

Management Cluster / vShield Resiliency?

Duncan Epping · Feb 14, 2011 ·

I was reading Scott’s article about using dedicate clusters for management applications. Which was quickly followed by a bunch of quotes turned into an article by Beth P. from Techtarget. Scott mentions that he had posed the original question on twitter if people were doing dedicated management clusters and if so why.

As he mentioned only a few responded and the reason for that is simple, hardly anyone is doing dedicated management clusters these days. The few environments that I have seen doing it were large enterprise environments or service providers where this was part of an internal policy. Basically in those cases a policy would state that “management applications cannot be hosted on the platform it is managing”, and some even went a step further where these management applications were not even allowed to be hosted in the same physical datacenter. Scott’s article was quickly turned in to a “availability concerns” article by Techtarget to which I want to respond. I am by no means a vShield expert, but I do know a thing or two about the product and the platform it is hosted on.

I’ll use vShield Edge and vShield Manager as an example as in Scott’s article vCloud Director is mentioned which leverages vShield Edge. This means that vShield Manager needs to be deployed in order to manage the edge devices. I was part of the team who was responsible for the vCloud Reference Architecture but also part of the team who designed and deployed the first vCloud environment in EMEA. Our customer had their worries as well about resiliency of vShield Manager and vShield Edge, but as they are virtual they can easily be “protected” by leveraging vSphere features. One thing I want to point out though, if vShield Manager is down vShield Edge will continue to function so no need to worry there. I created the following table to display how vShield Manager and vShield Edge can be “protected”.

Product	vShield Manager	VMware HA	VM Monitoring	VMware FT
vShield Manager	Yes (*)	Yes	Yes	Yes
vShield Edge	Yes (*)	Yes	Yes	Yes

Not only would you be able to leverage these standard vSphere technologies there is more that can be leveraged:

Please don’t get me wrong here, there are always methods to get locked out but as Edward Haletky stated “In fact, the way vShield Manager locks down the infrastructure upon failure is in keeping with longstanding security best practices”. (Quote from Beth P’s article) I also would not want my door to be opened up automatically when there is something wrong with my lock. The trick though is to prevent a “broken lock” situation from occurring and to utilize vSphere capabilities in such a way that the last known state can be safely recovered if it would.

As always an architect/consultant will need to work with all the requirements and constraints and based on the capabilities of a product come up with a solution that offers maximum resiliency and with the mentioned options above you can’t tell me that VMware doesn’t provide these

RE: Maximum Hosts Per Cluster (Scott Drummonds)

Duncan Epping · Nov 29, 2010 ·

I love blogging because of the discussions you some times get into. One of the bloggers I highly respect and closely follow is EMC’s vSpecialist Scott Drummonds (former VMware Performance Guru). Scott posted a question on his blog about what the size of a cluster should be. Scott discussed this with Dave Korsunksy and Dan Anderson, both VMware employee, and more or less came to the conclusion that 10 is probably a good number.

So, have I given a recommendation? I am not sure. If anything I feel that Dave, Dan and I believe that a minimum cluster size needs should be set to guarantee that the CPU utilization target, and not the HA failover capacity, is the defining the number of wasted resources. This means a minimum cluster of something like four or five hosts. While neither of us claims a specific problem that will occur with very large clusters, we cannot imagine the value of a 32-host cluster. So, we think the right cluster size is somewhere shy of 10.

And of course they have a whole bunch of arguments for both Large( 12+) and small (8-) clusters… which I summarized below for your convenience

Pro Large: DRS efficiency. This was my primary claim in favor of 32-host clusters. My reasoning is simple: with more hosts in the cluster there are more CPU and memory resource holes into which DRS can place running virtual machines to optimize the cluster’s performance. The more hosts, the more options to the scheduler.
Pro Small: DRS does not make scheduling decisions based on the performance characteristics of the server so a new, powerful server in a cluster is just as likely to receive a mission-critical virtual machine as older, slower host. This would be unfortunate if a cluster contained servers with radically different–although EVC compatible–CPUs like the Intel Xeon 5400 and Xeon 5500 series.
Pro Small: By putting your mission-critical applications in a cluster of their own your “server huggers” will sleep better at night. They will be able to keep one eye on the iron that can make or break their job.
Pro Small: Cumbersome nature of their change control. Clusters have to be managed to a consistent state and the complexity of this process is dependent on the number of items being managed. A very large cluster will present unique challenges when managing change.
Pro Small: To size a 4+1 cluster to 80% utilization after host failure, you will want to restrict CPU usage in the five hosts to 64%. Going to a 5+1 cluster results in a pre-failure CPU utilization target of 66%. The increases slowly approach 80% as the clusters get larger and larger. But, you can see that the incremental resource utilization improvement is never more than 2%. So, growing a cluster slightly provides very little value in terms of resource utilization.

It is probably an endless debate and all the arguments for both “Pro Large” and “Pro Small” are all very valid although I seriously disagree with their conclusion as in not seeing the value of a 32-host cluster. As always it fully depends. On what in this case you might say, why would you ever want a 32-host cluster? Well for instance when you are deploying vCloud Director. Clusters are currently your boundary for your vDC, and who wants to give his customer 6 vDCs instead of just 1 because you limited your cluster size to 6 hosts instead of leaving the option open to go to the max. This might just be an exception and nowhere near reality for some of you but I wanted to use this as an example to show that you will need to take many factors into account.
Now I am not saying you should, but at least leave the option open.

One of the arguments I do want to debate is the Change Control argument. Again, this used to be valid in a lot of Enterprise environments where ESX was used. Now I am deliberately using “ESX” and “Enterprise” here as reality is that many companies don’t even have a change control process in place. (I worked for a few large insurance companies which didn’t!) On top of that there is a large discrepancy when it comes to the amount of work associated with patching ESX vs ESXi. I have spent many weekends upgrading ESX but today literally spent minutes upgrading ESXi. The impact and risks associated with patching has most certainly decreased with ESXi in combination with VUM and the staging options. On top of that many organizations treat ESXi as an appliance, and with with stateless ESXi and the Auto-Deploy appliance being around the corner I guess that notion will only grow to become a best practice.

A couple of arguments that I have often seen being used to restrict the size of a cluster are the following:

HA limits (different max amount of VMs when cluster are > 8 hosts)
SCSI Reservation Conflicts
HA Primary nodes

Let me start with saying that for every new design you create, challenge your design considerations and best practices… are the still valid?

The first one is obvious as most of you know by now that there is no such a thing anymore as an 8 host boundary with HA. The second one needs some explanation. Around the VI3 time frame cluster sizes were often limited because of possible storage performance issues. These alleged issues were mainly blamed on SCSI Reservation Conflicts. The conflicts were caused by having many VMs on a single LUN in a large cluster. Whenever a metadata update was required the LUN would be locked by a host and this would/could increase overall latency. To avoid this, people would keep the amount of VMs per VMFS volume low (10/15) and keep the amount of VMFS volumes per cluster low…. Also resulting in a fairly low consolidation factor, but hey 10:1 beats physical.

Those arguments used to be valid, however things have changed. vSphere 4.1 brought us VAAI; which is a serious game changer in terms of SCSI Reservations. I understand that for many storage platforms VAAI is currently not supported… However, the original mechanism which is used for SCSI Reservations has also severely improved over time (Optimistic Locking) which in my opinion reduced the need to have many small LUNs, which eventually would limit you from a max amount of LUNs per host perspective. So with VAAI or Optimistic Locking, and of course NFS, the argument to have small clusters is not really valid anymore. (Yes there are exceptions)

The one design consideration, which is crucial, that is missing in my opinion though is HA node placement. Many have limited their cluster sizes because of hardware and HA primary node constraints. As hopefully known, if not be ashamed, HA has a maximum of 5 primary nodes in a cluster and a primary is required for restarts to take place. In large clusters the chances of losing all primaries also increase if and when the placement of the hosts is not taken into account. The general consensus usually is, keep your cluster limited to 8 and spread across two racks or chassis so that each rack always has at least a single primary node to restart VMs. But why would you limit yourself to 8? Why, if you just bought 48 new blades, would you create 6 clusters of 8 hosts instead of 3 clusters of 16 hosts? By simply layering your design you can mitigate all risks associated with primary nodes placements while benefiting from additional DRS placement options. (Do note that if you “only” have two chassis, your options are limited.) Which brings us to another thing I wanted to discuss…. Scott’s argument against increased DRS placement was that hundreds of VMs in an 8 host cluster already leads to many placement options. Indeed you will have many load balancing options in an 8 host cluster, but is it enough? In the field I also see a lot of DRS rules. DRS rules will restrict the DRS Load Balancing algorithm when looking for suitable options, as such more opportunities will more than likely result in a better balanced cluster. Heck, I have even seen cluster imbalances which could not be resolved due to DRS rules in a five host cluster with 70 VMs.

Don’t get me wrong, I am not advocating to go big…. but neither am I advocating to have a limited cluster size for reasons that might not even apply to your environment. Write down the requirements of your customer or your environment and don’t limit yourself to design considerations around Compute alone. Think about storage, networking, update management, max config limits, DRS&DPM, HA, resource and operational overhead.

Did you know? All hosts failed…

Duncan Epping · Oct 22, 2010 ·

** for vSphere 5.0 check this update! **

Today I received a very valid question around a full cluster failure. What happens when all the hosts in a cluster go down and at some point return? Will the VMs be restarted and what do I need to have in place to ensure they will?

It seems to be an urban myth that you need to use “auto-start” for a full cluster failure. But as you might have noticed that won’t work when HA is enabled. So what will?

VMware HA

Is it really that simple? Yes it is! When a full cluster fails and nodes start powering up HA will restart the VMs. As you know HA (or to be precise the primary nodes) maintains the host states, which includes the status of all VMs on those hosts. When one of the primary nodes returns to duty it will trigger the restarts based on the last known state. Make sure you set the restart priority correct so that any VMs hosting “management apps” will be booted up first.

It can’t get any simpler than that can it!