ha

Two new HA Advanced Settings

Duncan Epping · Aug 23, 2010 ·

Just noticed a couple of new advanced settings in the vCenter Performance Best Practices whitepaper.

das.perHostConcurrentFailoversLimit
When multiple VMs are restarted on one host, up to 32 VMs will be powered on concurrently by default. This is to avoid resource contention on the host. This limit can be changed through the HA advanced option: das.perHostConcurrentFailoversLimit. Setting a larger value will allow more VMs to be restarted concurrently and might reduce the overall VM recovery time, but the average latency to recover individual VMs might increase. We recommend using the default value.
das.sensorPollingFreq
The das.sensorPollingFreq option controls the HA polling interval. HA polls the system periodically to update the cluster state with such information as how many VMs are powered on, and so on. The polling interval was 1 second in vSphere 4.0. A smaller value leads to faster VM power on, and a larger value leads to better scalability if a lot of concurrent power operations need to be performed in a large cluster. The default is 10 seconds in vSphere 4.1, and it can be set to a value between 1 and 30 seconds.

I want to note that I would not recommend changing these. There is a very good reason the defaults have been selected. Changing these can lead to instability, however when troubleshooting they might come in handy.

HA and a Metrocluster

Duncan Epping · Aug 20, 2010 ·

I was reading an excellent article on NetApp metroclusters and vm-host affinity rules Larry Touchette the other day. That article is based on the tech report TR-3788 which covers the full solution but did not include the 4.1 enhancements.

The main focus of the article is on VM-Host Affinity Rules. Great stuff and it will “ensure” you will keep your IO local. As explained when a Fabric Metrocluster is used the increased latency when going across for instance 80KM of fibre will be substantial. By using VM-Host Affinity Rules where a group of VMs are linked to a group of hosts this “overhead” can be avoided.

Now, the question of course is what about HA? The example NetApp provided shows 4 hosts. With only four hosts we all know, hopefully at least, that all of these hosts will be primary. So even if a set of hosts fail one of the remaining hosts will be able to take over the failover coordinator role and restart the VMs. Now if you have up to an 8 host cluster that is still very much true as with a max of 5 primaries and 4 hosts on each side at least a single primary will exist in each site.

But what about 8 hosts or more? What will happen when the link between sites fail? How do I ensure each of the sites has primaries left to restart VMs if needed?

Take a look at the following diagram I created to visualize all of this:

We have two datacenters here, Datacenter A and B. Both have their own FAS with two shelves and their own set of VMs which run on that FAS. Although storage will be mirrored there is still only one real active copy of the datastore. In this case VM-Host Affinity rules have been created to keep the VMs local in order to avoid IO going across the wire. This is very much similar to what NetApp described.

However in my case there are 5 hosts in total which are a darker color green. These hosts were specified as the preferred primary nodes. This means that each site will have at least 2 primary nodes.

Lets assume the link between Datacenter A and B dies. Some might assume that this will trigger an HA Isolation Response but it actually will not.

The reason for this being is the fact that an HA primary node still exists in each site. Isolation Response is only triggered when no heartbeats are received. As a primary node sends a heartbeat to both the primary and secondary nodes a heartbeat will always be received. Again as I can’t emphasize this enough, an Isolation Response will not be triggered.

However if the link dies between these Datacenter’s, it will appear to Datacenter A as if Datacenter B is unreachable and one of the primaries in Datacenter A will initiate restart tasks for the allegedly impacted VMs and vice versa. However as the Isolation Response has not been triggered a lock on the VMDK will still exist and it will be impossible to restart the VMs.

These VMs will remain running within their site. Although it might appear on both ends that the other Datacenter has died HA is “smart” enough to detect it hasn’t and it will be up to you to decide if you want to failover those VMs or not.

I am just so excited about these developments, that I can’t get enough of it. Although the “das.preferredprimaries” setting is not supported as of writing, I thought this was cool enough to share it with you guys. I also want to point out that in the diagram I show 2 isolation addresses, this of course is only needed when a gateway is specified which is not accessible at both ends when the network connection between sites is dead. If the gateway is accessible at both sites even in case of a network failure only 1 isolation address, which can be the default gateway, is required.

HA Cli

Duncan Epping · Aug 3, 2010 ·

I was just playing around with the HA Cli and noticed that when you give an “ln” (listNodes) that the failover coordinator (aka master primary) is also listed. I have never noticed this before, but don’t have a pre-vSphere 4.1 environment to test it on to see if this existed before 4.1. If you want to test it in your own environment just simply run “/opt/vmware/aam/bin/Cli” and give the “ln” command as shown in the screenshot below:

I also tested demoting of a node just for fun. In this case I demoted the node “esxi1” from primary to secondary:

And of course I promoted it again to primary:

** Disclaimer: This article contains references to the words master and/or slave. I recognize these as exclusionary words. The words are used in this article for consistency because it’s currently the words that appear in the software, in the UI, and in the log files. When the software is updated to remove the words, this article will be updated to be in alignment. **

VMware related acronyms

Duncan Epping · Jul 29, 2010 ·

We were just talking about some random VMware acronyms during a lab day and I thought I would write the ones down which some of us didn’t know. (Even google did not have the answer to some) I guess the most difficult one to figure out was VPXA/VPXD, which refers to VPX which was the official name for vCenter back in the days….

FDM = Fault Domain Manager
CSI = Clustering Services Infrastructure
PAE = Propero Application Environment
ESX = Elastic Sky X
GSX = Ground Storm X or Ground Swell X
VPX = Virtual Provisioning X
VPXA = Virtual Provisioning X Agent
VPXD = Virtual Provisioning X Daemon
VMX = Virtual Machine eXecutable
AAM = Automated Availability Manager
VIX = Virtual Infrastructure eXtension
VIM = Virtual Infrastructure Management
DAS = Distributed Availability Service
ccagent = Control Center agent
vswif = Virtual Switch Interface
vami =Virtual Appliance Management Infrastructure
vob = VMkernel Observation
MARVIN = Modular Automated Rackable Infrastructure Node
WCP = Workload Control Plane

How about code names for releases? Well we had a couple, note that the first name usually refers to ESX and the second to vCenter, so for KL “Kadinsky” was the code name for ESX and Logan for vCenter:

DM = Dali/McKinley = VI 3.0
NP = Neptune/Pluto = VI 3.5
KL = Kadinsky/Logan = vSphere 4.0
KL.next = vSphere 4.1
MN = Matisse/Newberry = vSphere 5.0
OP = Oliveira/Pikes = vSphere 5.5

Of course the big question is where the “X” comes from in ESX, GSX etc. To be honest I don’t know but according to VMware old-timer Mike Di Petrillo (source is this interview (21:30) by Rodney Haywood) the X had been added by an Engineer to make it sound technical and cool!

If there are any to VMware related acronyms that you feel should be on the list which are not too obvious… leave me a comment. (And too obvious would be something like vDS.)

HA/DRS and Flattened Shares

Duncan Epping · Jul 22, 2010 ·

A week ago I already touched on this topic but I wanted to get a better understand for myself what could go wrong in these situations and how vSphere 4.1 solves this issue.

Pre-vSphere 4.1 an issue could arise when shares had been set custom on a virtual machine. When HA fails over a virtual machine it will power-on the virtual machine in the Root Resource Pool. However, the virtual machine’s shares were scaled for its appropriate place in the resource pool hierarchy, not for the Root Resource Pool. This could cause the virtual machine to receive either too many or too few resources relative to its entitlement.

A scenario where and when this can occur would be the following:

VM1 has a 1000 shares and Resource Pool A has 2000 shares. However Resource Pool A has 2 VMs and both will have 50% of those “2000” shares.

When the host would fail both VM2 and VM3 will end up on the same level as VM1. However as a custom shares value of 10000 was specified on both VM2 and VM3 they will completely blow away VM1 in times of contention. This is depicted in the following diagram:

This situation would persist until the next invocation of DRS would re-parent the virtual machine to it’s original Resource Pool. To address this issue as of vSphere 4.1 DRS will flatten the virtual machine’s shares and limits before fail-over. This flattening process ensures that the VM will get the resources it would have received if it would have been failed over to the correct Resource Pool. This scenario is depicted in the following diagram. Note that both VM2 and VM3 are placed under the Root Resource Pool with a shares value of 1000.

Of course when DRS is invoked both VM2 and VM3 will be re-parented under Resource Pool A and will receive the amount of shares they had originally assigned again. I hope this makes it a bit more clear what this “flattened shares” mechanism actually does.