BC-DR

vSphere HA in 5.0 constantly pinging my gateway?

Duncan Epping · Jun 26, 2012 ·

I had this question today and noticed someone also dropped it on the community forums. The question was if vSphere HA is constantly pinging the default gateway or not. I knew HA would ping the gateway on a regular basis as of vSphere 5.0, and on a more frequently basis if a ping would fail but I wasn’t sure about the timing. I pointed Marc Sevigny from the HA engineering team to the thread on the community forums and he added added some nice juicy details to the it. I figured I would share them with you.

First of all, each ESXi host in a 5.x cluster will ping the isolation address every 5 minutes (300 seconds). Could this flood the isolation device?

There should be no “flood” of ICMP messages, and it should have little impact on network performance. The ICMP packet is 53 bytes long and sent once every 5 seconds from each of the HA hosts until the address(es) become pingable once again, at which point it returns to pinging once per hour.

If your default gateway is never pingable because of your firewall, you should open up the ports needed by HA. It is also possible to or disable the isolation address monitoring on the default gateway by using an advanced option (das.useDefaultIsolationAddress = false). It is recommended to specify a different isolation address (das.isolationaddress0) when the default gateway is a non-pingable device. Note that it is highly recommend to use a device as the default gateway which is as few hops removed from your hosts as possible!

Forced recovery option grayed out with Site Recovery Manager 5.0.1

Duncan Epping · Jun 22, 2012 ·

I was playing with Site Recovery Manager (SRM) 5.0.1 today and I wanted to trigger a fail-over. As I just wanted a quick test I figured I would use the “forced recovery” option. This option allows you to fail-over without SRM trying to sync the storage layer. In a normal situation I would probably try to sync my storage but as I knew the other site was dead and I just wanted to test it quickly I figured I would just tick it and get the recovery plan going. Unfortunately the option was grayed out.

You can enable this fairly simple though:

Right click in the left pane on your site
Click “advanced settings”
Click “Recovery”
Select the “recovery.forcedFailover” setting

Now when you run your recovery plan it will not try to power-off/shutdown VMs or sync the storage. Nice right.

Another option that I spotted which many of you might need is “storageProvider.hostRescanRepeatCnt”, in the past I often had to rescan my storage system at least twice before LUNs would appear. That is where this setting comes in handy as it will do that for you. There’s some more nice new SRM 5.0.1 features to be found in this article by Ken Werneburg, make sure to read it.

HA Admission Control the basics – Part 2/2

Duncan Epping · Jun 20, 2012 ·

In part one I described what HA Admission Control is and in part two I will explain what your options are when admission control is enabled. Currently there are three admission control policies:

Host failures cluster tolerates
Percentage of cluster resources reserved as failover spare capacity
Specify a failover host

Each of these work in a slightly different way. And lets start with “Specify a failover host” as it is the most simple one to explain. This admission control policy allows you to set aside 1 host that will only be used in case a fail-over needs to occur. This means that even if your cluster is overloaded DRS will not use it. In my opinion there aren’t many usecases for it, and unless you have very specific requirements I would avoid using it.

The most difficult one to explain is “Host failures cluster tolerates” but I am going to try to keep it simple. This admission control policy takes the worst case scenario in to account, and only the worst case scenario, and it does this by using “slots”. A slot is comprised of two components:

Memory
CPU

For memory it will take the largest reservation on any powered-on virtual machine in your cluster plus the memory overhead for this virtual machine. So if you have one virtual machine that has 24GB memory provisioned and 10GB out of that is reserved than the slot size for memory is ~10GB (reservation + memory overhead).

For CPU it will take the largest reservation on any powered-on virtual machine in your cluster, or it will use a default of 32MHz (5.0, pre 5.0 it was 256MHz) for the CPU slot size. If you have a virtual machine with 8 vCPUs assigned and a 2GHz reservation then the slot size will be 2GHz for CPU.

HA admission control will look at the total amount of resources and see how many “memory slots” there are by dividing the total amount of memory by the “memory slot size”. It will do the same for CPU. It will calculate this for each host. From the total amount of available memory and CPU slots it will take the worst case scenario again, so if you have 80 memory slots and 120 CPU slots then you can power on 80 VMs… well almost, cause the number of slots of the largest hosts is also subtracted. Meaning that if you have 5 hosts and each of those have 10 slots for memory and CPU instead of having 50 slots available in total you will end up with 40.

Simple right? So remember: reservations –> slot size –> worst case. Yes, a single large reservation could severely impact this algorithm!

So now what? Well this is where the third admission control policy comes in to play… “Percentage of cluster resources reserved as failover spare capacity”. This is not a difficult one to explain, but again misunderstood by many. First of all HA will add up all available resources to see how much it has available. It will now subtract the amount of resource specified for both memory and CPU. Then HA will calculate how much resources are currently reserved for both memory and CPU for powered-on virtual machines. For CPU, those virtual machines that do not have a reservation larger than 32Mhz a default of 32Mhz will be used. For memory a default of 0MB+memory overhead will be used if there is no reservation set. If a reservation is set for memory it will use the reservation+memory overhead.

That is it. Percentage based looks at “powered-on virtual machines” and its reservation or uses the above mentioned defaults. Nothing more than that. No. it doesn’t look at resource usage / consumption / active etc. It looks at reserved resources. Remember that!

What do I recommend? I always recommend using the percentage based admission control policy as it is the most flexible policy. It will do admission control on a per virtual machine reservation basis without the risk of skewing the numbers.

If you have any questions around this please don’t hesitate.

HA Admission Control the basics – Part 1/2

Duncan Epping · Jun 18, 2012 ·

Last week I received three different questions about vSphere HA Admission control and I figured I would lay out the basics once more. What is admission control?

vCenter Server uses admission control to ensure that sufficient resources are available in a cluster to provide failover protection and to ensure that virtual machine resource reservations are respected.

Almost every thing you need to know about admission control is in that single sentence. But lets break it down in to more consumable bites:

vCenter uses admission control to ensure that sufficient resources are available in a cluster to provide failover protection.
vCenter uses admission control to ensure that virtual machine resource reservations are respected.

So first and foremost… Admission control is not about resource management, I devoted a whole article to that so not going in to details, but HA admission control is all about reserving resources to allow for a failover.

Secondly, admission control ensures virtual machine resource reservations of powered-on VMs can be respected. This is because virtual machine resources reservations are required to be available in order for a power-on to successfully complete! Meaning that if you set a 5GB memory reservation there needs to be 5GB of unreserved memory available (+ reserved memory overhead) on a single host in order for this virtual machine to power-on. If that 5GB machine is actually actively using 40GB it might end up swapping / paging, as only those 5GB of reserved capacity is taken in to account!

Note the “+ reserved memory overhead”! Every virtual machine has a memory overhead. This is usually in the range of a couple hundred MBs. For a successful power-on attempt you will need to be able to reserve this memory. If there is not enough “unreserved memory capacity” the power-on attempt will fail. So in reality that 5GB could just be 5.15GB. Might seem irrelevant, but I will explain why it is relevant in a second. Did you spot the “powered-on”? Yes, admission control only takes the resource reservations of powered-on VMs in to account. So if you have a VM with a large memory reservation which is powered-off it will not impact your admission control calculations!

In summary:

Admission Control is about reserving resources to allow for a fail-over.
Admission Control is no resource management tool, it only takes reserved capacity of powered-on VMs in to account.

So now that you know what admission control is. There are three policies when it comes to admission control… and we will discuss these in Part 2 of this article.

How do I use das.isolationaddress[x]?

Duncan Epping · May 25, 2012 ·

Recently I received a question on twitter how the vSphere HA advanced option “das.isolationaddress” should be used. This setting is used when there is the desire or a requirement to specify an additional isolation address. The isolation address is used by a host which “believes” it is isolated. In other words, if a host isn’t receiving heartbeats anymore it pings the isolation address to validate if it still has network access or not. If it does still have network access (response from isolation address) then no action is taken, if the isolation address does not respond then the “isolation response” is triggered.

Out of the box the “default gateway” is used as an isolation address. In most cases it is recommended to specify at least one extra isolation address. This would be done as follows:

Right click your vSphere Cluster and select “Edit settings”
Go to the vSphere HA section and click “Advanced options”
Add “das.isolationaddress0” under the option column
And add the “IP Address” of the device you want to use as an isolation address under the value column

Now if you want to specify a second isolation address you should add “das.isolationaddress1”. In total 10 isolation addresses will be used (0 – 9). Keep in mind that all of these will be pinged in parallel! Many seem to be under the impression that this happens sequential, but that is not the case!

Now if for whatever reason the default gateway should not be used you could disable this by adding the “das.usedefaultisolationaddress” to “false”. A usecase for this would be when the default gateway is a “non-pingable” device, in most scenarios it is not needed though to use “das.usedefaultisolationaddress”.

I hope this helps when implementing your cluster,