ha

Replaced certificates and get vSphere HA Agent unreachable?

Duncan Epping · May 24, 2013 ·

Replaced certificates and get vSphere HA Agent unreachable? I have heard this multiple times in the last couple of weeks. I started looking in to it and it seems that in many of these scenarios the common issue was the thumbprints. The log files typically give a lot of hints that look like this:

[29904B90 verbose 'Cluster' opID=SWI-d0de06e1] [ClusterManagerImpl::IsBadIP] <ip of the ha primary> is bad ip

Also, note that the UI will state “vSphere HA agent unreachable” in many of these cases. Yes I know, these error messages can be improved for sure.

You can simply solve this by disconnecting and reconnecting the hosts. Yes, it really is as simple as that, and you can do this without any downtime. No need to move the VMs off even, just right-click the host and disconnect it. Then when the disconnect task is finished reconnect it.

Number of vSphere HA heartbeat datastores less than 2 error, while having more?

Duncan Epping · May 23, 2013 ·

Last week on twitter someone mentioned he received the error that he had less than two vSphere HA heartbeat datastores configured. I wrote an article about this error a while back so I asked him if he had two or more. This was the case, so next thing to do was to “reconfigure for HA” to clear the message hopefully.

The number of vSphere HA heartbeat datastores for this host is 1 which is less than required 2

Unfortunately after reconfiguring for HA the error was still there, next suggestion was looking at the “heartbeat datastore” section in HA. For whatever reason HA was configured to “Select only from my preferred datastores” and no datastores were selected just like in the screenshot below. HA does not override this so when configured like this NO heartbeat datastores are used, resulting in this error within vCenter. Luckily the fix is easy, just set it to “Select any of the cluster datastores”.

the number of heartbeat datastores for host is 1

vSphere HA – VM Monitoring sensitivity

Duncan Epping · May 14, 2013 ·

Last week there was a question on VMTN about VM Monitoring sensitivity. I could have sworn I did an article on that exact topic, but I couldn’t find it. I figured I would do a new one with a table explaining the levels of sensitivity that you can configure VM Monitoring to.

The question that was asked was based on a false positive response of VM Monitoring, in this case the virtual machine was frozen due to the consolidation of snapshots and VM Monitoring responded by restarting the virtual machine. As you can imagine the admin wasn’t too impressed as it caused downtime for his virtual machine. He wanted to know how to prevent this from happening. The answer was simple, change the sensitivity as it is set to “high” by default.

As shown in the table high sensitivity means that VM Monitoring responds to missing “VMware Tools heartbeat” within 30 seconds. However, before VM Monitoring restarts the VM though it will check if their was any storage or networking I/O for the last 120 seconds (advanced setting: das.iostatsInterval). If the answer is no to both, the VM will be restarted. So if you feel VM Monitoring is too aggressive, change it accordingly!

Sensitivity	Failure Interval	Max Failures	Max Failures Time window
Low	120 seconds	3	7 days
Medium	60 seconds	3	24 hours
High	30 seconds	3	1 hour

Do note that you can change the above settings individually as well in the UI, as seen in the screenshot below. For instance you could manually increase the failure interval to 240 seconds. How you should configure it is something I cannot answer, it should be based on what you feel is an acceptable response time to a failure. Also, what is the sweet spot to avoid a false positive… A lot to think about indeed when introducing VM Monitoring.

Guaranteeing availability through admission control, chip in!

Duncan Epping · Apr 9, 2013 ·

I have been having these discussions with our engineering teams for the last year around guaranteed restarts of virtual machines in a cluster. In the current shape / form we use Admission Control to guarantee virtual machines are restarted. Today Admission Control is all about guaranteeing virtual machine restarts by keeping track of Memory and CPU resource reservations, but you can imagine that in the Software Defined Datacenter this could be expanded with for instance storage or networking reservation.

Now why am I having these discussions, what is the problem with Admission Control today? Well first of all it is the perception that many appear to have of Admission Control. Many believe the Admission Control algorithm uses “used” resources. Reality however is that Admission Control is not that flexible, it uses resource reservations and as you know this is static. So what is the result of using reservations?

By using reservations for “admission control” vSphere HA has a simple way of guaranteeing a restart is possible at all times. Simply because it checks if sufficient “unreserved resources” are available and if so it allows the virtual machine to be powered-on. If not, then it won’t allow the power-on just to ensure that all virtual machines can be restarted in case of a failure. But what is the problem? Although we guarantee a restart we do not guarantee any type of performance after the restart! Unless, unless of course you are setting your reservations equal to what you provisioned… but I don’t know anyone doing this as it eliminates any form of overcommitment and will result in an increase of cost and a decrease in flexibility.

So that is the problem. Question is – what should we do about it? We (the engineering teams and I) would like to hear from YOU.

What would you like admission control to be?
What guarantees do you want HA to provide?
After a failure, what criteria should HA apply in deciding which VMs to restart?

One idea we have been discussing is to have Admission Control use something like “used” resources… or for instance an “average of resources used” per virtual machine. What if you could say: I want to ensure that my virtual machines always get at least 80% of what they use on average? If so, what should HA do when there are not enough resources to meet the 80% demand of all VMs? Power on some of the VMs? Power on all with reduced share values?

Also, something we have discussed is having vCenter show how many resources are used on average taking your high availability N-X setup in to account, which should at least provide an insight around how your VMs (and applications) will perform after a fail-over. Is that something you see value in?

What do you think? Be open and honest, tell us what you think… don’t be scared, we won’t be bite, we are open for all suggestions.

What is: Current Memory Failover Capacity?

Duncan Epping · Mar 14, 2013 ·

I have had this question many times by now, what is “Current Memory Failover Capacity” that is shown in the cluster summary when you have selected the “Percentage Based Admission Control Policy”? What is that percentage? 99% of what? And will it go down to 0%? Or will it go down to the percentage that you reserved? Well I figured it was time to put things to the test and no longer be guessing.

As shown in the screenshot above, I have selected 33% of memory to be reserved and currently have 99% of memory failover capacity. Lets power-on a bunch of virtual machines and see what happens. Below is the result shown in a screenshot, “current memory failover capacity” went down from 99% to 94%.

Also when I increase the reservation in a virtual machine I can see “Current Memory Failover Capacity” drop down even further. So it is not about “used” but about “unreserved / reserved” memory resources (including memory overhead), let that be absolutely clear! When will vCenter Server shout “Insufficient resources to satisfy configured failover level for vSphere HA”?

It shouldn’t be too difficult to figure that one out, just power-on new VMs until it says “stop it”. As you can see in the screenshot below. This happens when you reach the percentage you specified to reserve as “memory failover capacity”. In other words in my case I reserved 33%, when “Current Memory Failover Capacity” reaches 33% it doesn’t allow the VM to be powered on as this would violate the selected admission control policy.

I agree, this is kind of confusing… But I guess when you run out of resources it will become pretty clear very quickly 😉