vSphere 5.0 HA restarting of VMs with no access to storage?

I had a question today around the restart of VMs with no access to storage by HA. The question was if HA would try to restart the VM and time out after 5 times. With the follow up question, if HA would try again when the storage would return for duty.

By default HA will try to restart a VM up to 5 times in roughly 30 minutes. If the master does not exceed it will stop trying. On top of that  HA manages a “compatibility list”. This list will contain the details around which VM can be restarted and where. In other words; which hosts have access to the datastores and network portgroup required for this VM to successfully power-on. Now if for whatever reason there are no compatible hosts available for the restart then HA will not try to restart the VM.

But what if the problem is resolved? As soon as the problem is resolved, and reported as such, the compatibility list will be updated. When the list is updated HA will continue with the restarts again.

It might also be good to know that if for whatever reason the master fails, a new master will continue trying to restart the VM. It will start with 5 new attempts and not take the number of restart attempts that the previous master did in to account.

VM Monitoring only using VMware Tools heartbeat?

I had this question twice this week and did a quick search on my blog and I wrote an article about it a while back, but I figured it wouldn’t hurt to repeat some of that and expand on it. I copied / pasted this from part from our book as I think it it spot on!

VM/App monitoring uses a heartbeat mechanism kind of similar to HA. If heartbeats, and, in this case, VMware Tools heartbeats, are not received for a specific (and configurable) amount of time, the virtual machine will be restarted. These heartbeats are monitored by the HA agent and are not sent over a network, but stay local to the host.

Although the heartbeat produced by VMware Tools is reliable, VMware added a further verification mechanism. To avoid false positives, VM Monitoring also monitors I/O activity of the virtual machine. When heartbeats are not received AND no disk or network activity has occurred over the last 120 seconds, per default, the virtual machine will be reset. Changing the advanced setting “das.iostatsInterval” can modify this 120-second interval.

Which isolation response should I use?

I wrote this article about split brain scenarios for the vSphere Blog. Based on this article I received some questions around which “isolation response” to use. This is not something that can be answered by a simple “recommended practice” and applied to all scenarios out there. Note that below has got everything to do with your infrastructure. Are you using IP-Based storage? Do you have a converged network? All of these impact the decision around the isolation response.

The following table however could be used to make a decision:

Likelihood that host will retain access to VM datastores Likelihood that host will retain access to VM network Recommended Isolation policy Explanation
Likely Likely Leave Powered On VM is running fine so why power it off?
Likely Unlikely Either Leave Powered On or Shutdown Choose shutdown to allow HA to restart VMs on hosts that are not isolated and hence are likely to have access to storage
Unlikely Likely Power Off Use Power Off to avoid having two instances of the same VM on the VM network
Unlikely Unlikely Leave Powered On or Power Off Leave Powered on if the VM can recover from the network/datastore outage if it is not restarted because of the isolation, and Power Off if it likely can’t.

vSphere 4.1 HA/DRS Deepdive promo was a huge hit!

Thanks to each and everyone of you who took the time to download the vSphere 4.1 HA/DRS Deepdive kindle copy during our promo days. Over 6000 downloads in just 2 days is nothing short of amazing. Frank and I were talking about this promo opportunity a week ago for the 4.1 book and never anticipated on these kind of numbers. We expected a couple of hundred copies to be given away, maybe close to a 1000, but definitely not 6000+. Just some facts about this promo:

  • 175+ retweets of my tweets
  • 600+ tweets
  • 30.000+ people reached
  • 6000+ Kindle copies

We were shocked, we anticipated on a couple of hundred copies, maybe close to a 1000, but never did we anticipate on 6000 kindle copies being downloaded. Thanks to everyone who helped driving this. All the tweets / facebook and G+ mentions helped with this huge success.

 

How do I use das.isolationaddress[x]?

Recently I received a question on twitter how the vSphere HA advanced option “das.isolationaddress” should be used. This setting is used when there is the desire or a requirement to specify an additional isolation address. The isolation address is used by a host which “believes” it is isolated. In other words, if a host isn’t receiving heartbeats anymore it pings the isolation address to validate if it still has network access or not. If it does still have network access (response from isolation address) then no action is taken, if the isolation address does not respond then the “isolation response” is triggered.

Out of the box the “default gateway” is used as an isolation address. In most cases it is recommended to specify at least one extra isolation address. This would be done as follows:

  • Right click your vSphere Cluster and select “Edit settings”
  • Go to the vSphere HA section and click “Advanced options”
  • Add “das.isolationaddress0″ under the option column
  • And add the “IP Address” of the device you want to use as an isolation address under the value column

Now if you want to specify a second isolation address you should add “das.isolationaddress1″. In total 10 isolation addresses will be used (0 – 9). Keep in mind that all of these will be pinged in parallel! Many seem to be under the impression that this happens sequential, but that is not the case!

Now if for whatever reason the default gateway should not be used you could disable this by adding the “das.usedefaultisolationaddress” to “false”. A usecase for this would be when the default gateway is a “non-pingable” device, in most scenarios it is not needed though to use “das.usedefaultisolationaddress”.

I hope this helps when implementing your cluster,