ha

HA fails to initiate restart when a VM is SvMotioned and on a VDS!

Duncan Epping · Apr 11, 2012 ·

<Update>I asked William Lam if he could write a script to detect this problem and possibly even mitigate it. William worked on it over the weekend and just posted the result! Head over to his blog for the script! Thanks William for cranking it out this quick! For those who prefer PowerCLI… Alan Renouf just posted his version of the script! Both scripts provide the same functionality though!</Update>

A couple of weeks back Craig S. commented on my blog about an issue he ran in to in his environment. He was using a Distributed vSwitch and testing certain failure scenarios. One of those scenarios was failing a host in the middle of a Storage vMotion process of a virtual machine. After he had failed the host he expected HA to restart the virtual machine but this did not happen unfortunately. He also could not get the virtual machine up and running again himself. Unfortunately in this case it was the vCenter Server that was used to test this scenario with, which brought him in to a difficult position. This was the exact error:

Operation failed, diagnostics report: Failed to open file /vmfs/volumes/4f64a5db-b539e3b0-afed-001b214558a5/.dvsData/71 9e 0d 50 c8 40 d1 c3-87 03 7b ac f8 0b 6a 2d/1241 Status (bad0003)= Not found

Today I spotted a KB article which describes this scenario, the error mentioned in this KB articles reveals a bit more what is going wrong I guess:

2012-01-18T16:23:17.827Z [FFE3BB90 error 'Execution' opID=host-6627:6-0] [FailoverAction::ReconfigureCompletionCallback]
Failed to load Dv ports for /vmfs/volumes/UUID/VM/VM.vmx: N3Vim5Fault19PlatformConfigFault9ExceptionE(vim.fault.PlatformConfigFault)
2012-01-18T16:23:17.827Z [FFE3BB90 verbose 'Execution' opID=host-6627:6-0] [FailoverAction::ErrorHandler]
Got fault while failing over vm. /vmfs/volumes/UUID/VM/VM.vmx: [N3Vim5Fault19PlatformConfigFaultE:0xecba148] (state = reconfiguring)

It seems that at the time of fail-over the “dvport” information cannot be loaded by HA as after the Storage vMotion process the dvport file is not created on the destination datastore. Now please note that this applies to all VMs attached to a VDS which have been Storage vMotioned using vCenter 5.0. However the problem will only be witnessed during time of HA fail-over.

This dvport info is what I mentioned in my “digging deeper into the VDS construct” article. I already mentioned there that this is what HA uses to reconnect the virtual machine to the Distributed vSwitch… And when files are moving around you can imagine it is difficult to power-on a virtual machine.

I reproduced the problem as shown in the following screenshot. The VM has port 139 assigned by the VDS, but on the datastore there is only a dvport file for 106. This is what happened after I simply Storage vMotioned the VM from Datastore-A to Datastore-B.

For now, if you are using a Distributed vSwitch and running a virtual vCenter Server and have Storage DRS enabled… I would recommend disable Storage DRS for vCenter specifically, just to avoid getting in these scenarios.

Go to Datastore & Datastore Clusters view, Edit properties on Datastore Cluster and change automation level:

The problem itself can be mitigated, as described by Michael Webster here, by simply selecting a different dvPortgroup or vSwitch for the virtual machine. After the reconfiguration has completed you can select the original portgroup again, this will recreate the dvport info on the datastore.

Cluster Sizes – vSphere 5 style!?

Duncan Epping · Apr 10, 2012 ·

At the end of 2010 I wrote an article about cluster sizes… ever since it has been a popular article and I figured that it was time to update it. vSphere 5 changed the game when it comes to sizing/scaling of your clusters and I this is an excellent opportunity to emphasize that. The key take-away of my 2010 article was the following:

I am not advocating to go big…. but neither am I advocating to have a limited cluster size for reasons that might not even apply to your environment. Write down the requirements of your customer or your environment and don’t limit yourself to design considerations around Compute alone. Think about storage, networking, update management, max config limits, DRS & DPM, HA, resource and operational overhead.

We all know that HA used to be a constraint for your cluster size… However these times are long gone. I still occasionally see people referring to old “max config limits” around the amount of VMs per cluster when exceeding 8 hosts… This is not a concern anymore. I also still see people referring to the max 5 primary node limit… Again not a concern anymore. I guess we can generalize things and using the 2010 article and applying that to vSphere 5 I guess we can come to the following conclusions:

HA does not limit the number of hosts in a cluster anymore! Using more hosts in a cluster results in less overhead. (N+1 for 8 hosts vs N+1 for 32 hosts)
DRS loves big clusters! More hosts equals more scheduling opportunities.
SCSI Locking? Hopefully all of you are using VAAI capable arrays by now… This should not be a concern. Even if you are not using VAAI, optimistic locking should have relieved this for almost all environments!
Max number of hosts accessing a file = 8! This is a constraint in an environment using linked clones like View
Max values in general (256 LUNs, 1024 Paths, 512 VMs per host, 3000 VMs per cluster)

Once again, I am not advocating to scale-up or scale-out. I am mere showing that there are hardly any limiting factors anymore at this point in time. One of the few constraints that is still valid is the max of 8 hosts in a cluster using linked clones. Or better said, a max of 8 hosts accessing a file concurrently. (Yes we are working on fixing this…)

I would like to know from you guys what the cluster sizes are you are using, and if you are constraint somehow… what those constraints are… chip in!

The number of vSphere HA heartbeat datastores for this host is 1 which is less than required 2

Duncan Epping · Apr 5, 2012 ·

Today I noticed a lot of people end-up on my blog by searching for an error which has got to do with HA heartbeat datastores. Heartbeat datastores were introduced in vSphere 5.0 (vCenter 5.0 actually as that is where the HA agent comes from!!) and I described what it is and where it comes in to play in my HA deepdive section. I just wanted to make the error message that pops up when the minimum amount of heartbeat datastore requirement is not met was easier to google… This is the error that is shown when you only have 1 shared datastore available to your hosts in an HA cluster:

The number of vSphere HA heartbeat datastores for this host is 1 which is
less than required 2

Or the other common error, when there are no shared datastores at all:

The number of vSphere HA heartbeat datastores for this host is 0 which is
less than required 2

You can either add a datastore or you can simply add an advanced option in your vSphere HA cluster settings. This advanced option is the following:

das.ignoreInsufficientHbDatastore = true

This advanced option will suppress the host config alarm that the number of heartbeat datastores is less than the configured das.heartbeatDsPerHost. By default this is set to “false”, and in this example will be set to true.

Slight change in “restart” behavior for HA with vSphere 5.0 Update 1

Duncan Epping · Mar 27, 2012 ·

Although this is a corner case scenario I did wanted to discuss it to make sure people are aware of this change. Prior to vSphere 5.0 Update 1 a virtual machine would be restarted by HA when the master had detected that the state of the virtual machine had changed compared to the “protectedlist” file. In other words, a master would filter the VMs it thinks had failed before trying to restart any. Prior to Update 1, a master used the protection state it read from the protectedlist. If the master did not know the on-disk protection state for the VM, the master did not try to restart it. Keep in mind that only one master can open the protectedList file in exclusive mode.

In Update 1 this logic has slightly changed. HA can know retrieve the state information from either the protectionlist stored on the datastore or from vCenter Server. So now multiple masters could try to restart a VM. If one of those restarts would fail, for instance because a “partition” does not have sufficient resources, the master in the other partition might be able to restart it. Although these scenarios are highly unlikely, this behavior change was introduced as a safety net!

** Disclaimer: This article contains references to the words master and/or slave. I recognize these as exclusionary words. The words are used in this article for consistency because it’s currently the words that appear in the software, in the UI, and in the log files. When the software is updated to remove the words, this article will be updated to be in alignment. **

Stretched Clusters and Site Recovery Manager

Duncan Epping · Mar 23, 2012 ·

My colleague Ken Werneburg, also known as “@vmKen“, just published a new white paper. (Follow him if you aren’t yet!) This white paper talks about both SRM and Stretched Cluster solutions and explains the advantages and disadvantages of either. It provides a great overview in my opinion on when a stretched cluster should be implemented or when SRM makes more sense. Various goals and concepts are discussed and I think this is a must read for everyone exploring implementing a Stretched Clusters or SRM.

http://www.vmware.com/resources/techresources/10262

This paper is intended to clarify concepts involved with choosing solutions for vSphere site availability, and to help understand the use cases for availability solutions for the virtualized infrastructure. Specific guidance is given around the intended use of DR solutions like VMware vCenter Site Recovery Manager and contrasted with the intended use of geographically stretched clusters spanning multiple datacenters. While both solutions excel at their primary use case, their strengths lie in different areas which are explored within.