u1

HA fails to initiate restart when a VM is SvMotioned and on a VDS!

Duncan Epping · Apr 11, 2012 ·

<Update>I asked William Lam if he could write a script to detect this problem and possibly even mitigate it. William worked on it over the weekend and just posted the result! Head over to his blog for the script! Thanks William for cranking it out this quick! For those who prefer PowerCLI… Alan Renouf just posted his version of the script! Both scripts provide the same functionality though!</Update>

A couple of weeks back Craig S. commented on my blog about an issue he ran in to in his environment. He was using a Distributed vSwitch and testing certain failure scenarios. One of those scenarios was failing a host in the middle of a Storage vMotion process of a virtual machine. After he had failed the host he expected HA to restart the virtual machine but this did not happen unfortunately. He also could not get the virtual machine up and running again himself. Unfortunately in this case it was the vCenter Server that was used to test this scenario with, which brought him in to a difficult position. This was the exact error:

Operation failed, diagnostics report: Failed to open file /vmfs/volumes/4f64a5db-b539e3b0-afed-001b214558a5/.dvsData/71 9e 0d 50 c8 40 d1 c3-87 03 7b ac f8 0b 6a 2d/1241 Status (bad0003)= Not found

Today I spotted a KB article which describes this scenario, the error mentioned in this KB articles reveals a bit more what is going wrong I guess:

2012-01-18T16:23:17.827Z [FFE3BB90 error 'Execution' opID=host-6627:6-0] [FailoverAction::ReconfigureCompletionCallback]
Failed to load Dv ports for /vmfs/volumes/UUID/VM/VM.vmx: N3Vim5Fault19PlatformConfigFault9ExceptionE(vim.fault.PlatformConfigFault)
2012-01-18T16:23:17.827Z [FFE3BB90 verbose 'Execution' opID=host-6627:6-0] [FailoverAction::ErrorHandler]
Got fault while failing over vm. /vmfs/volumes/UUID/VM/VM.vmx: [N3Vim5Fault19PlatformConfigFaultE:0xecba148] (state = reconfiguring)

It seems that at the time of fail-over the “dvport” information cannot be loaded by HA as after the Storage vMotion process the dvport file is not created on the destination datastore. Now please note that this applies to all VMs attached to a VDS which have been Storage vMotioned using vCenter 5.0. However the problem will only be witnessed during time of HA fail-over.

This dvport info is what I mentioned in my “digging deeper into the VDS construct” article. I already mentioned there that this is what HA uses to reconnect the virtual machine to the Distributed vSwitch… And when files are moving around you can imagine it is difficult to power-on a virtual machine.

I reproduced the problem as shown in the following screenshot. The VM has port 139 assigned by the VDS, but on the datastore there is only a dvport file for 106. This is what happened after I simply Storage vMotioned the VM from Datastore-A to Datastore-B.

For now, if you are using a Distributed vSwitch and running a virtual vCenter Server and have Storage DRS enabled… I would recommend disable Storage DRS for vCenter specifically, just to avoid getting in these scenarios.

Go to Datastore & Datastore Clusters view, Edit properties on Datastore Cluster and change automation level:

The problem itself can be mitigated, as described by Michael Webster here, by simply selecting a different dvPortgroup or vSwitch for the virtual machine. After the reconfiguration has completed you can select the original portgroup again, this will recreate the dvport info on the datastore.

The number of vSphere HA heartbeat datastores for this host is 1 which is less than required 2

Duncan Epping · Apr 5, 2012 ·

Today I noticed a lot of people end-up on my blog by searching for an error which has got to do with HA heartbeat datastores. Heartbeat datastores were introduced in vSphere 5.0 (vCenter 5.0 actually as that is where the HA agent comes from!!) and I described what it is and where it comes in to play in my HA deepdive section. I just wanted to make the error message that pops up when the minimum amount of heartbeat datastore requirement is not met was easier to google… This is the error that is shown when you only have 1 shared datastore available to your hosts in an HA cluster:

The number of vSphere HA heartbeat datastores for this host is 1 which is
less than required 2

Or the other common error, when there are no shared datastores at all:

The number of vSphere HA heartbeat datastores for this host is 0 which is
less than required 2

You can either add a datastore or you can simply add an advanced option in your vSphere HA cluster settings. This advanced option is the following:

das.ignoreInsufficientHbDatastore = true

This advanced option will suppress the host config alarm that the number of heartbeat datastores is less than the configured das.heartbeatDsPerHost. By default this is set to “false”, and in this example will be set to true.

Permanent Device Loss (PDL) enhancements in vSphere 5.0 Update 1 for Stretched Clusters

Duncan Epping · Mar 16, 2012 ·

In the just released vSphere 5.0 Update 1 some welcome enhancements were added around vSphere HA and how a Permanent Device Loss (PDL) condition is handled. The PDL condition is a condition that is communicated by the array to ESXi via a SCSI sense code and indicates that a device (LUN) is unavailable and more than likely permanently unavailable. This is a condition which is useful for “stretched storage cluster” configurations where in the case of a failure in Datacenter-A the configuration in Datacenter-B can take over. An example of when a condition like this would be communicated by the array would be when a LUN “detached” in a site isolation. PDL is probably most common in non-uniform stretched solutions like EMC VPLEX. With VPLEX site affinity is defined per LUN. If your VM resides in Datacenter-A while the LUN it is stored on has affinity to Datacenter-B in case of failure this VM could lose access to the LUN. These enhancements will ensure the VM is killed and restarted on the other side.

Please note that action will only be taken when a PDL sense code is issued. When your storage completely fails for instance it is impossible to reach the PDL condition as there is no communication possible anymore from the array to the ESXi host and the state will be identified by the ESXi host as an All Paths Down (APD) condition. APD is a more common scenario in most environments. If you are testing these enhancements please check the log files to validate which problem has been identified.

With vSphere 5.0 and prior HA did not have a response in a PDL condition, meaning that when a virtual machine was residing on a datastore which had a PDL condition the virtual machine would just sit there. This virtual machine would be unable to read or write from disk however. As of vSphere 5.0 Update 1 a new mechanism has been introduced which allows vSphere HA to take action when a datastore has reached a PDL state. Two advanced settings make this possible. The first setting is configured on a host level and is “disk.terminateVMOnPDLDefault”. This setting can be configured in /etc/vmware/settings and should be set to “True”. This setting ensures that a virtual machine is killed when the datastore it resides on is in a PDL state. The virtual machine is killed as soon as it initiates disk I/O on a datastore which is in a PDL condition and all of the virtual machine files reside on this datastore. Note that if a virtual machine does not initiate any I/O it will not be killed!

The second setting is a vSphere HA advanced setting called das.maskCleanShutdownEnabled. This setting is also not enabled by default and it will need to be set to “True”. This settings allows HA to trigger a restart response for a virtual machine which has been killed automatically due to a PDL condition. This setting allows HA to differentiate between a virtual machine which was killed due to the PDL state or a virtual machine which has been powered off by an administrator.

As soon as “disaster strikes” and the PDL sense code is sent you will see the following popping up in the vmkernel.log that indicates the PDL condition and the kill of the VM:

2012-03-14T13:39:25.085Z cpu7:4499)WARNING: VSCSI: 4055: handle 8198(vscsi4:0):opened by wid 4499 (vmm0:fri-iscsi-02) has Permanent Device Loss. Killing world group leader 4491 2012-03-14T13:39:25.085Z cpu7:4499)WARNING: World: vm 4491: 3173: VMMWorld group leader = 4499, members = 1

As mentioned earlier, this is a welcome enhancement which especially in non-uniform stretched storage environment can help in specific failure scenarios.