In the just released vSphere 5.0 Update 1 some welcome enhancements were added around vSphere HA and how a Permanent Device Loss (PDL) condition is handled. The PDL condition is a condition that is communicated by the array to ESXi via a SCSI sense code and indicates that a device (LUN) is unavailable and more than likely permanently unavailable. This is a condition which is useful for “stretched storage cluster” configurations where in the case of a failure in Datacenter-A the configuration in Datacenter-B can take over. An example of when a condition like this would be communicated by the array would be when a LUN “detached” in a site isolation. PDL is probably most common in non-uniform stretched solutions like EMC VPLEX. With VPLEX site affinity is defined per LUN. If your VM resides in Datacenter-A while the LUN it is stored on has affinity to Datacenter-B in case of failure this VM could lose access to the LUN. These enhancements will ensure the VM is killed and restarted on the other side.
Please note that action will only be taken when a PDL sense code is issued. When your storage completely fails for instance it is impossible to reach the PDL condition as there is no communication possible anymore from the array to the ESXi host and the state will be identified by the ESXi host as an All Paths Down (APD) condition. APD is a more common scenario in most environments. If you are testing these enhancements please check the log files to validate which problem has been identified.
With vSphere 5.0 and prior HA did not have a response in a PDL condition, meaning that when a virtual machine was residing on a datastore which had a PDL condition the virtual machine would just sit there. This virtual machine would be unable to read or write from disk however. As of vSphere 5.0 Update 1 a new mechanism has been introduced which allows vSphere HA to take action when a datastore has reached a PDL state. Two advanced settings make this possible. The first setting is configured on a host level and is “disk.terminateVMOnPDLDefault”. This setting can be configured in /etc/vmware/settings and should be set to “True”. This setting ensures that a virtual machine is killed when the datastore it resides on is in a PDL state. The virtual machine is killed as soon as it initiates disk I/O on a datastore which is in a PDL condition and all of the virtual machine files reside on this datastore. Note that if a virtual machine does not initiate any I/O it will not be killed!
The second setting is a vSphere HA advanced setting called das.maskCleanShutdownEnabled. This setting is also not enabled by default and it will need to be set to “True”. This settings allows HA to trigger a restart response for a virtual machine which has been killed automatically due to a PDL condition. This setting allows HA to differentiate between a virtual machine which was killed due to the PDL state or a virtual machine which has been powered off by an administrator.
As soon as “disaster strikes” and the PDL sense code is sent you will see the following popping up in the vmkernel.log that indicates the PDL condition and the kill of the VM:
2012-03-14T13:39:25.085Z cpu7:4499)WARNING: VSCSI: 4055: handle 8198(vscsi4:0):opened by wid 4499 (vmm0:fri-iscsi-02) has Permanent Device Loss. Killing world group leader 4491
2012-03-14T13:39:25.085Z cpu7:4499)WARNING: World: vm 4491: 3173: VMMWorld group leader = 4499, members = 1
As mentioned earlier, this is a welcome enhancement which especially in non-uniform stretched storage environment can help in specific failure scenarios.
That’s great news, I will see if EMC VNX will send PDL in Case of e.g. Mirror Promotion.
Question in this Connection: Is there any way to manually cancel a pending IO on an ESXi 4/5 Host?
I think I’m a little confused. The post says that “disk.terminateVMOnPDLDefault” should be set to True by default, but it says of “das.maskCleanShutdownEnabled” that “This setting is also not enabled by default and it will need to be set to ‘True’.”
Are both parameters set to “False” by default ans both need to be set to “True” for the specific recovery scenario you have in mind?
I am a little curious how HA can recover a VM that was resident on a datastore/LUN that suffered a PDL. Doesn’t that mean the datastore is gone? I understand how something like a Vplex works and I’m guessing that we’re assuming that one side is down and HA will somehow know to stand the VM back up on a host at the other side that (hopefully) has access to that datastore. Is HA intelligent that way today, or will it have to find a suitable host by trial and error?
Thanks for the insight Duncan, i’m curious about this statement in the release notes “Some third party HA solutions incorporate a VMX option”, do you have a name ?
Duncan Epping says
@doug: changed the article. both settings need to be manually set to “true” in order for it to work. As stated these enhancements are for stretch cluster scenarios, VPLEX is probably best example. I change the post to reflect this.
@NITRO: I have no clue what the releasenote is referring to and will ask internally.
Thanks for the clarification!
Thanks for the article.
I have this exact issue on a group of ESXi 5 hosts… planning to upgrade to update 1.
If I were to kill the VM’s manually that reside on the dead datastores would it achieve the same result, halting I/O and resuming host to normal health?
As esxcli is out of reach and esxcfg-rescan -d isn’t resolving, i’m concerned that I could bring off the VM’s and still be in the same situation.
For info – the only way I managed to fix this was to reconnect the dead paths and do some hostd fixing.
– remove /var/usr/vmware/vmware-hostd.PID and vmware-watchdog.PID
– ps x | grep hostd to find the process id
– kill -9 process id
– OR use esxtop, press f, press c, press enter, press m, to list the gwid of hostd about half way down, press k, enter gwid of process and enter to kill
– /sbin/services.sh restart
If it sticks at usbarbitrator you probably still have LUN dead issues.
esxcfg-mpath -l | grep state will list the LUN’s
2 of the hosts resolved… 2 i’m waiting to time out the scan for storage or pickup the new storage.
esxcfg-rescan -d shows error of another rescan already in progress. Unsure which process it’s running on after searching for a couple of days. Time to lab it up.
I hope this helps others.
in your article you add this option to the /etc/vmware/settings in the VMware KB this must be entered into /etc/vmware/options. We had tried both and neither worked. So we opened a VMware SR.
So just to let you know; we finally received this answer:
Unfortunately, the information in the KB you refer to, 2015681, is very misleading and the information in the article needs to be updated to reflect the correct expected behavior of HA after a PDL event.
A PDL or permanent device loss is a very specific event, and is detailed in the attached KB. If the sense code highlighted in the KB are not reported in the environment, the host reports an APD event and not a PDL in the event of a SAN/LUN outage.
The updates in ESX5 update 1 all relate to PDL only. No changes were made to how either the ESX host or HA react to an APD outage.
As stated above, the KB article does not differentiate between an APD or a PDL event in a clear manner, and is very confusing.
This will be cleared up in the near future, to clearly identify the expected behavior.
In ESX5 Update 1, It is not expected that HA will start any VMs affected by an APD.
So if you may add this information to your article.
Duncan Epping says
@Tom: this behavior is indeed targeted at PDL and not to APD as mentioned above. I slightly altered the article to make that more obvious.
would these two settings also help in a scenario where a storage admin made a configuration mistake and unpresented a LUN from a single host in the cluster which would then lead to a PDL state?
thanks for clarification.
Duncan Epping says
Depends on the type of mistake made, but in 99% of the cases I would guess that all hosts will see the LUN in PDL state. This means restarting will be impossible and all VMs will just go down.
Thiago Caires says
Hi Duncan, you can help me?
I am working in deployed environment where we need that after a device is removed from host is possible to rescan the adapter. But when I rescan the adapter the job freeze on running tasks until the device is connected again. I cannot reboot the host to solve the problem of dead device.
I tried through “disk.terminateVMOnPDLDefault” and das.maskCleanShutdownEnabled parameters but my VM not goes down with the device. I need to umount the dead datastore to connect the same replicated LUN. When I remove the device the VM not goes down (It’s frozen the console). The vmkernel log shows:
2012-08-10T10:25:41.738Z cpu0:3546)WARNING: NMP: nmpDeviceAttemptFailover:658:Retry world failover device “naa.60c090a23de503c720fe7416314b9506” – failed to issue command due to Not found (APD), try again…
2012-08-10T10:25:41.738Z cpu0:3546)WARNING: NMP: nmpDeviceAttemptFailover:708:Logical device “naa.60c090a23de503c720fe7416314b9506”: awaiting fast path state update…
2012-08-10T10:25:49.167Z cpu1:2155)ScsiDeviceIO: 2322: Cmd(0x4124007ce7c0) 0x2a, CmdSN 0x134 from world 2050 to dev “naa.60c090a23de503c720fe7416314b9506” failed H:0x5 D:0x0 P:0x0 Possible sense data: 0x2 0x3a 0x0.
2012-08-10T10:25:49.167Z cpu1:2155)WARNING: NMP: nmp_DeviceStartLoop:721:NMP Device “naa.60c090a23de503c720fe7416314b9506” is blocked. Not starting I/O from device.
2012-08-10T10:25:49.737Z cpu0:3546)WARNING: NMP: nmpDeviceAttemptFailover:562:Retry world restore device “naa.60c090a23de503c720fe7416314b9506” – no more commands to retry
The “disk.terminateVMOnPDLDefault” parameter works only in a case of HA is identified?
This document of EMC (page 124) says to make the configuration on /etc/vmware/settings but on video the config is on /etc/vmware/config. Settings is the correct file?
The configuration files of my host: (I try to disk.terminateVMOnPDLDefault=TRUE on config file and not works for me.
~ # cat /etc/vmware/settings
~ # cat /etc/vmware/config
libdir = “/usr/lib/vmware”
authd.proxy.vim = “vmware-hostd:hostd-vmdb”
authd.proxy.nfc = “vmware-hostd:ha-nfc”
authd.proxy.nfcssl = “vmware-hostd:ha-nfcssl”
authd.proxy.vpxa-nfcssl = “vmware-vpxa:vpxa-nfcssl”
authd.proxy.vpxa-nfc = “vmware-vpxa:vpxa-nfc”
authd.fullpath = “/sbin/authd”
authd.soapServer = “TRUE”
vmauthd.server.alwaysProxy = “TRUE”
Duncan Epping says
This setting only works when a PDL is identified. A PDL is a SCSI Sense Code which is sent by the array. In your case the path is gone and there is no SCSI Sense Code dropped down and as such the scenario is recognized as APD (all paths down) and no action is taken unfortunately.
Did you remove the Device manually? If so, no PDL will be generated. If you want to manually remove a device you should first unmount the datastore and detach the device. Otherwise the ESX host will try to finish its IO forever.
Thiago Caires says
Hi, Tks for replys.
Yes I remove the mapping of storage. But when I shutdown all the storage the device not goes to PDL state. Its only enter in PDL with the specific sense code correct? When this sense code is generated?
Yes, this is correct. Good question, when exactly this sense code will be genrated. I think in case of VPLEX it will be, when your VNX/DMX/VMAX behind that specific VPLEX cluster goes down. In that case THIS VPLEX cluster will send PDL, then ESX kills the VM and HA restarts it on the other side, where THAT VPLEX has a working VNX/DMX /VMAX behind it.
Would be nice to get some clarification from EMC², though.
Anders O says
Hi Duncan. I asked you a related question on the Q&A panel on VMworld Europe, which was something like “When is HA going to start monitoring anything else than primarily the hosts’ FDM heartbeats?”. You mentioned these PDL enhancements in 5.1, but as I understand there is still nothing that will recover/restart VMs in an APD scenario?
Do we still need to rely on MS clustering or third-party VM monitoring tools to get reliable VM/service failover in stretched clusters that experience APD scenarios?
I tried to frame my concerns in this thread: http://communities.vmware.com/thread/425584
I think you are looking for VM monitoring. Take a look at it!
Anders O says
Thanks Oliver, but VM monitoring doesn’t react to a loss of disk. Since the VMs (and their VMware Tools) are still running, they still send the heartbeats to their hosts.
Hello Duncan. You’ve written that “disk.terminateVMOnPDLDefault=true” should be used in stretched-cluster scenarios and that VMs, residing in a datastore with a PDL state, will be killed. I read in your clustering deepdive that VMs spanned over multiple datastores aren’t supported with vSphere 5 U1. Just to clarify: This applies only for stretched-cluster, vSphere 5U1 and VMs with disks in multiple datastores, because a PDL state for one datastore will kill the whole VMs, even if only one datastore of the VM is affected, right?
David Hesse says
Thanks for posting this excellent article Duncan.
I had to make my own painful experiences with APD conditions in a vSphere 4.1 Environment last year and I am happy to see that there were some enhancements made in ESXi 5 and later.
Some week ago I experienced another APD condition in a vSphere 5.0 environment and again the hosts in the effected cluster became unmanageable and the Hosts had to be restarted, because hostd, after all possible worker threads were used up waiting for I/O to a device that is not responding , became unresponsive.
Why the customer has to modify the advanced settings in order to get the ESXi hosts reacting properly when a PDL or APD condition is experienced?
Should those settings not be enabled by default?
Has this being changed in ESX 5.1?
I don’t think this should be made a default value, because it is only sensible in a stretched cluster environment. In a non-stretched I would like my hosts to try I/O for ever – or enable the timeout feature intruduced in 5.1.
David Hesse says
I just found your follow-up blog post that answers my question.
Thanks for taking the time to write this up.
Your blog is first class. Keep up the good work!
Due to the way our software replicates it’s databases across 3 redundant nodes, if the VMs hang we could end up in a situation where a large database gets corrupted or deleted. We need a resolution to shut down VMs in an APD state. I could use esxcfg-mpath to get the state or grep vmkernel.log run from cron but would prefer something built in, or at least have the option to shut a VM down. This is serious for us, and we have a system due to be shipped to a customer in 2 weeks and I need to find a way to do this. Any ideas?