I received two questions on the same topic last week. The question was around using the PDL enhancements in a non-stretched environment… does it make sense? The question was linked to a scenario where for instance a storage admin makes a mistake and removes access for a specific host to a LUN. For those who don’t know what a PDL is read this article, but in short it is a SCSI sense code issued by an array when it believes storage will be permanently unavailable.
First of all, the vSphere HA advanced option “das.maskCleanShutdownEnabled” is enabled by default as of vSphere 5.1. In other words, HA is going to assume a virtual machine needs to be restarted when it is powered and isn’t able to update the config files. (Config files contain the details about the shutdown state normally, was it an admin initiated shutdown?)
Now, one thing to note is that “disk.terminateVMOnPDLDefault” is not on by default. If this setting is not explicitly enabled then the virtual machine will not be killed and HA won’t be able to take action. In other words, if your storage admin changes the presentation of your LUNs and removes a host accidentally the virtual machine will just sit there without access to disk. The OS might fail at some point, your application will definitely not be happy, but this is it.
To answer the question, yes even in a non-stretched environment it makes sense to enable both disk.terminateVMOnPDLDefault and das.maskCleanShutdownEnabled. Virtual machines will be automatically restarted by HA if they are killed by the VMkernel when a PDL has been detected.
Pablo says
Hi Duncan! thanks for the recommendation. Which ones are the cons of configure those options in a non-strech cluster?
I mean, why are those advanced options not enabled by default??
Thanks again
Pablo.-
Duncan Epping says
Some might not want the virtual machines to be automatically killed and restarted, it does mean you incur downtime and if the chances of ever hitting a PDL are slim and the time you would have the PDL would be short why incur the downtime…
Chad Sakac says
Disclosure EMCer here. I’m with Duncan. For me personally, I would rather have a PDL trigger a VM shutdown/restart. BTW, I got asked the other day why, if it makes sense, it’s not the default? Answer, IMO, is that people underestimate caution needed when there are hundreds of thousands of customers and millions of installs. Step 1, intro changed behavior as non-default, then later, make it the default.
IMO (and I’ve been advocating it for a while), I would make more conditions (APD) trigger HA response. For every case where an admin found “hey, I could fix the issue and everything came back to life”, there are cases where an HA response would have automated the fix.
Duncan Epping says
APD enhancements have been shown at VMworld by Keith Farkas in his session. Hopefully that will make it in to the product soon:
http://www.yellow-bricks.com/2012/09/05/inf-bco2807-vsphere-ha-and-datastore-access-outages/
Agreed on the non-default part of your comment. Changing behavior like this by default is a huge risk…
David Hese says
Thanks for posting this update Duncan.
I personally would like to have this set by default, but that of course depends on the environment you are managing and the organizational structures in your company.
(VMware Admins unaware of what Storage Admin’s are doing, etc…)
Also the kind of VMs and their Services you are managing and their attached SLAs are an important factor to consider.
it is my opinion that those advanced settings if they are not set by default should be communicated better by VMware, so that every VMware Administrator out there is aware of them and can decide for himself whether to set them or not, depending on the specific requirements of the vSphere environment being managed.
Duncan says
They are typically mentioned in the release notes. Unfortunately not everyone reads these from top-to-bottom…
I have requested this to be enabled by default and this is being considered.