All of you know by now that I have a love for availability related topics… Hence the reasons I needed to write something about INF-BCO2807. The session titled “vSphere HA and Datastore Access Outages – Current- Capabilities Deep-Dive and Tech Preview”, presented by Keith Farkas and Smriti Desai, discussed possible future HA enhancements that will solve component failures. Those of you who read my whitepaper on stretched clusters can immediately see why this would be a nice enhancement!
Once again a big fat disclaimer, VMware gives absolutely no guarantees when or even if this will be released.
This session was all about inaccessible data stores. During our talk Lee Dilworth and I explained the difference between a Permanent Device Loss (PDL) and an All Paths Down (APD) condition. In short, PDL is a “scsi sense code” issued by the storage system (or an iSCSI “login reject” for that matter). This scsi sense code allows vSphere (both the kernel and HA) to respond and act upon it. In the case of an APD vSphere cannot respond… the LUN is gone on that host and we don’t know why, so what do we do? Well with 5.1 and prior we do nothing. This results in zombied virtual machines, and that is not the state you want your virtual machines to be in right?
So how is VMware planning to solve this? It is planning to enhance HA with what was referred to as “Component Protection”. Component Protection allows responses per virtual machine when an APD or PDL has been detected. This is not based on guest I/Os failing, but on the vSphere platform declaring that the device is in a PDL or APD condition.
When an APD scenario is detected HA will be smart enough to understand which hosts can restart virtual machines, as in some cases multiple hosts might be impacted. Of course it will also only kill your virtual machine and restart it when it knows capacity is available for it.
I don’t know about you, but I would rather see this implemented today than tomorrow!? APD is not common, but also not rare… and when disaster strikes, it strikes hard!
I don’t think this session is scheduled for VMworld Europe, so make sure to watch the recording as soon as it is available as it is well worth your time. Keith and Smriti gave an excellent deepdive on the current vSphere HA and a nice look in to the future!
Interesting, will look out for the recording. (Over the years, have unfortunately seen APD bring a cluster to its knees more than once.)
Chad Sakac says
Disclosure – EMCer here!
Thanks for posting this Duncan. Have been working this for a while with various VMware folks. I think that the most desirable behavior is to trigger VM HA response on APD. The PDL changes in 5.0 u1 and 5.1 are a great step (PDL, as you said is a SCSI sense code, which means the target is still alive and responding, just no LUN or device response, APD on the other hand is “I’m getting nada, including getting to the target itself”).
For every case where someone fat fingered something and yanked data stores, only to re-add them and have the zombie VMs come back to life, I can cite another example where VM HA response would have saved the day.
Glad to work with VMware to continue to refine and harden this set of use cases!
David Hesse says
are there any updates on this?
Is VMware planing to implement a further enhancement in the next update release?
I can’t comment on futures to be honest. Sorry about that. I suggest, if you have one, to reach out to a local VMware representative and ask for a roadmap update. (these are under NDA)
David Hesse says
OK, I understand.
I guess we just have to wait and see….