Yesterday someone asked me a question on twitter about a whitepaper by EMC on stretched clusters and Permanent Device Loss (PDL) behavior. For those who don’t know what a PDL is, make sure to read this article first. This EMC whitepaper states the following on page 40:
In a full WAN partition that includes cross-connect, VPLEX can only send SCSI sense code (2/4/3+5) across 50% of the paths since the cross-connected paths are effectively dead. When using ESXi version 5.1 and above, ESXi servers at the non-preferred site will declare PDL and kill VM’s causing them to restart elsewhere (assuming advanced settings are in place); however ESXi 5.0 update 1 and below will only declare APD (even though VPLEX is sending sense code 2/4/3+5). This will result in a VM zombie state. Please see the section Path loss handling semantics (PDL and APD)
Now as far as I understood, and I tested this with 5.0 U1 the VMs would not be killed indeed when half of the paths were declared APD and the other half PDL. But I guess something has changed with vSphere 5.1. I knew about one thing that has changed which isn’t clearly documented so I figured I would do some digging and write a short article on this topic. So here are the changes in behavior:
Virtual Machine using multiple Datastores:
- vSphere 5.0 u1 and lower: When a Virtual Machine’s files are spread across multiple Datastores it might not be restarted in the case a Permanent Device Loss scenario occurs.
- vSphere 5.1 and higher: When a Virtual Machine’s files are spread across multiple Datastores and a Permanent Device Loss scenario occurs then vSphere HA will restart the virtual machine taking availability of those datastores on the various hosts in your cluster in to account.
Half of the paths in APD state:
- vSphere 5.0 u1 and lower: When a datastore on which your virtual machine resides is not in a 100% declared in a PDL state (assume half of the paths in APD) then the virtual machine will not be killed and restarted.
- vSphere 5.1 and higher: When a datastore on which your virtual machine resides is not in a 100% declared in a PDL state (assume half of the paths in APD) then the virtual machine will be killed and restarted. This is a huge change compared to 5.0 U1 and lowe
These are the changes in behavior I know about for vSphere 5.1, I have asked engineering to confirm these changes for vSphere Metro Storage Cluster environments. When I have received an answer I will update this blog.