I’ve written about Permanent Device Loss multiple times but another scenario that some of you might have encountered is All Paths Down. All Paths Down already describes the scenario, but an example would be when for whatever reason the network between the host and the array fails. This would be result in an APD condition, meaning that the LUNs are unreachable due to the fact that all paths to the LUN are gone.
Some of you who have been in this scenario probably also have seen hosts being disconnected. In some cases, I’ve seen this happening at one point, a host might even freeze up. This would typically happen when a lot of I/O was sent to the datastore. This is of course something that everyone would want to avoid and hence a new advanced setting has been introduced, a new mechanism to handle APD conditions.
This brand new setting is called Misc.APDHandlingEnable. It can be set to 0 or 1. A value of zero means that ESXi will stick to the “old” method which is to always retry failed I/O’s. A value of 1 enables the new behavior. The behavior will allow ESXi to “fast-fail” I/Os. This will happen after 140 seconds by default. Fast-failing I/Os is what will prevent the host to be disconnected or frozen up. This is configurable though through Misc.APDTimeout. Note you can set a filter in the Web Client to find the right advanced setting as shown in the screenshot below. Note that the minimum value for Misc.APDTimeout is 20 seconds.
Cormac Hogan has a great article about APD with a lot more technical details, make sure to read it.